The UN Committee of Experts on Big Data and Data Science for Official Statistics is launching a pilot lab programme, to make international data sharing more secure by using Privacy Enhancing Technologies (PETs)
Announced today at Dubai Expo 2020, the ‘UN PET Lab’ pilot programme will look to demonstrate how PETs can allow for fully compliant data sharing and insights between organisations, utilising publicly available trade data from UN Comtrade.
Four National Statistical Offices (NSOs) — the US Census Bureau; Statistics Netherlands; the Italian National Institute of Statistics; and the UK’s Office for National Statistics — will be involved in the project.
The PET Lab will see statistical bodies collaborate with tech providers that offer PET technologies.
The team at @2i2c_org has finished our first ~full year of operations! To celebrate, we’re writing a 3-post series describing our experiences and challenges.
Our first post: designing and piloting services for use-cases in research and education
Web dumps are one of the key data sources for training LMs. But in their raw form, they have lots of undesirable content (e.g., boilerplate, hatespeech). Researchers typically apply tools called *quality filters* to select text considered “high quality” from these sources. /2
We argue that quality filtering implies a language ideology – a sociolinguistics term for a subjective belief about language use. These ideologies are often implicit/undocumented. What language is high quality enough to be included in the corpus? Whose language is excluded? /3
The Daily Californian student newspaper, Clara Rodas
from
In an effort to make the data privacy sector more accessible to underrepresented communities, the campus Privacy Office hosted a panel with data privacy experts from across California on Wednesday.
Panelists spoke about their own experiences in the field while also giving prospective attendants advice regarding what actions to take if they wished to pursue a career in data privacy. Many panelists emphasized that having a background in data science, computer science or the law was not necessary and encouraged those in other fields to see how data privacy affected their own interests.
“There’s absolutely nothing that distinguishes me from anybody else who’s interested and wants to learn about data privacy,” said Thea Bullock, UC Irvine campus privacy official. “This field is not going to look the same in five years or 10 years — it will change and you can choose to come in the door and be part of the change.”
UC San Diego campus privacy officer Pegah Parsi recommended that prospects seek out people in the data privacy field and offer to help because privacy departments are often understaffed.
Harvard University, Institute for Quantitative Social Science
from
The Dataverse Project at IQSS is joining the Office of Data Science Strategy (ODSS) at the National Institutes of Health (NIH) and five other data repositories in launching a new data curation, sharing, and interoperability initiative. Through this collaboration, the Generalist Repository Ecosystem Initiative (GREI), the Dataverse Project plans to facilitate access to NIH-funded data by building on the existing Harvard Dataverse Repository. In order to supplement the NIH’s existing domain-specific repositories, the goal of the GREI is to expand its data ecosystem into additional repositories so that researchers can more easily and effectively find and share data from studies funded by the NIH.
[Qiaozhu] Mei and colleagues at the University of Michigan School of Information developed a strategy to not only monitor the emotional health of workers, but even predict work behaviors. In a new study in PLOS ONE, the team tracked emoji use as a marker of emotions, and tracked how the use of emoji in work communications can predict remote worker dropouts.
“We saw a report from GitHub about the status of developers at the early stage of the COVID-19 pandemic,” said study lead author Xuan Lu, research fellow at UMSI.
Developers were showing signals of burnout at the start of the pandemic. The report spurred the team to look at how to better track the state of mind of remote workers, she said.
In August of 2021, the Santa Fe Institute hosted a workshop on collective intelligence as part of its Foundations of Intelligence project. This project seeks to advance the field of artificial intelligence by promoting interdisciplinary research on the nature of intelligence. The workshop brought together computer scientists, biologists, philosophers, social scientists, and others to share their insights about how intelligence can emerge from interactions among multiple agents–whether those agents be machines, animals, or human beings. In this report, we summarize each of the talks and the subsequent discussions. We also draw out a number of key themes and identify important frontiers for future research.
Nature Medicine; Pranav Rajpurkar, Emma Chen, Oishi Banerjee & Eric J. Topol
from
Artificial intelligence (AI) is poised to broadly reshape medicine, potentially improving the experiences of both clinicians and patients. We discuss key findings from a 2-year weekly effort to track and share key developments in medical AI. We cover prospective studies and advances in medical image analysis, which have reduced the gap between research and deployment. We also address several promising avenues for novel medical AI research, including non-image data sources, unconventional problem formulations and human–AI collaboration. Finally, we consider serious technical and ethical challenges in issues spanning from data scarcity to racial bias. As these challenges are addressed, AI’s potential may be realized, making healthcare more accurate, efficient and accessible for patients worldwide.
With more than half of states having completed their newly redrawn congressional maps, a few trends have emerged from the U.S. redistricting process. One is there are fewer highly competitive seats up for grabs compared to the old maps, which Michael Li of the Brennan Center for Justice predicted in MapLab last November would likely advantage Republicans in coming elections.
Another is that the more politicized the redistricting process, the more partisan the resulting map. That’s according to the latest grades on the Redistricting Report Card, a project led by the Princeton Gerrymandering Project and RepresentUs that scores new maps using data-driven metrics. To learn more about that effort and what it reveals, I spoke with Joe Kabourek, senior campaign director at RepresentUs. This interview has been edited for clarity and length.
Vox Media, The Verge, Epic Magazine, Kristen Radtke and Amelia Holowaty Krales
from
In the early ’80s, Susan and her friends pulled increasingly elaborate phone scams until they nearly shut down phone service for the entire city. As two of her friends, Kevin Mitnick and Lewis DePayne, were being convicted for cybercrime, she made an appearance on 20/20, demonstrating their tradecraft to Geraldo Rivera. Riding her celebrity, she went briefly legit, testifying before the US Senate and making appearances at security conventions, spouting technobabble in cowboy boots and tie-dye. Then, without a trace, she left the world behind.
I went looking for the great lost female hacker of the 1980s. I should have known that she didn’t want to be found.
The algorithm, known as Pattern, overpredicted the risk that many Black, Hispanic and Asian people would commit new crimes or violate rules after leaving prison. https://n.pr/35rr3zJ
It took just one virus to cripple the world’s economy and kill millions of people; yet virologists estimate that trillions of still-unknown viruses exist, many of which might be lethal or have the potential to spark the next pandemic. Now, they have a new—and very long—list of possible suspects to interrogate. By sifting through unprecedented amounts of existing genomic data, scientists have uncovered more than 100,000 novel viruses, including nine coronaviruses and more than 300 related to the hepatitis Delta virus, which can cause liver failure.
“It’s a foundational piece of work,” says J. Rodney Brister, a bioinformatician at the National Center for Biotechnology Information’s National Library of Medicine who was not involved in the new study. The work expands the number of known viruses that use RNA instead of DNA for their genes by an order of magnitude. It also “demonstrates our outrageous lack of knowledge about this group of organisms,” says disease ecologist Peter Daszak, president of the EcoHealth Alliance, a nonprofit research group in New York City that is raising money to launch a global survey of viruses. The work will also help launch so-called petabyte genomics—the analyses of previously unfathomable quantities of DNA and RNA data. (One petabyte is 1015 bytes.)
Somewhere between $1 million and $3 million of the university’s portion of American Rescue Plan Act money will go to the computing school, Baldwin said. This is federal money meant to help states recover economically and socially from the COVID-19 pandemic. Wyoming is set to receive about $1 billion in ARPA money over the next two years.
While the ARPA money jump started the program, the university had been planning on allocating money to the project since 2020.
The university also has been eliminating and merging other academic programs, which freed up some money for new programs. About $3 million is planned to be reallocated to the computing school from other areas in the budget.
A $1.75 million gift from Meta (formerly Facebook) to the Center for Community Engagement, Environmental Justice and Health (CEEJH) in the University of Maryland’s School of Public Health will significantly bolster the group’s ability to advance environmental justice across the United States.
The gift will support new and ongoing activities led by CEEJH, including a new paid internship program, new staff hires and the annual University of Maryland Symposium on Environmental Justice (EJ) and Health Disparities, a multiday event for community organizers, policymakers and environmental health experts using innovative policy, legal and public health tools to address pressing EJ issues and help communities to advocate for themselves.
Yale University, Yale Daily News student newspaper, Valentina Simon
from
An interdisciplinary team of Yale researchers has designed brain-machine interface chips that, when implanted in humans, can reduce the rate of epileptic seizures.
More than three million people experience epileptic seizures in the United States, with 60 to 70 percent of patients able to successfully treat the condition with medicine. For the remaining individuals, surgically removing the parts of the brain where seizures arise, regardless of their role in everyday function, has been the only path toward mitigating the issue. A team of Yale computer scientists, engineers and surgeons have found that short-circuiting the path neurons fire during an epileptic seizure can successfully reduce the rate of seizures in patients. The Swebilius Foundation recently awarded the team a grant to continue its research.
“When the signature traits of a seizure are observed, the device stimulates that part of the brain, and it is not curative, but over time 60 percent of patients will get 50 percent fewer seizures than they had before,” said Dennis Spencer, professor emeritus of neurosurgery, who implants these brain-computer interface chips in patients.
University of Oklahoma, Vice President for Research and Partnerships
from
“In the peptide-engineering field, the general approach is to take those natural proteins and make incremental changes to identify the properties of the end aggregated products, and then find an application for which the identified properties would be useful,” [Handan] Acar said. “However, there are more than 500 natural and unnatural amino acids. Especially when you consider the size of the peptides, this approach is just not practical.”
Machine learning has great potential to counter this challenge, but Acar says the complex way peptides assemble and disassemble has prevented artificial intelligence methods from being effective so far.
“Clearly, computational methods, such as machine learning, are necessary,” she said. “Yet, the peptide aggregation is very complex. It is currently not possible to identify the effects of individual amino acids with computational methods.”
For almost five years, Boston University and Red Hat, a leading provider of open-source computer software solutions, have collaborated to drive innovative research and education in open-source technology. Now that partnership has announced the first recipients of the Red Hat Collaboratory Research Incubation Awards. (Open source means that the original source code is made available for use or modification by users and developers.)
The awards are administered through BU’s Red Hat Collaboratory, housed within the Rafik B. Hariri Institute for Computing and Computational Science & Engineering, and Red Hat Research. “This collaborative model gives us the opportunity to increase the diversity and richness of open engineering and operations projects we undertake together, and also allows us to pursue fundamental research under one umbrella,” says Heidi Picher Dempsey, Red Hat research director, Northeast United States.
You are being asked to voluntarily participate in a research study. We are doing this study to identify features of biodiversity informatics training valued by the biodiversity research, conservation, management, and education community. If you choose to participate, you will be asked to complete a survey that is anticipated to take up to 20 minutes. You will not be paid to take part in this study.
SPONSORED CONTENT
The eScience Institute’s Data Science for Social Good program is now accepting applications for student fellows and project leads for the 2021 summer session. Fellows will work with academic researchers, data scientists and public stakeholder groups on data-intensive research projects that will leverage data science approaches to address societal challenges in areas such as public policy, environmental impacts and more. Student applications due 2/15 – learn more and apply here. DSSG is also soliciting project proposals from academic researchers, public agencies, nonprofit entities and industry who are looking for an opportunity to work closely with data science professionals and students on focused, collaborative projects to make better use of their data. Proposal submissions are due 2/22.
2021 saw many exciting advances in machine learning (ML) and natural language processing (NLP). In this post, I will cover the papers and research areas that I found most inspiring. I tried to cover the papers that I was aware of but likely missed many relevant ones. Feel free to highlight them as well as ones that you found inspiring in the comments. I discuss the following highlights:
Universal Models
Massive Multi-task Learning
Beyond the Transformer
Prompting
Efficient Methods
Benchmarking
Conditional Image Generation
ML for Science
Program Synthesis
Bias
Retrieval Augmentation
Token-free Models
Temporal Adaptation
The Importance of Data
Meta-learning
Twitter, Silviu Paun, Morgan & Claypool Publishers
from
New! The focus of this book is primarily on Natural Language Processing; models of language interpretation and production. Most of the methods discussed here are applicable to other areas of AI and Data Science.
We are introducing embeddings, a new endpoint in the OpenAI API that makes it easy to perform natural language and code tasks like semantic search, clustering, topic modeling, and classification. Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts. Our embeddings outperform top models in 3 standard benchmarks, including a 20% relative improvement in code search.