Data Science newsletter – January 29, 2022

Newsletter features journalism, research papers and tools/software for January 29, 2022

 

UN launches privacy lab pilot to unlock cross-border data sharing benefits

Information Age, Aaron Hurst


from

The UN Committee of Experts on Big Data and Data Science for Official Statistics is launching a pilot lab programme, to make international data sharing more secure by using Privacy Enhancing Technologies (PETs)

Announced today at Dubai Expo 2020, the ‘UN PET Lab’ pilot programme will look to demonstrate how PETs can allow for fully compliant data sharing and insights between organisations, utilising publicly available trade data from UN Comtrade.

Four National Statistical Offices (NSOs) — the US Census Bureau; Statistics Netherlands; the Italian National Institute of Statistics; and the UK’s Office for National Statistics — will be involved in the project.

The PET Lab will see statistical bodies collaborate with tech providers that offer PET technologies.


It’s amazing to see how much 2i2c has grown and learned in the past year, under Chris’s thoughtful leadership.

Twitter, Ryan Abernathy, Chris Holdgraf


from

The team at @2i2c_org has finished our first ~full year of operations! To celebrate, we’re writing a 3-post series describing our experiences and challenges.

Our first post: designing and piloting services for use-cases in research and education


More fun publisher surveillance: Elsevier embeds a hash in the PDF metadata that is *unique for each time a PDF is downloaded*, this is a diff between metadata from two of the same paper.

Twitter, Jonny Saunders


from

Combined with access timestamps, they can uniquely identify the source of any shared PDFs.


In our new interdisciplinary work, “Whose Language Counts as High Quality?”, we empirically demonstrate that the data selection procedures for language models like GPT-3 implicitly favor text written by authors from powerful social positions.

Twitter, Suchin Gururangan


from

Web dumps are one of the key data sources for training LMs. But in their raw form, they have lots of undesirable content (e.g., boilerplate, hatespeech). Researchers typically apply tools called *quality filters* to select text considered “high quality” from these sources. /2

We argue that quality filtering implies a language ideology – a sociolinguistics term for a subjective belief about language use. These ideologies are often implicit/undocumented. What language is high quality enough to be included in the corpus? Whose language is excluded? /3


‘Be part of the change’: California data experts talk diversity

The Daily Californian student newspaper, Clara Rodas


from

In an effort to make the data privacy sector more accessible to underrepresented communities, the campus Privacy Office hosted a panel with data privacy experts from across California on Wednesday.

Panelists spoke about their own experiences in the field while also giving prospective attendants advice regarding what actions to take if they wished to pursue a career in data privacy. Many panelists emphasized that having a background in data science, computer science or the law was not necessary and encouraged those in other fields to see how data privacy affected their own interests.

“There’s absolutely nothing that distinguishes me from anybody else who’s interested and wants to learn about data privacy,” said Thea Bullock, UC Irvine campus privacy official. “This field is not going to look the same in five years or 10 years — it will change and you can choose to come in the door and be part of the change.”

UC San Diego campus privacy officer Pegah Parsi recommended that prospects seek out people in the data privacy field and offer to help because privacy departments are often understaffed.


Dataverse Joins NIH in Increasing Access to Biomedical Data | Institute for Quantitative Social Science

Harvard University, Institute for Quantitative Social Science


from

The Dataverse Project at IQSS is joining the Office of Data Science Strategy (ODSS) at the National Institutes of Health (NIH) and five other data repositories in launching a new data curation, sharing, and interoperability initiative. Through this collaboration, the Generalist Repository Ecosystem Initiative (GREI), the Dataverse Project plans to facilitate access to NIH-funded data by building on the existing Harvard Dataverse Repository. In order to supplement the NIH’s existing domain-specific repositories, the goal of the GREI is to expand its data ecosystem into additional repositories so that researchers can more easily and effectively find and share data from studies funded by the NIH.


Can emoji use be the key in detecting remote-work burnout?

University of Michigan, Michigan News


from

[Qiaozhu] Mei and colleagues at the University of Michigan School of Information developed a strategy to not only monitor the emotional health of workers, but even predict work behaviors. In a new study in PLOS ONE, the team tracked emoji use as a marker of emotions, and tracked how the use of emoji in work communications can predict remote worker dropouts.

“We saw a report from GitHub about the status of developers at the early stage of the COVID-19 pandemic,” said study lead author Xuan Lu, research fellow at UMSI.

Developers were showing signals of burnout at the start of the pandemic. The report spurred the team to look at how to better track the state of mind of remote workers, she said.


Frontiers in Collective Intelligence: A Workshop Report

Complexity Digest;Tyler Millhouse, Melanie Moses, Melanie Mitchell


from

In August of 2021, the Santa Fe Institute hosted a workshop on collective intelligence as part of its Foundations of Intelligence project. This project seeks to advance the field of artificial intelligence by promoting interdisciplinary research on the nature of intelligence. The workshop brought together computer scientists, biologists, philosophers, social scientists, and others to share their insights about how intelligence can emerge from interactions among multiple agents–whether those agents be machines, animals, or human beings. In this report, we summarize each of the talks and the subsequent discussions. We also draw out a number of key themes and identify important frontiers for future research.


AI in health and medicine

Nature Medicine; Pranav Rajpurkar, Emma Chen, Oishi Banerjee & Eric J. Topol


from

Artificial intelligence (AI) is poised to broadly reshape medicine, potentially improving the experiences of both clinicians and patients. We discuss key findings from a 2-year weekly effort to track and share key developments in medical AI. We cover prospective studies and advances in medical image analysis, which have reduced the gap between research and deployment. We also address several promising avenues for novel medical AI research, including non-image data sources, unconventional problem formulations and human–AI collaboration. Finally, we consider serious technical and ethical challenges in issues spanning from data scarcity to racial bias. As these challenges are addressed, AI’s potential may be realized, making healthcare more accurate, efficient and accessible for patients worldwide.


MapLab: How to Make Fairer Congressional Maps

Bloomberg, CityLab, Laura Bliss


from

With more than half of states having completed their newly redrawn congressional maps, a few trends have emerged from the U.S. redistricting process. One is there are fewer highly competitive seats up for grabs compared to the old maps, which Michael Li of the Brennan Center for Justice predicted in MapLab last November would likely advantage Republicans in coming elections.

Another is that the more politicized the redistricting process, the more partisan the resulting map. That’s according to the latest grades on the Redistricting Report Card, a project led by the Princeton Gerrymandering Project and RepresentUs that scores new maps using data-driven metrics. To learn more about that effort and what it reveals, I spoke with Joe Kabourek, senior campaign director at RepresentUs. This interview has been edited for clarity and length.


Searching for Susy Thunder

Vox Media, The Verge, Epic Magazine, Kristen Radtke and Amelia Holowaty Krales


from

In the early ’80s, Susan and her friends pulled increasingly elaborate phone scams until they nearly shut down phone service for the entire city. As two of her friends, Kevin Mitnick and Lewis DePayne, were being convicted for cybercrime, she made an appearance on 20/20, demonstrating their tradecraft to Geraldo Rivera. Riding her celebrity, she went briefly legit, testifying before the US Senate and making appearances at security conventions, spouting technobabble in cowboy boots and tie-dye. Then, without a trace, she left the world behind.

I went looking for the great lost female hacker of the 1980s. I should have known that she didn’t want to be found.


Many academics and civil rights groups raised serious concerns about the Pattern algorithm back in early 2020. Yet here we are. See @civilrightsorg letter here

Twitter, Dr. Kate Crawford


from

The algorithm, known as Pattern, overpredicted the risk that many Black, Hispanic and Asian people would commit new crimes or violate rules after leaving prison. https://n.pr/35rr3zJ


New dangers? Computers uncover 100,000 novel viruses in old genetic data

Science, Elizabeth Pennisi


from

It took just one virus to cripple the world’s economy and kill millions of people; yet virologists estimate that trillions of still-unknown viruses exist, many of which might be lethal or have the potential to spark the next pandemic. Now, they have a new—and very long—list of possible suspects to interrogate. By sifting through unprecedented amounts of existing genomic data, scientists have uncovered more than 100,000 novel viruses, including nine coronaviruses and more than 300 related to the hepatitis Delta virus, which can cause liver failure.

“It’s a foundational piece of work,” says J. Rodney Brister, a bioinformatician at the National Center for Biotechnology Information’s National Library of Medicine who was not involved in the new study. The work expands the number of known viruses that use RNA instead of DNA for their genes by an order of magnitude. It also “demonstrates our outrageous lack of knowledge about this group of organisms,” says disease ecologist Peter Daszak, president of the EcoHealth Alliance, a nonprofit research group in New York City that is raising money to launch a global survey of viruses. The work will also help launch so-called petabyte genomics—the analyses of previously unfathomable quantities of DNA and RNA data. (One petabyte is 1015 bytes.)


UW’s new School of Computing expected to benefit multiple disciplines

Rawlins Times (WY), Abby Vander Graaf


from

Somewhere between $1 million and $3 million of the university’s portion of American Rescue Plan Act money will go to the computing school, Baldwin said. This is federal money meant to help states recover economically and socially from the COVID-19 pandemic. Wyoming is set to receive about $1 billion in ARPA money over the next two years.

While the ARPA money jump started the program, the university had been planning on allocating money to the project since 2020.

The university also has been eliminating and merging other academic programs, which freed up some money for new programs. About $3 million is planned to be reallocated to the computing school from other areas in the budget.


Meta Gifts $1.75 Million to UMD to Boost Environmental Justice and Health Equity

University of Maryland, Maryland Today


from

A $1.75 million gift from Meta (formerly Facebook) to the Center for Community Engagement, Environmental Justice and Health (CEEJH) in the University of Maryland’s School of Public Health will significantly bolster the group’s ability to advance environmental justice across the United States.

The gift will support new and ongoing activities led by CEEJH, including a new paid internship program, new staff hires and the annual University of Maryland Symposium on Environmental Justice (EJ) and Health Disparities, a multiday event for community organizers, policymakers and environmental health experts using innovative policy, legal and public health tools to address pressing EJ issues and help communities to advocate for themselves.


Yale researchers receive grant to develop novel epilepsy brain-computer chip treatment

Yale University, Yale Daily News student newspaper, Valentina Simon


from

An interdisciplinary team of Yale researchers has designed brain-machine interface chips that, when implanted in humans, can reduce the rate of epileptic seizures.

More than three million people experience epileptic seizures in the United States, with 60 to 70 percent of patients able to successfully treat the condition with medicine. For the remaining individuals, surgically removing the parts of the brain where seizures arise, regardless of their role in everyday function, has been the only path toward mitigating the issue. A team of Yale computer scientists, engineers and surgeons have found that short-circuiting the path neurons fire during an epileptic seizure can successfully reduce the rate of seizures in patients. The Swebilius Foundation recently awarded the team a grant to continue its research.

“When the signature traits of a seizure are observed, the device stimulates that part of the brain, and it is not curative, but over time 60 percent of patients will get 50 percent fewer seizures than they had before,” said Dennis Spencer, professor emeritus of neurosurgery, who implants these brain-computer interface chips in patients.


OU Engineers Build a Molecular Framework to Bridge Experimental and Computer Sciences for Peptide-Based Materials Engineering

University of Oklahoma, Vice President for Research and Partnerships


from

“In the peptide-engineering field, the general approach is to take those natural proteins and make incremental changes to identify the properties of the end aggregated products, and then find an application for which the identified properties would be useful,” [Handan] Acar said. “However, there are more than 500 natural and unnatural amino acids. Especially when you consider the size of the peptides, this approach is just not practical.”

Machine learning has great potential to counter this challenge, but Acar says the complex way peptides assemble and disassemble has prevented artificial intelligence methods from being effective so far.

“Clearly, computational methods, such as machine learning, are necessary,” she said. “Yet, the peptide aggregation is very complex. It is currently not possible to identify the effects of individual amino acids with computational methods.”


BU and Red Hat Announce First Research Incubation Awards

Boston University, BU Today


from

For almost five years, Boston University and Red Hat, a leading provider of open-source computer software solutions, have collaborated to drive innovative research and education in open-source technology. Now that partnership has announced the first recipients of the Red Hat Collaboratory Research Incubation Awards. (Open source means that the original source code is made available for use or modification by users and developers.)

The awards are administered through BU’s Red Hat Collaboratory, housed within the Rafik B. Hariri Institute for Computing and Computational Science & Engineering, and Red Hat Research. “This collaborative model gives us the opportunity to increase the diversity and richness of open engineering and operations projects we undertake together, and also allows us to pursue fundamental research under one umbrella,” says Heidi Picher Dempsey, Red Hat research director, Northeast United States.


AAAS honors computer, data scientist Juliana Freire as Lifetime Fellow

EurekAlert!, NYU Tandon School of Engineering


from

Freire joins four other NYU faculty in becoming 2021 Fellows, and two NYU Tandon Faculty who have received AAAS fellowships in the past.


Deadlines



Biodiversity Informatics Summary

You are being asked to voluntarily participate in a research study. We are doing this study to identify features of biodiversity informatics training valued by the biodiversity research, conservation, management, and education community. If you choose to participate, you will be asked to complete a survey that is anticipated to take up to 20 minutes. You will not be paid to take part in this study.

SPONSORED CONTENT

Assets  




The eScience Institute’s Data Science for Social Good program is now accepting applications for student fellows and project leads for the 2021 summer session. Fellows will work with academic researchers, data scientists and public stakeholder groups on data-intensive research projects that will leverage data science approaches to address societal challenges in areas such as public policy, environmental impacts and more. Student applications due 2/15 – learn more and apply here. DSSG is also soliciting project proposals from academic researchers, public agencies, nonprofit entities and industry who are looking for an opportunity to work closely with data science professionals and students on focused, collaborative projects to make better use of their data. Proposal submissions are due 2/22.

 


Tools & Resources



ML and NLP Research Highlights of 2021

Sebastian Ruder


from

2021 saw many exciting advances in machine learning (ML) and natural language processing (NLP). In this post, I will cover the papers and research areas that I found most inspiring. I tried to cover the papers that I was aware of but likely missed many relevant ones. Feel free to highlight them as well as ones that you found inspiring in the comments. I discuss the following highlights:

Universal Models
Massive Multi-task Learning
Beyond the Transformer
Prompting
Efficient Methods
Benchmarking
Conditional Image Generation
ML for Science
Program Synthesis
Bias
Retrieval Augmentation
Token-free Models
Temporal Adaptation
The Importance of Data
Meta-learning


It’s finally out! Your one-stop resource for annotation analysis: we cover agreement, aggregation, and training methods for multi-annotated corpora. Have a look! @poesio Ron Artstein

Twitter, Silviu Paun, Morgan & Claypool Publishers


from

New! The focus of this book is primarily on Natural Language Processing; models of language interpretation and production. Most of the methods discussed here are applicable to other areas of AI and Data Science.


Introducing Text and Code Embeddings in the OpenAI API

OpenAI


from

We are introducing embeddings, a new endpoint in the OpenAI API that makes it easy to perform natural language and code tasks like semantic search, clustering, topic modeling, and classification. Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts. Our embeddings outperform top models in 3 standard benchmarks, including a 20% relative improvement in code search.

Leave a Comment

Your email address will not be published.