Data Science newsletter – February 11, 2020

Newsletter features journalism, research papers, events, tools/software, and jobs for February 11, 2020


Data Science News

FDA approves AI-based software to help doctors perform echocardiograms

STAT, Matthew Herper


The Food and Drug Administration has approved a software product from an artificial intelligence startup that is aimed at making it easier for doctors and other medical professionals to take ultrasound pictures of the heart, also known as echocardiograms.

The technology, developed by San Francisco-based Caption Health, could help more hospitals use the diagnostic test, which currently requires expertise that is in short supply.

How Google Got Its Employees to Eat Their Vegetables

Medium, OneZero, Jane Black


The tech giant is engineering a way to encourage its employees to eat healthier — and it might just help the rest of the country

Schwarzman backs artificial intelligence to make sure it’s done right

Yahoo Finance, Julia LaRoche


Billionaire private equity tycoon Stephen A. Schwarzman recognizes the velocity of changes in computing technology, particularly artificial intelligence (AI).

Economists, workers and even some technology CEOs are worried about how AI will impact people’s livelihoods.

Yet the CEO of Blackstone (BX) is deeply invested in the sector: A year ago, Schwarzman donated $350 million to MIT to establish his namesake College of Computing to advance the study of A.I., and he explained why the technology is significant.

“[It’s] important these types of technologies are implemented in a way that serves society. Because if you sometimes do things too quickly, you can have real dislocation,”Schwarzman told Yahoo Finance in a wide-ranging exclusive interview.

Once wary of feds, state election leaders now welcome help

Fifth Domain, Andrew Everdsen


The latest test for the once-fraught working relationship between the federal government and state election officials came Jan. 2 when a U.S. drone strike killed Iranian Maj. Gen. Qassem Soleimani, stoking fears about a potential cyber response from Iran.

Within 24 hours, leaders from the Department of Homeland Security’s critical infrastructure protection agency were on the phone with the state leaders, discussing Iranian threats with election officials and those who oversee critical infrastructure.

Mac Warner, the Republican secretary of state of West Virginia, said feds warned that Iranian actors may already be inside their systems but hadn’t yet engaged in malicious activity.

“I really hadn’t thought through that,” Warner told Fifth Domain. “We were watching just for somebody to get inside, but what if they were already inside and I didn’t know it?”

Predicting chaos using aerosols and AI

Washington University in St. Louis, The Source


If a poisonous gas were released in a bioterrorism attack, the ability to predict the path of its molecules — through turbulent winds, temperature changes and unstable buoyancies — could mean life or death. Understanding how a city will grow and change over a 20-year period could lead to more sustainable planning and affordable housing.

Deriving equations to solve such problems — adding up all of the relevant forces — is, at best, difficult to the point of near-impossibility and, at worst, actually impossible. But machine learning can help.

Using the motion of aerosol particles through a system in flux, researchers from the McKelvey School of Engineering at Washington University in St. Louis have devised a new model, based on a deep learning method, that can help researchers predict the behavior of chaotic systems, whether those systems are in the lab, in the pasture or anywhere else.

What to do when you don’t trust your data anymore

Kate Laskowski


Science is built on trust. Trust that your experiments will work. Trust in your collaborators to pull their weight. But most importantly, trust that the data we so painstakingly collect are accurate and as representative of the real world as they can be.

And so when I realized that I could no longer trust the data that I had reported in some of my papers, I did what I think is the only correct course of action. I retracted them.

Retractions are seen as a comparatively rare event in science, and this is no different for my particular field (evolutionary and behavioral ecology), so I know that there is probably some interest in understanding the story behind it. This is my attempt to explain how and why I came to the conclusion that these papers needed to be removed from the scientific record.

A fresh new name for UC Berkeley’s data science division

University of California-Berkeley, Berkeley News


UC Berkeley’s new interdisciplinary division, launched in November 2018 under the provisional title of Division of Data Science and Information, now has a permanent name: the Division of Computing, Data Science, and Society (CDSS).

The moniker, announced today (Feb. 5) after input and nominations from across campus, was chosen to reflect the broad span of the division, which seeks to unite the vast array of data science-related research and teaching that is popping up in all corners of campus.

Entrepreneur opening doors for women in data analytics

The Columbus Dispatch, Columbus CEO, Cynthia Bent Findlay


Rehgan Avon is only about three years out of college, but she’s already at the helm of an enterprise with annual revenue approaching $1 million. What’s more, that’s her part-time gig. … Avon already has held multiple positions in data science and recently joined local mobility data infrastructure startup Mobikit as head of solutions. Columbus CEO talked to her about the promise of data analytics and Ohio’s part in the growing industry.

How maps tracking climate disasters fall short—and endanger lives

Fast Company, Suzanne LaBarre


Last fall, a fire swept across Sonoma County, California, burning more than 77,000 acres and scattering particulate matter throughout the Bay Area, where I live. I had just had a baby, and as the sky over my house turned a strange popsicle orange, I took to obsessively refreshing air-quality maps on my phone. Was it safe to take my newborn outside? Should my family leave town, as we had done the previous fall—smoke refugees from the deadliest fire season on California record?

The maps were not helpful. My neighborhood appeared yellow on one map to indicate “moderate” air quality, and red (“unhealthy”) on another. At one point, the latter map turned violet (“very unhealthy”), even though the first map was still yellow. I couldn’t get a clear answer. So for days, I hunkered down indoors and parked my baby by the air purifier. Better safe than sorry.

Hey Alexa! Sorry I fooled you …



MIT’s new system TextFooler can trick the types of natural-language-processing systems that Google uses to help power its search results, including audio for Google Home.

Hidden Donors Play Significant Role in Political Campaigns

Caltech, News


A new Caltech study reveals that so-called hidden donors in a political campaign—those contributors who donate less than $200—can make up a sizable fraction of a candidate’s campaign funds.

The study, appearing in the Election Law Journal, specifically looked at the 2016 presidential campaign of Bernie Sanders. Unlike many other campaigns at the time, that of Sanders used an intermediary online fundraising service, called ActBlue, which meant that small contributions were required to be reported to the Federal Election Commission (FEC). Typically, donations from a single donor that add up to $200 or less do not need to be reported, but for intermediary fundraising services, the rules are more strict.

The Inside Story of the First Picture of a Black Hole

IEEE Spectrum, Katherine L. Bouman


Last April, a research team that I’m part of unveiled a picture that most astronomers never dreamed they would see: one of a massive black hole at the center of a distant galaxy. Many were shocked that we had pulled off this feat. To accomplish it, our team had to build a virtual telescope the size of the globe and pioneer new techniques in radio astronomy.

Our group—made up of more than 200 scientists and engineers in 20 countries—combined eight of the world’s most sensitive radio telescopes, a network of synchronized atomic clocks, two custom-built supercomputers, and several new algorithms in computational imaging. After more than 10 years of work, this collective effort, known as the Event Horizon Telescope (EHT) project, was finally able to illuminate one of the greatest mysteries of nature.

World’s largest linguistics database is getting too expensive for some researchers

Science, Catherine Matacic


It was 2015 when Gary Simons knew that something had to change. That was the year spare funds started to dry up at the Summer Institute of Linguistics (SIL), a Bible translation group that helped revolutionize the documentation of endangered languages in the mid–20th century. SIL’s budget had long supported Simons’s passion project: Ethnologue—or “the Ethnologue” as many researchers call it—a massive online database considered by many to be the definitive source for information on the world’s languages.

Artificial Intelligence and Public Standards: Committee publishes report

GOV.UK, Committee on Standards in Public Life


The Committee on Standards in Public Life today published its report and recommendations to the Prime Minister to ensure that high standards of conduct are upheld as technologically assisted decision making is adopted more widely across the public sector.

The Committee also published new polling on public attitudes to AI.

Researchers develop artificial intelligence that can understand social cues

University of Arizona, The Daily Wildcat student newspaper, Gabriella Cobian


The Defense Advanced Research Projects Agency awarded a $7.5 million grant to researchers at the University of Arizona to develop artificial intelligence that can comprehend social signals and human exchanges.

Researchers plan to study the AI within a video game format where it will then be played by humans.


Berkeley EECS Annual Research Symposium (BEARS) 2020

University of California-Berkeley, EECS Department


Berkeley, CA February 13, starting at 9 a.m., International House (2299 Piedmont Ave.) “This day-long symposium will highlight the latest research in the EECS department and will feature a slate of informative talks by distinguished EECS faculty members and directors of some of our world-renowned centers and labs.” [registration required]

Jill Lepore, Historian, The End of Knowledge: From Facts to Data – Avenali Lecture

University of California-Berkeley, Townsend Center for the Humanities


Berkeley, CA February 19, starting at 5 p.m., 315 Wheeler Hall at University of California-Berkeley. ” In her Avenali Lecture, “The End of Knowledge: From Facts to Data,” Lepore traces the shifting form and purpose of elemental units of knowledge. Situating the current crisis over the “death of the fact” within a long historical arc, she argues that facts were replaced by numbers which have since been replaced by data — with consequences not only for how we know what we know, but for how we form (or dismantle) political communities.” Boston GHC Scholarship Applications Informational Community network


Cambridge, MA February 18, starting at 6 p.m., Microsoft New England Research and Development (NERD) Center (One Memorial Drive). “Scholarships for the 2020 Grace Hopper Celebration are now open! Please join the Boston chapter to hear from former GHC Scholarship application reviewers Mariana Carvalho and Melissa Greenlee.” [registration required]


Data science and machine learning conferences (NeurIPS, ICML, AISTATS, ICLR, UAI, …): Allow remote paper & poster presentations at conferences

Sign the Petition – Allow remote paper & poster presentations at scientific conferences

Technology Transfer Days (TTD) 2020 – Open Call for Speakers

New York, NY April 27-May 1 at Microsoft Reactor. “Submit your talk, lightning talk, workshop, demo, tech talk, mentors round table, workshop, challenge, or intro talks.” Deadline for speaker submissions is March 15.
Tools & Resources

CCMatrix: A billion-scale bitext data set for training translation models

Facebook Artificial Intelligence, Holger Schwenk and Armand Joulin


CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set, CCMatrix is more than 50 times larger than the WikiMatrix corpus that we shared last year. Gathering a data set of this size required modifying our previous bitext mining approach used for WikiMatrix, assuming that the translation of one sentence could be found anywhere on CommonCrawl, which functions as an open archive of the internet. To address the significant computational challenges posed by comparing billions of sentences to determine which ones are mutual translations, we used massively parallel processing, as well as our highly efficient FAISS library for fast similarity searches.

Contextualise: Manage Your Knowledge

GitHub – brettkromkamp


Contextualise is a simple and flexible tool particularly suited for organising information-heavy projects and activities consisting of unstructured and widely diverse data and information resources — think of investigative journalism, personal and professional research projects, world building (for books, movies or computer games) and many kinds of hobbies.

How can we manage distractions in the workplace?

World Economic Forum, Quartz, Heather Landy


There are all kinds of apps available now to help you plan, hack, and track your workday. That’s on the forethought side. And on the follow-though side, one of the biggest roadblocks has all but disappeared. “It used to be that I couldn’t do what I said I was going to do because I didn’t know how to do it,” says author and angel investor Nir Eyal. “Today that’s not really an excuse anymore. If you don’t know, you Google it.”

What we haven’t addressed is the flip side to figuring out how to do more, which is figuring out how to do less. Or, as Eyal puts it, “We haven’t learned how to stop getting distracted.”

ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters

Microsoft Research, DeepSpeed Team, Rangan Majumder and Junhua Wang


The latest trend in AI is that larger natural language models provide better accuracy; however, larger models are difficult to train because of cost, time, and ease of code integration. Microsoft is releasing an open-source library called DeepSpeed, which vastly advances large model training by improving scale, speed, cost, and usability, unlocking the ability to train 100-billion-parameter models. DeepSpeed is compatible with PyTorch. One piece of that library, called ZeRO, is a new parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained. Researchers have used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), the largest publicly known language model at 17 billion parameters, which you can learn more about in this accompanying blog post.

The Zero Redundancy Optimizer (abbreviated ZeRO) is a novel memory optimization technology for large-scale distributed deep learning. ZeRO can train deep learning models with 100 billion parameters on the current generation of GPU clusters at three to five times the throughput of the current best system. It also presents a clear path to training models with trillions of parameters, demonstrating an unprecedented leap in deep learning system technology. We are releasing ZeRO as part of DeepSpeed, our high-performance library for accelerating distributed deep learning training.



Post Graduate Research Position

U.S. Army Research Laboratory, Human Research and Engineering Directorate; Aberdeen, Maryland
Full-time, non-tenured academic positions

Research Officer – Data and AI Ethics

London School of Economics, Department of Media and Communications; London, England

Software Engineer

California Institute of Technology, Climate Modeling Alliance; Pasadena, CA
Full-time positions outside academia

Computer Vision and Deep Learning Researcher

Metrica Sports; Amsterdam, Netherlands

Product Manager

Bowery Farming; New York, NY

Leave a Comment

Your email address will not be published.