Data Science newsletter – September 30, 2019

Newsletter features journalism, research papers, events, tools/software, and jobs for September 30, 2019


Data Science News

Google Research India: an AI lab in Bangalore

Google AI, Jay Yagnik


People have been using technology to solve problems and improve their quality of life for centuries, from sharing knowledge with the printing press to going online to build a small business. These days, artificial intelligence is opening up the next phase of technological advances. And with its world-class engineering talent, strong computer science programs and entrepreneurial drive, India has the potential to lead the way in using AI to tackle big challenges. In fact, there are already many examples of this happening in India today: from detecting diabetic eye disease to improving flood forecasting and teaching kids to read.

To take this to the next level we’ve created Google Research India—an AI lab we’re starting in Bangalore. This team will focus on two pillars: First, advancing fundamental computer science and AI research by building a strong team and partnering with the research community across the country. Second, applying this research to tackle big problems in fields like healthcare, agriculture, and education while also using it to make apps and services used by billions of people more helpful.

Enhancing Data Sharing, One Dataset at a Time

National Institutes of Health, National Library of Medicine, Susan Gregurick


The National Institutes of Health (NIH) has an ambitious vision for a modernized, integrated biomedical data ecosystem. How we plan to achieve this vision is outlined in the NIH Strategic Plan for Data Science, and the long-term goal is to have NIH-funded data be findable, accessible, interoperable, and reusable (FAIR). To support this goal, we have made enhancing data access and sharing a central theme throughout the strategic plan.

While the topic of data sharing itself merits greater discussion, in this post I’m going to focus on one primary method for sharing data, which is through domain-specific and generalist repositories.

The landscape of biomedical data repositories is vast and evolving. Currently, NIH supports many repositories for sharing biomedical data. These data repositories all have a specific focus, either by data type (e.g., sequence data, protein structure, continuous physiological signals) or by biomedical research discipline (e.g., cancer, immunology, or clinical research data associated with a specific NIH institute or center), and often form a nexus of resources for their research communities. These domain-specific, open-access data-sharing repositories, whether funded by NIH or other sources, are good first choices for researchers, and NIH encourages their use.

Google contractors vote to unionize in Pittsburgh

Slate, April Glaser


On Tuesday afternoon, a group of tech professionals in Pittsburgh who work at Google—but not for Google—voted to form a union. It’s likely the first time that white-collar workers in the technology industry have done so. The workers are employed by the India-owned firm HCL America. Forty-nine of them voted yes in the election, which they’ve asked the National Labor Relations Board to certify. Twenty-four voted against it.*

The contractors work on the Google Shopping platform, in the same Pittsburgh offices as full-time Googlers directly employed by the company.

Amazon, Microsoft, Salesforce, and others launch initiative to bring multiple voice assistants to devices

VentureBeat, Kyle Wiggers


Amazon and more than 30 other industry partners hope to give consumers greater choice in voice services. To this end, they together announced the Voice Interoperability Initiative today, a new program to ensure that voice-enabled products like smart speakers and smart displays provide users with “choice and flexibility” through multiple, interoperable intelligent assistants.

Contributing Data to Deepfake Detection Research

Google AI Blog, Nick Dufour and Andrew Gully,


As we published in our AI Principles last year, we are committed to developing AI best practices to mitigate the potential for harm and abuse. Last January, we announced our release of a dataset of synthetic speech in support of an international challenge to develop high-performance fake audio detectors. The dataset was downloaded by more than 150 research and industry organizations as part of the challenge, and is now freely available to the public.

Today, in collaboration with Jigsaw, we’re announcing the release of a large dataset of visual deepfakes we’ve produced that has been incorporated into the Technical University of Munich and the University Federico II of Naples’ new FaceForensics benchmark, an effort that Google co-sponsors.

$100M health initiative aims to democratize data science

Devex, Catherine Cheney


On Wednesday, the Rockefeller Foundation announced a new effort to prevent 6 million maternal and child deaths in 10 countries by 2030.

Launched on the sidelines of the United Nations General Assembly, and on the heels of the U.N. High-Level Meeting on Universal Health Coverage, the $100 million Precision Public Health initiative aims to ensure that frontline health workers have access to data science tools such as predictive analytics, artificial intelligence, and machine learning.

Aptiv’s $4B joint venture with large automaker to be based in Boston

Boston Business Journal, Julia Mericle


Self-driving vehicle company Aptiv (NYSE: APTV) announced a partnership with Hyundai Motor Group.

Karl Iagnemma, president of Boston-based Aptiv Autonomous Mobility, will lead the joint venture, which will be headquartered in Boston. Aptiv’s Seoul office will serve as a key technology and testing center for the partnership, as well.

Aptiv, which is headquartered in Dublin, and Hyundai Motor Group will each take 50 percent ownership stake in the $4 billion joint venture.

Embark Trucks raises $70M Series C, readies self-driving truck hubs

The Robot Report


There are approximately 2 million truck drivers in the U.S. today, but there is a shortfall of 60,000 drivers, largely due to low unemployment and a preference among younger workers to stay close to home, stated Embark Trucks. The company claimed that the trucking industry is worth close to $800 billion in the U.S. alone, more than the entire global software industry.

As a result, several companies are developing driverless trucks. They include Ike, which raised a $52 million Series A in February; Starsky Robotics, whose Series A in February was $16.5 million; Torc Robotics, now majority-owned by Daimler and testing in Virginia; Kodiak Robotics in Texas; and TuSimple, which last month got funding from UPS.

Scientific Societies Update Policies to Address #MeToo

The Scientist Magazine®, Diana Kwon


When a #MeToo controversy roiled an archaeology meeting in April, Twitter erupted with angry posts. Scientists took to social media to denounce the Society for American Archaeology for failing to respond immediately to the presence of a professor, banned from his own campus after being credibly accused of sexual harassment, at its annual conference. That same month, a group of scholars called for a boycott of a meeting organized by the European Society for the study of Human Evolution due the organization’s inaction in response to sexual harassment allegations against its president.

In both cases, academics took matters involving sexual harassment into their own hands when they felt conference organizers failed to address them. Such actions are part of the broader #MeToo movement in the sciences, which has led to investigations into alleged harassers, lawsuits against universities for how they’ve addressed reports of misconduct, and a broader call for systemic change.

Honored to lead the new Center for Science of Science and Innovation.

Twitter, Dashun Wang


Look fwd to working with my colleagues and students to better establish @KelloggSchool global thought leadership in this field, and help grow our community at NU and beyond.

To truly reform the criminal justice system, CA must stop lagging on transparency

Sacramento Bee, Opinion, Jeff Reisig


Last week, Assembly Bill 1331 passed the California Legislature and is on its way to Gov. Gavin Newsom’s office, which is a much needed step toward improving the state’s data and bringing some long overdue transparency to its criminal justice system. Here’s why: Better data means better outcomes for the thousands of people who encounter that system every day, as well as improved public safety overall.

Global initiative aims to use AI to improve public health in developing world

STAT, Rebecca Robbins


The initiative will partner with organizations including Medic Mobile, a nonprofit that builds software for frontline health workers and, in so doing, has collected potentially valuable health data in Uganda and other countries, Mitchell said.

One of the Rockefeller Foundation’s biggest concerns as it embarks on the effort? The quality of the data it’s working with, said Dr. Naveen Rao, the foundation’s senior vice president of health. Rao said the initiative is enlisting the help of data scientists to examine questions such as: “How can we make sure that our models are based on quality data that we believe actually is real? And how can we test it in real time?”

New federal rules limit police searches of family tree DNA databases

Science, Jocelyn Kaiser


The U.S. Department of Justice (DOJ) released new rules yesterday governing when police can use genetic genealogy to track down suspects in serious crimes—the first-ever policy covering how these databases, popular among amateur genealogists, should be used in law enforcement attempts to balance public safety and privacy concerns.

The value of these websites for law enforcement was highlighted last year when Joseph DeAngelo was charged with a series of rapes and murders that had occurred decades earlier. Investigators tracked down the suspect, dubbed the Golden State Killer, by uploading a DNA profile from a crime scene to a public ancestry website, identifying distant relatives, then using traditional genealogy and other information to narrow their search. The approach has led to arrests in at least 60 cold cases around the country.

But these searches also raise privacy concerns. Relatives of those in the database can fall under suspicion even if they have never uploaded their own DNA.


CADE Data Ethics Seminar: The promises, and dangers of AI. Brian Brackeen

Meetup, Rachel Thomas


San Francisco, CA October 4, starting at 12:30 p.m., part of University of San Francisco Seminar Series in Data Science. [rsvp required]

CADE Tech Policy Workshop

University of San Francisco, Center for Applied Data Ethics


San Francisco, CA November 16-17. The USF Center for Applied Data Ethics will be hosting a Tech Policy Workshop the weekend of Nov 16 to 17. Systemic problems, such as increasing surveillance, spread of disinformation, concerning uses of predictive policing, and the magnification of unjust bias, all require systemic solutions. [registration required]


Apply by Oct. 8 for the 2020-2021 AAAS Leshner Leadership Institute in Artificial Intelligence!

“This program convenes mid-career scientists who demonstrate leadership in their research careers and in promoting meaningful dialogue between science and society.” Deadline for applications is October 8.

MinneMUDAC 2019: Student Data Science Challenge

“The Challenge
Student teams have several weeks to analyze data before presenting their findings to judges from the analytics community at the main event Nov. 9. Teams with the highest scores move on to the finals round in the auditorium. Cash prizes are awarded to top teams in each division.” Teams will present findings on November 9.
Tools & Resources

How to Interpret What Your AI Engineer is Telling You

Medium, OSDC


You’ve hired your first AI engineer, and communication has been…tricky. It’s not that you can’t get things accomplished or that the work has been shoddy. Instead, it seems like you’re talking around each other in meetings and status updates. The good news is that you can learn what your AI engineer is really saying to you by remembering why you hired that position in the first place. Get your pipeline back on track by understanding what these common sentences mean.

Everything a Data Scientist Should Know About Data Management*

TOPBOTS, Phoebe Wong


To be a real “full-stack” data scientist, or what many bloggers and employers call a “unicorn,” you’ve to master every step of the data science process — all the way from storing your data, to putting your finished product (typically a predictive model) in production. But the bulk of data science training focuses on machine/deep learning techniques; data management knowledge is often treated as an afterthought. Data science students usually learn modeling skills with processed and cleaned data in text files stored on their laptop, ignoring how the data sausage is made. Students often don’t realize that in industry settings, getting the raw data from various sources to be ready for modeling is usually 80% of the work.

rlpyt: A Research Code Base for Deep Reinforcement Learning in PyTorch

The Berkeley Artificial Intelligence Research Blog, Adam Stooke


We are pleased to share rlpyt, which leverages this commonality to offer all three algorithm families built on a shared, optimized infrastructure, in one repository. Available from BAIR at, it contains modular implementations of many common deep RL algorithms in Python using Pytorch, a leading deep learning library. Among numerous existing implementations, rlpyt is a more comprehensive open-source resource for researchers.

rlpyt is designed as a high-throughput code base for small- to medium-scale research in deep RL.

Containers Return Work-Life Balance Back to Developers

The New Stack, Sanjay Chandru


We’re finding that containers make the lives of developers significantly easier. Suddenly, many of the time-consuming manual processes involved in the application life cycle fall away. Developers spend more time developing and less time preparing, and the time saved in preparing means that development time is spent more creatively solving the tougher problems presented by deploying enterprise-wide processes.

At the heart of this notion is the idea of being subtractive—and I mean that in a positive way. When you give the chance for your developers to be subtractive; that is, to remove the need to repeat unnecessary tasks, you create a work environment that is more efficient and allows your team to move much quicker.

New Dryad is Here

Dryad Digital Repository


The Dryad team has worked over the past year to understand what features are required to best support the research community’s ever-evolving needs. We are proud to announce the launch of our new Dryad platform and we are excited to share with the research community the enhancements that we have made!


Tenured and tenure track faculty positions

Cluster hire of Tenure Track Faculty in Urban Science and Engineering

New York University, Tandon School of Engineering; Brooklyn, NY

Open Faculty Position in Complex Systems

University of Michigan, Center for the Study of Complex Systems; Ann Arbor, MI
Full-time positions outside academia

Sr. Manager, Data Engineering, Marketing Data Platforms

Capital One; McLean, VA

Leave a Comment

Your email address will not be published.