Data Science newsletter – January 16, 2020

Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for January 16, 2020


Data Science News

Largest gaseous structure ever seen in our galaxy is discovered

Harvard Gazette


Astronomers at Harvard University have discovered a monolithic, wave-shaped gaseous structure — the largest ever seen in our galaxy — made up of interconnected stellar nurseries. Dubbed the “Radcliffe Wave” in honor of the collaboration’s home base, the Radcliffe Institute for Advanced Study, the discovery transforms a 150-year-old vision of nearby stellar nurseries as an expanding ring into one featuring an undulating, star-forming filament that reaches trillions of miles above and below the galactic disk.

The work, published in Nature, was enabled by a new analysis of data from the European Space Agency’s Gaia spacecraft, launched in 2013 with the mission of precisely measuring the position, distance, and motion of the stars.

Clemson mathematician helps deepen understanding of Earth’s mysterious mantle

Clemson University, Newsstand


College of Science researcher Timo Heister is part of a multi-institutional team of Earth scientists and mathematicians that recently received a $2.5 million National Science Foundation grant to develop a new framework for integrated geodynamic models that will provide realistic simulations from our planet’s mantle boundary to its surface.

“Most physical phenomena can be described by partial differential equations that explain energy balances or loss,” said Heister, an associate professor of mathematical sciences who will receive $393,000 of the overall funding. “My geoscience colleagues will develop the equations to describe the phenomena and I’ll write the algorithms that solve their equations quickly and accurately.”

Warning: I just got an invite to “Fixing Science” conference from NAS. This is not @theNASEM! It’s @NASorg, a political organization that has weaponized reproducibility to derail science policy.

Twitter, Lenny Teytelman


They are clever and dangerous.

Facebook doubles down on division

Revue, The Interface newsletter, Casey Newton


In October, Facebook made the controversial decision to exempt most political ads from fact-checking. The announcement met with a swift backlash, particularly among leading Democratic candidates for president. As criticism mounted, Facebook began to hint that it would further refine its policy to address lawmakers’ concerns. One change that seemed likely was to limit the ability of candidates to use the company’s sophisticated targeting tools, particularly after hundreds of employees wrote an open letter to Mark Zuckerberg asking for it.

On Thursday, Facebook unveiled the refinements to its policy that it had been promising. But restrictions on targeting were nowhere to be found. Instead, the company doubled down on its current policy, and said the only major change in 2020 would be to allow users to see “fewer” ads. (Fewer than what? It didn’t say.)

Challenges to the Reproducibility of Machine Learning Models in Health Care

JAMA Network; Andrew L. Beam, Arjun K. Manrai, Marzyeh Ghassemi


The discussion around reproducibility and replication has primarily focused on traditional statistical models and the results from randomized clinical trials, but these considerations can and should apply equally to machine learning studies. Challenges to reproducibility and replication include confounding, multiple hypothesis testing, randomness inherent to the analysis procedure, incomplete documentation, and restricted access to the underlying data and code. The last concern, data access, is especially germane for medicine, as privacy barriers are important considerations for data sharing. However, by definition, replication does not require access to the original data or code because a replication exercise examines the extent to which the original phenomenon generalizes to new contexts and new populations.

This Viewpoint focuses on reproducibility, even though it is important to acknowledge that replication is often the ultimate goal. Replication is especially important for studies that use observational data (which is almost always the case for machine learning studies) because these data are often biased, and models could operationalize this bias if not replicated. The challenges of reproducing a machine learning model trained by another research team can be difficult, perhaps even prohibitively so, even with unfettered access to raw data and code.

Top tech grads are increasingly unwilling to work for Big Tech, viewing it as a new, unethical Wall Street

Boing Boing, Cory Doctorow


About five years ago, I was trying to get a bunch of Big Tech companies to take the right side of an urgent online civil rights fight, and I called an old friend who was very senior at one of the biggest tech companies in the world; they told me that it wasn’t going to work, in part because the kinds of people who were coming to tech were there because they wanted to get as rich as possible, no matter what they had to do. My friend contrasted this with earlier eras — even the dotcom bubble — when the financial motive was blended with a genuine excitement for the transformative potential of tech to make a fairer, more equitable world. Now, my friend said, the kind of kid who would have gotten an MBA was instead getting an electrical engineering or computer science degree — not out of any innate love for the subject, but because that was a path to untold riches.

But things are changing. Not only are young people far more skeptical of capitalism and concerned that it will annihilate the human race, but the tech companies’ masks have slipped, revealing their willingness to supply ICE and the Chinese government alike, to help the oil industry torch the planet, and to divert their fortunes to supporting white nationalist causes. Companies that tout their ethical center have harbored and even rewarded serial sexual predators and busted nascent union movements.

US Bureau of Labor Statistics plans data center move

Data Centre Dynamics, Sebastian Moss


The United States Bureau of Labor Statistics plans to move its Washington, D.C. headquarters and data center ahead of its current lease expiring.

In late December, President Trump signed the Further Consolidated Appropriations Act, 2020, covering funding for certain federal agencies for the year ahead. Among the projects planned is the BLS’s office and data center move.

Springville police consider partnering with artificial intelligence company to solve crimes more quickly

The Daily Herald (Provo, UT), Connor Richards


The amount of time it takes for emergency responders to assist a victim or for police to find an abducted child could soon drastically reduce in Springville as the city’s police department considers partnering with an artificial intelligence company aimed at helping law enforcement work more efficiently.

Banjo, a technology company based in Park City, gathers real-time data from various sources — 911 dispatch calls, traffic cameras, emergency alarms, social media posts — and synthesizes the information in a way that lets police respond to emergencies significantly more quickly than they would otherwise be able to.

Lawmakers introduce bill to bolster artificial intelligence, quantum computing

TheHill, Emily Birnbaum


Key senators on Tuesday introduced a bill to bolster U.S. investments in artificial intelligence, quantum computing and next-generation wireless technology, an effort to ease the country’s transition into the “industries of the future.”

The legislation, from bipartisan members of the Senate Commerce Committee including Chairman Roger Wicker (R-Miss.), comes amid a larger push from the White House and presidential adviser Ivanka Trump to invest more government resources in emerging technologies like quantum and artificial intelligence, which are set to reshape how the country works and interacts.

Yale researchers identify fossils using machine learning

Yale Daily News, Jessica Pevner


In recent years, machine learning has risen in popularity as an exciting frontier in data science. This type of statistical technique can be leveraged to gain insight in all sorts of applications, from suggesting the perfect song on Spotify to predicting the weather.

The newest application? Classifying plankton fossils. Recently, a Yale-led team created a first-of-its-kind machine learning model that can identify the species of almost 7,000 plankton fossil images.

The model works extremely well — even better than human experts.

“Our best-performing model gets the answer right 87.4 percent of the time, which is better than our average human expert accuracy,” said Allison Hsiang, who is a postdoctoral researcher at the GeoBio-Center at Ludwig Maximilian University of Munich and the lead author of the study.

What the struggles of pizza and coffee-making robots mean for investors

Fortune, Jonathan Vanian


In recent weeks, two high-profile robotics startups specializing in food preparation, have both significantly slashed costs. Automated coffee shop Café X shut down three San Francisco-based stores, laid off some staff, and is now focusing on two robotic cafes in airports, Axios reported. Meanwhile, Zume Pizza fired over 200 employees and has pivoted from pizza-making robots to creating sustainable packaging for food, Business Insider reported.

Imagine a world without hunger, then make it happen with systems thinking

Nature, Editorial


Take nutrition. In its latest report on global food security, the United Nations Food and Agriculture Organization says that the number of undernourished people in the world has been rising since 2015, despite great advances in nutrition science. For example, tracking of 150 biochemicals in food by the US Department of Agriculture and various databases has been important in revealing the relationships between calories, sugar, fat, vitamins and the occurrence of common diseases. But using machine learning and artificial intelligence, network scientist Albert László Barabási at Northeastern University in Boston, Massachusetts, and his colleagues propose that human diets consist of at least 26,000 biochemicals — and that the vast majority are not known (Nature Food 1, 33–37; 2020). This shows that we have some way to travel before achieving the first objective of systems thinking — which, in this example, is to identify more components of the nutrition system.

5G Security

Bruce Schneier


The security risks inherent in Chinese-made 5G networking equipment are easy to understand. Because the companies that make the equipment are subservient to the Chinese government, they could be forced to include backdoors in the hardware or software to give Beijing remote access. Eavesdropping is also a risk, although efforts to listen in would almost certainly be detectable. More insidious is the possibility that Beijing could use its access to degrade or disrupt communications services in the event of a larger geopolitical conflict. Since the internet, especially the “internet of things,” is expected to rely heavily on 5G infrastructure, potential Chinese infiltration is a serious national security threat.

But keeping untrusted companies like Huawei out of Western infrastructure isn’t enough to secure 5G. Neither is banning Chinese microchips, software, or programmers. Security vulnerabilities in the standards­the protocols and software for 5G­ensure that vulnerabilities will remain, regardless of who provides the hardware and software. These insecurities are a result of market forces that prioritize costs over security and of governments, including the United States, that want to preserve the option of surveillance in 5G networks. If the United States is serious about tackling the national security threats related to an insecure 5G network, it needs to rethink the extent to which it values corporate profits and government espionage over security.

Throw Out Your Assumptions About Whistleblowing

Harvard Business Review, Kyle Welch and Stephen Stubben


Our research on employee whistleblowing, using previously unavailable data, shows for the first time that we may be in the golden age of accountability systems. In 2018, NAVEX Global, the leading provider of employee hotline and incident management systems, provided us secure, anonymized access to more than 2 million internal reports made by employees of more than 1,000 publicly traded U.S. companies.

Our study of the data led us to two important findings: First, whistleblowers are crucial to keeping firms healthy. The average manager seems to take these reports seriously and uses them to learn of and address issues early, before they evolve into larger, more costly problems. We also found that second hand reports are more credible and more valuable, on average, than firsthand reports.


Can We Build Social Media that’s Good for Society?

University of Massachusetts, College of Social and Behavioral Sciences and the College of Information and Computer Sciences


Amherst, MA January 23, starting at 4 p.m. Speaker: Ethan Zuckerman. [free, registration required]

Who Can We Trust? Technology’s Impact on Democracy

Town Hall Seattle, UW Center for an Informed Public, UW Communication Leadership Program, KUOW


Seattle, WA January 23, starting at 7:30 p.m., Town Hall Seattle (1119 Eighth Avenue). “The University of Washington’s Center for an Informed Public explores questions and solutions for building our trust in modern media. Participate in an active dialogue that covers ideas and solutions from the community, and hear from researchers on the cutting edge of information and communication.” [$]


Rice Business Plan Competition

“The World’s Richest and Largest Student Startup Competition” Deadline for application is January 27.

ASA Symposium on Data & Statistics

Pittsburgh, PA June 3-6. Deadline for abstracts submissions is January 30.

Data Science for Social Good, Summer 2020, Applications are Open – Data Science 101

“The Data Science for Social Good Summer Fellowship, now hosted at Carnegie Mellon University, is accepting applications. This is a 12-week program to train data scientists about working on projects which positively impact society. There are a number of roles available.” Deadline for applications is January 31.

Indy Autonomous Challenge

“Program an automated Dallara IL-15 Indy Lights race car to outrace and outmaneuver fellow innovators in a head-to-head, high-speed race at the Indianapolis Motor Speedway, the world’s most famous racetrack.” Deadline for Round 1 registration is February 28.
Tools & Resources

Use IEEE DataPort to Share Your Research Data Sets

IEEE Spectrum, Melissa Handa


IEEE DataPort made its debut to the public last year, and to date more than 200,000 people have used the Web-based platform, uploading more than 1,000 data sets. Developed and supported by IEEE, the product allows researchers to store, share, access, and manage their research data sets in a single trusted location.


Full-time positions outside academia

VP of Data Science & Analytics

Mozilla; Mountain View, CA

Research Scientist

Bay Area Environmental Research Institute; Moffett Field, CA

Senior Software Engineer

Columbia University, The Brown Institute for Media Innovation; New York, NY

Staff Editor – Statistical Modeling

The New York Times; New York, NY
Internships and other temporary positions

Summer paid research internships

University of Southern California, Information Sciences Institute; Marina Del Rey, CA

Leave a Comment

Your email address will not be published.