Data Science newsletter – April 3, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for April 3, 2018


Data Science News

Shulkin out before signing Cerner contract

POLITICO, Morning eHealth, Darius Tahir


After weeks of rumors, David Shulkin is out from atop the Department of Veterans Affairs, defenestrated by a presidential tweet. Donald Trump is nominating Rear Admiral Ronny Jackson, the White House doctor, as the agency secretary, with Robert Wilkie, the Department of Defense’s undersecretary of preparedness and readiness, acting as interim replacement.

The decision has potentially vast implications for health IT at Veterans Affairs, which has pioneered electronic health records and telemedicine. Shulkin was — apparently soon — going to sign a contract with Cerner estimated at $16 billion. What effects his firing will have on the contract are uncertain. Cerner deferred comment to the VA.

New twist in Shulkinology

POLITICO, Morning eHealth, Darius Tahir


The Shulkin saga didn’t subside over the weekend, as the now-former VA secretary insisted that he was fired, while the White House is saying he quit. The distinction is more than a prideful one, and it might have an impact on the VA’s long-running EHR contract process. As our colleague Andrew Restuccia reports, an obscure bill called the Vacancies Act gives the president pretty expansive powers to fill, well, vacancies in a federal agency with an acting replacement. But those powers only clearly apply when the void is created by resignation; it’s unclear whether any acting official can be legally appointed in this fashion when a spot opened up because of a firing.

And here’s where the EHR deal comes in. Let’s say acting VA secretary Robert Wilkie decides to sign the Cerner contract. It’s theoretically possible that a competitor could sue the VA and argue it wasn’t validly signed, because Wilkie shouldn’t be the acting VA secretary. And, for what it’s worth, there’s already litigation arguing the no-bid EHR award wasn’t proper in the first place. So, putting two and two together, we might see some novel legal territory explored.

Gary King on Big Data Analysis

SAGE Research Methods, Dave Edmonds


In this conversation King uses text analysis as an example of big data analytics. Social media has likely brought with it the largest increase in the expressive capacity of the human race in the history of the world. Roughly 650 million social media messages are produced every day. So, to someone trying to make statements about what those messages contain, would having 750 million messages make anything better? “Having bigger data,” King says, “only makes things more difficult.” The real innovation is in the ways of analysing those data. [audio, 26:10]

Penn State-developed plant-disease app recognized by Google

Penn State University, Penn State News


A mobile app designed by Penn State researchers to help farmers and others diagnose crop diseases has earned recognition from one of the world’s tech giants.

PlantVillage, developed by a team led by David Hughes, associate professor of entomology and biology, was the subject of a keynote video presented at Google’s TensorFlow Developer Summit 2018, held March 30 in Mountain View, California. The event brought together a diverse mix of machine learning users from around the world for a full day of technical talks, demonstrations and conversation with the TensorFlow team and community.

Inside Antarctica: the continent whose fate will affect millions, Pilita Clark


The struggle to understand a continent whose fate affects millions of people worldwide, yet is fearsomely hard to study.

Science has long played an outsized role in Antarctica. Nations wishing to help run the continent, which has no indigenous people or central government, have had to prove their commitment to scientific research since the Antarctic Treaty came into force in 1961, turning the remote white expanse into a gigantic natural laboratory.

Antarctic scientists discovered the hole in the ozone layer, along with ice cores that shed new light on the planet’s climate history. Yet for most of the 20th century, Antarctica was widely thought to be frozen in time.

Knowledge Applied podcast: Luis Bettencourt

UChicago News


On this episode of Knowledge Applied, we talk with Bettencourt on how he’s combining science and policy and using data to capture “the magic of cities for the common good.” [audio, 14:05]

Inria takes part in PRAIRIE Institute launch



CNRS, Inria and PSL University, together with Amazon, Criteo, Facebook, Faurecia, Google, Microsoft, NAVER LABS, Nokia Bell Labs, PSA Group, SUEZ and Valeo are joining their academic and industrial perspective as well as their forces to create in Paris the PRAIRIE Institute, whose objective is to become an international reference in the field of artificial intelligence.

Google appoints veteran engineers to lead Search, Artificial Intelligence amid ‘AI first’ push | 9to5Google

9to5Google, Abner Li


Two years ago, Google appointed its head of artificial intelligence to lead Search in a move that reflected the future of the company. Today, John Giannandrea is stepping down from those positions, with Google veterans Ben Gomes and Jeff Dean taking over.

Appointed in early 2016, Giannandrea served as senior vice president of engineering and joined Google in 2010 following the acquisition of Metaweb Technologies. That purchase by Google later became the Knowledge Graph, which is responsible for powering what Assistant and Search “know” when queried.

As reported by The Information today and confirmed by the company, his role is being split among two longtime Googlers. Jeff Dean will lead Google’s AI efforts, with the 19-year veteran and widely revered engineer continuing to lead Google Brain — the company’s internal machine learning research team.

How Grubhub Analyzed 4,000 Dishes to Predict Your Next Order

WIRED, Business, Adam Rogers


With 14.5 million active users ordering from 80,000 restaurants, Grubhub data ought to be able to tell you a lot about food. Maloney wanted to be able to segment, quantify, and compare who was ordering what across neighborhoods and cities. He wanted to algorithmically recommend dishes, help restaurants optimize their food choices, attract new customers with slicker service, and frankly get customers all over the country to act more like New Yorkers, who order from somewhere at least once a week.

Today Grubhub does indeed have an algorithm that can look across a country’s worth of take-out orders and tell a user what Indian joint near them delivers the most popular chicken tikka masala. But getting there required solving a seemingly impossible data problem, some high-end machine learning, and a cookbook author from Brooklyn.

Move Over Moore’s Law, Make Way for Huang’s Law

IEEE Spectrum, Tekla S. Perry


Graphics processors are on a supercharged development path that eclipses Moore’s Law, says Nvidia’s Jensen Huang

Rise of the smartish machines

Chemical & Engineering News, Rick Mullen


As in other industries that have practical experience with intelligent machines, researchers in the drug industry have taken off the table the prospect of robots making humans obsolete. Many drug researchers consider the technology an indispensable aid and enabler (see page 21).

AI, however, has also shown that adding decision-making and the ability to “learn” to a computer’s traditional number-crunching role is changing the work done by the research scientist. Uncertainty regarding the extent and nature of that change is the source of some anxiety. That’s certainly true for medicinal chemists.

Canadian Scientists Least Likely to Share Data: Survey

The Scientist Magazine®, Kerry Grens


Among more than 7,000 respondents to a survey given to researchers around the world, Canadians, Americans, and Australians had the lowest rate of participants who share their data, while scientists in Poland, Germany, and Switzerland were the most open. Just 50 percent of Canadians said they shared their data, while 76 percent of survey participants from Poland provided data through a repository or supplement.

The survey, conducted by Springer Nature, found that organizing data was the most commonly cited barrier to data sharing. Researchers in the medical sciences were specifically concerned about copyright and licensing. Another common challenge researchers faced was knowing which repository to use to deposit their data.

States requiring CS for all students may be making a mistake: Responding to unfunded mandates

Mark Guzdial, Computing Education Research Blog


As of this writing, New Jersey and Wyoming are the latest states to require CS for all their students (as described in this article) or to be offered in all their schools (as described in this post and this news article), respectively. Wyoming has a particularly hard hill to climb. As measured by involvement in AP exams, there’s just not much there — only 8 students took the AP CS A exam in the whole state last year, and 13 took AP CS Principles.

In 2014, I wrote an article titled “The Danger of Requiring Computer Science in K-12 Schools.” I still stand by the claim that we should not mandate computer science for US schoolchildren yet. We don’t know how to do it, and we’re unlikely to fund it to do it well.

The physics of finance helps solve a century-old mystery

Tokyo Institute of Technology


Researchers at Tokyo Tech have brought the worlds of physics and finance one step closer to each other.

In a study published in Physical Review Letters, the team successfully demonstrated the close parallels between random movements of particles in a fluid (called physical Brownian motion) and price fluctuations in financial markets (known as financial Brownian motion).

In doing so, they revive the seminal work of French mathematician Louis Bachelier, who in 1900 was the first to describe the stochastic process2, which later became known as Brownian motion in the context of financial modeling. Extraordinarily, Bachelier’s findings were published five years before Albert Einstein published his first paper on physical Brownian motion.

Closing the Loop: The Importance of External Engagement in Computer Science Research

John Regehr, Embedded in Academia blog


Computer scientists tend to work by separating the essence of a problem from its environment, solving it in an abstract form, and then figuring out how to make the abstract solution work in the real world. For example, there is an enormous body of work on solving searching and sorting problems and in general it applies to finding and rearranging things regardless of whether they live in memory, on disk, or in a filing cabinet.


Data Visualization Summit

Innovation Enterprise


San Francisco, CA April 12-13. “With 20+ Industry Speakers & 150+ delegates, Data Visualization brings together the world’s leaders in the industry. Running concurrently with our Big Data Innovation Summit, networking opportunities are second to none.” [invitation required]

2018 CGA Conference: Illuminating Space and Time in Data Science

Harvard University Center for Geographic Analysis


Cambridge, MA April 26-27 at Harvard University Center for Geographic Analysis. [registration required]

Data Science Salon – Dallas

Formulated By


Dallas, TX April 27. “The Data Science Salon is a destination conference which brings together specialists face-to-face to educate each other, illuminate best practices, and innovate new solutions in a casual atmosphere with food, drinks, and entertainment.” [$$$]

4th Data Science Summit



Tel Aviv, Israel May 28 at Tel Aviv Convention Center, co-organized by IGTCloud, Intel and O’Reilly Media, in collaboration with eBay and IBM. [$$$]

NIH mHealth Technology Showcase

National Institutes of Health


Bethesda, MD June 4, Natcher Conference Center. “The goal for the meeting is to discuss how these communities can work together to improve the specificity, reliability, and validity of health indicators identified from data collected from wearable and mobile sensors, in the context of rapidly evolving and increasingly complex and diverse technologies.” [free, attendee application required]

Visual Trumpery with Alberto Cairo

Meetup, Data Visualization Group in the Bay Area


Los Gatos, CA May 3, starting at 6 p.m., Netflix (131 Albright Way, Bldg D). [rsvp required]


Quantifying Probabilistic Expressions

If a future event is ____ to happen (or going to happen with ____), what percentage of the time would you estimate it ends up happening?

Please assign the probability value that you associate with the following list of words or phrases.

Announcing the DeepGlobe Satellite Challenge for CVPR 2018

“Challengers will be provided with high-resolution satellite image datasets (courtesy of DigitalGlobe)
and the corresponding training data. We expect them to learn the expected urban elements for each category: road extraction, building detection and land cover classification.” Deadline for submissions is May 1.

Data Science Summer School (DS3)

Paris, France June 25-29, co-organised by the Data Science Initiative of École Polytechnique and DATAIA Institute. Deadline for applications is May 2.

The 1st Workshop on Machine Learning and Data Mining for Podcasts (MLDM4P), 2018

London, England August 20, a KDD 2018 workshop. “Podcast content has become a major channel for information, entertainment, and advertising.” … “research into podcast content modeling, recommendation, and interaction is relatively neglected.” Deadline for submissions is May 8.

Forecasting Counterfactuals in Uncontrolled Settings (FOCUS)

“The FOCUS program seeks to develop and empirically evaluate systematic approaches to counterfactual forecasting. Counterfactual forecasts are statements about what would have happened if different circumstances had occurred.” Deadline for proposals is June 29.

USD 2 Million up for Grabs in 2018 Sheikh Hamad Translation Prize

“The prize money will be shared by winners across three categories — Translation Prizes (USD 800,000), Achievement Prizes (USD 1m), and Prize for International Understanding (USD 200,000).” Nominations are accepted until August 31, 2018.
Tools & Resources

How Facebook handles account deletions

Ramy Khuffash


Before we dive in, I want to quickly explain the difference between deactivation and deletion.

When you deactivate your account, you don’t show up in search or on friend lists, but you can log back in whenever you want to re-activate it. Deleting your account is more permanent. Once it’s deleted, you can’t go back to it and continue using it as if you never attempted to abandon it.

[1803.09473] code2vec: Learning Distributed Representations of Code

arXiv, Computer Science > Learning; Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav


We present a neural model for representing snippets of code as continuous distributed vectors. The main idea is to represent code as a collection of paths in its abstract syntax tree, and aggregate these paths, in a smart and scalable way, into a single fixed-length code vector, which can be used to predict semantic properties of the snippet.
We demonstrate the effectiveness of our approach by using it to predict a method’s name from the vector representation of its body. We evaluate our approach by training a model on a dataset of 14M methods. We show that code vectors trained on this dataset can predict method names from files that were completely unobserved during training. Furthermore, we show that our model learns useful method name vectors that capture semantic similarities, combinations, and analogies.
Comparing previous techniques over the same data set, our approach obtains a relative improvement of over 75%, being the first to successfully predict method names based on a large, cross-project, corpus.




“A WebGL accelerated, browser based JavaScript library for training and deploying ML models.”


Stanford DAWN


“MacroBase is a new analytic monitoring engine designed to prioritize human attention in large-scale datasets and data streams. Unlike a traditional analytics engine, MacroBase is specialized for one task: finding and explaining unusual or interesting trends in data.”

A workshop on Linux containers: Rebuild Docker from Scratch

GitHub – Fewbytes


The preparatory talk covers all the basics you’ll need for this workshop, including:

  • Linux syscalls and glibc wrappers
  • chroot vs pivot_root
  • namespaces
  • cgroups
  • capabilities
  • and more

  • Archiving Large-Scale Legacy Multimedia Research Data: A Case Study

    International Journal of Digital Curation; Claudia Yogeswaran


    In this paper we provide a case study of the creation of the DCAL Research Data Archive at University College London. In doing so, we assess the various challenges associated with archiving large-scale legacy multimedia research data, given the lack of literature on archiving such datasets. We address issues such as the anonymisation of video research data, the ethical challenges of managing legacy data and historic consent, ownership considerations, the handling of large-size multimedia data, as well as the complexity of multi-project data from a number of researchers and legacy data from eleven years of research.


    Full-time, non-tenured academic positions

    Data Scientist

    University of Chicago, Urban Crime Labs; Chicago, IL
    Full-time positions outside academia

    Associate Deputy Director – Discretionary Programs

    National Foundation on the Arts and the Humanities, Institute of Museum and Library Services; Washington, DC

    Leave a Comment

    Your email address will not be published.