Data Science newsletter – May 4, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for May 4, 2017


Data Science News

How Companies Say They’re Using Big Data

Harvard Business Review, Randy Bean


Health Datapalooza 2017 Day 1: Data Liberation, Sharing, and Analytics

Medgadget, Michael Batista


The now annual event was launched in 2010 by the Obama administration as a hackathon-style program where attendees were challenged to develop prototype applications in 30 days from 30 data sets. Today, Health Datapalooza includes presentations from government and private sector healthcare experts, breakout panel sessions that dive into specific areas of interest, and an exhibit hall. The common thread of discussion throughout these activities is how health data is, can, or will eventually be used to improve health outcomes. Over the next two days, we’ll be giving you a snapshot of the presentations and topics being covered at this year’s event with a few specific highlights as we find out what’s on the horizon for health data technology.

Introducing the 4th Global Open Data Index – Advancing the State of Open Data Through Dialogue

Open Knowledge International Blog


We are pleased to present the 4th edition of the Global Open Data Index (GODI), a global assessment of open government data publication. GODI compares national government in 94 places across the 15 key datasets that have been assessed by our community as the most useful for solving social challenges.

For this edition, we received 1410 submitted of datasets, but only 10% of these are open according to the Open Definition. At Open Knowledge International (OKI), we believe it’s important to look further than just the numbers. GODI is not just a benchmark, it can be and should be used as a tool to improve open data publication and make data findable, useful and impactful. This is why we include a new phase, the dialogue phase for this edition.

Brain Mapping Tech Inflates Tissue 20x to Reveal Remarkable Detail

SingularityHub, Shelly Fan


By embedding the brain into a gel that swells up when pumped with water, the team blew up mouse brain tissue to roughly 20 times its original size, while preserving the normal structure and connections of neurons and their dendrites.

Using this method, aptly dubbed expansion microscopy (ExM), the team reconstructed a tiny piece of the mouse brain in 3D. Normally, dendrites entangle into a jumbled mess, making it hard to tease apart individual synaptic connections with a conventional light microscope.

Fun bits, ear candy, and long reads

OKCupid has some of the best company blog posts. They just put out a new piece on the pitfalls of A/B testing featuring lines like, “the control group would have a bunch of ignored messages and unreciprocated love”.

Should you get a phd or not? by Chris Albon gets real, involves sobbing.

David Hasselhoff has a new role in which all his lines were written by a robot. I used to get confused, thinking David Hasselhoff was a robot. I realize I may not have been alone in that confusion.

Eric Horvitz, Microsoft Research Technical Fellow spoke at Data & Society about “failures of automation in the open world, biased data and algorithms, opacity of reasoning, adversarial attacks on AI systems, and runaway AI.” Horvitz has just been named ‘head of all research’ at Microsoft Research.

Worms can digest plastic (maybe).

Andrew Russell (SUNY-Polytechnic in Utica) and Lee Vinsel (Stevens Institute of Technology) are angling to launch a sub-field of science and technology studies focusing on maintenance and maintainers. They question the all too often unbridled, uncritical celebration of innovation: “Entire societies have come to talk about innovation as if it were an inherently desirable value, like love, fraternity, courage, beauty, dignity, or responsibility. Innovation-speak worships at the altar of change, but it rarely asks who benefits, to what end? A focus on maintenance provides opportunities to ask questions about what we really want out of technologies. What do we really care about? What kind of society do we want to live in? Will this help get us there?”

Love tech company bios? This week we’ve got the story of Venmo, a money-transfer app widely used by 20-somethings to facilitate splitting restaurant bills, cab fares, rent, and utilities. One person pays upfront, everyone else pays her back using Venmo which is now part of PayPal. In the beginning, as with so many tech companies, mistakes were made. The biggest: no fraud detection whatsoever, in flagrant violation of US securities law and common sense: “the service launched with virtually no regulatory compliance built into it”. Move fast and break stuff, indeed.

This is absolutely not a ‘fun bit’. A years old anti-vaccination campaign in Minnesota has kicked off a measles outbreak with 41 cases reported and more expected in Minnesota’s Somali community. Andrew Wakefield, the one-time doctor who published the original ‘vaccines cause autism’ paper which has been rescinded (so has his medical license), traveled to Minnesota three times in recent years to address the Somali population there. He spread his toxic gospel that the measles, mumps, rubella (MMR) vaccine causes autism, a disease that continues to impact that community. Vaccination rates within the Minnesotan Somali community plummeted from 92 in 2004 before Wakefield’s visits to 42 percent in 2014 according to the Minnesota Department of Health. At NYU, an MBA student was diagnosed with mumps last week. The National Hockey League has reported cases of mumps for the past two years, reminding all of us – especially academics in university settings – to make sure we receive the recommended MMR booster in adulthood.

Automation is coming for the jobs of poor people, enriching the innovators who create them (like the readers of this newsletter, I imagine), and the investors backing the innovators. A Bloomberg analysis found, “the difference in annual income between households in the top 20 percent and those in the bottom 20 percent ballooned by $29,200 to $189,600 between 2010 and 2015”. Yes, the gap between the top quintile and the bottom quintile is now more than three times the size of median household income ($56,516) in the US.

Raj Chetty, David Grusky et al., in a study of income mobility, shows Americans’ ability to move from one income strata to another is fading: “rates of absolute mobility have fallen from approximately 90 percent for children born in 1940 to 50 percent for children born in the 1980s.” As an 80s baby, yeah, even chance sounds about right.

£3m awarded to Oxford-led consortium for national computing facility to support machine learning

Oxford University, News and Events


A consortium of eight UK universities, led by the University of Oxford, has been awarded £3 million by the Engineering and Physical Sciences Research Council (EPSRC) to establish a national high-performance computing facility to support machine learning.

The ground truth about metadata and community detection in networks

Science Advances; Leto Peel, Daniel B. Larremore, and Aaron Clauset


Across many scientific domains, there is a common need to automatically extract a simplified view or coarse-graining of how a complex system’s components interact. This general task is called community detection in networks and is analogous to searching for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-called ground truth communities. This works well in synthetic networks with planted communities because these networks’ links are formed explicitly based on those known communities. However, there are no planted communities in real-world networks. Instead, it is standard practice to treat some observed discrete-valued node attributes, or metadata, as ground truth. We show that metadata are not the same as ground truth and that treating them as such induces severe theoretical and practical problems. We prove that no algorithm can uniquely solve community detection, and we prove a general No Free Lunch theorem for community detection, which implies that there can be no algorithm that is optimal for all possible community detection tasks. However, community detection remains a powerful tool and node metadata still have value, so a careful exploration of their relationship with network structure can yield insights of genuine worth. We illustrate this point by introducing two statistical techniques that can quantify the relationship between metadata and community structure for a broad class of models. We demonstrate these techniques using both synthetic and real-world networks, and for multiple types of metadata and community structures.

How do Temperature and Rainfall Interact to Control Tropical Forest Carbon Cycling?



Tropical forests dominate global terrestrial carbon exchange, but long-term climate variability might affect their ability to uptake and store carbon dioxide. To better understand tropical forest carbon dynamics, the Tropical Nutrient Limitation Working Group assembled published datasets to determine how temperature and rainfall interact to control carbon cycling in tropical forests. Their results are featured in a recent Ecology Letters publication.

How fluids flow through shale

EurekAlert! Science News, American Institute of Physics


New computer simulations, described this week in the journal Physics of Fluids can better probe the underlying physics, potentially leading to more efficient extraction of oil and gas.

Peter Norvig on “As We May Program”

Vimeo, LispNYC


The general theme of the presentation is the future of programming and computer science and Peter touches on numerous aspects of these topics.

Peter envisions how our interaction with Technology will evolve as intelligent agents become increasingly prevailing in our lives and how Computer Science is gradually becoming an empirical science. [video, 1:37:13]

The statistical crisis in science: How is it relevant to clinical neuropsychology?

Andrew Gelman, Statistical Modeling, Causal Inference, and Social Science blog


There is currently increased attention to the statistical (and replication) crisis in science. Biomedicine and social psychology have been at the heart of this crisis, but similar problems are evident in a wide range of fields. We discuss three examples of replication challenges from the field of social psychology and some proposed solutions, and then consider the applicability of these ideas to clinical neuropsychology. In addition to procedural developments such as preregistration and open data and criticism, we recommend that data be collected and analyzed with more recognition that each new study is a part of a learning process. The goal of improving neuropsychological assessment, care, and cure is too important to not take good scientific practice seriously.

NuTonomy joins forces with French automaker

The Boston Globe, Hiawatha Bray


NuTonomy will integrate its software and sensors into the Peugeot 3008, a gasoline-powered crossover SUV. Converted vehicles will undergo real-world testing on the streets of Singapore, where nuTonomy last year launched the world’s first taxi service built around self-driving vehicles.

[Karl] Iagnemma said Peugeot and nuTonomy plan to “get a couple of cars on the road quickly in Singapore this summer, and then scale up rapidly from there.”

Microsoft Looks to Regain Lost Ground in the Classroom

The New York Times, Nick Wingfield and Natasha Singer


Last week, Satya Nadella, the chief executive of Microsoft, slipped on a glove made of cardboard and clenched his hand into a fist, causing a robotic hand with fingers made of drinking straws to mimic his movements.

The glove was one of several engineering projects built in a makeshift laboratory on Microsoft’s campus. The company spent the last year talking to thousands of teachers and designing high-tech experiments that require mostly low-cost parts. It will give the designs to schools for free so teachers can use them in their lesson plans.

The projects are part of a major push the company announced Tuesday at an event in New York to make its products more attractive to school administrators, students and teachers.

How to Prepare for an Automated Future

The New York Times, The Upshot blog, Claire Cain Miller


We don’t know how quickly machines will displace people’s jobs, or how many they’ll take, but we know it’s happening — not just to factory workers but also to money managers, dermatologists and retail workers.

The logical response seems to be to educate people differently, so they’re prepared to work alongside the robots or do the jobs that machines can’t. But how to do that, and whether training can outpace automation, are open questions.

Smarr, Others Talk Healthtech, AI at Xconomy’s Impact of Innovation

Xconomy, Bruce V. Bigelow


In the not-too-distant future, a “planetary” computer will be able to create a computational model of your body, with the ability to run simulations of your health and to anticipate chronic disease before you show any symptoms.

This is the direction we’re headed, according to Larry Smarr, founding director of the California Institute of Telecommunications & Information Technology at UC San Diego. While his expertise lies in computer networks and infrastructure, Smarr has emerged as a de facto leader in quantified health—largely due to his relentless curiosity about his own health. In 2011, Smarr diagnosed his own Crohn’s disease long before he showed any symptoms.

Can We Stop Ad Creep?

Fast Company, Mark Bartholomew


Places that used to be ad-free—from the living room to our friendships—are now becoming sites for ads or surveillance designed to make them more effective.


Data Science Fundamentals summit

University of Illinois


Urbana, IL May 16 at NCSA. [free, registration required]

MobileMonday NYC API-First Hackathon

IBM Watson, StrongLoop, Bluemix


New York, NY June 2-4 at Galvanize NYC. [free, registration required]

PyCon Warmup @ 2U

The New York Python Meetup Group


New York, NY A night of PyCon warmup talks! Thursday, May 11 at 7 p.m., 2U Office (601 W 26th St, Suite 1255) [free, rsvp required]

The AI Conference

ML Conference


San Francisco, CA June 2. The Artificial Intelligence Conference is an annual event where leading AI researchers and top industry practitioners meet and collaborate. [$$$]


Call for submissions: stories of federal data use in your research

The Social Science Research Council’s Digital Culture program is issuing a call to our networks, fellows, and the scholarly community in general for personal stories of government data use.


Como, Italy FATREC Workshop on Responsible Recommendation at RecSys 2017 is a venue for discussing questions of social responsibility in building, maintaining, evaluating, and studying recommender systems. Workshop is August 31. Deadline for paper submissions is June 22.

Funding Opportunity: 2018 Conservation Project Grants from the International Elephant Foundation

Applications must clearly contribute to in situ or ex situ conservation of Asian and African elephants. Deadline for submissions is August 11.
Tools & Resources

Introducing Yelp’s Local Graph

Yelp Engineering blog, Tomer Elmalem


“We’re continuously adding new features to our API to make it easier for developers to integrate with our data and share great local businesses through their apps. Today, we’re releasing access to query our data via GraphQL, a graph query language. This is available immediately through our developer beta program.”

How to do an NLG Evaluation: Metrics

Ehud Reiter


A metric-based evaluation give an NLG system a score by computing how similar its output text is to “gold-standard” reference texts. There are a number of different metrics (including BLEU, METEOR, and ROUGE), which are based on different scoring functions.

I am not a great fan of metric-based evaluation, for reasons I explain below, and would be very dubious if, for example, I was asked to review a paper on NLG which only presented a metric-based evaluation. Nevertheless, I will also below give some advice on best practice for such evaluations.

Transfer Learning – Machine Learning’s Next Frontier

Sebastian Ruder


Over the course of this blog post, I will first contrast transfer learning with machine learning’s most pervasive and successful paradigm, supervised learning. I will then outline reasons why transfer learning warrants our attention. Subsequently, I will give a more technical definition and detail different transfer learning scenarios. I will then provide examples of applications of transfer learning before delving into practical methods that can be used to transfer knowledge. Finally, I will give an overview of related directions and provide an outlook into the future.

Data Science & Machine Learning Platforms for the Enterprise

Algorithmia, Ahmad AlNaimi


A resilient Data Science Platform is a necessity to every centralized data science team within a large corporation. It helps them centralize, reuse, and productionize their models at peta scale. We’ve built Algorithmia Enterprise for that purpose.


Full-time positions outside academia

National Climate Assessment Program Coordinator

ICF International; Fairfax, VA

SDE II, Decision Services team

Microsoft; New York, NY
Tenured and tenure track faculty positions

Assistant/Associate professorships in Data Science

IT University of Copenhagen; Copenhagen, Denmark

Leave a Comment

Your email address will not be published.