Data Science newsletter – May 17, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for May 17, 2017


Data Science News

Should computer programming be a prerequisite for learning statistics?

Andrew Gelman, Statistical Modeling, Causal Inference, and Social Science blog


Sure, no need to learn programming before you take that statistics course. No need to learn math, either. If you had to choose between them, I’d choose programming. Better to have both, though. Programming and math are both useful. Programming’s more useful, but math helps too.

[1705.04098] A Generative Model of People in Clothing

arXiv, Computer Science > Computer Vision and Pattern Recognition; Christoph Lassner, Gerard Pons-Moll, Peter V. Gehler


We present the first image-based generative model of people in clothing in a full-body setting. We sidestep the commonly used complex graphics rendering pipeline and the need for high-quality 3D scans of dressed people. Instead, we learn generative models from a large image database. The main challenge is to cope with the high variance in human pose, shape and appearance. For this reason, pure image-based approaches have not been considered so far. We show that this challenge can be overcome by splitting the generating process in two parts. First, we learn to generate a semantic segmentation of the body and clothing. Second, we learn a conditional model on the resulting segments that creates realistic images. The full model is differentiable and can be conditioned on pose, shape or color. The result are samples of people in different clothing items and styles. The proposed model can generate entirely new people with realistic clothing. In several experiments we present encouraging results that suggest an entirely data-driven approach to people generation is possible.

The New Frontier of Internet-Enabled Cities Slated for Toronto

Planetizen, James Brasuell


“Sidewalk Labs LLC, the urban innovation unit of Page’s Alphabet Inc., has applied to develop a 12-acre strip in downtown Toronto, responding to a recent city agency request for proposals,” report Mark Bergen.

Some of the information made public about the idea so far came from a speech by Sidewalk Labs Chief Executive Officer Dan Doctoroff at the Smart Cities NYC conference earlier in May, although some of the plans are still private. One big soundbite from early reports is that the company hopes to build an urban zone “from the internet up.”

As Bergen explains, any new project by Sidewalk Labs would be a big expansion beyond the company’s previous efforts—the LinkNYC network of Wi-Fi kiosks located around New York and a traffic data platform in partnership with the U.S. Department of Transportation.

Bill Gates Pens An Essay For the Class Of 2017

GOOD, Tod Perry


In his essay, Gates also shares what he wishes he knew while in college. “For one thing, intelligence is not quite as important as I thought it was, and it takes many different forms,” Gates wrote. “In the early days of Microsoft, I believed that if you could write great code, you could also manage people well or run a marketing team or take on any other task. I was wrong about that. I had to learn to recognize and appreciate people’s different talents. The sooner you can do this, if you don’t already, the richer your life will be.”

David Beck appointed associate director of DIRECT program

University of Washington, eScience Institute


David Beck, eScience senior data science fellow, director of research and life sciences, and research assistant professor in chemical engineering, has been promoted to associate director of the Data Intensive Research Enabling Clean Technologies (DIRECT) program. DIRECT, a UW Clean Energy Institute graduate training program in data-enabled discovery and design of advanced materials for clean energy, is funded by the National Science Foundation Research Traineeship Program.

Democratizing AI Through Microsoft Certifications in Data Science & Machine Learning

Microsoft, Cortana Intelligence and Machine Learning Blog


This blog is the first in a series where we will discuss several new certification exams for data professionals and the resources to prepare for them.

When individuals decide which certification to take, their decision is typically based on their role and – in this particular domain – their relationship with data and analytics. Common role or titles we encounter in our domain include data scientists, data analysts, data engineers, data architects and data developers, although by no means is this list exhaustive. What’s more, with the democratization of artificial intelligence, it is inevitable that almost all current and future data professionals will be using, or will need to be educated about, machine learning and artificial intelligence, and many will want to have certifications to prove their skills.

Applying Artificial Intelligence in Medicine: Our Early Results

Medium, Cardiogram, Avesh Singh


Cardiogram is taking the first step down that path. We’ve developed an algorithm to use the Apple Watch to detect atrial fibrillation — the most common heart arrhythmia — with higher accuracy than previously validated methods. Our work is being presented at the Heart Rhythm Society, and has been picked up by TechCrunch, Buzzfeed, and CNET.

One year ago, we teamed up with UCSF Cardiology to start the mRhythm study, which 6,158 Cardiogram users enrolled in. Cardiogram trained a deep neural network on the Apple Watch’s heart rate readings and was able to obtain an AUC of 0.97, enabling us to detect atrial fibrillation with 98.04% sensitivity and 90.2% specificity.

The Partnership on AI adds Intel, Salesforce and others as it formalizes Grand Challenges and work groups

TechCrunch, John Mannes


Intel, Salesforce, eBay, Sony, SAP, McKinsey & Company, Zalando and Cogitai are joining the Partnership on AI, a collection of companies and non-profits that have committed to sharing best practices and communicating openly about the benefits and risks of artificial intelligence research. The new members will be working alongside existing partners that include Facebook, Amazon, Google, IBM, Microsoft and Apple.

Improving official statistics in emerging markets using machine learning and mobile phone data

EPJ Data Science, Eaman Jahani


Mobile phones are one of the fastest growing technologies in the developing world with global penetration rates reaching 90%. Mobile phone data, also called CDR, are generated everytime phones are used and recorded by carriers at scale. CDR have generated groundbreaking insights in public health, official statistics, and logistics. However, the fact that most phones in developing countries are prepaid means that the data lacks key information about the user, including gender and other demographic variables. This precludes numerous uses of this data in social science and development economic research. It furthermore severely prevents the development of humanitarian applications such as the use of mobile phone data to target aid towards the most vulnerable groups during crisis. We developed a framework to extract more than 1400 features from standard mobile phone data and used them to predict useful individual characteristics and group estimates. We here present a systematic cross-country study of the applicability of machine learning for dataset augmentation at low cost. We validate our framework by showing how it can be used to reliably predict gender and other information for more than half a million people in two countries. We show how standard machine learning algorithms trained on only 10,000 users are sufficient to predict individual’s gender with an accuracy ranging from 74.3 to 88.4% in a developed country and from 74.5 to 79.7% in a developing country using only metadata. This is significantly higher than previous approaches and, once calibrated, gives highly accurate estimates of gender balance in groups. Performance suffers only marginally if we reduce the training size to 5,000, but significantly decreases in a smaller training set. We finally show that our indicators capture a large range of behavioral traits using factor analysis and that the framework can be used to predict other indicators of vulnerability such as age or socio-economic status. Mobile phone data has a great potential for good and our framework allows this data to be augmented with vulnerability and other information at a fraction of the cost. [full text]

Paul Ryan Has Worst Ratio on Twitter, SelectAll blog, Madison Malone Kircher


There are many ways to gauge a tweet’s performance on Twitter. Maybe you’re hoping that it gets a lot of likes. Maybe you’re hoping that a famous person acknowledges your tweet. Maybe all you care about is that it makes you laugh. But for people looking for more advanced and nuanced measurements of Twitter performance, there is only one important figure: “The Ratio.”

The ratio, explicated earlier this year by Luke O’Neil at Esquire, works like this: If you tweet something good, people will retweet or “like” it, as a sign of agreement or endorsement. If you tweet something bad, people will reply to it, usually angrily. The ratio between replies and retweets, then, can tell you if you did a good tweet, or a bad one. For example, this charmer from New York congressman John Faso supporting the AHCA. Faso, at the time, had just over 2,000 Twitter followers. His bad tweet has since racked up 4,000 replies to just 119 faves. This, friends, isn’t a good ratio.

Like sabermetricians revolutionizing baseball statistics in the 1990s, Mike Williams and Sepand Ansari over at Fast Forward Labs scraped Twitter data to discover the ratio for several Twitter accounts, by writing code — Twitter’s API doesn’t track reply counts — that looked at tweets from seven different accounts, from January 2016 to now, that received over 50 replies.

AI Mines Hundreds of Thousands of News Articles Per Hour for Stock Tips

VICE, Motherboard, Michael Byrne


An investment management firm called Triumph Asset Management is making a go of it with a new system that parses large volumes of news articles for indicators that can be incorporated into predictive models.

Singapore to launch data science consortium and AI initiative to enhance digital economy



Today, at the opening ceremony of Innovfest Unbound 2017 — the anchor event of Singapore’s week long Smart Nations Innovation festival — Minister For Communications And Information Dr Yaacob Ibrahim unveiled two new initiatives that will ramp up the nation’s digital economy.

The first is AI.SG. This programme, to be launched by the National Research Foundation (NRF), will see various government agencies leverage on AI to tackle key challenges facing Singapore’s society.

Part Three: The End of Privacy – In politics, the best data wins.

Stanford Graduate School of Business


Assistant Professor Michal Kosinski says data driven politics is here to stay. Is that good or bad? [video, 9:28]

The Census Bureau Is in Crisis. Here’s Why It Matters to You

The Fiscal Times


The decennial census unquestionably is one of the most important functions of the Commerce Department’s Census Bureau – a laborious and often highly imprecise operation mandated by the Constitution.

That’s because the census generates vital information and data for the nation that is used — among other things — to reapportion the 435 seats in the House of Representatives, realign the boundaries of legislative districts in every state, and provide invaluable social, demographic and economic profiles of the country essential in developing public policy and conducting research.

“It’s enormously important,” said Andrew Reamer, a research professor at George Washington University’s public policy institute and an expert on demographics. “It’s hard to overstate its importance.”

Better screenings through artificial intelligence

Lehigh University


rtificial intelligence—commonly known as AI—is already exceeding human abilities. Self-driving cars use AI to perform some tasks more safely than people. E-commerce companies use AI to tailor product ads to customers’ tastes more quickly and precisely than any breathing marketing analyst can.

And soon AI will be used to “read” biomedical images more accurately than medical personnel alone—providing better early cervical cancer detection at lower cost than current methods.

Artificial Intelligence Course Creates AI Teaching Assistant

Georgia Tech News Center


College of Computing Professor Ashok Goel teaches Knowledge Based Artificial Intelligence (KBAI) every semester. It’s a core requirement of Georgia Tech’s online master’s of science in computer science program. And every time he offers it, Goel estimates, his 300 or so students post roughly 10,000 messages in the online forums — far too many inquiries for him and his eight teaching assistants (TA) to handle.

That’s why Goel added a ninth TA this semester. Her name is Jill Watson, and she’s unlike any other TA in the world. In fact, she’s not even a “she.” Jill is a computer — a virtual TA — implemented, in part, using technologies from IBM’s Watson platform.

“The world is full of online classes, and they’re plagued with low retention rates,” Goel said. “One of the main reasons many students drop out is because they don’t receive enough teaching support. We created Jill as a way to provide faster answers and feedback.”

Yale People: Economist Costas Arkolakis prizes cross-disciplinary approach

Yale University, YaleNews


In the course of his research, Yale economist Costas Arkolakis often utilizes tools and methods from mathematics and other scientific disciplines, such as physics and computer science.

To that end, working at a university that is strong across disciplines provides an advantage, said Arkolakis, the Henry Kohn Associate Professor of Economics.

“I can go to seminars in different fields and talk to people from different departments,” said Arkolakis, whose research focuses on spatial economics — the study of how space and frictions of movement affect economic activity and welfare. “I can take courses in engineering or a course in finance at the School of Management, which keeps me on my toes.”

This cross-disciplinary approach has proven fruitful: Arkolakis recently received the 2017 Bodossaki Foundation Distinguished Young Scientists Award for social-economic sciences — awarded annually to scholars of Greek nationality or descent who are younger than 45.

Experts Ponder Artificial Intelligence and Cities

CityLab, Richard Florida


When we think of the city of the future, we might think about flying cars and scenes from “Star Trek” or “The Jetsons.” But coming new technologies are shaping deeper and more fundamental changes in our cities.

These changes are already well underway. CityLab readers already know how ride-hailing companies are transforming the nature of mobility and car ownership. Cities have overtaken suburbs to become a major center for high tech firms and the talent that drives them. Initiatives like Google’s Sidewalk Labs are attempting to deepen the connection between technology and urbanism and transform the city itself into a platform for new technology and innovation.

A report by a panel of leading experts on technology, business, and cities takes a deep dive into the changes that will come about as a result of one key new technology—artificial intelligence.

Forensic DNA Reveals More Than We Thought, And That’s Both Good And Bad

Stanford Medicine, Scope Blog


What’s the most private piece of information someone could learn about you? The answer is probably different for everyone, but your genes have to rank high on the list – so much so that questions about genetic privacy figured heavily into recent legal battles over a Maryland law that permitted police to gather and store 13 particular snippets of DNA from anyone they arrested, not just those convicted of crimes.

The Supreme Court upheld that law on the assumption that it wouldn’t violate anyone’s privacy – really, how far could you get trying to determine a person’s physical traits, predisposition to disease, and so on based on just 13 genetic markers?

Pretty far, actually. Michael D. “Doc” Edge, PhD, a recent Stanford graduate; Noah Rosenberg, PhD, a professor of biology, and colleagues discovered that by starting with just 13 forensic DNA markers from an individual, they could find that same person’s record in a distinct, theoretically anonymous database containing hundreds of thousands of additional markers, which could be mined for private genetic information.

Guardant Health Raises $360 Million In Race To Create Cancer Blood Tests

Forbes, Matthew Herper


Guardant Health, a Redwood City, Calif., biotechnology company that sells blood tests to track and, potentially, detect cancer, is announcing this morning that it has raised $360 million from investors, with the goal of deploying its test in 1 million people over the next five years.


AAAS and Big Data Hubs Offer a Deep Dive into Data Visualization

South Big Data Hub


Washington, DC Friday, July 14, at the headquarters of the American Association for the Advancement of Science (1200 New York Ave. NW). [registration coming soon]

WeRobotics Global 2017



New York, NY Thursday June 1, at the Rockefeller Foundation (420 Fifth Avenue) WeR Global is invite-only.


Annual Conference on Cognitive Computational Neuroscience

New York, NY September 6-8 at Columbia University. The CCN submission deadline is extended to May 26
Tools & Resources

General Tips for Web Scraping with Python

Jack Schultz, Big-Ish Data blog


“The great majority of the projects about machine learning or data analysis I write about here on Bigish-Data have an initial step of scraping data from websites. And since I get a bunch of contact emails asking me to give them either the data I’ve scraped myself, or help with getting the code to work for themselves. Because of that, I figured I should write something here about the process of web scraping!”

Understanding the Kubernetes ecosystem

O'Reilly Media, Brian Anderson


I recently sat down with Sebastien Goasguen, Senior Director of Cloud Technologies at Bitnami, to talk about Kubernetes and its ecosystem—why the tool is becomingly increasingly popular, the tools you can use to support it, and how to adopt containerized architectures at your organization. Here are some highlights from our talk.

Python for Scientists and Engineers – Python For Engineers

Shantru, Python for Engineers


Python for Scientists and Engineers is now free to read online.

Coarse Discourse: A Dataset for Understanding Online Discussions

Google Research Blog; Praveen Paritosh and Ka Wong


“We are releasing the Coarse Discourse dataset, the largest dataset of annotated online discussions to date. The Coarse Discourse contains over half a million human annotations of publicly available online discussions on a random sample of over 9,000 threads from 130 communities from”

alexafsm, A Finite-State Machine Python Library for Building Complex Alexa Skills

Medium, Allen Institute blog, Vu Ha


In building Alexa skills, developers have a number of toolkits to choose from. For Javascript developers, the Alexa team offers the official NodeJS SDK, which provides support for handling session attributes, skill state persistence, response building, and behavior modeling. For Pythonistas, John Wheeler’s Flask-Ask seems to be the go-to option, as it is built on top of Flask, a popular micro-framework for building web applications. While these tools are excellent choices, they leave out a key aspect of (modeling) complex conversational logics with more than a handful of intents and states: finite-state machine (FSM). Enters alexafsm, an open-source Python library from the Allen Institute of Artificial Intelligence.


OpenAI, John Schulman, Jack Clark & Oleg Klimov.


Roboschool provides new OpenAI Gym environments for controlling robots in simulation. Eight of these environments serve as free alternatives to pre-existing MuJoCo implementations, re-tuned to produce more realistic motion. We also include several new, challenging environments.

How we built Tagger News: machine learning on a tight schedule

David Robinson, Variance Explained blog


This weekend three friends (Chris Riederer, Nathan Gould, and my twin brother Dan) and I took part in the 2017 TechCrunch Disrupt Hackathon. We’d all been to several of these hackathons before, and we enjoy the challenge of building a usable application in a short timeframe while learning some new technologies along the way.

Since three of the four of us are data scientists, we knew we were looking for a data-driven project, and since the best hackathon projects tend to be usable apps (as opposed to an analysis or a library) we figured we’d build a machine-learning driven product. We quickly landed on the idea of a classifier for the programmer community Hacker News, which would automatically assign topics to each submitted based on its text.

Adaptive Mesh Refinement: An Essential Ingredient in Computational Science

SIAM News, Paul Davis


Adaptive mesh refinement may be to computational science and engineering (CSE) what Bolognese sauce is to Italian cooking: part of many meals and integral to the repertoire of most cooks, nearly all of whom are happy to share and serve their special recipes. In that spirit, adaptive mesh refinement was widely served at invited presentations, minisymposia, and poster sessions at the 2017 SIAM Conference on Computational Science and Engineering (CSE17) in Atlanta, Ga., this February; “adaptive mesh refinement” and “AMR” are mentioned on 52 distinct occasions in the meeting’s abstracts!

Like Bolognese, AMR has but a few simple ingredients. The trick is managing them well — across many, many pots on a very large stove. AMR aims to efficiently accommodate the vast variations in scale inherent in most physical phenomena by using smaller scales only when and where required.


Full-time positions outside academia

Senior Research Scientist

Spotify; Stockholm, Sweden, London, England, or New York, NY

Postdoctoral Fellowship: Climate Variability on the Nutritional Value of Crops

SESYNC; Annapolis, MD
Full-time, non-tenured academic positions

HCI Lecturer

Stanford Computer Science; Palo Alto, CA

Leave a Comment

Your email address will not be published.