Data Science newsletter – May 2, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for May 2, 2017


Data Science News

Data Visualization of the Week

Twitter, Conrad Hackett


Tweet of the Week

Twitter, Ryan North


Scalable Genomics Data Processing Pipeline with Alluxio, Mesos, and Minio

Minio, Guardant Health, Nitish Tiwari


Guardant Health is the world leader in comprehensive liquid biopsy. Oncologists order our blood test to help determine if their advanced cancer patients are eligible for certain drugs that target specific genomic alterations in tumour DNA. Each test produces huge amounts of genomic data that we process into easily interpretable test results.

Spread of Zika virus in the Americas

Proceedings of the National Academy of Sciences, Qian Zhang et al.


Mathematical and computational modeling approaches can be essential in providing quantitative scenarios of disease spreading, as well as projecting the impact in the population. Here we analyze the spatial and temporal dynamics of the Zika virus epidemic in the Americas with a microsimulation approach informed by high-definition demographic, mobility, and epidemic data. The model provides probability distributions for the time and place of introduction of Zika in Brazil, the estimate of the attack rate, timing of the epidemic in the affected countries, and the projected number of newborns from women infected by Zika. These results are potentially relevant in the preparation and analysis of contingency plans aimed at Zika virus control.

No bones needed: ancient DNA in soil can tell if humans were around

Ars Technica, Cathleen O'Grady


Humans, modern and otherwise, have lived in Denisova Cave in Siberia for tens of thousands of years, where they left behind a treasury of archaeological artifacts. The cave is famous for giving its name to Denisovans, a species of human closely related to Neanderthals. But Neanderthals have lived there, too.

In the cave’s Main Gallery, stone tools had been left behind by people who lived thousands of years ago. Those people were probably Neanderthals, according to a paper in Science this week: The soil says so. Even though no Neanderthal bones have been found with the tools, the paper’s authors are the first to be able to detect the presence of humans based on DNA found in the soil. This allows them to paint a much more detailed picture of the past, in Denisova Cave and elsewhere.

Cloudera’s new data science tool aims to boost big data and machine learning for businesses

TechRepublic, Conner Forrest


Cloudera has announced the general availability of its Data Science Workbench, a new self-service tool that could help speed the time to value for advanced analytics and deep learning.

ICLR2017 — deep thought vs exaflops

Medium, Libby Kinsey


Many of the papers now judged most original and significant rely on massive compute resources, usually beyond the financial reach of academia. So where does that leave academic research?

Paul Allen’s AI2 aims to teach computers grade-school science, and it isn’t easy

The Seattle Times, Rachel Lerman


The Allen Institute for Artificial Intelligence is working to teach computers to answer science questions at a grade-school level — a task that might sound simple, but requires the computer to decipher images, diagrams and understand the contextual meaning of what is written.

Facebook told advertisers it can identify teens feeling ‘insecure’ and ‘worthless’

The Guardian, Sam Levin


Facebook showed advertisers how it has the capacity to identify when teenagers feel “insecure”, “worthless” and “need a confidence boost”, according to a leaked documents based on research quietly conducted by the social network.

The internal report produced by Facebook executives, and obtained by the Australian, states that the company can monitor posts and photos in real time to determine when young people feel “stressed”, “defeated”, “overwhelmed”, “anxious”, “nervous”, “stupid”, “silly”, “useless” and a “failure”.

The Australian reported that the document was prepared by two top Australian executives, David Fernandez and Andy Sinn.

Supercomputers assist in search for new, better cancer drugs

University of Texas, Texas Advanced Computing Center


Finding new drugs that can more effectively kill cancer cells or disrupt the growth of tumors is one way to improve survival rates for ailing patients. Researchers are using supercomputers to find new chemotherapy drugs and to test known compounds to determine if they can fight different types of cancer. Recent efforts have yielded promising drug candidates, potential plant-derived compounds and insights into how to design more effective drugs.

Exploratory Analysis: Data Scientist Salaries Across the USA

Medium, Towards Data Science, Valeria Rozenbaum


In my previous post, I talked about scraping for Data Scientist jobs across the United States. While I was able to scrape a little over 10,500 listings, few of them contained salary data and many of the salaries were hourly, monthly or weekly. After running a massive clean up on the data, I was left with 493 salaries to use for the modeling. The median salary was $100K with 236 of the listings being above the median and 257 below. I was excited to get to the modeling. However, before jumping to the grand finale, I wanted see what other insights I could gain from the data. This task called for me to pull one of my favorite data exploration tools from my data scientist toolbox — Tableau!

Machine Learning Security at ICLR 2017

Approximately Correct blog, Victoria Krakovna


The overall theme of the ICLR conference setting this year could be summarized as “finger food and ships”. More importantly, there were a lot of interesting papers, especially on machine learning security, which will be the focus on this post.

President Bollinger Names Microsoft Research Head Jeannette Wing to Lead Columbia’s Data Science Institute | Columbia News

Columbia University, Columbia News


Columbia University President Lee C. Bollinger today announced that Jeannette Wing, currently corporate vice president of Microsoft Research, will become the Avanessians Director of Columbia’s Data Science Institute and Professor of Computer Science.

“Jeannette Wing is a pioneering figure in the world of computer science research and education. Her addition to the University’s academic leadership team reflects the continuing expansion of our work in this field,” said Bollinger. “Our Data Science Institute is indispensable to virtually every scholarly initiative at the University dedicated to addressing a societal problem. The benefits to be derived from Jeannette’s leadership and her presence here will be immense.”


Astro Hack Week 2017

Moore-Sloan Data Science Environments at University of California Berkeley, New York University, and the University of Washington


Seattle, WA Monday-Friday, August 28-September 1, at the University of Washington. Deadline to apply is May 31.

JSM 2017

American Statistical Association


Baltimore, MD July 29-August 3. [$$$]

Great Lakes Software Excellence Conference



Grand Rapids, MI May 22 at Eberhard Center GVSU [$$$]

Kenneth Prewitt: The Transformation of the National Statistical System in the Era of Digital Data — Without a Roadmap |

Columbia University


New York, NY May 9 at 6 p.m. Part of Columbia University Computational Social Science speaker series. [free]


ESSI Summer Camp

Ann Arbor, MI The University of Michigan Exercise & Sport Science Initiative, in collaboration with the Michigan Institute for Data Science, will be hosting a data science summer camp for high-school students who are interested in sport analytics. Deadline to apply is May 20.

Astro Hack Week 2017

Seattle, WA Monday-Friday, August 28-September 1, at the University of Washington. Deadline to apply is May 31.
Tools & Resources

The pitfalls of A/B testing in social networks

OK Cupid, Tech blog, Brenton McMenamin


It’s the perfect tool for most testing situations. Unfortunately, if you’re doing tests for a product that relies heavily on interaction between users — such as a dating app — doing random assignment on a per-user basis can lead to unreliable experiments and misleading conclusions.

Check Out the First Data Science in Production Magazine



Many organisations develop successful proof of concepts but then don’t manage to materialize the models beyond their laptops. Taking models into production requires a professional workflow, high quality standards, and scalable code and infrastructure. Data Science in Production is dedicated to reaping benefit from data by taking data driven applications into production. [pdf download]

ICLR – Videos

Facebook, ICLR


All videos, 18 total.

IARPA Announces Publication of Forecasting Data

SIGNAL Magazine


Forecasting data collected during the Intelligence Advanced Research Projects Activity’s (IARPA’s) Aggregative Contingent Estimation (ACE) program by team Good Judgment is now available for use by the public and the research community.


Full-time positions outside academia

Senior Developer

Flatiron School; New York, NY

Full-stack Engineer

The Information; San Francisco, CA

Research Scientist, Core Data Science NYC

Facebook; New York, NY

Leave a Comment

Your email address will not be published.