Data Science newsletter – May 25, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for May 25, 2017


Data Science News

After years of planning, California is likely to roll out its earthquake warning system next year

Los Angeles Times, Rong-Gong Lin II


California will likely roll out a limited public earthquake early warning system sometime next year, researchers building the network say.

New earthquake sensing stations are being installed in the ground, software is being improved, and operators are being hired to make sure the system is properly staffed, Caltech seismologist Egill Hauksson said at a joint meeting of the Japan Geoscience Union and American Geophysical Union.

Earth-observing companies push for more-advanced science satellites

Nature News & Comment, Gabriel Popkin


Never have so many private eyes looked down at Earth. In the past decade, about a dozen companies have formed to launch Earth-observing satellites. Few have sought to compete with sophisticated government-built instruments, but that is changing.

Private firms have begun to develop satellite radar systems and other advanced technologies in a bid to court scientists and other users, even as the US government is threatening to pare back its stable of satellites. Later this year, for example, the Finnish firm Iceye plans to launch a prototype radar instrument — the first step, the company says, towards a constellation of 20 such probes. Until recently, commercial firms had shied away from pursuing radar satellites because they require heavy instruments and consume a lot of power.

Company Data Science News

Facebook was fined 110m Euros for “misleading” the European Commission around the time of the WhatsApp acquisition. When Facebook bought WhatsApp and its global user base, it claimed there was no technical way to automatically link user accounts across the two platforms. *eyebrows raise* Yeah. The European Commission didn’t buy that either, but it took over two years to push the case through.

A new analysis reveals that Facebook‘s HR team is highly competitive when it comes to recruiting data scientists: “The percentage of new hires attracted to Facebook from another top company was 3X the average.”

Airbnb is running its own courses for employees. Teaching them data science may be more efficient than trying to compete for top talent. It’s also much better for the employees who will be able to improve their skills without having to go to grad school or muster the self-motivation to get through Coursera and Udacity material.

Amazon has a physical bookstore in New York. Take a peak around.

Ionic Materials is developing a battery that uses solid rather than liquid electrolytes, led by Tufts University professor Mike Zimmerman. The goal is to make cheaper, safer batteries to replace the moderately dangerous lithium-ion defaults we’re carrying around with us. While battery technology is not ready for manufacturing yet, you can get excited by watching a NOVA special.

Google had it’s I/O conference last week and posted a bunch of the talks on youtube, including this one: “Past, present and future of AI/Machine Learning” featuring Stanford Prof and Google Cloud Chief Scientist Fei Fei Li, data visualization expert/artist Fernanda Viegas, Daphne Koller, and speech recognition expert Francoise Beaufays who wears sunglasses the whole time.

Uber has upset residents and civic leaders in Pittsburgh, the city where they rolled out their driverless cars. The city alleges that the company charged for rides that people thought would be free, failed to create jobs, doesn’t share data as promised, and withdrew support from the city’s application for a $50m federal grant. Uber is the company with the most negative coverage in this newsletter. Even though they rebut these claims using something like a “we didn’t ever put our promises in writing” excuse, I tend to side with Pittsburgh from sheer ethical inertia.

Viz, a SF-based startup doing AI for medical imaging, just closed a funding round. This is the type of company that may want to pay close attention to Trump’s health care plan and the consequences of knowledge in a situation where pre-existing conditions are punishable offenses.

Photos: Inside Amazon’s first New York City bookstore

Recode, Dan Frommer


After helping drive many U.S. bookstore chains out of business, Amazon has been opening its own retail stores, starting in Seattle in late 2015.

Its first Amazon Books location in New York City opens Thursday morning in Manhattan’s Shops at Columbus Circle, which was previously home to a pretty large — and now closed — Borders Books and Music.

Researcher returning to Hilo for data science program launch

University of Hawaiʻi System News


Grady Weyenberg, a statistical researcher who grew up in Hilo, is the first faculty member hired to create a new data science program at the University of Hawaiʻi at Hilo.

Universities Roll Out New Data Science Programs

datanami, Alex Woodie


College grads looking to take the next step in their data science careers will have several new graduate-level data science programs to enroll in this fall, including a new Master’s of Applied Data Science at Syracuse University, a new Doctorate of Data Science at the University of Tennessee, and a new

Demand from students for graduate-level data science degrees, as well as demand from companies for more skilled data scientists to crunch big data, led Syracuse University leaders to create the new 18-month, multi-disciplinary program, says Associate Professor Jeff Saltz.

Airbnb is running its own internal university to teach data science

TechCrunch, John Mannes


Tech companies, and increasingly even non-tech companies, are struggling with the fact that there are not enough trained data scientists to fill market demand. Every company has their own strategy for hiring and training, but Airbnb has taken things a step further — running its own university-style program, complete with a custom course-numbering system.

Data University is Airbnb’s attempt to make its entire workforce more data literate. Traditional online programs like Coursera and Udacity just weren’t getting the job done because they were not tailored to Airbnb’s internal data and tools. So the company decided to design a bunch of courses of its own around three levels of instruction for different employee needs.

Norwegian Minister visits Imperial’s innovative Data Science Institute

Imperial College of London


Jan Tore Sanner, Norway’s Minister for Local Government and Modernisation, visited the College to explore how data science can be used by governments.

The Norwegian delegation visited Imperial’s Data Science Institute to see first-hand the possibilities of AI and data science for the public sector.

Facebook AI director Yann LeCun on the importance of emotional machines

Gigaom, Derrick Harris


Yann LeCun, the deep learning expert and recently appointed director of artificial intelligence research at Facebook, held an Ask Me Anything session on Reddit last week. He went deep into the methodologies of AI and deep learning, the best academic training for excelling in the field and even touched on how to deal with the ethical issues that will arise from the advent of advanced AI. The most interesting exchange, however, might have been about the role of emotions in AI systems.

Essentially, LeCun argued that a system like the one popularized in the recent film Her is nowhere near the realm of possibility right now because of its deep understanding of human emotions, but that understanding of emotions is critical to truly useful systems. “Science fiction often depicts AI systems as devoid of emotions,” he wrote, “but I don’t think real AI is possible without emotions.“

A Visit to Seoul Brings Our Writer Face-to-Face With the Future of Robots

Smithsonian magazine, Gary Shteyngart


Striving for perfection in mind, body and spirit is a Korean way of life, and the cult of endless self-improvement begins as early as the hagwons, the cram schools that keep the nation’s children miserable and sleep-deprived, and sends a sizable portion of the population under the plastic surgeon’s knife. If The Great Gatsby were written today, the hero’s last name would be Kim or Park. And as though human competition isn’t enough, when I land in Seoul I learn that Korea’s top Go champion—Go is a mind-bendingly complex strategic board game played in East Asia—has been roundly beaten by a computer program called AlphaGo, designed by Google DeepMind, based in London, one of the world’s leading developers of artificial intelligence.

The country I encounter is in a mild state of shock.

Origins of the Modern Chemical Database

Collaborative Drug Discovery


The chemical database of today is inextricably linked to several historical threads, not all of them chemical, that began a long time before the modern computational world existed. Understanding these origins provides a deeper understanding of the modern chemical database and yields perspective on the paths that brought us here.

Flight delay? Lost luggage? Don’t blame airline mergers, research shows

News at IU Bloomington


It’s often said that airline mergers lead to more headaches for travelers, including more flight delays, late arrivals and missed connections. But an analysis of 15 years of U.S. Department of Transportation statistics found that airline consolidation has had little negative impact on on-time performance.

In fact, two Indiana University researchers found evidence that mergers lead to long-term improvements, likely due to improved efficiencies. The research is forthcoming in the Journal of Industrial Economics.

The Dimension Question

Simons Foundation, Emily Singer


Scientists can routinely track the activity of hundreds — sometimes even thousands — of neurons in the brain of awake animals. But how many cells do they need to monitor to truly understand how the brain functions? That question has become a hot topic among researchers doing large-scale neural recordings. The answer could shape the future of the field, influencing how scientists design experiments and new technologies.

“The question is whether we can infer something about what the larger network is doing from sampling a subset of its neurons,” says Byron Yu, a neuroscientist at Carnegie Mellon University in Pittsburgh and an investigator with the Simons Collaboration on the Global Brain (SCGB). “What can we learn from populations of neurons that would be different from what we learn from one neuron at a time?”

“Now that we can record from 100 or 1,000 neurons at the same time, is this giving us enough of a view of what the brain is doing?” asks John Cunningham, a statistician and computational neuroscientist at Columbia University and an SCGB investigator. “Can we look at 100 neurons and say, ‘I think this is a good summary of brain activity’?”

Making Data Count

UC-Santa Barbara, National Center for Ecological Analysis and Synthesis


NCEAS is excited to announce that DataONE, the California Digital Library, and DataCite received a 2-year, $747K grant from the Alfred P. Sloan Foundation. This grant will support the collection of usage and citation metrics for data objects, and launch a new service that collates and shares these metrics.

When artificial intelligence botches your medical diagnosis, who’s to blame?

Quartz, Robert Hart


Recent reports show systems already capable of matching specialists when diagnosing skin cancer or identifying a rare eye condition responsible for around 10% of global childhood vision-loss. AI systems can even exceed human doctors in identifying certain types of lung cancer. These successes will continue to grow as the technology matures. Add to this the benefits derived from faster diagnoses, reduced costs, and a more personalized medicine, and it’s easy to see there are compelling reasons to adopt AI throughout medical practice.

Of course, there are downsides. AI raises profound questions regarding medical responsibility. Usually when something goes wrong, it is a fairly straightforward matter to determine blame. A misdiagnosis, for instance, would likely be the responsibility of the presiding physician. A faulty machine or medical device that harms a patient would likely see the manufacturer or operator held to account. What would this mean for an AI?

Facial appearance affects science communication

Proceedings of the National Academy of Sciences, Ana I. Gheorghiu, Mitchell J. Callan, and William J. Skylark


The dissemination of scientific findings to the wider public is increasingly important to public opinion and policy. We show that this process is influenced by the facial appearance of the scientist. We identify the traits that engender interest in a scientist’s work and the perception that they do high-quality work, and show that these face-based impressions influence both the selection and evaluation of science news. These findings inform theories of person perception and illuminate a potential source of bias in the public’s understanding of science.


Webinar: UMETRICS Data in the Federal Statistical Research Data Centers

University of Michigan, Institute for Research on Innovation & Science


Online June 8. This webinar is to highlight the work of the Big Data Center and the Innovation Measurement Initiative at Census, provide a detailed description of the dataset itself, give potential researchers information about how to access the data, and address questions raised by participants. [requires system login]

ICML 2017 Accepted Papers

Thirty-fourth International Conference on Machine Learning


Sydney, Australia August 6-11 [$$$]

Schedule – Statistical Inference for Network Models

International School and Conference on Network Science


Indianapolis, IN June 20, part of NetSci 2017. [$$$]

Sabermetrics, Scouting, and the Science of Baseball,

Boston University, CompNet


Boston, MA August 5-6. Weekend seminar for the benefit of the Jimmy Fund and the Angioma Alliance, puts you up close with some of baseball’s top coaches, statisticians, scouts, doctors, and scientists. [$$$]

Ready, Set, Robots!

Indiana University


Bloomington, IN June 8-9. “Introduces teens to technology-related fields and concepts including computer programming, high performance computing, and networking as they work side by side with IU technology professionals and researchers.” [free, registration required]


13th AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE’17)

Snowbird, UT October 5-9. Deadline for Papers, Demos, DC, & Playable Experiences is June 1.

Zillow Prize: Zillow’s Home Value Prediction (Zestimate)

Can you improve the algorithm that changed the world of real estate in this Kaggle competition? $1.2 million in prize money. Rules Acceptance/Team Merger Deadline is October 2.
Tools & Resources

CRAN – Package readtext

Kenneth Benoit


“Functions for importing and handling text files and formatted text files with additional meta-data, such including ‘.csv’, ‘.tab’, ‘.json’, ‘.xml’, ‘.pdf’, ‘.doc’, ‘.docx’, ‘.xls’, ‘.xlsx’, and others.”

[1705.06273] Transfer Learning for Named-Entity Recognition with Neural Networks

arXiv, Computer Science > Computation and Language; Ji Young Lee, Franck Dernoncourt, Peter Szolovits


Recent approaches based on artificial neural networks (ANNs) have shown promising results for named-entity recognition (NER). In order to achieve high performances, ANNs need to be trained on a large labeled dataset. However, labels might be difficult to obtain for the dataset on which the user wants to perform NER: label scarcity is particularly pronounced for patient note de-identification, which is an instance of NER. In this work, we analyze to what extent transfer learning may address this issue. In particular, we demonstrate that transferring an ANN model trained on a large labeled dataset to another dataset with a limited number of labels improves upon the state-of-the-art results on two different datasets for patient note de-identification.

MapR Releases New Ecosystem Pack with Optimized Security and Performance for Apache Spark



onverged Data Platform that converges the essential data management and application processing technologies on a single, horizontally scalable platform, announced its next major release of the MapR Ecosystem Pack (MEP) program. MEP is a broad set of open source ecosystem projects that enable big data applications running on the MapR Converged Data Platform with inter-project compatibility. Version 3.0 of MEP provides enhanced security for Spark, new Spark connectors for MapR-DB and HBase, significant updates and integrations with Drill, and a faster version of Hive.

Plotnine is the best Python implementation of R’s ggplot2

Paul Teehan


“plotnine is a new Python library that implements R’s ggplot2. It is a better implementation than ggpy, which was the best option until now for Python users. In this article I show a few examples comparing plotnine, ggpy, and ggplot2.”

Quick, Draw! The Data

A.I. Experiment, Google


Over 15 million players have contributed millions of drawings playing Quick, Draw! These doodles are a unique data set that can help developers train new neural networks, help researchers see patterns in how people around the world draw, and help artists create things we haven’t begun to think of. That’s why we’re open-sourcing them, for anyone to play with.

TechBlog: My digital toolbox: Julia Stewart Lowndes

Naturejobs Blog


Julia Stewart Lowndes, a marine data scientist at the National Center for Ecological Analysis and Synthesis (NCEAS) at the University of California at Santa Barbara, published a paper this week laying out the challenges her team faces as they try to share and reuse data on the world’s oceans. Here, some key lessons.


Full-time positions outside academia

Senior Data Scientist

General Mills; Minneapolis, MN

Leave a Comment

Your email address will not be published.