Data Science newsletter – June 9, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for June 9, 2017


Data Science News

Facebook is speeding up training for visual recognition models

TechCrunch, John Mannes


Every minute spent training a deep learning model is a minute not doing something else and in today’s fast paced world of research, that minute is worth a lot. Facebook published a paper this morning detailing its personal approach to this problem. The company says its managed to reduce the training time of a ResNet-50 deep learning model on ImageNet from 29 hours to one.

Facebook managed to reduce training time so dramatically by distributing training in larger “minibatches” across a greater number of GPUs. In the previous benchmark case, batches of 256 images were spread across eight GPUs. But today’s work involves batch sizes of 8192 images distributed across 256 GPUs.

Google Brain Residency

Ryan Dahl


Last year, after nerding out a bit on TensorFlow, I applied and was accepted into the inaugural class of the Google Brain Residency Program. The program invites two dozen people, with varying backgrounds in ML, to spend a year at Google’s deep learning research lab in Mountain View to work with the scientists and engineers pushing on the forefront of this technology.

The year has just concluded and this is a summary of how I spent it.

Data loss and recovery in the age of paper

SSRC, Parameters, Sheila A. Brennan


Not all lost federal records can be reconstituted or reconstructed. Some losses of federal records have been devastating.

Margaret Chan: AI in healthcare must be for the good of all

Wired UK, Liat Clark


While artificial intelligence stands to bring rapid improvements to the healthcare sector, director-general of the World Health Organisation Margaret Chan has warned that it must be for the good of everybody – not just the wealthiest countries.

“What good does it do to get early diagnoses of skin or breast cancer if a country does not provide the opportunity for treatment or if the price of medicines are not affordable?” she asked the audience at the UN’s AI Summit for Good. “Many developing countries don’t have health data to mine. And they don’t have functioning systems for registering vital causes of death stats.

Wanted: More Data, the Dirtier the Better

Quanta Magazine, Esther Landhuis


The computational immunologist Purvesh Khatri [Stanford] embraces messy data as a way to capture the messiness of disease. As a result, he’s making elusive genomic discoveries.

Trends > Cobesity

Contexts, Sven E. Wilson


In the academic literature, the phenomenon of both spouses having obesity is called “spousal concordance in obesity.” I prefer the simpler nickname “cobesity.” As shown in the figure above, the prevalence of cobesity has risen significantly in recent decades. This poses new public health challenges.

Using nationally representative data from the Health and Retirement Study (HRS) in the United States, the graph shows the rising trend in cobesity prevalence among married, heterosexual couples in late mid-life. Since 1992, cobesity among couples where the husband is aged 55-59 has almost tripled, rising from just over 6% in 1992 to around 16% in 2012.

Apple HomePod could be a defining moment for consumer AI

Medium, ArchiTecht, Derrick Harris


Apple announced its HomePod smart speaker / home assistant on Monday, so all parties in the battle to own the smart home are now present and accounted for. And because Apple is making its entrance in typical Apple fashion (i.e., expensive — $349 — and aesthetically pleasing), HomePod’s success relative to the Amazon Echo and Google Home devices could teach us a whole lot about what consumers really value when it comes to artificial intelligence. (Also digital platforms.)

What Intelligent Machines Need to Learn From the Neocortex

IEEE Spectrum, Jeff Hawkins


Machines won’t become intelligent unless they incorporate certain features of the human brain. Here are three of them

Location, location, location goes high-tech: Facts and FAQs about satellite-based wildlife tracking

Mongabay, Wildtech, Sue Palminteri


How has satellite technology advanced wildlife tracking? And how do satellite-based tags differ from radio transmitter tags in function and application?

[1706.01869] StreetStyle: Exploring world-wide clothing styles from millions of photos

arXiv, Computer Science > Computer Vision and Pattern Recognition; Kevin Matzen, Kavita Bala, Noah Snavely


Each day billions of photographs are uploaded to photo-sharing services and social media platforms. These images are packed with information about how people live around the world. In this paper we exploit this rich trove of data to understand fashion and style trends worldwide. We present a framework for visual discovery at scale, analyzing clothing and fashion across millions of images of people around the world and spanning several years. We introduce a large-scale dataset of photos of people annotated with clothing attributes, and use this dataset to train attribute classifiers via deep learning. We also present a method for discovering visually consistent style clusters that capture useful visual correlations in this massive dataset. Using these tools, we analyze millions of photos to derive visual insight, producing a first-of-its-kind analysis of global and per-city fashion choices and spatio-temporal trends.

Apple Needs to Reinvent Itself. It Just Might Be Doing So.

The New York Times, Farhad Manjoo


For some time now, Apple has faced questions about its growth and what rabbits it can pull out of its hat next, especially as rivals including Google, Facebook and Amazon appear to have gotten the jump on it with emerging technologies like artificial intelligence, virtual reality and augmented reality. The Apple iPhone remains the most profitable computing device in the world, and Apple’s immediate future looks sunny, but its long-term outlook has begun to look partly cloudy. In a world that seems to care less and less about beautiful hardware and more about services that help you from afar, over the air, without your ever having to touch a machine, Apple risks becoming an anachronism.

HomePod will be a test of how Apple responds to these difficulties. That’s because for Apple to outdo Amazon in the home assistant game, it will need to prioritize skills that have long been on its back burner — cloud services and A.I., for instance.

Safe Crime Prediction – Homomorphic Encryption and Deep Learning for More Effective, Less Intrusive Digital Surveillance

i am trask blog


TLDR: What if it was possible for surveillance to only invade the privacy of criminals or terrorists, leaving the innocent unsurveilled? This post proposes a way with a prototype in Python.

How One Boston Startup is Overcoming Flight Delays and Cancellations

RE•WORK | Blog, Katie Pollitt


Everyone knows that sinking feeling when you’re waiting at the airport and your flight flashes up ‘delayed’. You wish you’d got to the airport a couple of hours later, or on the occasion of cancelled flights not bothered at all, but how do we know ‘which of the over 30 million commercial flights in the US will get actually delayed or cancelled?’ Freebird has built a business based on using data science to answer that question, their number one priority is to eliminate the stress, delay, and massive inconvenience delays can cause – they know that ‘getting there matters’.

Sam Zimmerman, CTO & Co-founder spoke at the Deep Learning Summit Boston last week where he explained that the Freebird team have created a real-time predictive analytics engine based on dynamic data sets and deep-learning algorithms. In the event of a cancellation or severe delay, with Freebird you can skip the line and instantly book a new ticket (on any airline) at no extra cost.

Wide-Open: Accelerating public data release by automating detection of overdue datasets

PLOS Biology; Maxim Grechkin, Hoifung Poon, Bill Howe


Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.

Peer review is a thankless job. One firm wants to change that

The Economist


One solution is to make peer review more desirable and less of a duty. That is the idea behind Publons, a firm which allows scientists to track and showcase their peer-reviewing contributions. It has just been bought for a tidy sum by Clarivate Analytics, which runs Web of Science, an index that tracks how often researchers cite each others’ papers. Scientists who sign up will get a verifiable, trackable measure of their contributions. Their reviews will even be given their own “DOI” numbers, unique identifiers currently used for keeping track of papers.

The hope is that once scientists can quantify their reviewing work and boast about it on their CVs, universities and funding bodies will take it into account when handing out promotions or cash. Making scientists keener to review papers could also speed up publishing, says Andrew Preston, one of the firm’s founders. At the moment, much of a journal editor’s time is spent tracking down potential peer reviewers, then badgering them to contribute. By making reviewing more attractive, hopes researchers might start volunteering instead. Since Publons’s founding in 2012, more than 150,000 researchers have signed up, writing more than 800,000 reviews.

‘SnotBots,’ Whales, and the Health of Humanity

Intel, IT Peer Network, Alyson Griffin


This brings us to the work of Parley for the OceansOpens in a new window, the collaboration network and space where creators, thinkers, and leaders raise awareness for the beauty and fragility of the oceans and work together to end their destruction. Parley addresses major threats to our oceans, the largest and most important ecosystem on Earth, through creative collaboration and eco-innovation. The organization’s network of partners and ambassadors includes renowned whale researcher and marine biologist Dr. Iain Kerr. Together with Dr. Kerr and a team of experts, Parley is dedicated to advancing research and education focused on conserving whales and the oceans that sustain them. This quest begins with scientific research and a commitment to changing and improving the ways we explore, preserve and protect ocean habitats.

A vital part of this quest involves SnotBot—a new, noninvasive way to study whales. SnotBots are modified drones designed to hover over the surface of the ocean and collect the blow, or snot, that whales exhale. That blow is rich with biological data stemming from DNA, stress hormones, pregnancy hormones, viruses, bacteria, and toxins. The SnotBot then relays collected samples to researchers on boats that are a comfortable distance away from the whales.

The Art of Exoplanets



Recently, NASA’s Spitzer Space Telescope helped reveal that a star called TRAPPIST-1 is circled by seven Earth-size planets—the first system of this kind found to date. The data collected by Spitzer and other telescopes reveal the exoplanets’ sizes and distances to their stars, while theoretical models predict additional information about the planets’ atmospheres and surfaces. But what do these planets really look like? Do they have continents, lava, or oceans?

Researchers Devise Nobel Approach to Faster Fibers

IEEE Spectrum, Jeff Hecht


Three decades of steady increases in fiber-optic transmission capacity have powered the growth of the Internet and information-based technology. But sustaining that growth has required increasingly complex and costly new technology. Now, a new experiment has shown that an elegant bit of laser physics called a frequency comb—which earned Theodor Hänsch and John Hall the 2005 Nobel Prize in Physics—could come to the rescue by greatly improving optical transmitters.

In a Nature paper published today, researchers at the Karlsruhe Institute of Technology in Germany and the Swiss Federal Institute of Technology in Lausanne report using a pair of silicon-nitride microresonator frequency combs to transmit 50 terabits of data through 75 kilometers of single-mode fiber using 179 separate carrier wavelengths. They also showed that microresonator frequency combs could serve as local oscillators in receivers, which could improve transmission quality, and lead to petabit-per-second (1000 terabit) data rates inside data centers.

Tweet of the Week

Twitter, Cal King


Data Visualization of the Week

Twitter, Randy Olson



PIDapalooza 2018

ORCID, Crossref, California Digital Library, DataCite


Girona, Spain PIDapalooza is a two-day Persistent Identifier festival. January 23-24, 2018. [save the date]

Great User Research in an Agile World

New York City User Experience Professionals Association


New York, NY Tuesday, June 20, starting at 6 p.m., XO Group (195 Broadway, 25th Floor). Perspective from three members of the Bloomberg UX team; a project manager, a user researcher and an interaction designer. [$$]

Careers in Data Science, Data Engineering, and Artificial Intelligence



New York, NY Tuesday, June 13, starting at 12 noon, NYU School of Medicine (550 1st Avenue, Skirball 3rd Floor Seminar Room). “We invite you to gain a perspective on data science from the team behind Insight’s Data Science Fellows Program.” [free, registration required]


Call for Papers: Data for Good Exchange 2017

New York, NY Event date: Sunday, September 24. Deadline for abstracts is July 1.

Call for Proposals | Sci Viz NYC

New York, NY Sci Viz NYC, the event, is on December 1. Deadline is July 15 to submit proposals for six 10-minute “lightning” presentations.
NYU Center for Data Science News

Exciting news: Brenden Lake will join @NYUDataScience and Psychology as an assistant professor starting Fall 2017.

Twitter, Claudio Silva


Welcome Brenden!

Tools & Resources

Distributed Tensorflow

MGH & BWH Center for Clinical Data Science, Neil Tennenholtz


“In this post, we will explore the mechanisms through which computation in TensorFlow can be distributed.”

Machine Learning

Apple Developer


“Take advantage of Core ML, a new foundational machine learning framework used across Apple products, including Siri, Camera, and QuickType. Core ML delivers blazingly fast performance with easy integration of machine learning models enabling you to build apps with intelligent new features using just a few lines of code.”

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Facebook Research; Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He


Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images.

An Overview of Multi-Task Learning for Deep Learning

Sebastian Ruder


Leave a Comment

Your email address will not be published.