NYU Data Science newsletter – August 25, 2016

NYU Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for August 25, 2016

GROUP CURATION: N/A

Data Science News

Science AMA Series: I’m Abe Davis, last week my research video about “Interactive Dynamic Video” (IDV) hit the front-page of Reddit, and a bunch of people expressed interest in learning more about it. So here I am, AMA! : science

reddit.com/r/science

from August 23, 2016

My name is Abe Davis and I’m a PhD student at MIT’s Computer Science and Artificial Intelligence Lab (CSAIL), where I’m a member of the lab’s Computer Graphics group.

Last week my research video about “Interactive Dynamic Video” hit the front-page of Reddit and a bunch of people wanted to learn more about it. So here I am, AMA!

Open-source mapping company Mapbox acquires passive fitness app-maker Human for its anonymized data

MobiHealthNews

from August 22, 2016

San Francisco-based Mapbox, an open-source mapping platform, has acquired fitness-tracking app Human, which allows users to track activity all day and aggregates the data anonymously to create information about cities.

Innovators Under 35

MIT Technology Review

from August 22, 2016

The people in our 16th annual celebration of young innovators are disrupters and dreamers. They’re inquisitive and persistent, inspired and inspiring. No matter whether they’re pursuing medical breakthroughs, refashioning energy technologies, making computers more useful, or engineering cooler electronic devices—and regardless of whether they are heading startups, working in big companies, or doing research in academic labs—they all are poised to be leaders in their fields.

Decoupled Neural Interfaces using Synthetic Gradients

arXiv, Computer Science > Learning; Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, Koray Kavukcuoglu

from August 18, 2016

Training directed neural networks typically requires forward-propagating data through a computation graph, followed by backpropagating error signal, to produce weight updates. All layers, or more generally, modules, of the network are therefore locked, in the sense that they must wait for the remainder of the network to execute forwards and propagate error backwards before they can be updated. In this work we break this constraint by decoupling modules by introducing a model of the future computation of the network graph. These models predict what the result of the modelled subgraph will produce using only local information. In particular we focus on modelling error gradients: by using the modelled synthetic gradient in place of true backpropagated error gradients we decouple subgraphs, and can update them independently and asynchronously i.e. we realise decoupled neural interfaces. We show results for feed-forward models, where every layer is trained asynchronously, recurrent neural networks (RNNs) where predicting one’s future gradient extends the time over which the RNN can effectively model, and also a hierarchical RNN system with ticking at different timescales. Finally, we demonstrate that in addition to predicting gradients, the same framework can be used to predict inputs, resulting in models which are decoupled in both the forward and backwards pass — amounting to independent networks which co-learn such that they can be composed into a single functioning corporation.

Population genetic history and polygenic risk biases in 1000 Genomes populations

bioRxiv; Alicia R Martin, Christopher R Gignoux, Raymond K Walters, Genevieve L Wojcik, Simon Gravel, Mark J Daly, Carlos D Bustamante, Eimear E Kenny

from August 23, 2016

We disentangle the population history of the widely-used 1000 Genomes Project reference panel, with an emphasis on underrepresented Hispanic/Latino and African descent populations. By leveraging haplotype sharing, linkage disequilibrium decay, and ancestry deconvolution along chromosomes in admixed populations, we gain insights into ancestral allele frequencies, the origins, rates, and timings of admixture, and sex-biased demography. We make empirical observations to evaluate the impact of population structure in association studies, with conclusions that inform rare variant association in diverse populations, how we use standard GWAS tools, and transferability of findings across populations. Finally, we show through coalescent simulations that inferred polygenic risk scores derived from European GWAS are biased when applied to diverse populations. … Our study provides fine-scale insight into the sampling, genetic origins, divergence, and sex-biased history of admixture in the 1000 Genomes Project populations.

CrowdAI Builds Smarter Image Recognition

Y Combinator, The Macro blog

from August 19, 2016

“CrowdAI is building smarter image recognition. They are currently working with satellite, drone and self-driving car companies to provide them with scalable image recognition.” … “We sat down with Devaki Raj, Pablo Garcia, and Nic Borensztein to talk about what they’re building.”

More Image Recognition:

Full Resolution Image Compression with Recurrent Neural Networks (August 18, arXiv, Computer Science > Computer Vision and Pattern Recognition; George Toderici et al.)

Segmenting and refining images with SharpMask (August 25, Facebook Code, Engineering Blog; Piotr Dolla)

Science AMA Series: I’m Abe Davis, last week my research video about “Interactive Dynamic Video” (IDV) hit the front-page of Reddit (August 23, reddit.com/r/science)

Full Resolution Image Compression with Recurrent Neural Networks

arXiv, Computer Science > Computer Vision and Pattern Recognition; George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, Michele Covell

from August 18, 2016

This paper presents a set of full-resolution lossy image compression methods based on neural networks. Each of the architectures we describe can provide variable compression rates during deployment without requiring retraining of the network: each network need only be trained once. All of our architectures consist of a recurrent neural network (RNN)-based encoder and decoder, a binarizer, and a neural network for entropy coding. We compare RNN types (LSTM, associative LSTM) and introduce a new hybrid of GRU and ResNet. We also study “one-shot” versus additive reconstruction architectures and introduce a new scaled-additive framework. We compare to previous work, showing improvements of 4.3%-8.8% AUC (area under the rate-distortion curve), depending on the perceptual metric used. As far as we know, this is the first neural network architecture that is able to outperform JPEG at image compression across most bitrates on the rate-distortion curve on the Kodak dataset images, with and without the aid of entropy coding.

Your Garbage Data Is A Gold Mine

Fast Company

from August 24, 2016

… Say, for example, you are interested in the seasonal profitability of supermarkets over time. Foot traffic data may not be the cause of profitability, as more store visitors doesn’t necessarily correlate directly to profit or even sales. But it may be statistically related to volume of sales and so may be one useful clue, just as body temperature is a good clue or one signal to a person’s overall well-being. And when combined with massive amounts of other signals using data analytics techniques, this can provide valuable new insights.

Bloomberg Media Names Global Head of Data Science

FishbowlNY, Chris O'Shea

from August 24, 2016

“Bloomberg Media has named Michelle Lynn global head of data science and insights” … “Lynn comes to Bloomberg from Dentsu Aegis Network, where she served as chief insights officer.”

More in Data Science in Business:

Apple Acquires Personal Health Data Startup Gliimpse (August 22, Fast Company, Christina Farr and Mark Sullivan)

An Exclusive Look at How AI and Machine Learning Work at Apple (August 24, Medium, Backchannel, Steven Levy)

Digital Feeding Frenzy Erupts: Internet of Things, Analytics Drive M&A Activity To Record Levels (August 18, Forbes, Joe McKendrick)

WhatsApp to Share User Data With Facebook (August 25, Wall Street Journal, Deepa Seetharaman and Brian R. Fitzgerald)

Videos from Deep Learning Summer School, Montreal 2016

VideoLectures.NET

from August 07, 2016

The Deep Learning Summer School 2016 is aimed at graduate students and industrial engineers and researchers who already have some basic knowledge of machine learning (and possibly but not necessarily of deep learning) and wish to learn more about this rapidly growing field of research. [35 video presentations]

Voice Recognition Software Finally Beats Humans At Typing, Study Finds : All Tech Considered : NPR

NPR, All Tech Considered

from August 24, 2016

Computers have already beaten us at chess, Jeopardy and Go, the ancient board game from Asia. And now, in the raging war with machines, human beings have lost yet another battle — over typing.

Turns out voice recognition software has improved to the point where it is significantly faster and more accurate at producing text on a mobile device than we are at typing on its keyboard. That’s according to a new study by Stanford University, the University of Washington and Baidu, the Chinese Internet giant. The study ran tests in English and Mandarin Chinese.

An Exclusive Look at How AI and Machine Learning Work at Apple

Medium, Backchannel, Steven Levy

from August 24, 2016

On July 30, 2014, Siri had a brain transplant.

Three years earlier, Apple had been the first major tech company to integrate a smart assistant into its operating system. Siri was the company’s adaptation of a standalone app it had purchased, along with the team that created it, in 2010. Initial reviews were ecstatic, but over the next few months and years, users became impatient with its shortcomings. All too often, it erroneously interpreted commands. Tweaks wouldn’t fix it.

So Apple moved Siri voice recognition to a neural-net based system for US users on that late July day (it went worldwide on August 15, 2014.)

Designing AI Systems that Obey Our Laws and Values

Communications of the ACM; Amitai Etzioni, Oren Etzioni

from August 24, 2016

Operational AI systems (for example, self-driving cars) need to obey both the law of the land and our values. We propose AI oversight systems (“AI Guardians”) as an approach to addressing this challenge, and to respond to the potential risks associated with increasingly autonomous AI systems.a These AI oversight systems serve to verify that operational systems did not stray unduly from the guidelines of their programmers and to bring them back in compliance if they do stray. The introduction of such second-order, oversight systems is not meant to suggest strict, powerful, or rigid (from here on ‘strong’) controls. Operations systems need a great degree of latitude in order to follow the lessons of their learning from additional data mining and experience and to be able to render at least semi-autonomous decisions (more about this later). However, all operational systems need some boundaries, both in order to not violate the law and to adhere to ethical norms. Developing such oversight systems, AI Guardians, is a major new mission for the AI community.

Why Data Citation Is a Computational Problem

Communications of the ACM; Peter Buneman, Susan Davidson, James Frew

from August 24, 2016

Citation is essential to traditional scholarship. Citations identify the cited material, help retrieve it, give credit to its creator, date it, and so on. In the context of printed materials (such as books and journals), citation is well understood. However, the world is now digital. Most scholarly and scientific resources are held online, and many are in some kind of database, or a structured, evolving collection of data. For example, most biological reference works have been replaced by curated databases, and vast amounts of basic scientific data—geospatial, astronomical, molecular, and more—are now available online. There is strong demand13,23 that these databases should be given the same scholarly status and appropriately cited, but how can this be done effectively?

Events

Clinical Trial Transparency and Reproducibility Discussion Panel and Workshop at NYU

New York, NY; NYU, Center for Open Science, and AllTrials USA Thursday, September 29, at 1:30 pm at Langone Medical Center, Skirball 4th Floor Seminar Room [free]

OpenTrials launch date + Hack Day

Berlin, Germany “OpenTrials will officially launch its beta on Monday 10th October 2016 at the World Health Summit in Berlin. After months of work behind-the-scenes meeting, planning, and developing, we’re all really excited about demoing OpenTrials to the world and announcing how to access and use the site!” … “If that wasn’t enough, we also have a confirmed date and location for the OpenTrials Hack Day – it will take place on Saturday 8th October at the German office of Wikimedia in Berlin.”

Deadlines

Data by the People, for the People: Join the White House Open Data Innovation Summit

deadline: Conference

Washington DC “The White House will showcase recent and future open data and My Data achievements at the September 28th Open Data Innovation Summit with Solutions Showcase!”

Deadline to be considered as a speaker or Solutions Showcase exhibitor is Thursday, September 1.

Tools & Resources

RNNs in Tensorflow, a Practical Guide and Undocumented Features – WildML

Denny Britz, WildML blog

from August 21, 2016

“Using an RNN should be as easy as calling a function, right? Unfortunately that’s not quite the case. In this post I want to go over some of the best practices for working with RNNs in Tensorflow, especially the functionality that isn’t well documented on the official site.”

Deep Deterministic Policy Gradients in TensorFlow

Patrick Emami

from August 21, 2016

“Deep Reinforcement Learning has recently gained a lot of traction in the machine learning community due to the significant amount of progress that has been made in the past few years. Traditionally, reinforcement learning algorithms were constrained to tiny, discretized grid worlds, which seriously inhibited them from gaining credibility as being viable machine learning tools.”

The Open Source Data Science Masters

datamasters

from August 25, 2015

Curriculum for Data Science

Text summarization with TensorFlow

Google Research Blog, Peter Liu

from August 24, 2016

“We’re open-sourcing TensorFlow model code for the task of generating news headlines on Annotated English Gigaword, a dataset often used in summarization research. We also specify the hyper-parameters in the documentation that achieve better than published state-of-the-art on the most commonly used metric as of the time of writing.”

Debugging machine learning

Hal Daume, natural language processing blog

from August 24, 2016

I’ve been thinking, mostly in the context of teaching, about how to specifically teach debugging of machine learning. Personally I find it very helpful to break things down in terms of the usual error terms: Bayes error (how much error is there in the best possible classifier), approximation error (how much do you pay for restricting to some hypothesis class), estimation error (how much do you pay because you only have finite samples), optimization error (how much do you pay because you didn’t find a global optimum to your optimization problem). I’ve generally found that trying to isolate errors to one of these pieces, and then debugging that piece in particular (eg., pick a better optimizer versus pick a better hypothesis class) has been useful.

Translational Software Releases Genomics API to Speed Up Precision Medicine

ProgrammableWeb

from August 23, 2016

Translational Software, clinical decision support tools developer, recently announced an API that labs and tech providers can utilize to hasten the development of medical apps. The API was built on HL7’s Fast Healthcare Interoperability Resources (FHIR) standard, a developing standard for the electronic exchange of healthcare information. The API queries databases for drug-drug-gene interaction data, which can be used to alert clinicians to adverse interaction possibilities.

New package tokenizers joins rOpenSci

rOpenSci, Lincoln Mullen

from August 23, 2016

The R package ecosystem for natural language processing has been flourishing in recent days. R packages for text analysis have usually been based on the classes provided by the NLP or tm packages. Many of them depend on Java. But recently there have been a number of new packages for text analysis in R, most notably text2vec, quanteda, and tidytext. These packages are built on top of Rcpp instead of rJava, which makes them much more reliable and portable. And instead of the classes based on NLP, which I have never thought to be particularly idiomatic for R, they use standard R data structures. The text2vec and quanteda packages both rely on the sparse matrices provided by the rock solid Matrix package. The tidytext package is idiosyncratic (in the best possible way!) for doing all of its work in data frames rather than matrices, but a data frame is about as standard as you can get.

Careers

Internships and other temporary positions

Software Curator for Systems and Environments

Rhizome; New York, NY

Full-time positions outside academia

Junior Developer

StatDNA; Seattle, WA

Sports.BradStenger.com

NYU Data Science newsletter – August 25, 2016

Leave a Comment Cancel reply