NYU Data Science newsletter – June 10, 2015

NYU Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for June 10, 2015

GROUP CURATION: N/A

Data Science News

ANNOUNCING: first public beta of stack

FP Complete

from June 09, 2015

stack is a new, complete, cross-platform development tool aimed at both new and experienced Haskell developers alike, for installing and setting up the compiler, installing packages needed, and building, testing or benchmarking one or more packages in a project at a time. It’s the whole stack.

Airbnb Opens Data, Machine Learning Code

Medium, Cool moments of the week

from June 08, 2015

… Aerosolve represents Airbnb’s latest efforts in machine learning, the creation of systems that automate the continuous improvement of algorithms. It powers a new product the company unveiled on Thursday, Price Tips.

I am Dan Altman, founder of North Yard Analytics and soccer contributor for The New Yorker : IAmA

reddit.com/r/IAmA

from June 09, 2015

I’m an economist and data analyst working for soccer clubs and other clients around the world to improve results on and off the field.

Crime Lab to study three programs in effort to reduce youth violence in Chicago

UChicago News

from June 09, 2015

In an effort to find effective ways to reduce youth violence in Chicago, the University of Chicago Crime Lab will evaluate three promising programs, selected from more than 200 ideas submitted during its inaugural design competition.

The recipients of the funding, which will receive scientific evaluation from Crime Lab, are Children’s Home + Aid, a leading child and family services agency in Illinois; the David Lynch Foundation, an organization that teaches meditation to children and adults to heal traumatic stress in at-risk populations; and Sweet Water Foundation, a group that provides hands-on learning of urban agriculture practices for community development.

Watch Stanford’s Laptop Orchestra Play A Giant VR Music Engine

Fast Company

from June 09, 2015

… During a recent performance in Stanford’s Bing Concert Hall, the Laptop Orchestra, “a large-scale, computer-mediated ensemble” directed by Ge Wang—the founder of the music app powerhouse Smule—played a piece called Carillon, which is “a networked (virtual reality) instrument that brings you inside a massive virtual bell tower.

Carillon was built using the Unreal Engine, a game development toolset. The orchestra used a combination of Oculus Rift VR goggles and Leap Motion’s hands-free gesture control system, which allowed the musicians to “slave each performer’s avatar arms and hands to the controller,” Leap Motion wrote in a blog post. [video, 7:28]

Reddit button ends, and here’s the click data

Flowing Data blog

from June 09, 2015

On April 1, Reddit posted a simple button with a 60-second timer that counted down to zero. Every time the button was pressed by a unique Reddit user, the timer reset to 60 seconds. Yesterday, more than two months and 1,008,316 presses later, the timer finally made it to zero seconds without a press.

Data Scientists in Demand: Salaries Rise as Talent Shortage Looms – Bloomberg Business

Bloomberg Business

from June 04, 2015

… A study by McKinsey projects that “by 2018, the U.S. alone may face a 50 percent to 60 percent gap between supply and requisite demand of deep analytic talent.” The shortage is already being felt across a broad spectrum of industries, including aerospace, insurance, pharmaceuticals, and finance. When the consulting firm Accenture surveyed its clients on their big-data strategies in April 2014, more than 90 percent said they planned to hire more employees with expertise in data science—most within a year. However, 41 percent of the more than 1,000 respondents cited a lack of talent as a chief obstacle. “It will get worse before it gets better,” says Narendra Mulani, senior managing director at Accenture Analytics.

Can Silicon Valley Fix Women’s Fashion

BuzzFeed News

from June 04, 2015

… Stitch Fix uses a combination of data science and personal stylists to select a “Fix” of five items sent directly to your front door. If a customer keeps one item, the $20 styling fee is applied toward its purchase. If she keeps everything, she receives 25% off the total cost of the box. Items are priced between $28 for a pair of earrings and $188 for a pair of jeans, sourced from lesser-known brands (Kut from the Kloth denim) and proprietary ones (six in total). Stitch Fix doesn’t release statistics about its number of customers, but the company projects more than $200 million in revenue in 2015.

Another Tottering Step Toward a New Era of Data-Making

Dart Throwing Chimp blog

from June 09, 2015

Ken Benoit, Drew Conway, Benjamin Lauderdale, Michael Laver, and Slava Mikhaylov have an article forthcoming in the American Political Science Review that knocked my socks off when I read it this morning. Here is the abstract from the ungated version I saw:

Empirical social science often relies on data that are not observed in the field, but are transformed into quantitative variables by expert researchers who analyze and interpret qualitative raw sources. While generally considered the most valid way to produce data, this expert-driven process is inherently difficult to replicate or to assess on grounds of reliability. Using crowd-sourcing to distribute text for reading and interpretation by massive numbers of non-experts, we generate results comparable to those using experts to read and interpret the same texts, but do so far more quickly and flexibly. Crucially, the data we collect can be reproduced and extended transparently, making crowd-sourced datasets intrinsically reproducible. This focuses researchers’ attention on the fundamental scientific objective of specifying reliable and replicable methods for collecting the data needed, rather than on the content of any particular dataset. We also show that our approach works straightforwardly with different types of political text, written in different languages. While findings reported here concern text analysis, they have far-reaching implications for expert-generated data in the social sciences.

Visualizing and Understanding Recurrent Networks

arXiv, Computer Science

from June 05, 2015

Recurrent Neural Networks (RNNs), and specifically a variant with Long Short-Term Memory (LSTM), are enjoying renewed interest as a result of successful applications in a wide range of machine learning problems that involve sequential data. However, while LSTMs provide exceptional results in practice, the source of their performance and their limitations remain rather poorly understood. Using character-level language models as an interpretable testbed, we aim to bridge this gap by providing a comprehensive analysis of their representations, predictions and error types. In particular, our experiments reveal the existence of interpretable cells that keep track of long-range dependencies such as line lengths, quotes and brackets. Moreover, an extensive analysis with finite horizon n-gram models suggest that these dependencies are actively discovered and utilized by the networks. Finally, we provide detailed error analysis that suggests areas for further study.

Costs Of Slipshod Research Methods May Be In The Billions

NPR, Shots blog

from June 09, 2015

Laboratory research seeking new medical treatments and cures is fraught with pitfalls: Researchers can inadvertently use bad ingredients, design the experiment poorly, or conduct inadequate data analysis. Scientists working on ways to reduce these sorts of problems have put a staggering price tag on research that isn’t easy to reproduce: $28 billion a year.

That figure, published Tuesday in the journal PLOS Biology, represents about half of all the preclinical medical research that’s conducted in labs (in contrast to research on human volunteers). And the finding comes with some important caveats.

The $28 billion doesn’t just represent out-and-out waste, the team that did the research cautions. It also includes some studies that produced valid results — but that couldn’t be repeated by others because of the confusing way the methods were described, or because of other shortcomings. [audio, 3:28]

Mortgages Are About Math: Open-Source Loan-Level Analysis of Fannie and Freddie

Todd W. Schneider

from June 09, 2015

Fannie Mae and Freddie Mac began reporting loan-level credit performance data in 2013 at the direction of their regulator, the Federal Housing Finance Agency. The stated purpose of releasing the data was to “increase transparency, which helps investors build more accurate credit performance models in support of potential risk-sharing initiatives.”

The so-called government-sponsored enterprises went through a nearly $200 billion government bailout during the financial crisis, motivated in large part by losses on loans that they guaranteed, so I figured there must be something interesting in the loan-level data. I decided to dig in with some geographic analysis, an attempt to identify the loan-level characteristics most predictive of default rates, and more. As part of my efforts, I wrote code to transform the raw data into a more useful PostgreSQL database format, and some R scripts for analysis. The code for processing and analyzing the data is all available on GitHub.

Officer Involved: Visualizing Police Brutality

First Look, The Intercept

from June 09, 2015

In the United States, there is no official accounting of the people killed by police. To address that void in information, non-governmental and news organizations have been collecting data on such incidents.

Intercept data artist Josh Begley’s new project, “Officer Involved,” uses databases on police brutality compiled by The Guardian to present the problem in a new way. Begley’s project (like several others he has done) is an intervention that makes visible the violence behind the way we live. “Officer Involved” reveals the lack of innocence in the landscape, and, without sensationalism or sentimentality, challenges us to think about a deep injustice that so many of us accept as normal.

CDS News

International Conference on Computational Social Science

YouTube, iccss2015

from June 10, 2015

The new era of Computational Social Science has then begun. IC2S2 2015 is the first global event on computational social science

Sports.BradStenger.com

NYU Data Science newsletter – June 10, 2015

Leave a Comment Cancel reply