Data Science newsletter – March 30, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for March 30, 2017


Data Science News

What Your Therapist Doesn’t Know

The Atlantic, Tony Rousmaniere


Big Data has transformed everything from sports to politics to education. It could transform mental-health treatment, too—if only psychologists would stop ignoring it.

Setting a National Standard for Data-Smart Cities

CityLab, Linda Poon


Bloomberg Philanthropies is introducing a new certification initiative that aims to set the national standard for how local governments enact evidence-based policies. Think of it as a Good Housekeeping Seal, or LEED certification, but for data-smart cities.

We Have Some Good News on the California Drought. Take a Look.

The New York Times, Mike McPhate, Derek Watkins and Jim Wilson


Knowing with precision how much snow has accumulated is crucial for farmers and water managers.

That’s where a mapping project at NASA’s Jet Propulsion Laboratory known as the Airborne Snow Observatory comes in. Using measurements gathered by specialized instruments on a plane, scientists have been able to gain an unprecedented understanding of the amount of water present in the Sierra’s snow.

Vector Institute is just the latest in Canada’s AI expansion

BBC News, Jessica Murphy


Toronto will soon get the Vector Institute for Artificial Intelligence, geared to fuelling “Canada’s amazing AI momentum”.

The new research facility, which will be officially launched on Thursday, will be dedicated to expanding the applications of AI by through explorations in deep learning and other forms of machine learning. It has received about C$170m (US$127m/£102m) in funding from the Canadian and Ontario governments and a group of 30 businesses, including Google and RBC.

New Research Shows that Data Science in the UK is Flourishing

DataInformed, Dan Somers


New research by Warwick Analytics shows that the number of data scientists in the UK looks to grow by around 50 percent in 2017. However, skilled resources are scarce and a key constraint is the amount of time data scientists manually spend processing data. This means that the key to unlocking the full potential of the UK data-science market is not just to train more data scientists, but to speed up a lot of the manual processes that transform and prepare data for analysis.

Chase Had Ads on 400,000 Sites. Then on Just 5,000. Same Results.

The New York Times, Sapna Maheshwari


The change illustrates the new skepticism with which major marketers are approaching online ad platforms and the automated technology placing their brands on millions of websites. In recent years, advertisers have increasingly shunned buying ads on individual sites in favor of cheaply targeting groups of people across the web based on their browsing habits, a process known as programmatic advertising — enabling, say, a Gerber ad to show up on a local mother’s blog, or a purse in an online shopping cart to follow a person around the internet for weeks.

Nvidia’s Deep-Learning Chips May Give Medicine a Shot in the Arm

MIT Technology Review, Will Knight


The chip maker Nvidia is riding the current artificial-intelligence boom with hardware designed to power cutting-edge learning algorithms. And the company sees health care and medicine as the next big market for its technology.

Kimberly Powell, who leads Nvidia’s efforts in health care, says the company is working with medical researchers in a range of areas and will look to expand these efforts in coming years.

Hedge Funds Are Training Their Computers to Think Like You

Bloomberg, Saijel Kishan


Hedge funds have been trying to teach computers to think like traders for years.

Now, after many false dawns, an artificial intelligence technology called deep learning that loosely mimics the neurons in our brains is holding out promise for firms. WorldQuant is using it for small-scale trading, said a person with knowledge of the firm. Man AHL may soon begin betting with it too. Winton and Two Sigma are also getting into the brain game.

The quant firms hope this A.I. — a kind of machine learning on steroids — will give them an edge in the escalating technological arms race in global finance.

New Technology Platform Designed to Provide African Park Rangers Real-Time Tools to Protect Iconic

PR Newswire, Vulcan


Responding to the elephant poaching crisis illustrated in 2016’s Great Elephant Census (GEC), philanthropist Paul G. Allen and his team of technologists and conservation experts are partnering with park managers across Africa to provide a new technology platform to better protect this iconic species and other wildlife threatened by human activities.

Open data is a right

Simon Rogers


A right to data. It’s worth taking a moment to let that sink in. This is not a random luxury, but a right we are all entitled to.

Open data requires open data journalism to make sense of it all, of course. Journalists are uniquely placed to make sense of the numbers and open them up for readers desperate for facts and a greater understanding of what’s going on around them. Whether it’s crime data or health statistics, the promise of open data was too great to be fulfilled in just a few years but promised a new era of greater awareness for all of us.

[1501.00960] Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution

arXiv, Physics > Physics and Society; Eitan Adam Pechenick, Christopher M. Danforth, Peter Sheridan Dodds


It is tempting to treat frequency trends from the Google Books data sets as indicators of the “true” popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We highlight these dynamics by examining and comparing major contributions to the statistical divergence of English data sets between decades in the period 1800–2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts, in clear contrast to the first version of the fiction data set and both unfiltered English data sets. Our findings emphasize the need to fully characterize the dynamics of the Google Books corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.

IEEE Global Initiative Aims to Advance Ethical Design of AI and Autonomous Systems

IEEE Spectrum, Raja Chatila, Kay Firth-Butterfield, John C. Havens and Konstantinos Karachalios


Algorithms with learning abilities collect personal data that are then used without users’ consent and even without their knowledge; autonomous weapons are under discussion in the United Nations; robots stimulating emotions are deployed with vulnerable people; research projects are funded to develop humanoid robots; and artificial intelligence-based systems are used to evaluate people. One can consider these examples of AI and autonomous systems (AS) as great achievements or claim that they are endangering human freedom and dignity.

We need to make sure that these technologies are aligned to humans in terms of our moral values and ethical principles to fully benefit from the potential of them.

Sneak Preview inside the New Manhattanville Campus at Columbia University

Untapped Cities, Michelle Young


Columbia University President Lee Bollinger welcomed staff and press to a preview of the first two buildings at the university’s new Manhattanville Campus in Harlem. At more than seventeen acres in size, this ambitious and controversial project has been in the making for nearly fifteen years, the largest expansion in over a century. Yet it is clear the university hopes that the Manhattanville Campus will not only provide a much-needed, modernized counterpart to its McKim, Mead and White campus in Morningside Heights but also serve as a community partner for the neighborhood.

Inventing Tools for Detecting Life Elsewhere



Recently, astronomers announced the discovery that a star called TRAPPIST-1 is orbited by seven Earth-size planets. Three of the planets reside in the “habitable zone,” the region around a star where liquid water is most likely to exist on the surface of a rocky planet. Other potentially habitable worlds have also been discovered in recent years, leaving many people wondering: How do we find out if these planets actually host life?

Computers learn to cooperate better than humans

Science, Latest News, Jackie Snow


For the first time, computers have taught themselves how to cooperate in games in which the objective is to reach the best possible outcome for all players. The feat is far harder than training artificial intelligence (AI) to triumph in a win-lose game such as chess or checkers, researchers say. The advance could help enhance human-machine cooperation.

A Grand New Adventure

Medium, Kim Rees


I am elated to announce the next chapter in my life. My trajectory of innovating with data has led me to join Capital One as Head of Data Visualization!

Scaling Deep Learning on an 18,000 GPU Supercomputer

The Next Platform, Nicole Hemsoth


The emphasis on machine learning scalability has often been focused on node counts in the past for single-model runs. This is useful for some applications, but as neural networks become more integrated into existing workflows, including those in HPC, there is another way to consider scalability. Interestingly, the lesson comes from an HPC application area like weather modeling where, instead of one monolithic model to predict climate, an ensemble of forecasts run in parallel on a massive supercomputer are meshed together for the best result. Using this ensemble method on deep neural networks allows for scalability across thousands of nodes, with the end result being derived from an average of the ensemble–something that is acceptable in an area that does not require the kind of precision (in more ways than one) that some HPC calculations do.

This approach has been used on the Titan supercomputer at Oak Ridge, which is a powerhouse for deep learning training given its high GPU counts. Titan’s 18,688 Tesla K20X GPUs have proven useful for a large number of scientific simulations and are now pulling double-duty on deep learning frameworks, including Caffe, to boost the capabilities of HPC simulations (classification, filtering of noise, etc.).


Data Science @ Cal Day

Berkeley Data Science


Berkeley, CA April 22, starting at 11 a.m., 190 Doe Library. [free]

Philomathia Forum on Energy and Environment

Berkeley Energy and Climate Institute


Berkeley, CA Organized by Berkeley Energy and Climate Institute, April 20 starting at 8:30 a.m., Banatao Auditorium, Sutardja Dai Hall. [$$]


Call For Papers – Computational Intelligence & Games 2017

New York, NY Conference is August 22-25 at NYU. Deadline for full technical papers is April 4.

2017 NYU Game Center Incubator Applications Open

Projects will be selected by the Incubator Advisory Board with guidance from the NYU Game Center Faculty and Staff. Online applications will be reviewed first. Deadline for applications is Sunday, April 16. A subset of the online applications will be invited to pitch in-person at The Game Center on the evening of April 26.

Mechanism Design for Social Good – Call for Papers

Cambridge, MA 1st Workshop on Mechanism Design for Social Good will be taking place at this year’s ACM Conference on Economics and Computation at MIT on June 26. Deadline for submissions is April 27.

NOAA Fisheries Steller Sea Lion Population Count – How many sea lions do you see?

Kagglers are invited to develop algorithms which accurately count the number of sea lions in aerial photographs. Deadline for entries is June 20.

Conference on Genome Informatics

Cold Spring Harbor, NY
November 1-4. Deadline for abstracts is August 18.
NYU Center for Data Science News

Improving Stress Tests on Financial Portfolios

NYU Center for Data Science


Bud Mishra, an affiliated professor at CDS, Gelin Gao (NYU Courant), and Daniele Ramazzotti (Stanford) have therefore suggested a new method for stress testing financial portfolios in their newly released paper, Efficient Simulation of Financial Stress Testing Scenarios with Suppes-Bayes Causal Networks (SBCNs).

Tools & Resources

Adversarial Autoencoders (with Pytorch)

Paperspace, Felipe Nicolas Ducau


“In this post we will look at a recently developed architecture, Adversarial Autoencoders, which are inspired in VAEs, but give us more flexibility in how we map our data into a latent dimension (if this is not clear as of now, don’t worry, we will revisit this idea along the post). One of the most interesting ideas about Adversarial Autoencoders is how to impose a prior distribution to the output of a neural network by using adversarial learning.”

Analyzing Scrabble Games

RPubs, James P. Curley


Re-reading all of these posts made me wonder about putting some scrabble data together into a package so myself and others could do some fun data analysis. The results is the package scrabblr. In this package I decided to collate every turn played by two ‘expert’ level computer sims when playing against each other.

JASP Tutorial: Data Editing

YouTube, JASP Statistics


In this video we explain how to edit your data using JASP statistical software.

[1703.09710] Fast and scalable Gaussian process modeling with applications to astronomical time series

arXiv, Astrophysics > Instrumentation and Methods for Astrophysics; Daniel Foreman-Mackey, Eric Agol, Ruth Angus, Sivaram Ambikasaran


The growing field of large-scale time domain astronomy requires methods for probabilistic data analysis that are computationally tractable, even with large datasets. Gaussian Processes are a popular class of models used for this purpose but, since the computational cost scales as the cube of the number of data points, their application has been limited to relatively small datasets. In this paper, we present a method for Gaussian Process modeling in one-dimension where the computational requirements scale linearly with the size of the dataset. We demonstrate the method by applying it to simulated and real astronomical time series datasets. These demonstrations are examples of probabilistic inference of stellar rotation periods, asteroseismic oscillation spectra, and transiting planet parameters. The method exploits structure in the problem when the covariance function is expressed as a mixture of complex exponentials, without requiring evenly spaced observations or uniform noise. This form of covariance arises naturally when the process is a mixture of stochastically-driven damped harmonic oscillators – providing a physical motivation for and interpretation of this choice – but we also demonstrate that it is effective in many other cases. We present a mathematical description of the method, the details of the implementation, and a comparison to existing scalable Gaussian Process methods. The method is flexible, fast, and most importantly, interpretable, with a wide range of potential applications within astronomical data analysis and beyond. We provide well-tested and documented open-source implementations of this method in C++, Python, and Julia.

An Introduction to Stock Market Data Analysis with R (Part 1)

Curtis Miller


“This post is the first in a two-part series on stock data analysis using R, based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah. In these posts, I will discuss basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking.”

Leave a Comment

Your email address will not be published.