Data Science newsletter – December 27, 2016

Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for December 27, 2016

GROUP CURATION: N/A

Data Science News

Bat chat: machine learning algorithms provide translations for bat squeaks

The Guardian, Nicola Davis

from December 22, 2016

In a step towards understanding the origins of human speech, researchers have worked out a way to understand the meaning of bat calls

Facebook Partners With 17 Universities to Streamline Research

Fortune.com, Reuters

from December 21, 2016

Facebook’s hardware development division on Wednesday announced a new partnership with Harvard, Princeton and 15 other universities intended to allow swifter collaboration on technology research projects.

The agreement between Facebook’s Building 8 and the universities comes as the social media company seeks to find new revenue streams in virtual reality and artificial intelligence, after the company signaled last month it had begun to hit some advertising growth limits on its network of 1.8 billion monthly active users.

Research partnerships between universities and companies typically take nine to 12 months to facilitate, but the new agreement will allow for collaboration on new ideas within weeks, said Regina Dugan, who joined the company in April to run the new Building 8 unit.

Virtual Reality Allows the Most Detailed, Intimate Digital Surveillance Yet

The Intercept, Joshua Kopstein

from December 23, 2016

As the tech industry continues to build VR’s social future, the very systems that enable immersive experiences are already establishing new forms of shockingly intimate surveillance. Once they are in place, researchers warn, the psychological aspects of digital embodiment — combined with the troves of data that consumer VR products can freely mine from our bodies, like head movements and facial expressions — will give corporations and governments unprecedented insight and power over our emotions and physical behavior.

As Groundwater Dwindles, a Global Food Shock Looms

National Geographic, Cheryl Katz

from December 22, 2016

To forecast when and where specific aquifers around the globe might be drained to the point that they’re unusable, Inge de Graaf, a hydrologist at the Colorado School of Mines in Golden, Colorado, developed a new model simulating regional groundwater dynamics and withdrawals from 1960 to 2100. She found that California’s agricultural powerhouses—the Central Valley, Tulare Basin, and southern San Joaquin Valley, which produce a plentiful portion of the nation’s food—could run out of accessible groundwater as early as the 2030s. India’s Upper Ganges Basin and southern Spain and Italy could be used up between 2040 and 2060. And the southern part of the Ogallala aquifer under Kansas, Oklahoma, Texas, and New Mexico could be depleted between 2050 and 2070.

Advertising’s Moral Struggle: Is Online Reach Worth the Hurt?

The New York Times, Sapina Maheshwari

from December 26, 2016

Advertising on the internet has never been easier. Data and automation increasingly allow companies large and small to reach millions of people every month, and to tailor ads to specific groups based on their browsing habits or demographics.

Now, however, the marketing industry is facing a moral quandary in the face of a national debate over the role that fake news played in the presidential election and the realization that many websites that promote false and misleading stories are motivated by the money they can make from online advertising.

Transparency, Reproducibility, and the Credibility of Economics Research

National Bureau of Economic Research; Garret S. Christensen, Edward Miguel

from December 21, 2016

There is growing interest in enhancing research transparency and reproducibility in economics and other scientific fields. We survey existing work on these topics within economics, and discuss the evidence suggesting that publication bias, inability to replicate, and specification searching remain widespread in the discipline. We next discuss recent progress in this area, including through improved research design, study registration and pre-analysis plans, disclosure standards, and open sharing of data and materials, drawing on experiences in both economics and other social sciences. We discuss areas where consensus is emerging on new practices, as well as approaches that remain controversial, and speculate about the most effective ways to make economics research more credible in the future.

University of Oregon proposes new undergraduate degree in spatial data science and technology

Eugene Register-Guard

from December 26, 2016

The University of Oregon proposes to graduate a new type of techie who knows how to find, analyze and map trends using big sets of data.

The UO’s geography department hopes to next fall launch a new bachelor’s degree in Spatial Data Science and Technology, an “absolutely booming” field, according to UO officials.

It already has an acronym: SDST.

Block models and personalized PageRank

Proceedings of the National Academy of Sciences; Isabel M. Kloumann, Johan Ugander, and Jon Kleinberg

from December 20, 2016

Methods for ranking the importance of nodes in a network have a rich history in machine learning and across domains that analyze structured data. Recent work has evaluated these methods through the “seed set expansion problem”: given a subset S of nodes from a community of interest in an underlying graph, can we reliably identify the rest of the community? We start from the observation that the most widely used techniques for this problem, personalized PageRank and heat kernel methods, operate in the space of “landing probabilities” of a random walk rooted at the seed set, ranking nodes according to weighted sums of landing probabilities of different length walks. Both schemes, however, lack an a priori relationship to the seed set objective. In this work, we develop a principled framework for evaluating ranking methods by studying seed set expansion applied to the stochastic block model. We derive the optimal gradient for separating the landing probabilities of two classes in a stochastic block model and find, surprisingly, that under reasonable assumptions the gradient is asymptotically equivalent to personalized PageRank for a specific choice of the PageRank parameter α that depends on the block model parameters. This connection provides a formal motivation for the success of personalized PageRank in seed set expansion and node ranking generally. We use this connection to propose more advanced techniques incorporating higher moments of landing probabilities; our advanced methods exhibit greatly improved performance, despite being simple linear classification rules, and are even competitive with belief propagation.

Data Scientists Making Millions Vie With Fund Managers Over Pay

Bloomberg, Saijel Kishan

from December 22, 2016

Portfolio managers at hedge funds have another thing to worry about: the $2 million data scientist.

Matt Ober, who left WorldQuant for Third Point, will be paid more than $2 million by Dan Loeb’s hedge fund, according to a breach of contract claim filed by his former employer. Ober, 32, who starts next month as Third Point’s chief data scientist, said in a filing that he will be paid a base salary of $200,000, the same as WorldQuant gave him, plus bonuses, and disputed that $2 million in compensation is guaranteed.

Loeb is joining other boldface hedge fund names in developing big data and quantitative investing to boost returns. Scientists and coders who mine, clean and model information are in high demand after being relegated for years to back office status. Experienced data scientists can earn $500,000 to $700,000, and as much as three times that for those with extensive backgrounds, according to recruiter Alexey Loganchuk.

Events

2017 Neural Computation and Engineering Connection

Seattle, WA The 2017 Neural Computation and Engineering Connection will be held on the afternoon of Thursday, January 19 and all day Friday, January 20. [free, registration required]

Global Women in Data Science (WiDS) Conference 2017

Stanford, CA February 3. WiDS events are happening at Stanford and 25+ locations worldwide, and available via live-stream.

SIPS 2017 Meeting

Charlottesville, VA July 30 – August 1. [$$$]

Deadlines

Urban Science Intensive Program

The Urban Science Intensive partners graduate student teams with a faculty advisor and a public agency or private sector organization that is looking to address a critical urban issue. Deadline for project proposals is Friday, January 20.

Courses: Graduate Workshop in Computational Social Science Modeling and Complexity at Santa Fe Institute

Tuition is $2,475. All application materials must be completed by February 14, 2017.

Tools & Resources

Give Back Box

Monika Wiela

from December 27, 2016

[1612.07140] A Guide to Teaching Data Science

arXiv, Statistics > Other Statistics; Stephanie C. Hicks, Rafael A. Irizarry

from December 21, 2016

Demand for data science education is surging and traditional courses offered by statistics departments are not meeting the needs of those seeking this training. This has led to a number of opinion pieces advocating for an update to the Statistics curriculum. The unifying recommendation is that computing should play a more prominent role. We strongly agree with this recommendation, but advocate that the main priority is to bring applications to the forefront as proposed by Nolan and Speed (1999). We also argue that the individuals tasked with developing data science courses should not only have statistical training, but also have experience analyzing data with the main objective of solving real-world problems. Here, we share a set of general principles and offer a detailed guide derived from our successful experience developing and teaching data science courses centered entirely on case studies. We argue for the importance of statistical thinking, as defined by Wild and Pfannkuck (1999) and describe how our approach teaches students three key skills needed to succeed in data science, which we refer to as creating, connecting, and computing. This guide can also be used for statisticians wanting to gain more practical knowledge about data science before embarking on teaching a course.

Scikit-learn Tutorial: Statistical-Learning for Scientific Data Processing

Open Data Science, Gael Varoquaux

from December 14, 2016

This tutorial will explore statistical learning, that is the use of machine learning techniques with the goal of statistical inference: drawing conclusions on the data at hand.

Careers

Internships and other temporary positions

Intern, Quantitative Analysis

New York Yankees; Bronx, NY

Post-Bachelor Fellowship

University of Washington, Institute for Health Metrics and Evaluation; Seattle, WA

Postdocs

Postdoctoral position in instrumental & extragalactic astrophysics

University of Washington; Seattle, WA

Full-time positions outside academia

Senior Scientist- Microbial Genomics Job

Pfizer; Pearl River, NY

Sports.BradStenger.com

Data Science newsletter – December 27, 2016

Leave a Comment Cancel reply