Data Science newsletter – December 12, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for December 12, 2017

GROUP CURATION: N/A

 
 
Data Science News



Report: Securing Social Science and Humanities Government Data

SSRC, Parameters


from

Social science has long depended on data generated from government sources—from the US census to climate and environmental data—to address problems and advance informed policy for the public good. In a new essay concluding Items’ “Just Environments” series, Lindsey Dillon and Christopher Sellers from the Environmental Data and Governance Initiative (EDGI) identify the dangers of losing access to those data: “Erasure of data and other information from government websites can also sow doubt and uncertainty on climate change and other important environmental issues.” EDGI was formed in response to threats to scholarly access to government data, and the essay outlines how members are securing knowledge necessary for their work, including “data rescue” events, careful monitoring of data sources, and collaboration with federal employees. The Items essay is likely relevant to Parameters readers, and is worth reading in full. You can read it here.

Today, alongside EDGI’s Items essay, the Council is publishing a report, Securing Social Science and Humanities Government Data, on the SSRC website.


Twitter can reveal our shared mood

University of Bristol


from

In the largest study of its kind, researchers from the University of Bristol have analysed mood indicators in text from 800 million anonymous messages posted on Twitter. These tweets were found to reflect strong patterns of positive and negative moods over the 24-hour day.


The Case for Learned Index Structures

arXiv, Computer Science > Databases; Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis


from

Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.


Study: What Skills Does the Typical Data Scientist Have?

Datascience.com, Iliya Valchanov


from

To gain a better understanding of the typical data scientist profile, 365 Data Science conducted a study in which it collected information from the LinkedIn profiles of 1,001 data scientist professionals.

Unlike previous publications, the primary source of data were not job ads, which skew findings towards the employers’ point of view. Instead, the research relied on information posted by data scientists themselves. The underlying assumption was that one’s LinkedIn profile is a good estimator of his or her resume.


No, Google’s AI Program Can’t Build Your Genome Sequence

Forbes, Steven Salzberg


from

Let’s look at some of the ways that the Google announcement, and the Wired article that followed it, are misleading or over-hyped.

1. The Google program doesn’t assemble genomes. That’s right: even though the Wired piece opens with the promise of “getting the full picture” of your genome, the new Google program, DeepVariant, doesn’t do anything of the sort. DeepVariant is a program for identifying small mutations, mostly changes of a single letter (called SNPs).


Estonia, the Digital Republic

The New Yorker, Nathan Heller


from

Its government is virtual, borderless, blockchained, and secure. Has this tiny post-Soviet nation found the way of the future?


Facebook app for kids sparks privacy concerns

TheHill; Ali Breland


from

A new Facebook chat app designed for kids is raising concern among lawmakers and children’s groups over data privacy and safety.

The app, Messenger Kids, is targeted toward children aged 6-12, who are still too young to use Facebook.

Unveiled Monday, it differs from the existing Facebook Messenger app in key ways.

An account can only be set up by a parent, who must also add any contacts for their child. Facebook also won’t advertise to children within the app or sell any data it collects to third-party advertisers. Children don’t need to set up a Facebook account and after the age of 12 they won’t be pushed on to the adult app.


NYU Team Triumphs in U.N. Data for Climate Action Challenge

NYU Tandon School of Engineering


from

Assistant Professor in Tandon’s Department of Civil and Urban Engineering and the NYU Center for Urban Science and Progress (CUSP), the Director of the Urban Intelligence Lab, and the Deputy Director for Academics at CUSP — recently led a team that participated in the U.N. Data for Climate Action Challenge, a call to the world’s data scientists, researchers, analysts, and innovators to develop viable, data-driven climate solutions.

The team, which included Yuan Lai, Bartosz Bonczak, Boyeong Hong, Sokratis Papadopoulos, Awais Malik, and Nick Johnson, developed a first-of-its-kind high-resolution spatial-temporal model of urban greenhouse gas emissions from buildings, transportation systems, and point sources such as power generation facilities and industrial plants.


New Model Predicts Lightning Strikes; Alert System to Follow

Eos, Kimberly M.S. Carter


from

The model’s development was sparked by a meeting between lightning experts and scientists working with big data and data mining, said Kristin Calhoun, lead project scientist on the model and alert system. After the meeting, “it clicked with me that we could actually use these techniques to predict lightning,” said Calhoun, a research scientist at the University of Oklahoma’s Cooperative Institute for Mesoscale Meteorological Studies in Norman and the National Oceanic and Atmospheric Administration’s (NOAA) National Severe Storms Laboratory (NSSL), also in Norman.


Research on Applying Deep Learning to Long-Term Investing

Euclidean Technologies, Zachary Lipton


from

In this paper, we introduce how we use deep learning techniques to predict future company fundamental data, such as earnings, revenue, and debt, from past fundamental data. We also show that these predictions can be used to meaningfully improve the investment performance of widely researched and commercially applied quantitative investment strategies that use valuation ratios.

The results reflect our use of long-short-term-memory recurrent-neural networks (RNN) and multi-layer perceptrons (MLP) to predict future company fundamentals from a time-series of past company fundamentals. We were motivated to do this by the intuition that future fundamentals should have a bigger impact than current fundamentals on a stock’s future price. We confirm this intuition through an empirical study demonstrating that knowledge of future fundamentals (a clairvoyant prediction) would, if possible, dramatically improve the investment performance of simulated portfolios. We then show that portfolios constructed with valuation ratios that use predicted fundamentals outperform portfolios constructed with valuations ratios based on current fundamentals.


As at home DNA tests become more common, people must grapple with surprises about their parents

CNBC, Christina Farr


from

Until recently, Andrea Ramirez, 43, thought she was part Mexican.

But the results from an at-home genetic test from 23andMe revealed that she is a mix of Northern European, North African and a little Native American.


Tempus unveils a standalone tool for structuring clinical data at scale

MedCity News, Juliet Preston


from

Why do some patients respond to immuno-oncology drugs when others don’t?

It’s one of many million-dollar questions in medicine that confound companies, researchers, and clinicians alike. And the really frustrating part? We know where many of the answers lie. They’re trapped in electronic health records (EHRs) and siloed by disparate health systems.

Chicago, Illinois-based Tempus is working to extract that information at scale. The two-year-old company recently began offering an operating system, dubbed Tempus O, designed to structure, cleanse, and annotate clinical data.


New Global Health Institute announced at Yale

Yale University, YaleNews


from

The new Yale Institute for Global Health (YIGH), approved by the Yale Corporation on Dec. 8, further advances President Salovey’s goal for the university to have a greater impact on complex international issues. Led by the Schools of Medicine, Nursing, and Public Health, YIGH is a university-wide effort to address global health issues, and will serve as the focal point for research, education, and engagement with global partners to improve the health of individuals and populations worldwide.

“It is critical for Yale to have a substantial role in addressing health problems that face populations around the globe, including in the United States,” said Ann Kurth, dean of the Yale School of Nursing. “We want to harvest all the talent and the distinct assets we have across the university, so we no longer work in a distributed way but with a more cohesive, interdisciplinary approach to make a deeper impact with our global initiatives. No one discipline can solve global health problems. YIGH will provide a catalyzing center for these collaborations.”


Government Data Science News

The FCC has struck down net neutrality, setting them up for a lawsuit by a long and growing number of states. Before the vote, a group of 21 people who had a hand in creating the internet wrote a letter trying to explain the technical folly of removing net neutrality and offering “nearly a dozen different examples of consumer harm” that could result. Ajit Pai ignored it.

Arizona is considering passing legislation that could severely restrict AI research in the state, even though it was written with the intention of curtailing ticket scalping. “Sen. John Kavanagh, R-Fountain Hills, wants to make it a crime to use any computer or software that conceals its real identity ‘to simulate or impersonate the action of a human.’ Violations would be a felony subjecting the owners or operators to a year in state prison.” Felony charges? A year in prison? Data scientists in Arizona, please go see your university’s legal team immediately.



New York will become the first city in the country to adopt legislation promoting algorithmic fairness, accuracy, and transparency. It will stand up a task force to assess agencies’ existing algorithms and their political implications. It will be composed of experts on transparency, fairness, and staff from non-profits.



Googler Christopher Shallue and UT-Austin postdoc Andrew Vanderburg used NASA’s Kepler data to find an eighth planet orbiting a Kepler-90, a sun-like star. The new planet is about 30% larger than Earth, rocky and very hot (>800 degrees F).


Artificial Intelligence and Supercomputers to Help Alleviate Urban Traffic Problems

The University of Texas at Austin, UT News


from

Look above the traffic light at a busy intersection in your city and you will probably see a camera. These devices may have been installed to monitor traffic conditions and provide visuals in the case of a collision. But can they do more? Can they help planners optimize traffic flow or identify sites that are most likely to have accidents? And can they do so without requiring individuals to slog through hours of footage?

Researchers from the Texas Advanced Computing Center (TACC), the University of Texas Center for Transportation Research and the City of Austin believe so. Together, they are working to develop tools that allow sophisticated, searchable traffic analyses using deep learning and data mining.

At the IEEE International Conference on Big Data this month, they will present a new deep learning tool that uses raw traffic camera footage from City of Austin cameras to recognize objects – people, cars, buses, trucks, bicycles, motorcycles and traffic lights – and characterize how those objects move and interact. This information can then be analyzed and queried by traffic engineers and officials to determine, for instance, how many cars drive the wrong way down a one-way street.


University Data Science News

New York City-area students, please consider taking my Ethics of Data Science class next semester. It’s on Thursdays from 4:55 – 7:05, reading the newsletter satisfies the pre-req, and it will equip students with the critical thinking skills to be *leaders* in data science. Leaders know how to answer the toughest questions: Should we build this? What are we missing? How will a new technology fit into existing human organizational, cultural, ethical, and legal contexts?
Hit reply and let me know if you want to sign up. It is open to NYU, Columbia, Princeton, CUNY, and Fordham students.

The New Yorker has profiled Jim Simons. He’s the man behind The Simons Foundation which he funded by killing it at Renaissance Technologies, a hedge fund he ran with a bunch of PhD physicists and mathematicians. The Simons Foundation maintains that fondness for physics, but is also pursuing computational research in genomics. Considering a job at the Simons Foundation-supported Flatiron Institute? Note that Simons (the person) smokes constantly. Indoors. Figures. He can afford whatever the fine is.



If you aren’t following Yann LeCun on Facebook, you’re missing all the times he “cringes” at AI hyperbole. This week it was about the idea that “neural networks copy the human brain” which is just not right because neural nets are “loosely inspired by some aspects of the brain, just as airplanes are loosely inspired by birds.” Plus, he is a huge star here at the newsletter, having posted a link to our sign up way back in 2015.



A team from NYU Center for Urban Science and Progress won the UN Climate Data Challenge with a visualization project that combines traffic data, tree locations, weather, and building GHG emissions to estimate real-time exposure to carbon emissions in New York City.

Lightning can strike the same place twice, but when? A new model predicts where lightning will strike next thanks to University of Oklahoma and NOAA scientist Kristin Calhoun. An alert system is in the works which would highlight areas with a high likelihood of being struck up to an hour in advance.



MIT Media Lab Social Machines group researchers are using deep neural nets trained on movie data to predict and manipulate human’s emotions. The goal, I guess, is to create more emotionally gripping movies. Is anyone else unsettled by this application? Surely human emotions can be manipulated – I don’t doubt that – but do we want to perfect this? Imagine using it in political campaign season, the way Cambridge Analytica implied they could do last cycle. I don’t know if they were super successful because they aren’t sharing their results, but I have a feeling they promised more than they delivered.



The NIPS community is doing some soul-searching around inclusivity and tolerance of bad behavior and short-sighted planning. It’s no fun to hear the stories of exclusion, but it *is* good that the conversations can happen and are likely to lead to positive changes, such as child care for parents. I, for one, love having kids and babies at conferences.



Along those lines, a new study using 23,918 grant applications over 5 years to the Canadian Institutes of Health Research found that “female grant applicants are equally successful when peer reviewers assess the science, but not when they assess the scientist”. The gap in success rates jumped from 0.9% in favor of men to 4% after the review process was amended to explicitly consider the PIs “caliber”.



Meanwhile, I have to serve as my floor’s designated fire safety “female searcher” because I’m the only woman employee who’s there every day. When is it appropriate to suggest that we may need to instate “female searching” every day in order to locate all the missing women in data science?



The University of Maryland‘s machine learning program will open a 7,500-square-foot innovation lab adjacent to campus in partnership with Capital One. The company will make an initial $3 million investment to endow a chair. The Department of Commerce kicked in an additional $2.1 million for which they were able to endow two more professorships. (This is a lesson in charging entities what we perceive they can afford.)



Let’s keep that fundraising learning alive: Bill and Melinda Gates donated $15m to the University of Washington-Seattle to put up a new computer science building which hit a construction milestone last week. The gift will allow the CS department to double enrollment, a major imperative as undergraduate interest in CS continues to skyrocket (up 74% between 2009-2015). The building will be named after Bill Gates. Microsoft was also a major donor to the building.



A gripping report by CS professors at Stanford warns that the brain drain to “voracious industry” means that, “hiring and retaining CS faculty is currently an acute challenge that limits institutions’ abilities to respond to increasing CS enrollments.” They also point out that even though universities have increased hiring, it dramatically lags the growth in enrollment. Extrapolating from current trends in CS PhD job placement, they find, “institutions can therefore expect to hire roughly 0.2 new Ph.D.s per year, or one Ph.D. every five years”. The vast majority of those who stay in academia will go to R1 schools, leaving only 17 graduates to fill positions at the other 1462 schools. They note this crisis is likely to exacerbate rather than alleviate the lack of gender and racial diversity in the field.



University of British Columbia poached Harvard CS professor Margo Seltzer by drawing on funds from the Canadian federal government through the 150 Canadian Cities project. She’ll get $1m per year for seven years in research funds. This, friendly readers, is what the high stakes game of CS hiring will often look like going forward.


Pandas: Meet Wes McKinney, the man behind the most important tool in data science

Quartz, Dan Kopf


from

Wes McKinney hates the idea of researchers wasting their time. “Scientists unnecessarily dealing with the drudgery of simple data manipulation tasks makes me feel terrible,” he says.

Perhaps more than any other person, McKinney has helped fix that problem. McKinney is the developer of “Pandas”, one of the main tools used by data analysts working in the popular programming language Python.

Millions of people around the world use Pandas. In October 2017 alone, Stack Overflow, a website for programmers, recorded 5 million visits to questions about Pandas from more than 1 million unique visitors.

 
Deadlines



Uptake.org Data Fellows program

The “Data Fellows program is a six month fellowship designed to connect data professionals at non-profits, foundations, and social enterprises to experts in data science. The fellowship begins with a week in Chicago to meet with mentors, learn hard skills in data science, and network with like-minded data-for-good professionals.” Deadline for applications is December 22.

Next Generation of Electronic Health Records workshop

Copenhagen, Denmark Workshop is March 21, 2018. Deadline for submissions is February 15, 2018.
 
Tools & Resources



Turi Create

Apple


from

Turi Create simplifies the development of custom machine learning models. You don’t have to be a machine learning expert to add recommendations, object detection, image classification, image similarity or activity classification to your app.


How to quickly experiment with Dataflow

Medium, Google Cloud Platform


from

“One of my colleagues showed me this trick to quickly experiment with Cloud Dataflow” … “The cool new way takes advantage of the Python REPL (the command-line interpreter) and the fact that Python lists can function as a Dataflow source.”


Social Media, Open Science, and Data Science Are Inextricably Linked

Neuron; Bradley Voytek


from

Should scientists use social media? Why practice open science? What is data science? Ten years ago, these phrases hardly existed. Now they are ubiquitous. Here I argue that these phenomena are inextricably linked and reflect similar underlying social and technological transformations. [full text]

 
Careers


Postdocs

Postdoctoral Researchers



University of California-Berkeley, Foundations of Data Analysis Institute; Berkeley, CA

Leave a Comment

Your email address will not be published.