Data Science newsletter – August 3, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for August 3, 2017


Data Science News

LinkedIn data can predict how likely you are to quit, and it’s being sued to keep it public

Quartz, Oliver Staley


LinkedIn has 500 million profiles online, an extraordinary wealth of information about the education and career paths of nearly 7% of all of humanity—and an absolute treasure trove for companies that build recruitment and human resources software. One of them is hiQ Labs, a startup that scrapes LinkedIn data to build an algorithm to predict whether employees will quit.

HiQ relies on the small portion of Linked profiles that are publicly available, and sells its products to employers looking to precent their best workers from jumping ship. It’s suing LinkedIn—now a unit of Microsoft—to ensure it keeps access to the data, a preemptive strike after LinkedIn sent hiQ a cease-and desist letter in May, according to the Wall Street Journal (paywall).

LinkedIn says its data is proprietary, and says that hiQ violates hacking statutes by scraping its data. HiQ says LinkedIn is stretching the definition of the law, and is asking a federal judge to declare it has acted legally.

Measuring Social Connectedness

National Bureau of Economic Research; Michael Bailey, Ruiqing (Rachel) Cao, Theresa Kuchler, Johannes Stroebel, Arlene Wong


We introduce a new measure of social connectedness between U.S. county-pairs, as well as between U.S. counties and foreign countries. Our measure, which we call the “Social Connectedness Index” (SCI), is based on the number of friendship links on Facebook, the world’s largest online social networking service. Within the U.S., social connectedness is strongly decreasing in geographic distance between counties: for the population of the average county, 62.8% of friends live within 100 miles. The populations of counties with more geographically dispersed social networks are generally richer, more educated, and have a higher life expectancy. Region-pairs that are more socially connected have higher trade flows, even after controlling for geographic distance and the similarity of regions along other economic and demographic measures. Higher social connectedness is also associated with more cross-county migration and patent citations. Social connectedness between U.S. counties and foreign countries is correlated with past migration patterns, with social connectedness decaying in the time since the primary migration wave from that country. Trade with foreign countries is also strongly related to social connectedness. These results suggest that the SCI captures an important role of social networks in facilitating both economic and social interactions. Our findings also highlight the potential for the SCI to mitigate the measurement challenges that pervade empirical research on the role of social interactions across the social sciences.

Op-ed: Should Artificial Intelligence Be Regulated?

Future of Life Institute; Anthony Aguirre, Ariel Conn, and Max Tegmark


Should artificial intelligence be regulated? Can it be regulated? And if so, what should those regulations look like?

These are difficult questions to answer for any technology still in development stages – regulations, like those on the food, pharmaceutical, automobile and airline industries, are typically applied after something bad has happened, not in anticipation of a technology becoming dangerous. But AI has been evolving so quickly, and the impact of AI technology has the potential to be so great that many prefer not to wait and learn from mistakes, but to plan ahead and regulate proactively.

In a Robot Economy, All Humans Will Be Marketers

Bloomberg View, Tyler Cowen


Let’s consider the ATM. Contrary to what many people think, the widespread adoption of automated teller machines in the 1990s didn’t significantly diminish the demand for bank tellers. ATMs made bank branches easier and cheaper to operate, and that led banks to hire more staff, including tellers.

These tellers play a smaller role in counting cash and handling deposits than before, so what are they doing instead? Economist James Bessen explained: “Their ability to market and their interpersonal skills in terms of dealing with bank clients has become more important. So the transition — what the ATM machine did was effectively change the job of the bank teller into one where they are more of a marketing person. They are part of what banks call the ‘customer relationship team.’”

Data Science: Challenges and Directions

Communications of the ACM, Longbing Cao


While data science has emerged as an ambitious new scientific field, related debates and discussions have sought to address why science in general needs data science and what even makes data science a science. However, few such discussions concern the intrinsic complexities and intelligence in data science problems and the gaps in and opportunities for data science research. Following a comprehensive literature review,5,6,10,11,12,15,18 I offer a number of observations concerning big data and the data science debate. For example, discussion has covered not only data-related disciplines and domains like statistics, computing, and informatics but traditionally less data-related fields and areas like social science and business management as well. Data science has thus emerged as a new inter- and cross-disciplinary field. Although many publications are available, most (likely over 95%) concern existing concepts and topics in statistics, data mining, machine learning, and broad data analytics. This limited view demonstrates how data science has emerged from existing core disciplines, particularly statistics, computing, and informatics. The abuse, misuse, and overuse of the term “data science” is ubiquitous, contributing to the hype, and myths and pitfalls are common.4 While specific challenges have been covered,13,16 few scholars have addressed the low-level complexities and problematic nature of data science or contributed deep insight about the intrinsic challenges, directions, and opportunities of data science as an emerging field.

The Natural Science of Computing

Communications of the ACM; Dominic Horsman, Vivien Kendon, Susan Stepney


Technology changes science. In 2016, the scientific community thrilled to news that the LIGO collaboration had detected gravitational waves for the first time. LIGO is the latest in a long line of revolutionary technologies in astronomy, from the ability to ‘see’ the universe from radio waves to gamma rays, or from detecting cosmic rays and neutrinos (the Laser Interferometer Gravitational-Wave Observatory—LIGO—is an NSF-supported collaborative effort by the U.S National Science Foundation and is operated by Caltech and MIT). Each time a new technology is deployed, it can open up a new window on the cosmos, and major new theoretical developments can follow rapidly. These, in turn, can inform future technologies. This interplay of technological and fundamental theoretical advance is replicated across all the natural sciences—which include, we argue, computer science. Some early computing models were developed as abstract models of existing physical computing systems. Most famously, for the Turing Machine these were human ‘computers’ performing calculations. Now, as novel computing devices—from quantum computers to DNA processors, and even vast networks of human ‘social machines’—reach a critical stage of development, they reveal how computing technologies can drive the expansion of theoretical tools and models of computing. With all due respect to Dijkstra, we argue that computer science is as much about computers as astronomy is about telescopes.

Cognitive scientist calls for integration in language sciences

Cornell Chronicle


In a new opinion piece in a major publication, Morten Christiansen, professor of psychology, describes how the study of language has fragmented into many highly-specialized areas of study that tend not to talk to each other. He calls for a new era of integration in the paper, published July 31 in Nature Human Behaviour.

The crux of the problem is each area of study proceeds in isolation from each other, he said.

“For example, people working on language evolution often know little about how language develops and is processed,” Christiansen said. “And researchers working on language acquisition are typically not aware of how the system they’re studying works in adulthood and how it might have evolved in our species.”

We are still waiting for the robot revolution

Tim Harford


James Bessen of Boston University points out that the ATM did not, in fact, replace bank tellers — there are more bank teller jobs in the US now than when the ATM was introduced.

This should not entirely be a surprise: the original story of the cash machine is that its inventor John Shepherd-Barron had the door of his local bank slammed in his face on a Saturday lunchtime, and was frustrated that there was no way to get his money until Monday morning. Mr Shepherd-Barron didn’t invent a replacement for human tellers so much as a way to get cash at any time of the day or night. Banks opened more branches and employed humans to cross-sell loans, mortgages and credit cards instead. The automated teller worked alongside more human tellers than ever.

The ATM is no outlier here. Mr Bessen found that in the 19th century, 98 per cent of the labour required to weave cloth was automated — yet employment in the weaving industry increased as our demand for clothes more than offset the labour-saving automation.

Facebook and Google Policing the Web Will Do More Harm Than Good

WIRED, Business, Tara Wadhwa and Gabriel Ng


The influence and proliferation of extremist content, hate speech, and state-sponsored propaganda on the internet has risen around the globe, as demonstrated by Russia’s involvement in the US election and the rise of ISIS recruitment online. As a result, the pressure that governments, media, and civil society are placing on technology companies to take meaningful action to stem the flow of this content is at an all-time high.

A recent law passed in Germany will require social media companies like Facebook and Twitter to remove illegal, racist, or slanderous content within 24 hours after it’s flagged by a user, or face fines as large as $57 million. Although this legislation was passed overseas, its effects will be felt stateside, as the sites that will bear the brunt of the law are American.

Amazon’s New Robo-Picker Champion Is Proudly Inhuman

MIT Technology Review, Jamie Condliffe


A robot that owes rather a lot to an annoying arcade game has captured victory in Amazon’s annual Robotics Challenge.

E-commerce companies like Amazon and Ocado, the world’s largest online-only grocery retailer, currently boast some of the most heavily automated warehouses in the world. But items for customers’ orders aren’t picked by robots, because machines cannot yet reliably grasp a wide range of different objects.

That’s why Amazon gathers together researchers each year to test out machines that pick and stow objects.

What analysis programs drive conservation science?

Christopher J. Brown, Seascape Models blog


With the International Congress for Conservation Biology on at the end of July I was wondering, what analysis programs are supporting conservation science? And, what programs support spatial analysis and mapping?

I ran a quick poll on my blog (you can take it here) to find out. Here are the results (as of 30th July 2017). Voters are allowed to pick multiple categories.

I think the results are informative, particularly if you are a scientist in training and are wondering what programs to learn.

Nvidia and Remedy use neural networks for eerily good facial animation

Ars Technica, Mark Walton


Remedy, the developer behind the likes of Alan Wake and Quantum Break, has teamed up with GPU-maker Nvidia to streamline one of the more costly parts of modern games development: motion capture and animation. As showcased at Siggraph, by using a deep learning neural network—run on Nvidia’s costly eight-GPU DGX-1 server, naturally—Remedy was able to feed in videos of actors performing lines, from which the network generated surprisingly sophisticated 3D facial animation. This, according to Remedy and Nvidia, removes the hours of “labour-intensive data conversion and touch-ups” that are typically associated with traditional motion-capture animation.

[1707.09476] FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras

arXiv, Computer Science > Computer Vision and Pattern Recognition; Shanghang Zhang, Guanhang Wu, João P. Costeira, José M. F. Moura


In this paper, we develop deep spatio-temporal neural networks to sequentially count vehicles from low quality videos captured by city cameras (citycams). Citycam videos have low resolution, low frame rate, high occlusion and large perspective, making most existing methods lose their efficacy. To overcome limitations of existing methods and incorporate the temporal information of traffic video, we design a novel FCN-rLSTM network to jointly estimate vehicle density and vehicle count by connecting fully convolutional neural networks (FCN) with long short term memory networks (LSTM) in a residual learning fashion. Such design leverages the strengths of FCN for pixel-level prediction and the strengths of LSTM for learning complex temporal dynamics. The residual learning connection reformulates the vehicle count regression as learning residual functions with reference to the sum of densities in each frame, which significantly accelerates the training of networks. To preserve feature map resolution, we propose a Hyper-Atrous combination to integrate atrous convolution in FCN and combine feature maps of different convolution layers. FCN-rLSTM enables refined feature representation and a novel end-to-end trainable mapping from pixels to vehicle count. We extensively evaluated the proposed method on different counting tasks with three datasets, with experimental results demonstrating their effectiveness and robustness. In particular, FCN-rLSTM reduces the mean absolute error (MAE) from 5.31 to 4.21 on TRANCOS, and reduces the MAE from 2.74 to 1.53 on WebCamT. Training process is accelerated by 5 times on average.

$9M grant will create neurotech research hub at Cornell

Cornell Chronicle


As neuroscientists examine challenging questions about the complexities of the central nervous system, new tools to be developed at Cornell will provide them with an unprecedented glimpse into the inner workings of the brain thanks to a five-year, $9 million grant from the National Science Foundation.

The grant will establish the Cornell Neurotechnology NeuroNex Hub, which will focus on researching, developing and disseminating new optical imaging tools for noninvasive recording of neural activity in animals.

Elsevier Acquires bepress

The Scholarly Kitchen, Roger C. Schonfeld


Today, Elsevier announces its acquisition of bepress. In a move entirely consistent with its strategy to pivot beyond content licensing to preprints, analytics, workflow, and decision-support, Elsevier is now a major if not the foremost single player in the institutional repository landscape. If successful, and there are some risks, this acquisition will position Elsevier as an increasingly dominant player in preprints, continuing its march to adopt and coopt open access.





San Francisco, CA October 10. The recurrent theme for this annual event: “Dare to Know.” [$$$]

Women of Color in Political Science Workshop 2017

Women of Color in Political Science


San Francisco, CA Wednesday, August 30 (pre-APSA).

Summer Quarter Research Data Management Workshop @ UW Libraries

University of Washington, UW Libraries


Online From August 14-17, the UW Libraries is offering Data Management Planning, an asynchronous online workshop for UW community members engaged in research with data. Topics will include getting started with data management planning, funder requirements for data sharing, metadata, tips to help keep you organized, sharing, archiving and preservation, and an introduction to tools and on-campus support to aid researchers.


Northeast Workshop in Empirical Political Science, Call for Papers

Princeton, NJ Our ninth meeting will be a half-day event held on September 11. Limited travel resources will be available for out-of-town presenters only. Deadline for submissions is August 7.

TDWI Analytics Accelerator Award

Have you completed an analytics or data science project that you think is a game changer for your company? Saddle up and submit your entry for TDWI’s Analytics Accelerator Award. Deadline for entries is August 17.

MSc in Social Data Science, University of Oxford

The MSc in Social Data Science provides the ability to collect, manipulate, and analyse large volumes of structured and unstructured data, which is becoming a core social science skill that is in high demand on the job market. Deadline to apply is September 17.
Tools & Resources

All your streaming data are belong to Kafka

InfoWorld, Matt Asay


Apache Kafka continues its ascent as attention shifts from lumbering Hadoop and data lakes to real-time streams

Beta release of PyKE v3.0 out now

NASA, Kepler & K2, Zé Vinícius


A new major version of PyKE has been released in beta. PyKE is Python-based set of data analysis tools which offer a user-friendly way to inspect and analyze the pixels and lightcurves obtained by Kepler and K2.

Lesser known dplyr tricks

Econometrics and Free Software blog


In this blog post I share some lesser-known (at least I believe they are) tricks that use mainly functions from dplyr.

A Practical Guide to Tree Based Learning Algorithms

Sadanand Singh, Sadanand's Notes blog


Tree based learning algorithms are quite common in data science competitions. These algorithms empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear relationships quite well. Common examples of tree based models are: decision trees, random forest, and boosted trees.

Citing Open Source Code

Dan Foreman-Mackey


PSA: Please use the preferred citation method for any open source code that you use in a paper – it’s an academic tool builder’s livelihood!

The Ins and Outs of Deep Learning with Apache Spark

The New Stack, TC Currie


Developing for deep learning requires a specialized set of expertise, explained Databricks software engineer Tim Hunter during the recent NVIDIA GPU Technology Conference in San Jose.

Databricks was founded in 2013 to help people build big data platforms using the Apache Spark data processing framework. It provides a Spark-as-a-Platform and expertise in deep learning using GPUs, which can greatly assist in the speeding up deep learning jobs.

There are multiple ways to integrate Spark and deep learning, but there is currently no consensus on how to best use Spark for deep learning, Hunter said.

Docker vs. Kubernetes vs. Apache Mesos: Why What You Think You Know is Probably Wrong

Mesosphere, Amr Abdelrazik


While all three technologies make it possible to use containers to deploy, manage, and scale applications, in reality they each solve for different things and are rooted in very different contexts. In fact, none of these three widely adopted toolchains is completely like the others.

Instead of comparing the overlapping features of these fast-evolving technologies, let’s revisit each project’s original mission, architectures, and how they can complement and interact with each other.


Full-time positions outside academia

Planetary Protection Officer

NASA, Headquarters; Washington, DC

Service Designer

City of Philadelphia; Philadelphia, PA


Universität Regensburg, Institute of Zoology; Regensburg, Germany

Postdoctoral Fellowship – Supply Chain Commitments

SESNYC; Annapolis, MD
Full-time, non-tenured academic positions

Project Director, Data Policy

Center for Policing Equity; New York, NY

Leave a Comment

Your email address will not be published.