Data Science newsletter – January 22, 2019

Newsletter features journalism, research papers, events, tools/software, and jobs for January 22, 2019


Data Science News

Why Do People Fall for Fake News?

The New York Times, Opinion, Gordon Pennycook and David Rand


What makes people susceptible to fake news and other forms of strategic misinformation? And what, if anything, can be done about it?

These questions have become more urgent in recent years, not least because of revelations about the Russian campaign to influence the 2016 United States presidential election by disseminating propaganda through social media platforms. In general, our political culture seems to be increasingly populated by people who espouse outlandish or demonstrably false claims that often align with their political ideology.

The good news is that psychologists and other social scientists are working hard to understand what prevents people from seeing through propaganda. The bad news is that there is not yet a consensus on the answer. Much of the debate among researchers falls into two opposing camps. One group claims that our ability to reason is hijacked by our partisan convictions: that is, we’re prone to rationalization. The other group — to which the two of us belong — claims that the problem is that we often fail to exercise our critical faculties: that is, we’re mentally lazy.

However, recent research suggests a silver lining to the dispute: Both camps appear to be capturing an aspect of the problem. Once we understand how much of the problem is a result of rationalization and how much a result of laziness, and as we learn more about which factor plays a role in what types of situations, we’ll be better able to design policy solutions to help combat the problem.

NASA’s NeMO-Net Will Give Scientists A Valid Excuse To Play Video Games Kids News Article

Dogo News, Kate Harveston


[Ved] Chirayath and his team developed a high-performance camera, FluidCam, which can be attached to small drones to take up-close photos of coral systems and create 3-D images. The data collected by FluidCam, which has been mapping the shallow coral reef systems in the South Pacific for the past two years, combined with existing satellite imagery, has resulted in thousands of high-quality 2-D and 3-D images, which need to be categorized. Chirayath and his colleagues hope to crowdsource the monumental task with their cleverly entitled NeMO-Net (Neural Multi-Modal Observation and Training Network) video game.

Foreign Students Sour on America, Jeopardizing a $39 Billion Industry

Bloomberg Education, Nick Leiber


Last May, Luis Carlos Soldevilla graduated with one of the best grade point averages in his Mexico City high school. For his senior project, he even tackled Goldbach’s conjecture, a famous number theory problem. Soldevilla considered attending Boston University and the University of Washington, both of which had accepted him. He also had fond memories of the University of California, Berkeley, where during the summer of 2016 he took a computer science course.

But instead of enrolling at a U.S. school, Soldevilla started this fall at the University of Toronto, Mississauga, where he’s pursuing a double major in computer science and mathematics. Why did he pick the Canadian school over those big American names?

“A very important factor of my decision was that there was no Trump,” the 19-year-old said.

Meituan Drives Instant Food Delivery With AI “Super Brain”



From Beijing to Barcelona to Buenos Aires, startups like Uber Eats, Deliveroo, Swiggy, Zomato and Go-Jek are revolutionizing urban food delivery. In the first quarter of 2018, food delivery accounted for 13 percent of Uber trips worldwide, and that figure is increasing.

Urban food delivery giants in China are Meituan-Dianping,, and DiDi Waimai; while startups like Shansong (FlashEx) and New Dada are also doing local and short-distance food delivery. Biggest of the bunch is Meituan, which reported revenue of US$2.3 billion in the first half of 2018, up a whopping 90 percent from 2017. At last month’s AI Developers Conference (AI NEXTCon), Meituan’s AI-powered logistics team lead Renqing He shared his thoughts on recent developments in urban food delivery and the application of machine learning in the field.

[P] Downloading all publicly available #10YearChallenge images as weekend project : MachineLearning


To set expectations, I am not a machine learning expert of any sort, just a ML engineer building products for startups.What kind of embeddings / annotations would you guys like me to do on images so that people can build interesting things using that data – like build GANs to generate images of how people will look like when 10 years older?

Researchers receive grant to study the invisible work of maintaining open-source software

Berkeley Institute for Data Science


Researchers at the UC Berkeley Institute for Data Science (BIDS), the University of California, San Diego, and the University of Connecticut have been awarded a grant of $138,055 from the Sloan Foundation and the Ford Foundation as part of a broad initiative to investigate the sustainability of digital infrastructures. The grant funds research into the maintenance of open-source software (OSS) projects, particularly focusing on the visible and invisible work that project maintainers do to support their projects and communities, as well as issues of burnout and maintainer sustainability. The research project will be led by BIDS staff ethnographer and principal investigator Stuart Geiger and will be conducted in collaboration with Lilly Irani and Dorothy Howard at UC San Diego, Alexandra Paxton at the University of Connecticut, and Nelle Varoquaux and Chris Holdgraf at UC Berkeley.

Duncan Watts Appointed Penn Integrates Knowledge University Professor

University of Pennsylvania, Annenberg School for Communication


President Amy Gutmann and Provost Wendell Pritchett are pleased to announce the appointment of Duncan Watts as the University of Pennsylvania’s twenty-third Penn Integrates Knowledge University Professor, effective July 1, 2019.

Watts, a pioneer in the use of data to study social networks, will be the Stevens University Professor, with joint faculty appointments in the Department of Computer and Information Science in the School of Engineering and Applied Science, the Annenberg School for Communication, and the Department of Operations, Information and Decisions in the Wharton School, where he will also be the inaugural Rowan Fellow.

Expanding the Research Data Management Service Portfolio at Bielefeld University According to the Three-pillar Principle Towards Data FAIRness

Data Science Journal, Practice Papers; Jochen Schirrwagen , Philipp Cimiano, Vidya Ayer, Christian Pietsch, Cord Wiljes, Johanna Vompras, Dirk Pieper


Research Data Management at Bielefeld University is considered as a cross-cutting task among central facilities and research groups at the faculties. While initially started as project “Bielefeld Data Informium” lasting over seven years (2010–2015), it is now being expanded by setting up a Competence Center for Research Data. The evolution of the institutional RDM is based on the three-pillar principle: 1. Policies, 2. Technical infrastructure and 3. Support structures. The problem of data quality and the issues with reproducibility of research data is addressed in the project Conquaire. It is creating an infrastructure for the processing and versioning of research data which will finally allow publishing of research data in the institutional repository. Conquaire extends the existing RDM infrastructure in three ways: with a Collaborative Platform, Data Quality Checking, and Reproducible Research.

Microsoft CTO: Understanding AI is part of being an informed citizen in the 21st century

VentureBeat, Khari Johnson


Microsoft CTO Kevin Scott believes understanding AI in the future will help people become better citizens.

“I think to be a well-informed citizen in the 21st century, you need to know a little bit about this stuff [AI] because you want to be able to participate in the debates. You don’t want to be someone to whom AI is sort of this thing that happens to you. You want to be an active agent in the whole ecosystem,” he said.

In an interview with VentureBeat in San Francisco this week, Scott shared his thoughts on the future of AI, including facial recognition software and manufacturing automation.

How Secrecy Fuels Facebook Paranoia

The New York Times, John Herrman


The biggest internet platforms are businesses built on asymmetric information. They know far more about their advertising, labor and commerce marketplaces than do any of the parties participating in them. We can guess, but can’t know, why we were shown a friend’s Facebook post about a divorce, instead of another’s about a child’s birth. We can theorize, but won’t be told, why YouTube thinks we want to see a right-wing polemic about Islam in Europe after watching a video about travel destinations in France. Everything that takes place within the platform kingdoms is enabled by systems we’re told must be kept private in order to function. We’re living in worlds governed by trade secrets. No wonder they’re making us all paranoid.

ShakeAlert moving closer to becoming a reality

Monterey Herald, Bailey Bedford


After more than a decade in development, it is finally about to become a reality for tens of millions of West Coast residents. The system is designed to alert them just seconds before the shaking starts so they can take cover or find a safer place to ride out the quake.

“The biggest commodity within the world of earthquake early warning is time,” said Robert-Michael de Groot, a U.S. Geological Survey scientist who is one of the coordinators of the ShakeAlert system.

“The concept of early warning is outstanding in my opinion,” said Gerry Malais, the emergency services manager. “Just a few seconds can give people enough warning to take cover and not sustain possible injury. In addition to our responders like our fire departments that can exit the station and not be trapped and be unable to respond.”

Facebook and Stanford researchers design a chatbot that learns from its mistakes

VentureBeat, Kyle Wiggers


Chatbots rarely make great conversationalists. With the exception of perhaps Microsoft’s Xiaoice in China, which has about 40 million users and averages 23 back-and-forth exchanges, and Alibaba’s Dian Xiaomi, an automated sales agent that serves nearly 3.5 million customers a day, most can’t hold humans’ attention for much longer than 15 minutes. But that’s not tempering bot adoption any — in fact, Gartner predicts that they’ll power 85 percent of all customer service interactions by the year 2020.

Fortunately, continued advances in the field of AI research promise to make conversant AI much more sophisticated by then. In a paper published this week on the preprint paper (“Learning from Dialogue after Deployment: Feed Yourself, Chatbot!“), scientists from Facebook’s AI Research and Stanford University describe a chatbot that can self-improve by extracting training data from conversations.

How to Rapidly Image Entire Brains at Nanoscale Resolution

Howard Hughes Medical Institute, Janelia Research Campus


A powerful new technique combines expansion microscopy with lattice light-sheet microscopy for nanoscale imaging of fly and mouse neuronal circuits and their molecular constituents that’s roughly 1,000 times faster than other methods.


Pinterest Labs Tech Talk

Pinterest, Jure Leskovec


San Francisco, CA January 28, starting at 5:15 p.m., Pinterest (505 Brannan St). Speaker: Ricardo Baeza-Yates. [registration required]

Documenting data science and documentation in data science: an ethnographic exploration

University of Washington, eScience Institute


Seattle, WA January 24, starting at 4:30 p.m., University of Washington Bagley Hall, room 154. Speaker: Stuart Geiger. [free]

Tools & Resources

Open sourcing bioinstruments

Lior Pachter, Bits of DNA blog


The long-standing practice of data sharing in genomics can be traced to the Bermuda principles, which were formulated during the human genome project (Contreras, 2010). While the Bermuda principles focused on open sharing of DNA sequence data, they heralded the adoption of other open source standards in the genomics community. For example, unlike many other scientific disciplines, most genomics software is open source and this has been the case for a long time (Stajich and Lapp, 2006). The open principles of genomics have arguably greatly accelerated progress and facilitated discovery.

While open sourcing has become de rigueur in genomics dry labs, wet labs remain beholden to commercial instrument providers that rarely open source hardware or software, and impose draconian restrictions on instrument use and modification. With a view towards joining others who are working to change this state of affairs, we’ve posted a new preprint in which we describe an open source syringe pump and microscope system called poseidon.

What makes people smarter than machines?

Twitter, Brenden Lake


Reading list for my NYU class “Advancing AI through Cognitive Science” has paired papers in AI and CogSci organized by topic, highlighting key ingredients for building machines that learn and think like people.

Flexible sensitivity analysis for observational studies without observable implications

arXiv, Statistics > Methodology; Alexander Franks, Alexander D'Amour, Avi Feller


A fundamental challenge in observational causal inference is that assumptions about unconfoundedness are not testable from data. Assessing sensitivity to such assumptions is therefore important in practice. Unfortunately, some existing sensitivity analysis approaches inadvertently impose restrictions that are at odds with modern causal inference methods, which emphasize flexible models for observed data. To address this issue, we propose a framework that allows (1) flexible models for the observed data and (2) clean separation of the identified and unidentified parts of the sensitivity model. Our framework extends an approach from the missing data literature, known as Tukey’s factorization, to the causal inference setting. Under this factorization, we can represent the distributions of unobserved potential outcomes in terms of unidentified selection functions that posit an unidentified relationship between the treatment assignment indicator and the observed potential outcomes. The sensitivity parameters in this framework are easily interpreted, and we provide heuristics for calibrating these parameters against observable quantities. We demonstrate the flexibility of this approach in two examples, where we estimate both average treatment effects and quantile treatment effects using Bayesian nonparametric models for the observed data.

How to Set Up an AI Center of Excellence

Harvard Business Review, Thomas H. Davenport and Shivaji Dasgupta


The idea of establishing a CC or COE in AI is not particularly radical. In one recent survey of U.S. executives from large firms using AI, 37% said they had already established such an organization. Deutsche Bank, J.P. Morgan Chase, Pfizer, Procter & Gamble, Anthem, and Farmers Insurance are among the non-tech firms that have created centralized AI oversight groups.

Certain AI technologies are well known within many organizations. Machine learning derives its roots from statistical regression. This raises the issue of whether an AI CC or COE should be combined with analytics groups. If an existing analytics group is already doing some predictive analytics work, analysts who are willing to learn and grow can probably master many AI projects, and a combined organization would make sense.

The Dangers of Overpersonalization

Neilsen Norman Group, Kim Flaherty and Kate Moran


Many users know that every interaction online is tracked and analyzed. All of this data is tagged and segmented to create individualized customer profiles, which drive the delivery of personalized content (stories, products, ads, and information) to us online. But when does personalization become a problem?

In our research, we observed some of the downsides of personalization on the web. In particular, one of the problems with a personalized experience is that users are placed into a niche and start experiencing only information that goes into that niche. However, individuals are often multifaceted and change over time. A system that caters to a single user facet risks becoming boring or even annoying and can miss opportunities.



Smart Cities Postdoctoral Fellowship

New York University, Center for Urban Science and Progress; New York, NY

Postdoctoral Research Associate

Northeastern University, Network Science Institute; Boston, MA
Full-time positions outside academia

Senior Data Scientist

iHeartRadio; New York, NY

Events and Digital Marketing Coordinator

NumFOCUS; Austin, TX

Baseball Systems Software Engineer

Chicago Cubs; Chicago, IL

Baseball Operations Analyst

San Francisco Giants; San Francisco, CA

Leave a Comment

Your email address will not be published.