Data Science newsletter – August 28, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for August 28, 2018


Data Science News

How TripAdvisor changed travel

The Guardian, Linda Kintsler


The world’s biggest travel site has turned the industry upside down – but now it is struggling to deal with the same kinds of problems that are vexing other tech giants like Facebook, Google and Twitter.

University Data Science News

Collective action against for-profit publishers is taking shape in a movement known as Free Our Knowledge. Their position is that, “the global academic community need to own up to the fact that we are complicitly responsible for this mess, and take action to correct it.” They plan to gather academics to sign an unspecified petition that will be developed for a while until they hit a threshold of support determined by the anonymous supporters. Once the threshold has been hit, they will “flip the switch” and stop submitting to, reviewing for, or editing for-profit journals. Elsevier’s 40% profit margin is gag-inducing so I applaud the effort. I also wonder if there is a way to capture some of those funds and feed them into revenue-positive, non-profit publishing. The idea of having to do all the work that the journals do without remuneration of any kind is not a recipe for harmony, either. Or for robust indexing.

Dartmouth and the New England Journal of Medicine add to our coverage of the politics of publishing this week. Leading cancer researcher, H. Gilbert Welch, apparently used data gathered by his colleague, Samir Soneji, an associate professor. Welch published on the data with nary an acknowledgement to Soneji who had explicitly noted he would like co-authorship if the data were used for publication purposes. Welch assured him it was only to be used in teaching. Soneji formally objected, but neither Welch nor NEJM are taking corrective action. Professor Welch believes the paper is part of the “natural progression” of his line of work, which does beg some questions about his line of work. Thievery? Dartmouth, to its credit, has cited Professor Welch for “research misconduct.”

And an op-ed in a PLOS blog argues that we should get rid of pre-publication peer review because it is biased (against women, towards senior scholars; yes, I know these groups aren’t mutually exclusive), discourages publication of negative findings, is slow, and takes too much damn free labor that benefits large for-profit publishers.

A group of researchers gathered at the Howard Hughes Medical Institute wrote an open letter calling for the publication of peer reviews. Over a hundred journals in biological research have already implemented their request and several more prominent publishers – such as PLOS – signed on. The academic publishing revolution continues.

And to round out our flurry of stories about peer review and academic publishing, Daniel Acuna of Syracuse University, James Evans of University of Chicago, and Konrad Kording of the University of Pennsylvania have an National Science Foundation grant in hand to study how peer review processes shape the research that gets done and the way studies are disseminated.

The field of nutritional epidemiology is having a reckoning. Meta-analyses reveal that eating anything appears to lead to increased mortality risk. (In extremely robust research outside of nutritional epidemiology, it has been repeatedly shown that eating no food consistently leads to significant increased mortality risk.) Bless his deadpan style. John Ioannidis (Stanford) writes that big data, improperly used, is part of the problem: “With more research involving big data, almost all nutritional variables will be associated with almost all outcomes. Moreover, given the complicated associations of eating behaviors and patterns with many time-varying social and behavioral factors that also affect health, no currently available cohort includes sufficient information to address confounding in nutritional associations.” Rock on John, you bubble-bursting rockstar. Eating specific foods is neither going to cure us nor kill us so you only have to eat walnuts, celery, and bee pollen if you actually like them.

University of North Carolina at Chapel Hill, USC, and Indiana University landed two $1m grants from the NSF to build an end-to-end data science platform that can help detect unintentional errors in data processing. The funding seems rather small for such an undertaking, but let’s see what they can do.

Andrew Moore has stepped down as Dean of the School of Computer Science at Carnegie Mellon University. He has long been an advocate of a revolving door between academia and industry and it looks like he will once again be leaving the school. The move is sudden; no interim successor has been named. The school, as expected, will launch a national search for his permanent replacement. In the past, Moore has pointed to large industry salaries as a perk top university scientists should not have to forgo. Ka-ching.

A new randomized controlled study by Damon Jones, David Molitor, and Julian Reif finds that workplace wellness programs do not work. For Kate Crawford, Ifeoma Ajunwa, and Jason Schulz this raises serious questions about the privacy trade-offs workers face when sharing tracking and health behavior data with employers.

The National Center for Ecological Analysis and Synthesis, part of the University of California-Santa Barbara, is working on important, non-glamorous data rescue efforts. This time they aren’t worried about rescuing data from the whims of presidents or prime ministers. They are saving long-form data from retiring along with the researchers who collected it. One of the key features of data is its durability across time. Anyone who has actually worked with data knows that in order for data to continue to be available and make sense, it needs proper care and maintenance, starting with getting into legible, standardized, documented, digital formats in the first place.

The University of Illinois is turning Illini Hall, located at the heart of Campustown, into “a state-of-the-art classroom and research facility focused on statistics, data analysis and machine learning”. The Department of Statistics is already located in the building, which will gain an extra 10,000 – 30,000 feet.

University of Texas-Austin is getting a $60 million supercomputer called Frontera next year. Everything is bigger in Texas. Their existing supercomputer, Stampede2, is covered in a painting of stampeding long horns. I cannot wait to see what they paint on Frontera.

Predicting anything about earthquakes – when they’ll hit, how big they’ll be, where they will strike – is hard. A new multi-authored paper in Nature used machine learning to predict where the after shocks will be. It’s a small part of the problem, but I’ll take progress however it comes.

The University of Rochester has appointed its first Chief Data Officer. It may be the first university to ever create such a role, though they are common in government and, increasingly, in industry. Sandra Cannon (Econ PhD), who previously served for 24 years in the Federal Reserve System, will figure out where to store data, how to provide access to it, and how to best combine data resources to maximize scientific gain.

The Challenge of Reforming Nutritional Epidemiologic Research

JAMA, The JAMA Network, Viewpoint; John P. A. Ioannidis


Some nutrition scientists and much of the public often consider epidemiologic associations of nutritional factors to represent causal effects that can inform public health policy and guidelines. However, the emerging picture of nutritional epidemiology is difficult to reconcile with good scientific principles. The field needs radical reform.

In recent updated meta-analyses of prospective cohort studies, almost all foods revealed statistically significant associations with mortality risk.1 Substantial deficiencies of key nutrients (eg, vitamins), extreme overconsumption of food, and obesity from excessive calories may indeed increase mortality risk. However, can small intake differences of specific nutrients, foods, or diet patterns with similar calories causally, markedly, and almost ubiquitously affect survival?

This is what filter bubbles actually look like

MIT Technology Review, John Kelly and Camille François


American public life has become increasingly ideologically segregated as newspapers have given way to screens. But societies have experienced extremism and fragmentation without the assistance of Silicon Valley for centuries. And the polarization in the US began long ago, with the rise of 24-hour cable news. So just how responsible is the internet for today’s divisions? And are they really as bad as they seem?

In this Twitter map (below, and in 3-D form above) of the US political landscape, accounts that follow one another are clustered together, and they are color-coded by the kinds of content they commonly share. At first glance, it might seem reassuring: although there are clear echo chambers, there is also an intertwined network of elected officials, the press, and political and policy professionals. There are extremes, but they are mediated through a robust middle.

The Desperate Quest for Genomic Compression Algorithms

IEEE Spectrum, Dmitri Pavlichin and Tsachy Weissman


Have you had your genome sequenced yet? Millions of people around the world already have, and by 2025 that number could reach a billion.

The more genomics data that researchers acquire, the better the prospects for personal and public health. Already, prenatal DNA tests screen for developmental abnormalities. Soon, patients will have their blood sequenced to spot any nonhuman DNA that might signal an infectious disease. In the future, someone dealing with cancer will be able to track the progression of the disease by having the DNA and RNA of single cells from multiple tissues sequenced daily.

And DNA sequencing of entire populations will give us a more complete picture of society-wide health. That’s the ambition of the United Kingdom’s Biobank, which aims to sequence the genomes of 500,000 volunteers and follow them for decades. Already, population-wide genome studies are routinely used to identify mutations that correlate with specific diseases. And regular sequencing of organisms in the air, soil, and water will help track epidemics, food pathogens, toxins, and much more.

This vision will require an almost unimaginable amount of data to be stored and analyzed.

The ethical problems that AI can pose are clear, the time has come to understand how we can solve them

Wired UK, Abigail Beall


Whether it’s robots coming to take your job or AI being used in military drones, there is no shortage of horror stories about artificial intelligence. Yet for all the potential it has to do harm, AI might have just as much potential to be a force for good in the world.

Harnessing the power for good will require international cooperation, and a completely new approach to tackling difficult ethical questions, the authors of an editorial published in the journal Science argue.

“From diagnosing cancer and understanding climate change to delivering risky and consuming jobs, AI is already showing its potential for good,” says Mariarosaria Taddeo, deputy director of the Digital Ethics Lab at Oxford University and one of the authors of the commentary. “The question is how can harness this potential?”

The Geography of Urban Violence

CityLab, Richard Florida


On the campaign trail, candidate Donald Trump falsely proclaimed that America’s cities have become increasingly dangerous. “We have an increase in murder within our cities, the biggest in 45 years,” he said. But the reality is that urban violence has declined substantially since its peak during the early 1990s. Then, not more than a year later, Attorney General Jeff Sessions claimed that Trump’s policies were in fact responsible for a decline in urban crime and the end of “American carnage.”

All of which raises the question: Is violence rising or falling? How do we know?

The confusion about trends in crime and violence was the motivation for a new mapping and data visualization tool put together by my colleague Patrick Sharkey at New York University and a team of researchers at NYU’s Marron Institute of Urban Management. Their site,, which was created with support from the Bill and Melinda Gates Foundation, compiles data on murder rates for 80-plus cities over the period 1990 to 2017, spanning the high point of violent crime to the recent decline.

Exclusive: The woman behind the scenes who helped capture the Golden State Killer

San Jose Mercury News, Matthias Gafni |


“How sure are you?” the FBI agent asked Barbara Rae-Venter.

“As long as we have all the descendants in the family tree, then I’m sure,” the retired intellectual property attorney and genetic genealogist told him.

About a week later, on April 25, Joseph DeAngelo was arrested; he has since been charged with 13 counts of murder and 13 counts of kidnapping. Investigators say the 72-year-old Citrus Heights man, a former police officer, is the Golden State Killer, the notorious serial murderer and rapist responsible for a terrifying crime spree up and down the state during the 1970s and ’80s.

Five lessons from the US government’s ultimate innovators

Tim Harford


Sunday July 29 is an important day in the history of innovation. It is the 60th anniversary of the founding of the US space agency Nasa, but that is only indirectly the reason. The incidental benefit of Nasa’s creation was that it stripped another young organisation of its funding, projects and purpose.

Founded in 1958, Arpa — the Advanced Research Projects Agency, part of the US Department of Defense — started the space race, but lost its role to Nasa a few months later and was described by Aviation Week as “a dead cat hanging in the fruit closet”.

But apparently cats really do have nine lives, because Arpa resurrected itself, and went on to play a foundational role in the creation of the internet, the Global Positioning System and, more recently, self-driving cars.

So what did Arpa do, does it deserve so much credit and, if so, can the trick be repeated in other fields such as clean energy or medicine? When it comes to an invention such as the internet, it is never easy to know whether success appeared by design or by luck. Still, here are five lessons I draw.

Top 10 Data Science Use Cases in Retail

Medium, ActiveWizards, Igor Bobriakov


The sphere of the retail develops rapidly. The retailers manage to analyze data and develop a peculiar psychological portrait of a customer to learn his or her sore points. Thereby, a customer tends to be easily influenced by the tricks developed by the retailers.

This article presents top 10 data science use cases in the retail, created for you to be aware of the present trends and tendencies.

The Impossible Job: Inside Facebook’s Struggle to Moderate Two Billion People

VICE, Motherboard, Jason Koebler and Joseph Cox


Moderating billions of posts a week in more than a hundred languages has become Facebook’s biggest challenge. Leaked documents and nearly two dozen interviews show how the company hopes to solve it.

How AI and Big Data Impact the Structure of the Financial Industry

Bloomberg TV


Antoinette Schoar, an economist and professor of finance at MIT Sloan School of Management, discusses how artificial intelligence and big data are impacting the structure of the financial industry. She speaks with Bloomberg’s Mike McKee at the Federal Reserve’s annual business symposium in Jackson Hole, Wyoming.

Noelle Selin named director of the MIT Technology and Policy Program

MIT News, Institute for Data, Systems, and Society


Selin will spearhead the master’s program for students whose research addresses societal challenges at the intersection of technology and policy.

How NYC Crunches Taxi Data

Socrata, Inc., Madeleine Burry


For decades, New York City’s been known for its iconic yellow taxis, which can be hailed with a raised arm or a piercing whistle. But in recent years, far more for-hire cars are driving the streets of the city. In addition to yellow taxis, there are lime-green taxis, black livery cars, and vehicles in every color of the rainbow from app-based companies like Uber and Lyft.

In a recent talk at Socrata Connect, Fausto Lopez, Policy Analytics Manager for NYC’s Taxi and Limousine Commission (TLC), shared the complexity of the for-hire vehicles options in New York, along with some of the data the TLC collects and how that data is used.

Better data for better nutrition

Biodiversity International, Irene Induli


A new tool, developed by Bioversity International and partners in Kenya, will ease the burden of measuring food intake thus helping decision-making in nutrition interventions. With a few adjustments, the tool could also be used in other parts of the world.


First UK Mobile, Wearable and Ubiquitous Systems Research Symposium

Cambridge University


Cambridge, England September 12-13 at Cambridge University, The Computer Laboratory. [$$]

Berkeley Distinguished Lectures in Data Science

Berkeley Institute for Data Science


Berkeley, CA September 18, starting at 4:10 p.m., 190 Doe Library. Speaker: Dan Kammen, University of California-Berkeley Energy & Resources Group. [free]

Tools & Resources

The different flavors of AutoML, Erin LeDell


The term “AutoML” (Automatic Machine Learning) refers to automated methods for model selection and/or hyperparameter optimization. AutoML is also a subfield of machine learning that has a rich academic history, an annual workshop at the International Conference on Machine Learning (ICML), and academic research labs devoted to this topic (e.g. University of Freiburg Machine Learning Lab in Germany).

The AutoML field began by developing methods for automating hyperparameter optimization in single models, and now includes such techniques as automated stacking (ensembles), neural architecture search, pipeline optimization and feature engineering.

CoQA: A Conversational Question Answering Challenge

arXiv, Computer Science > Computation and Language ; Siva Reddy, Danqi Chen, Christopher D. Manning


Humans gather information by engaging in conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. We introduce CoQA, a novel dataset for building Conversational Question Answering systems. Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage. We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning. We evaluate strong conversational and reading comprehension models on CoQA. The best system obtains an F1 score of 65.1%, which is 23.7 points behind human performance (88.8%), indicating there is ample room for improvement. We launch CoQA as a challenge to the community at this http URL

CoQA – A Conversational Question Answering Challenge

Siva Reddy, Danqi Chen


“CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation.”



Post-Doctoral Fellow

Jacobs Technion-Cornell Institute. and Zucker Hillside Hospital; New York, NY
Full-time, non-tenured academic positions

Senior Research Specialist

University of Arizona Health Sciences; Tucson, AZ

Leave a Comment

Your email address will not be published.