Data Science newsletter – February 26, 2020

Newsletter features journalism, research papers, events, tools/software, and jobs for February 26, 2020


Data Science News

New machine learning method from Stanford, with Toyota researchers, could accelerate battery development for EVs

Green Car Congress


At every stage of the battery development process, new technologies must be tested for months or even years to determine how long they will last. Now, a team led by Stanford professors Stefano Ermon and William Chueh has developed a machine learning-based method that slashes these testing times by 98%.

How Universities Cover Up Scientific Fraud

Areo Magazine, Justin T Pickett


I learned a hard lesson last year, after blowing the whistle on my coauthor, mentor and friend: not all universities can be trusted to investigate accusations of fraud, or even to follow their own misconduct policies. Then I found out how widespread the problem is: experts have been sounding the alarm for over thirty years.

One in fifty scientists fakes research by fabricating or falsifying data. They make off with government grant money, which they share with their universities, and their made-up findings guide medical practice, public policy and ordinary people’s decisions about things like whether or not to vaccinate their children. The fraudulent science we know about has caused thousands of deaths and wasted millions in taxpayer dollars. That is only scratching the surface, however—because most fraudsters are never caught. As Ivan Oransky notes in Gaming the Metrics, “the most common outcome for those who commit fraud is: a long career.”

There are two reasons for this. First, many scientists who witness fraud don’t report it, because they believe nothing would happen if they did and they fear retaliation. Second, when fraud is reported, the job of investigating it falls to the fraudsters’ universities. Most whistleblowers inform their universities directly. Even if they don’t, federal agencies, like the National Science Foundation and the National Institutes of Health, refer fraud accusations back to universities for investigation, and publishers and the Committee on Publication Ethics tell journal editors to do the same.

Cleveland Clinic biorepository project moving forward after 2 years of discussions, Michelle Jarboe


Two years and three sites after the Cleveland Clinic revealed its plans to add a biorepository to its main campus, developers hope to start foundation work for the $11.3 million project next month.

The two-story building will house 400 freezers full of tissue samples used for research. The Clinic owns the land, on the south side of Cedar Avenue between East 97th and East 100th streets. Local developers Jim Doyle and Michael Panzica will construct the building and lease it to Brooks Life Sciences, a company that specializes in specimen storage and analysis.

The Untold History of Facebook’s Most Controversial Growth Tool

Medium, Marker, Steven Levy


‘People You May Know’ helped the social media giant grow exponentially. One man made it happen. An exclusive excerpt from ‘Facebook: The Inside Story’

‘Inside Story’ Sheds Light On Facebook’s Effort To Connect The World

NPR, Fresh Air, Terry Gross


“You’d be hard-pressed to name an American company that’s more distrusted and yet more influential today than Facebook. The social media site has been rocked by scandals involving the misuse of its users’ personal information and harsh criticism of its role in the 2016 election. And yet it remains huge, with nearly 3 billion users, and profitable, with annual earnings in the billions of dollars. Our guest, Steven Levy, is a veteran technology journalist who’s been reporting on Facebook for years and has written a new in-depth history of the company. Facebook’s founder and CEO Mark Zuckerberg gave Levy leaving nine interviews and permission to talk to many other present and former employees of the company. Levy writes that virtually every problem Facebook has confronted since 2016 is a consequence of its unprecedented mission to connect the world and its reckless haste to do so.” [audio, 35:00]

How to Generate Infinite Fake Humans

The Atlantic, Ian Bogost


You encounter so many people every day, online and off-, that it is almost impossible to be alone. Now, thanks to computers, those people might not even be real. Pay a visit to the website This Person Does Not Exist: Every refresh of the page produces a new photograph of a human being—men, women, and children of every age and ethnic background, one after the other, on and on forever. But these aren’t photographs, it turns out, though they increasingly look like them. They are images created by a generative adversarial network, a type of machine-learning system that fashions new examples modeled after a set of specimens on which the system is trained. Piles of pictures of people in, images of humans who do not exist out.

It’s startling, at first. The images are detailed and entirely convincing: an icy-eyed toddler who might laugh or weep at any moment; a young woman concerned that her pores might show; that guy from your office. The site has fueled ongoing fears about how artificial intelligence might dupe, confuse, and generally wreak havoc on commerce, communication, and citizenship.

But are these people who don’t exist any different, really, from all the Tinder profiles on which you swiped left, or the faces in the crowd on the subway whom you might never see again?

The police want your phone data. Here’s what they can get — and what they can’t.

Vox, Recode, Sara Morrison


Our lives are in our phones, making them a likely source of evidence if police suspect you’ve committed a crime. But as we’ve seen in recent cases of suspected terrorists with passcode-protected iPhones that Apple refused to help the FBI unlock, it’s not always as simple as getting a warrant and breaking down a metaphorical door.

When the key to unlock your phone is in your own mind or on the tip of your finger, it becomes a legal question that judges have to rely on decades-old, pre-modern-technology precedent to answer. And in many places, this question hasn’t yet been answered.

Here are some of the main ways the government can get information off of your phone, including why they’re allowed and how they’d do it.

The dark side of social movements: Social identity, non-conformity, and the lure of conspiracy theories

PsyArXiv; Anni Sternisko, Aleksandra Cichocka, Jay Van Bavel


Social change does not always equal social progress–there is a dark side of social movements. We discuss conspiracy theory beliefs –beliefs that a powerful group of people are secretly working towards a malicious goal–as one contributor to destructive social movements. Research has linked conspiracy theory beliefs to anti-democratic attitudes, prejudice and non-normative political behavior. We propose a framework to understand the motivational processes behind conspiracy theories and associated social identities and collective action. We argue that conspiracy theories comprise at least two components – content and qualities— that appeal to people differently based on their motivations. Social identity motives draw people foremost to contents of conspiracy theories while uniqueness motives draw people to qualities of conspiracy theories.

How screwed are we? Experts don’t see a bright future for technology’s impact on democracy (or journalism) » Nieman Journalism Lab

Nieman Journalism Lab, Hanaa' Tameez


A new Pew Research Center report (done in partnership with Elon University’s Imagining the Internet Center) surveyed hundreds of technology experts about whether or not digital disruption will help or hurt democracy by 2030. Of the 979 responses, about 49 percent of these respondents said use of technology “will mostly weaken core aspects of democracy and democratic representation in the next decade” while 33 percent said the use of technology “will mostly strengthen core aspects of democracy.” The remaining 18 percent expect no significant change in the next decade.

Cornell joins network to expand public interest tech

Cornell University, Cornell Chronicle


When Karen Levy, assistant professor of information science, teaches her class on privacy and surveillance, she’s often struck by how students want to build technology for good.

“They say, ‘I want to be thinking much more consciously about the ethical aspects of my work, or I want to use my skills to help address social issues. What do I do?’ And my answers are not that satisfying,” said Levy, also an associate member of the faculty of Cornell Law School. “There has not been a pipeline for folks who want to do that work.”

As part of its efforts to develop that pipeline, Cornell is joining the Public Interest Technology University Network (PIT-UN), a collaboration of 36 colleges and universities committed to building the field of public interest technology and preparing a generation of civic-minded technologists.

HHS creates plan to make EHRs less painful for docs

MedCity News, Elise Reuter


The Department of Health and Human Services put out a series of strategies to reduce the amount of time physicians spend on documentation on Monday. The report includes suggestions to standardize certain elements of EHRs, and bring federal reporting requirements up to date.

A Vaccine Won’t Stop the New Coronavirus

The Atlantic, James Hamblin


With its potent mix of characteristics, this virus is unlike most that capture popular attention: It is deadly, but not too deadly. It makes people sick, but not in predictable, uniquely identifiable ways. Last week, 14 Americans tested positive on a cruise ship in Japan despite feeling fine—the new virus may be most dangerous because, it seems, it may sometimes cause no symptoms at all.

Team USA Tells Athletes to Proceed as Planned for Tokyo Olympics

Bloomberg Business, Eben Novy-Williams


The U.S. Olympic & Paralympic Committee is telling its teams to train and prepare as planned for the 2020 Summer Games in Tokyo, as it continues to monitor the coronavirus outbreak spreading rapidly through Asia.

Team USA, which plans to send 620 athletes and twice as many coaches and executives to Tokyo, last week told its individual sport bodies that it’s “been given no reason to deviate from any of our Tokyo Games planning and preparation.” The games are set for July 24 through Aug. 9, with the Paralympic Games a few weeks later.

A new AI ‘Super Nurse’ monitors patients in Israeli hospital

ISRAEL21c, Brian Blum


Artificial intelligence can spot potential deterioration before a human nurse or doctor could, and it can predict which patients will be readmitted.

University adds new, profitable major that could help fight climate change

University of Delaware, The Review student newspaper, Wyatt Patterson


The Department of Geography and Spatial Sciences recently developed a new program allowing students to obtain a Bachelor of Science in GIScience and Environmental Data Analytics.

The program defines GIScience, or Geographic Information Science, as the analysis and mapping of large geospatial data sets to better understand the world. The new major focuses on teaching students how to analyze and harness GPS and geospatial data, and is considered ideal for those who want to apply mathematical and scientific rigor to environmental problems.


Come join the first ever Bay area Grafana Labs User Group!



San Francisco, CA February 27, starting at 3 p.m., Strava Headquarters (208 Utah St). “Learn best practices and what’s next for Prometheus and Grafana” [registration required]


Argonne Training Program on Extreme-Scale Computing

St. Charles, IL July 26-August 7. The program “provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future.” Deadline to apply is March 2.

Request for Information: Public Access to Peer-Reviewed Scholarly Publications, Data and Code Resulting From Federally Funded Research

OSTP and the National Science and Technology Council’s (NSTC) Subcommittee on Open Science (SOS) are engaged in ongoing efforts to facilitate implementation and compliance with the 2013 memorandum Increasing Access to the Results of Federally Funded Scientific Research [1] and to address recommended actions made by the Government Accountability Office in a November 2019 report.[2] OSTP and the SOS continue to explore opportunities to increase access to unclassified published research, digital scientific data, and code supported by the U.S. Government. This RFI aims to provide all interested individuals and organizations with the opportunity to provide recommendations on approaches for ensuring broad public access to the peer-reviewed scholarly publications, data, and code that result from federally funded scientific research.” Deadline to submit comments is March 16.

ICWSM2020: The Data Challenge: Safety

“ICWSM 2020 is hosting the first ICWSM data challenge to bring together researchers from across disciplines to solve societally-relevant problems together as a community. This will be enabled by fostering collaboration and exchange of ideas in a structured setting. This year’s data challenge theme is Safety. To achieve this, we invite participants to work on two pertinent datasets in the areas of Misinformation and Abusive behavior in social media.” Deadline for submissions is April 25.
Tools & Resources

Fully Differentiable Procedural Content Generation through Generative Playing Networks

arXiv, Computer Science > Artificial Intelligence; Philip Bontrager, Julian Togelius


To procedurally create interactive content such as environments or game levels, we need agents that can evaluate the content; but to train such agents, we need content they can train on. Generative Playing Networks is a framework that learns agent policies and generates environments in tandem through a symbiotic process. Policies are learned using an actor-critic reinforcement learning algorithm so as to master the environment, and environments are created by a generator network which tries to provide an appropriate level of challenge for the agent. This is accomplished by the generator learning to make content based on estimates by the critic. Thus, this process provides an implicit curriculum for the agent, creating more complex environments over time. Unlike previous approaches to procedural content generation, Generative Playing Networks is end-to-end differentiable and does not require human-designed examples or domain knowledge. We demonstrate the capability of this framework by training an agent and level generator for a 2D dungeon crawler game.

Open Source Data Developers – A discussion forum for open source data project developers

Open Source Data Developers


This forum is intended for the maintainers and contributors of open source data processing and related computing. The idea is to facilitate cross-project design discussions or development initiatives that are not specific to a single open source project. In the past, many discussions would not happen at all, or they would take place on a particular project’s mailing list or GitHub issues, which makes it hard to stay organized and build community. The forum is not intended for end users or people needing help with using particular projects (except in the context of project developers depending on other developers’ projects).

Unsupervised Question Decomposition for Question Answering

arXiv, Computer Science > Computation and Language; Ethan Perez, Patrick Lewis, Wen-tau Yih, Kyunghyun Cho, Douwe Kiela


We aim to improve question answering (QA) by decomposing hard questions into easier sub-questions that existing QA systems can answer. Since collecting labeled decompositions is cumbersome, we propose an unsupervised approach to produce sub-questions. Specifically, by leveraging >10M questions from Common Crawl, we learn to map from the distribution of multi-hop questions to the distribution of single-hop sub-questions. We answer sub-questions with an off-the-shelf QA model and incorporate the resulting answers in a downstream, multi-hop QA system. On a popular multi-hop QA dataset, HotpotQA, we show large improvements over a strong baseline, especially on adversarial and out-of-domain questions. Our method is generally applicable and automatically learns to decompose questions of different classes, while matching the performance of decomposition methods that rely heavily on hand-engineering and annotation.

Leave a Comment

Your email address will not be published.