Data Science newsletter – May 16, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for May 16, 2018


Data Science News

Microsoft Lobbyist Leaving for Role at Google, Sources Say

Bloomberg, Naomi Nix


Pablo Chavez, a former aide to Senator John McCain and a veteran Washington lobbyist, is leaving Microsoft Corp. to lead Alphabet Inc.’s Google’s global public policy for cloud services.

The hire comes as Google is attempting to reshuffle its policy shop as big tech companies come under greater scrutiny in Washington for the size of their platforms and over Russian interference in the 2016 presidential election. The search giant is seeking a replacement for Caroline Atkinson, who stepped aside in September. She was a former deputy national security adviser to President Barack Obama. Leslie Miller, a Google policy director based in California, is filling in on an interim basis.

University Data Science News

Seriously, academics, put the rights of your research subjects ahead of your science goals. The hierarchy of objectives is just that simple. Why am I all twisted out of sorts this week?

There’s more fallout from the Facebook and Cambridge Analytica case. Another app, myPersonality, also run by academics at the University of Cambridge, also collecting sensitive personal information, who also went on to start a consulting company selling access to the insights from the data has “leaked” the responses of ~3m people. By “leaked” I mean shared. But in this case sharing is not caring, it is callous. David Stillwell and Michal Kosinski posted a login and password to download the original data on GitHub. People – mostly other academics, I’m guessing – used that password and login from 2014 until now to download the dataset.

When others in academia do this kind of thing, we all risk losing access to funding and to trusting relationships with research subjects.

Next issue: There are not enough computer science professors to meet student demand for computer science education and the students are speaking out. One recent study found that, “18 percent of computer science faculty searches in 2017 failed entirely” because there weren’t enough candidates to fill vacancies. This is in contrast to skyrocketing demand. At PhD granting universities, the rate of new bachelor degrees minted climbed 300% between 2009 and 2015. The big R1 schools take 80% of the newly minted PhD holders, leaving smaller institutions in a bind: they can expect to hire one new Ph.D. every 27 years. Academia will be outcompeted by nimbler for-profit educational institutions if we cannot solve this problem. But, hey, if students are getting the education they want outside the slow-moving ivory tower, it is very difficult to argue that this is a bad development.

York University in Canada has launched a part-time, continuing studies Certificate in Machine Learning. The certificate is for tech employees and will deliver, “the requisite technical skills, as well as an in-depth, real-world understanding of the ethical, social, and business implications of their work.” I am so glad to see the emphasis at that sociotechnical intersection.

UC Berkeley is also offering an extension course in health informatics for health care practitioners that will focus on “one of the big challenges: confidentiality.” In particular, health care practitioners are bound by the privacy principles established by the Health Insurance Portability and Accountability Act (HIPAA).

Tuskegee University launched a new bachelors program that “will focus on computer hardware design and cybersecurity engineering.” This is great for the HBCU, and I hope they can find competent faculty to staff it.

Brian Silliman of Duke University combined a bunch of diverse datasets and determined that successful conservation efforts mean humans are likely to be seeing more alligators on beaches. In fact, “alligators, sea otters, river otters, gray whales, gray wolfs, mountain lions, orangutans and bald eagles, among other large predators are increasingly expanding their habitats, some of which are encroaching on human locales.” Successful conservation efforts, coming soon to an Instagram account near you. Meanwhile, Kansas University professor Daniel Reuman just got a grant to figure out how ecosystem spanning events lead to synchronous cross-species responses. There is already “theoretical evidence that changes in synchrony can influence extinction risks.”

University of Waterloo researchers have developed a shirt that collects data in order to predict the onset of cardiovascular or respiratory disease. Lead researcher, Thomas Beltrame, completed this research in Canada but has since taken a position at the University of Campinas in Brazil.

Open AI looked at the compute time required to train a model. Since 2012 total compute operations have been increasing by 10x per year. They credit this to researchers willing to bear the economic costs of producing better models.

Then London School of Economics has launched their own open access academic publishing platform to compete with for-profit journals and, as their focus is social science, and to compete with the newly formed SocArxiv. Researchers internal and external to LSE can start their own open access journals on the platform.

In science as in every other industry, it’s really hard to prove gender discrimination as a case against the Salk Institute is demonstrating. My guess? They’ll settle at some point and the terms will be sealed.

Machine learning masters molecules | Research

Chemistry World, Matthew Gunther


Research papers and patents contain huge numbers of molecular structures and experimental data that could be used in virtual screening programs, but getting it out of the documents is laborious. ‘First you have to identify what compounds in the publication you want to actually extract,’ comments [Joshua] Staker. ‘So, you read through the paper and then … go into some drawing software and draw it manually.’ Once the molecule is re-drawn in a computer-readable format (commonly known as SMILES), the information can be used in a screening program.

‘Doing this for hundreds of compounds in a large patent, it becomes tedious,’ laments Staker. ‘[It] starts to become easier and easier to make mistakes over data entry.’

Staker and [Kyle] Marshall came up with a solution to cut out the middle man. In fact, to cut out men and women altogether. The team has developed a deep neural network that can find images of molecular structures in a document and convert them into a digital format, without being told anything about molecules beforehand.1 ‘That’s really the beauty and simplicity of it in that there are no complex rules or features that we need to engineer as humans,’ says Staker.

York University Continuing Studies Announces New Machine Learning Program

Cision, Newswire, York University


As tech giants and other businesses increasingly rely on data and artificial intelligence (AI) to stay competitive, there is greater demand than ever for machine learning specialists with the requisite technical skills, as well as an in-depth, real-world understanding of the ethical, social, and business implications of their work. The York University School of Continuing Studiesi recently launched a part-time Certificate in Machine Learning to address this demand. The only dedicated program of its type in Canada—and created in collaboration with industry leaders in AI and machine learning—the certificate focuses on preparing students to become qualified candidates for the rewarding, desirable jobs on offer by top employers.

Implementing a Research Data Policy at Leiden University

International Journal of Digital Curation; Fieke Schoots, Laurents Sesink, Peter Verhaar, Floor Frederiks


In this paper, we discuss the various stages of the institution-wide project that lead to the adoption of the data management policy at Leiden University in 2016. We illustrate this process by highlighting how we have involved all stakeholders. Each organisational unit was represented in the project teams. Results were discussed in a sounding board with both academic and support staff. Senior researchers acted as pioneers and raised awareness and commitment among their peers. By way of example, we present pilot projects from two faculties. We then describe the comprehensive implementation programme that will create facilities and services that must allow implementing the policy as well as monitoring and evaluating it. Finally, we will present lessons learnt and steps ahead. The engagement of all stakeholders, as well as explicit commitment from the Executive Board, has been an important key factor for the success of the project and will continue to be an important condition for the steps ahead.

Evaluating Sampling Methods for Content Analysis of Twitter Data

Social Media + Society; Hwalbin Kim, S. Mo Jang, Sei-Hill Kim, Anan Wan


Despite the existing evaluation of the sampling options for periodical media content, only a few empirical studies have examined whether probability sampling methods can be applicable to social media content other than simple random sampling. This article tests the efficiency of simple random sampling and constructed week sampling, by varying the sample size of Twitter content related to the 2014 South Carolina gubernatorial election. We examine how many weeks were needed to adequately represent 5 months of tweets. Our findings show that a simple random sampling is more efficient than a constructed week sampling in terms of obtaining a more efficient and representative sample of Twitter data. This study also suggests that it is necessary to produce a sufficient sample size when analyzing social media content. [download full-text pdf]

A sit down with Danny Lange, the creator of ML systems at some of the world’s leading tech companies

Medium, City.AI, Christoph Auer-Welsbach


A sit down with Danny Lange — VP of AI and machine learning at Unity Technologies, and formerly of Uber, Amazon and Microsoft. Danny is a computer scientists and researcher with a passion for teaching machines to talk…properly…with a little more charisma than Siri, and few fewer memes than Alexa.

1. Danny, how did you became the person heading up the development of AI at some of the most prestigious tech companies of our time?

From the very beginning, I have been guided by two core principles. The first is that autonomous systems are needed to successfully address increasingly complex system challenges, and the second is that real impact is achieved through the enablement of broad developer adoption. Let me give you some concrete examples of that from my past. I spent years in two startups creating voice-enabled virtual assistants just like today’s Siri and Alexa. The vision was to achieve dynamic dialog generation driven by the feedback between with the human user and the computer.

UC Berkeley Extension to launch health informatics program

The Daily Californian, Alexandra Stassinopoulos


UC Berkeley Extension is launching a new health informatics program in July that aims to empower patients and health care providers.

The program comprises four two-unit courses and is taught entirely online. Beginning with “Introduction to Health Informatics” and ending with “Population Health Management,” the course sequence covers topics associated with data analysis as well as record management.

Government Data Science News

Canada won a small skirmish in the North American STEM brain drain battle. The Collision Conference, a rapidly growing tech conference, announced it will stop hosting in the US for at least the next three years. I do not think it is a coincidence that this time period overlaps with Trump’s presidency. Organizers noted that some would-be conference attendees were denied visas to attend this year’s conference in New Orleans noting, “intolerance is on the rise” as evidenced by “some countries around the world [that] seem to be shutting their borders.” Onwards to Canada.

Cities like Oakland, Cambridge, MA, and Seattle are quietly adopting policies that would limit the use of police surveillance technology. This is absolutely necessary as police departments have adopted insecure technologies (see: Securus leaking individuals’ phone geolocation data to anyone), having policing technology implemented without local knowledge (see: Palantir bypassing the New Orleans Police Department), and with negative consequences for over-surveillance (apologies, people who have been stopped 11-30 times per month in LA after Palantir swooped in).

Meanwhile, in the UK, police forces experimenting with facial recognition technology found it to be horribly inaccurate. A trial run identified 102 people as potential suspects leading to exactly 0 arrests. Face-palm, not face recognition.

In China, a country that does not have a long history with credit scoring, there has been a swift uptake of credit scores to determine everything from whether or not Chinese citizens will be allowed to get travel visas to how their dating profiles will be ranked. Given the lack of the type of credit history available in the US, the models are using everything from an individual’s social graph to what kind of phone they own. This type of system seems likely to replicate existing wealth gradients and social hierarchies. I’m not arguing that the US credit scoring system is perfect. I simply reserve the right to criticize the status quo and the innovative future with equal weight.

South Korea has announced the opening of six new research centers for machine learning and artificial intelligence. The country has committed to spend $2 billion by 2022 to foster AI/ML research and development. The plan includes funding for 4500 scholarships.

Beacon Labs is a new non-profit that just launched out of the Bayes Impact incubator. They “work alongside government agencies at the federal, state, and local levels” to identify and bridge gaps between policies and actual people in real communities. They mostly do this with data driven approaches. I’m fascinated by the approach and hope they’re able to have the impact they want.

Can Computers Create Art?

Arts journal, Aaron Hertzmann


This essay discusses whether computers, using Artificial Intelligence (AI), could create art. First, the history of technologies that automated aspects of art is surveyed, including photography and animation. In each case, there were initial fears and denial of the technology, followed by a blossoming of new creative and professional opportunities for artists. The current hype and reality of Artificial Intelligence (AI) tools for art making is then discussed, together with predictions about how AI tools will be used. It is then speculated about whether it could ever happen that AI systems could be credited with authorship of artwork. It is theorized that art is something created by social agents, and so computers cannot be credited with authorship of art in our current understanding. A few ways that this could change are also hypothesized.

Can A.I. usher in a new era of hyper-personalized food?

New Food Economy, Nadia Berenstein


Artificial intelligence is helping to develop flavors tweaked precisely for your age, ethnicity, and gender. Think beer and snacks as unique as your fingerprint, and a future where your food knows more about you.

Company Data Science News

Alphabet employees continue to oppose the company’s Project Maven collaboration with the US Department of Defense. Over 4,000 have now signed a petition requesting Alphabet’s CEO Sundar Pichai to cancel the existing contract and refuse any future work “in the business of war.” Last week the company introduced a new “just let Google do it” mantra; it used to tout the phrase “Don’t be evil.” Now over 1000 members of the academic community have signed their own petition urging Alphabet to drop the contract, promise never to build technology for war, and to specifically prevent any work on autonomous weapons technology. The signatory list is basically a who’s who of global sociotechnical scholars.

Facebook has hired 20,000 human workers to identify and remove inappropriate content. Some of this is easy – Facebook officially opposes Holocaust denials, porn, threats of physical harm, and other mostly clear-cut categories of trash. They need this giant group of people because there are plenty of posts that fall into a grey area. The line between an unfortunate-but-acceptable exercise of free speech and a delete-worthy post is unclear.

Beam Dental wants to sell you a smart toothbrush that will keep track of your brushing and flossing. Then it will give the most orally fastidious among us discounts on dental insurance. To me, this is the perfect example of another affordance that benefits the wealthy (most people don’t even have dental insurance) to the detriment of the poor (who won’t be able to afford this toothbrush or dental insurance and will pay more for their dental care). I also place Lyft and Uber in this category of inequality enablers.

Walmart, on the other hand, is turning the lens the other direction, using health care data to identify good doctors and reward them. In 2016, medical error was the third leading cause of death, so there is definitely room for improvement.

Oracle just bought, a data science as a platform company.

Google may have faked its big AI demo in which an AI assistant called a hair salon to make an appointment. I’m not sure I care as much as lots of people do. Please feel free to email me and explain why I should be deeply concerned.

Open AI looked at the compute time required to train a model. Since 2012 total compute operations have been increasing by 10x per year. They credit this to researchers willing to bear the economic costs of producing better models.

System Crash – Computer science students on a number of campuses complain that their departments can’t meet demand. Their professors are also stressed. But experts say there is no clear fix for nationwide shortage of computer science faculty.

Inside Higher Ed, Colleen Flaherty


At Haverford College last month, computer science students playfully but pointedly hung April Fool’s Day-inspired signs around campus, asking where their computer science professors were.

The faculty members weren’t missing, per se, since they were never there: Haverford, like the vast majority of other institutions, suffers from a computer science faculty shortage.

Haverford’s “dire shortage of faculty has created what can only be described as a crisis for students interested in computer science, lotterying us out of required introductory and upper-level classes,” a group of undergraduates wrote in a related letter to the campus’s Education Policy Committee published in the independent student newspaper, The Clerk.

How artificial intelligence is reimagining work

MIT Sloan School of Management, Newsroom, Brian Eastwood


Paul Daugherty, chief technology and innovation officer at Accenture, sees three myths surrounding artificial intelligence: Robots are coming for us, machines will take our jobs, and current approaches to business processes will still apply.

The three myths represent “conventional changes to linear processes,” he said. The reality is more transformative. An example: Newark, New Jersey-based AeroFarms grows seeds indoors without soil or sunlight. Seeds are harvested in less than three weeks and the process requires 95 percent less water than conventional farming methods.

AI plays a key role, Daugherty said. AeroFarms’ scientists monitor 130,000 data points, analyzing everything from light sensitivity to nutrient absorption.

“How do we get the conventional mindset [of AI] from beating Go to reimagining business?” he said. “That’s what we like to think about.”

Transparency in climate science

RealClimate, Gavin Schmidt


I was invited to give a short presentation to a committee at the National Academies last week on issues of reproducibility and replicability in climate science for a report they have been asked to prepare by Congress. My
slides give a brief overview of the points I made, but basically the issue is not that there isn’t enough data being made available, but rather there is too much!

A small selection of climate data sources is given on our (cleverly named) “Data Sources” page and these and others are enormously rich repositories of useful stuff that climate scientists and the interested public have been diving into for years. Claims that have persisted for decades that “data” aren’t available are mostly bogus (to save the commenters the trouble of angrily demanding it, here is a link for data from the original hockey stick paper. You’re welcome!).

The issues worth talking about are however a little more subtle.


31st CVS Symposium, Frontiers in Virtual Reality

University of Rochester, Center for Visual Science


Rochester, NY June 1-3. “This symposium will bring together leaders in vision and multisensory research, optics, computer science and clinical applications whose work connects to new developments in virtual and augmented reality.” [$$$]


Summer Camps!

DawgBytes is UW’s Computer Science & Engineering K-12 outreach program. Deadline to apply is May 23.

Welcome to Fighting Game AI Competition

“You are invited to develop an AI controller for Java based fighting game “FightingICE,” also wrapped for Python by Py4J. The current version of FightingICE also supports development of visual-based AI controllers. Please submit your AI controller and beat down your opponents to win the competition! The competition and the platform are organized and maintained by Intelligent Computer Entertainment Lab, Ritsumeikan University.” Deadline for midterm submissions is May 30.

NIPS 2018 Call for Workshops

Montreal, Quebec, Canada Following the NIPS 2018 main conference, workshops on a variety of current topics will be held on Friday, December 7, 2018 and Saturday, December 8, 2018. Deadline for workshop submissions is June 11.
Tools & Resources

MegaVeridicality (v1.0) dataset

FACTS.lab at the University of Rochester and the Semantics Lab at Johns Hopkins University


“This MegaVeridicality dataset consists of ordinal veridicality judgments as well as ordinal acceptability judgments for 517 clause-embedding verbs of English. The data were collected on Amazon’s Mechanical Turk using Turktools.”

Movidius Neural Compute SDK

Intel Movidius


“As of V2.04.00, SDK has been refactored and contains many new features and structural changes. It is recommended you read the documentation to familiarize with the new features and contents.”

Qualitative before Quantitative: How Qualitative Methods Support Better Data Science

Medium, Indeed Data Science, Robyn Rap and Vicky Zhang


“Data scientists and their models can benefit greatly from qualitative methods. Without doing qualitative research, data scientists risk making assumptions about how users behave. These assumptions could lead to:”

  • neglecting critical parameters,
  • missing a vital opportunity to empathize with those using our products, or
  • misinterpreting data.

  • Full guide to developing REST API’s with AWS API Gateway and AWS Lambda

    Sourcerer.IO, David Herron


    “In this article we’ll explore using AWS Lambda to develop a service using Node.js. Amazon recently announced an upgrade where developers using Lambda can now use an 8.10 runtime, which lets them use async functions. We’ll be sure to use async functions in the application.”

    LRWSN-hardware: The Long-range Wireless Sensor Network hardware.

    Santa Fe New Mexican, Janette Rose Frigo


    The Long-range Wireless Sensor Network developed by researchers at Los Alamos National Laboratory and West Virginia University easily, efficiently, and affordably collects, processes, and transmits data in all kinds of rugged and remote outdoor environments — areas with few roads, little to no infrastructure, no electricity or cellphone service, or extremely cold or hot temperatures. In fact, the researchers have already demonstrated continuous operation of the sensor network in remote areas for up to five years.

    This invention grew out of the Laboratory’s decades of experience in developing rugged, low-power satellite components for a really remote and harsh environment: space. Now the Lab has applied this expertise to develop these novel long-range wireless sensor networks for harsh environments and low resource situations on earth.


    Full-time positions outside academia

    Data Scientist

    Feedzai; Lisbon, Portugal

    Interdisciplinary General Engineer / Physical Scientist

    Government Accountability Office; Washington, DC

    Leave a Comment

    Your email address will not be published.