Data Science newsletter – February 21, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for February 21, 2018


Data Science News

DigitalGlobe’s parent Maxar Technologies moving headquarters to Colorado

The Denver Post, Tamara Chuang


Make way for Maxar Technologies. The parent of Westminster’s DigitalGlobe said Wednesday that it is moving its global headquarters to Colorado and bringing 800 new jobs to the area.

The international space company, which until a year ago was based in Canada, will consolidate administrative operations in the Denver area as part of its recent $2.4 billion October purchase of DigitalGlobe, known for taking high-resolution photos of Earth from space. But beyond centralizing duties such as accounting, finance and taxes, Maxar officials expect to grow its technical workforce in Colorado.

“The vast majority (of employees) we have are engineers and scientists. We certainly hope to have significant growth in the DigitalGlobe business,” said Maxar CEO Howard L. Lance, who will move to Colorado.




PlanScore presents the most comprehensive historical dataset of partisan gerrymandering ever assembled. We also provide tools for policymakers and litigators to transparently score new plans and assess their fairness.

Roche buying Flatiron Health, backed by Alphabet

CNBC, Christina Farr


Alongside Alphabet’s GV (formerly Google Ventures), Roche is one of the biggest backers in Flatiron, having led the most recent venture round of $175 million in 2016. It already owned 12% of the company. As part of the deal, Roche agreed to acquire several of the company’s software products, which was intended to put the company in a position to go public.

Flatiron has an electronic medical record system used by doctors who are treating patients with cancer. It then uses this data to help researchers and life sciences companies figure out better treatments for cancer — for instance, by making sure that the right patients are being recruited for clinical trials.

The company was founded by former Google employees Nat Turner and Zach Weinberg.

University Data Science News

The past couple weeks have seen a series of unflattering glimpses into academic life. An historian wrote a much-forwarded quit-lit essay on what it’s like to be left behind when colleagues disappear from scholarship. We all know there aren’t enough tenure-track positions, but the scholarly and emotional impacts are rarely discussed because we’re all so busy patting ourselves on the back for having collected the best and brightest. The counterfactual to this ‘best and brightest’ claim may be that the impact factor is bad for science, incentivizing scientists who “are driven mostly by accolades….in positions of immense power.” This challenging viewpoint was offered by an anonymous author.

C. Titus Brown, open software scientist, felt no need to hide behind Anonymous when arguing that papers “are merely artifacts” that do little to advance the scientific process. Instead, “software, trainees, blog posts, and training materials are more impactful.”

Kate Weisshaar looked at three possible explanations for the lower number of women in tenured positions and found that even in sociology – a field heavily invested in gender equality, at least in terms of the research undertaken – “some productivity measures partially account for the gender gap in promotion, but large portions of the gender gap are not explained either by research productivity or by the department/school context. In other words, the results suggest that gender inequality in the promotion evaluation processes are contributing to the gender gap in promotion among professors.”

It could be so much worse! Silicon Valley’s attempt to prove academia is organized all wrong what with our silly tuition and codes of conduct, Singularity University, is having trouble adulting. “A teacher allegedly sexually assaulted a former student, an executive stole more than $15,000, a former staffer alleges gender and disability discrimination, and Singularity dismissed 14 of about 170 staffers and suspended GSP, now called the Global Solutions Program, after Google withdrew funding last year.” In a banal but fundamentally more damaging accusation, “Alumni say for-profit Singularity is becoming just another organizer of conferences and executive seminars.” Very few intelligent people want to saddle up for the for-profit conference rodeo circuit as their *only* professional activity.

PlanScore, a mapping and political science non-profit led by Michal Migurski is offering 1) a large dataset of partisan gerrymandering in US history and 2) a tool for lawmakers to assign a partisanship score to new plans for redistricting.

Stanford’s OpenfMRI project has succeeded so swimmingly that after 6 months it is becoming OpenNeuro. OpenNeuro will accept and distribute all types of neurological data, not just fMRI data. All historical datasets will be transferred and efforts will be made to prevent link rot. Mazel tov.

There’s also a Canadian version of the neurological data sharing movement coming from McGill University, the Open Neuroscience Platform.

Getting to the psychiatric side of brains, Dr. Tom Insel, former director of the National Institute of Mental Health, now founder and CEO of Mindspring, gave the best interview of the week detailing the intersection between data science and mental health. The field, known for publishing the DSM, has largely been building knowledge based on subjective reports. Dr. Insel wants to change that by finding biomarkers, an approach riddled with controversy. His earlier efforts to insert genomics and imaging failed, “I came in with the bias that we could fix the diagnostic problem with genomics and imaging. I was wrong. We spent a lot of money on both of those efforts, and at the end of the day we found only a little evidence.”

I have been teaching my class that social media is a winner-take-all-market (Frank & Cook, 1996) and Mathias Bartl, a professor at Offenburg University of Applied Sciences, has new data out suggesting that the top 3% of accounts on YouTube capture 90% of the attention economy. YouTubers who want to quit their day jobs need to be in the top 1% to make a decent living.

Georgia Tech has a new technique for authenticating individuals that combines facial recognition with a behavioral component. Users facial image would be captured by the device trying to authenticate them (single-authentication) and then they’d be asked to speak a prompted word within a very short time period. The time period would be too short for current malfeasant actors to spoof, even if they could send the correct facial image data. I’m semi-queasy about static biometric identifiers, but there’s great promise in behavioral authentication.

Penn State’s Conrad Tucker won a ~$350,000 grant from the US Air Force to predict global threats using social media data. Right. This is one of those grants Tucker had to frame in terms of national security interests when what he’s actually doing is basic science. Tucker notes, “Regardless of what domain you are looking at, the fundamental problem is when you go to sample data or acquire information, how do you know which pieces of data to include in your model and how do you know which ones to leave out?” It’s a classic big data question and I’m pleased he’s spending some of our gargantuan military budget to address it.

Georgia State University is using a behavioral tracking system to reduce its dropout rate by 22%. Advisors overseeing 300 students (wow, public education is underfunded) get access to risk scores, allowing them to dedicate their time and effort to the students who need it most. Score one for the bots. And the people.

Susan Lozier at Duke University reported on a network of 53 new ocean sensors called OSNAP will eventually settle long-standing oceanographic questions about “how much of the overturning circulation is driven by the winds and how much by the formation of cold, dense water that sinks into the deep sea.” This is also a reminder about how much ground truth we lack as we face questions about the impacts of climate change. The earliest year for which the OSNAP network has data: 2014.

Brown University raised tuition by 10% in five of its masters degree programs, including computer science and data science. The rationale is capitalism: “The increases follow the decision to apply variable tuition rates to these programs based on market prices.”

Machine learning identifies potential inorganic complexes for switches and sensors

Chemical & Engineering News, Sam Lemonick


Machine learning, in which computers train on large data sets to make predictions, can be a fast way to find promising molecules for various applications, but it’s only as good as the data it trains on. A new strategy could make the method more useful for identifying leads among inorganic complexes, for which reliable data can be harder to come by.

Amazon Go – Deep Learning Conquers Retail

insideBIGDATA, Daniel Gutierrez


Seattle is one of my favorite tech-friendly cities and I always look forward to heading out to the Pacific Northwest for a conference. Sometimes I take off on foot from my favorite downtown hotel to take in the feel of the city. Yes, Seattle is that cool. This time, I stumbled upon a real gem – the new Amazon Go store located at 2131 7th Ave., just a few blocks from my hotel. I soon found out this experimental retail outlet is a bold new move to transform retail that’s powered by deep learning technology, specifically image recognition algorithms.

You immediately get a sense for its uniqueness – there are no cashiers or registers anywhere! There’s no need for any. You enter the store through a row of gates resembling a BART station in San Francisco. Only shoppers with the store’s smartphone app are allowed in. Inside, the complement of products are like those of a small bodega. The store itself is only 1,800 square feet.

OpenfMRI becomes OpenNeuro

Stanford Center for Reproducible Neuroscience


Six months ago we launched a new platform for sharing and analyzing neuroimaging data – Over 600 users have registered on the platform since it launched, helping us test all of the new exciting features (such as client-side BIDS dataset validation, resumable uploads, and running analysis apps). We are now ready to make OpenNeuro the sole destination for sharing neuroimaging data provided by our center. This means that the OpenfMRI platform will no longer accept new data submissions – instead we encourage users to upload their data to OpenNeuro.

Deep learning applications in railroads: Predicting carloads

Predictive Analytics Times, Railway Age, Offei Adarkwa, Ph.D and Nii Attoh-Okine, Ph.D., PE


The freight railroad industry is considered by many as having a critical role to play in the development and economic growth of the United States. In 2014 alone, it generated $33 billion in tax revenues as part of a total of $274 billion contribution to the economy (Lord, 2016).

Unsurprisingly, expert analysts and investors including Warren Buffett view freight carloads as one of the most important economic indicators (Perry, 2010). Multiple efforts have been made in the past to predict rail freight volume. Accurate prediction is important for freight railroad companies in planning and decision-making for operations, maintenance, labor allocation and capital investment. In the midst of rapid technological development and marketplace integration, prediction of rail freight volumes is all the more crucial. With the right information on future freight volumes, rail companies can efficiently allocate resources to maximize their returns on investments. At the administrative level, this provides a scientific basis for developing policies aimed to improve service delivery (Yang and Yu, 2015). Due to the recent success of TensorFlow; a machine-learning tool used in business applications, this work explored how these tools can be used to predict rail freight carloads in the United States.

Science Pick of the Week: UO undergrad is using social media to create more constructive dialogue about ocean conservation

University of Oregon, Daily Emerald student newspaper, Franklin Lewis


If she had it her way, Ellie Jones would have started campaigning for ocean life 30 years ago. The only problem: she was not born yet.

Jones, a junior marine biology major at the University of Oregon, is the creator and administrator of Everblue, an ocean conservation awareness project. The project is simple: sift through scientific journals and databases to find facts or new discoveries related to ocean ecosystems. Jones along with her diverse team of undergrads, graduate students and professors from across the country translate these findings into tips, which are pushed out through Everblue’s Instagram, Facebook — and as of last week — Twitter.

“Pairing it down to just one or two sentences has definitely been a challenge,” Jones said. “If you’re scrolling through Instagram, nobody is going to take the time to read paragraphs.”

Facebook Still Lying About Its Role in the 2016 Election

Talking Points Memo, Josh Marshall


I flagged this on Twitter before President Trump started flogging it. But I’m not at all surprised that he did. Because, somewhat to my surprise, it revealed that Facebook seems still to be committed to lying, albeit now more artfully, about its role in the 2016 election and more broadly as a channel of choice for propaganda and misinformation.

#UMassCICS joins the universities mentioned in this @nytimes article in offering a course in #ComputingEthics.

Twitter, UMass CICS


COMPSCI 590E Ethical Considerations in Computing was offered for the first time in Fall ’17.

Northcentral University Introduces New Approach to Data Science Graduate Education

Chicago Evening Post, Ari Roul


Northcentral University’s School of Technology, led by Dean J. Robert Sapp, EdD, has announced a groundbreaking approach to data science secondary education, now available through the school’s Master of Science and Doctor of Philosophy in Data Science programs.

The current demand for data scientists is irrefutable. In fact, IBM and the Business-Higher Education Forum estimate that there will be 2.7 million open data and analytics positions in the US by 2020. What’s more, the industry itself is changing in real time, with continuous advancements and new applications across every sector.

The Neuro joins neuroscience data sharing partnership

McGill University, Newsroom


Modern neuroscience research can produce massive amounts of data, which researchers can use to find patterns revealing anything from the first physiological signs of Alzheimer’s disease to a new drug target that could stop neurodegeneration. However, this data must be stored, processed, and distributed effectively.

To improve access to critical data, and to continue its policy of being a leader in open science, The Montreal Neurological Institute and Hospital (The Neuro) of McGill University has joined The Canadian Open Neuroscience Platform (CONP), a new data sharing partnership that will break down the barriers to collaboration, facilitating the distribution of data across the Canadian neuroscience community and beyond.

Area 51 Stack Exchange Staging: followers Follow It! Share This Computational Social Science and Digital Humanities

Area 51 – Stack Exchange


Proposed Q&A site for researchers and practitioners who use computers to model, simulate, and analyze phenomena in social sciences and humanities, such as computational economics, computational sociology, cliodynamics, culturomics, and contents analysis.

Real-Time Captcha Technique Improves Biometric Authentication

Georgia Tech, News Center


A new login authentication approach could improve the security of current biometric techniques that rely on video or images of users’ faces. Known as Real-Time Captcha, the technique uses a unique challenge that’s easy for humans — but difficult for attackers who may be using machine learning and image generation software to spoof legitimate users.

The Real-Time Captcha requires users to look into their mobile phone’s built-in camera while answering a randomly-selected question that appears within a Captcha on the screens of the devices. The response must be given within a limited period of time that’s too short for artificial intelligence or machine learning programs to respond. The Captcha would supplement image- and audio-based authentication techniques that can be spoofed by attackers who may be able to find and modify images, video and audio of users — or steal them from mobile devices.


A Gilbert Public Lecture: Microsoft’s Brad Smith: The Rise of Artificial Intelligence: Who Will Have Jobs in the Future?

Princeton CITP


Princeton, NJ Thursday, March 1, starting at 4:30 p.m., McCosh 50. [registration requested]




Paris, France April 19-20, before RECOMB conference. “The meeting will focus on proteogenomics, single cell systems biology and cancer epidemiology, and how crowdsourced science, data sharing and a culture of collaboration can help advance research in these fields.” [$$$]

Privacy Law Forum: Silicon Valley

Berkeley Center for Law & Technology


East Palo Alto, CA March 23, Four Seasons Hotel. [$$$]


Supported Undergraduate Positions on Summer Research Teams

“The Johns Hopkins University Center for Language and Speech Processing is hosting the Fifth Frederick Jelinek Memorial Summer Workshop. We are seeking outstanding members of the current junior class to join a summer research workshop on language engineering from June 11 to August 3, 2018.” Deadline for applications is March 2.

MRQA 2018: Machine Reading for Question Answering

Melbourne, Australia July 19, workshop at ACL 2018. Deadline for submissions is April 23.
NYU Center for Data Science News

NYU Scientist Tells Why the AI Apocalypse Isn’t the End of the World

Observer, Sissi Cao


“Machines still have a long way to go to replace humans,” Kyunghyun Cho, a scientist of Facebook AI research and a data science professor at New York University, told Observer in a recent interview.

Cho is a rising star in machine translation, an subfield of computational linguistics that has seen major breakthroughs in recent years thanks to the application of AI. Cho was named on Bloomberg’s list of “people to watch in 2018.” Geoffrey Hinton, a computer science professor at the University of Toronto, who is regarded as “the Godfather of AI,” told Bloomberg that Cho’s work had a huge impact on machine translation.

Episode #115: Data Ethics with Laura Noren & Hetan Shah

Jon Schwabish


“On last week’s episode, I sat outside Facebook and chatted with Andy Kirk about our experience at the Social Science Foo Camp, a two-and-a-half day conference at Facebook that brought together all sorts of social scientists. One of the first sessions I attended was about Data Ethics, co-hosted by Laura Noren from the Institute for Public Knowledge at NYU and Hetan Shah, the Executive Director of the Royal Statistical Society. After my recent interview with Jenn Schiffer, I was curious about what and how they thought about what data ethics means and how we can all be more ethical with it.” [audio, 20:17]

Tools & Resources




Luna is a data processing and visualization environment built on a principle that people need an immediate connection to what they are building. It provides an ever-growing library of highly tailored, domain specific components and an extensible framework for building new ones.

Using Lyrics to Predict Genre

Timothy Dobbins


In the last post we analyzed rap lyrics using word vectors. We mostly did some first-pass analysis and very little prediction. In this post, we’ll actually focus on predictions and visualizing our results. I’ll use Python’s machine-learning library, scikit-learn, to build a naive Bayes classifier to predict a song’s genre given its lyrics. To get the data, we’ll use Cypher, a new Python package I recently released that retrieves music lyrics. To visualize the results, I’ll use D3 and D3Plus, which is a nice wrapper for D3.

Getting Text into Tensorflow with the Dataset API

Medium, Tal Perry


This post will discuss consuming text in Tensorflow with the Dataset API, which makes things almost easy. To illustrate the ideas in this post, I’ve uploaded a repo with an implementation of the end to end process described here. It contains a model that reads a verse from the bible character by character and predicts which book it came from. (e.g. “in the beginning…” came from the book of Genesis). The model itself is not the point, rather I hope the repo serves as a living example of how to use the Dataset API to work with textual data.


Full-time positions outside academia

Communications and Member Relations Director

DataCite; Global

Senior Analyst, Quantitative Analysis

New York Yankees; Bronx, NY

Leave a Comment

Your email address will not be published.