Data Science newsletter – June 12, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for June 12, 2018


Data Science News

Majority of Americans Believe It Is Essential That the U.S. Remain a Global Leader in Space

Pew Research Center, Cary Funk and Mark Strauss


Sixty years after the founding of the National Aeronautics and Space Administration (NASA), most Americans believe the United States should be at the forefront of global leadership in space exploration. Majorities say the International Space Station has been a good investment for the country and that, on balance, NASA is still vital to the future of U.S. space exploration even as private space companies emerge as increasingly important players.

Roughly seven-in-ten Americans (72%) say it is essential for the U.S. to continue to be a world leader in space exploration, and eight-in-ten (80%) say the space station has been a good investment for the country, according to a new Pew Research Center survey conducted March 27-April 9, 2018.

The most exciting thing about Canada legalising weed? Data

Wired UK, Bryan Borzykowski


When the information floodgates open, researchers from all fields – health, criminology, policy, economy and more – will be able to collect information about cannabis use that they weren’t able to get before

Johns Hopkins professor Alex Szalay is leading a new effort to manage the ‘deluge’ of scientific data Baltimore, Stephen Babcock


The Open Storage Network is looking to provide a more efficient way for researchers to manage data. The National Science Foundation provided $1.8 million for the pilot.

UH starts data science programming with the basics: What, exactly, is a data scientist?

Houston Chronicle, Lindsay Ellis


Data science in Houston is a loaded term.

It conjures a scuttled University of Texas System project, attractive skills on a job listing and, perhaps, an endorsement by Mayor Sylvester Turner, calling for collaborations between universities to bring Houston into the burgeoning field.

On Thursday, experts in a quiet, sunlit University of Houston classroom made concrete the vague field by giving examples of data science in action. The campus’s new institute hosted roughly 40 people, who listened to field leaders define the term, show how data science can be used in political science and demonstrate a model of coding for large-scale analysis.

UH announced its center in October, but the summer series beginning this month will be the first public events from the institute, meant to fortify the city’s entry into a field that has attracted significant interest around Texas and nationwide.

Artificial intelligence is in a bubble: Here’s why we should build it anyway

The Globe and Mail, Stephen Piron


Today’s euphoria about AI feels eerily similar. As a co-founder of a Toronto-based AI company, I get weekly pitches from investment bankers looking to take us − a young company slightly more than two years old − public. While flattering, this attention seems premature. I also imagine we’re not the only AI company investment banks are calling on. These days, the world is hot for AI given its potential to transform our lives.

When it comes to the web’s first wave of hype, I’m sure most people recall how quickly the dot-com bubble burst. What the internet could deliver in those days wasn’t what people had been promised − the reality couldn’t surpass the hype.

This is something people should be thinking about more − a lot more − when it comes to AI, especially here in Canada.

Rice team to unveil PlinyCompute at SIGMOD conference

Rice University, Rice Engineering


Computer scientists from Rice University’s DARPA-funded Pliny Project believe they have the answer for every stressed-out systems programmer who has struggled to implement complex objects and workflows on ‘big data’ platforms like Spark and thought: “Isn’t there a better way?”

Rice’s PlinyCompute will be unveiled here Thursday at the 2018 ACM SIGMOD conference. In a peer-reviewed conference paper, the team describes PlinyCompute as “a system purely for developing high-performance, big data codes.”

Like Spark, PlinyCompute aims for ease of use and broad versatility, said Chris Jermaine, the Rice computer science professor leading the platform’s development. Unlike Spark, PlinyCompute is designed to support the intense kinds of computation that have only previously been possible with supercomputers, or high-performance computers (HPC).

MITx MicroMasters Program in Statistics and Data Science opens enrollment

MIT News, MIT Open Learning


The new MITx MicroMasters Program in Statistics and Data Science, which opened for enrollment today, will help online learners develop their skills in the booming field of data science. The program offers learners an MIT-quality, professional credential, while also providing an academic pathway to pursue a PhD at MIT or a master’s degree elsewhere.

“There are many online programs that provide a professional overview of data science, but they don’t offer the level of detail learners gain from an actual, residential master’s program,” says Professor Devavrat Shah, faculty director of the program and MIT professor in the Department of Electrical Engineering and Computer Science (EECS). “This new MicroMasters program in Statistics and Data Science is bringing the quality, rigor, and structure of a master’s-level, residential program in data science at MIT to a wider audience around the world, and at a very accessible price, so people can learn anywhere they are while keeping their day jobs.”

In all, seven universities will be accepting the new MicroMasters Statistics and Data Science (SDS) credential towards a master’s degree, including the Rochester Institute of Technology (United States), Doane University (United States), Galileo University (Guatemala), Reykjavik University (Iceland), Curtin University (Australia), Deakin University (Australia), and RMIT University (Australia).

Why the Future of Machine Learning is Tiny

Pete Warden's blog


When Azeem asked me to give a talk at CogX, he asked me to focus on just a single point that I wanted the audience to take away. A few years ago my priority would have been convincing people that deep learning was a real revolution, not a fad, but there have been enough examples of shipping products that that question seems answered. I knew this was true before most people not because I’m any kind of prophet with deep insights, but because I’d had a chance to spend a lot of time running hands-on experiments with the technology myself. I could be confident of the value of deep learning because I had seen with my own eyes how effective it was across a whole range of applications, and knew that the only barrier to seeing it deployed more widely was how long it takes to get from research to deployment.

Instead I chose to speak about another trend that I am just as certain about, and will have just as much impact, but which isn’t nearly as well known. I’m convinced that machine learning can run on tiny, low-power chips, and that this combination will solve a massive number of problems we have no solutions for right now. That’s what I’ll be talking about at CogX, and in this post I’ll explain more about why I’m so sure.

Why Silicon Valley CEOs now have France on their to-do lists

CNET, Katie Collins


Paris in the spring is always worth a visit, especially if you’re a Silicon Valley exec and you’ve been personally summoned by the president of France.

In late May, the tech world’s version of a star-studded event took place in the French capital. Facebook’s Mark Zuckerberg, Microsoft’s Satya Nadella, Google’s Eric Schmidt and scores of other tech leaders appeared at a relatively new and fairly obscure conference called VivaTech. It’s a lineup that might not be unusual in California, but in Europe it’s almost unheard of.

To understand how the conference managed to snag famed names to appear on its stage, you only need look to France’s charismatic, tech-savvy president: Emmanuel Macron. Macron has made great efforts to woo Silicon Valley, and the resulting love story is romance for the ages.


Climate Informatics Workshop

Boulder, CO September 19-21. “The Climate Informatics workshop series seeks to build collaborative relationships between researchers from statistics, machine learning and data mining and researchers in climate science. Because climate models and observed datasets are increasing in complexity and volume, and because the nature of our changing climate is an urgent area of discovery, there are many opportunities for such partnerships.” Deadline for paper submissions is June 30.

SSRC Abe Fellowship

“The Abe Fellowship is designed to encourage international multidisciplinary research on topics of pressing global concern. The program seeks to foster the development of a new generation of researchers who are interested in policy-relevant topics of long-range importance and who are willing to become key members of a bilateral and global research network built around such topics.” Deadline for applications is September 1.
Moore-Sloan Data Science Environment News

Jupyter Notebooks: The influential software system at the center of data science

Gordon and Betty Moore Foundation


Our world is whirling in data. So much so, there is now an entire field (data science), and a new profession (data scientist), that is expected to make sense of it all.

As a field, data science obtains, scrubs, explores, models and interprets data. And the data scientist, called the “Sexiest Job of the 21st Century” by Harvard Business Review, is to help interpret and manage the data and solve problems. Data scientists build algorithms, write code and visualize data, which can get difficult to manage.

The partner to the data scientist is Jupyter Notebooks – a free, open-source software designed as a computational notebook, to document, visualize and run complex code with data.

New report on Career Paths and Prospects in Academic Data Science

University of California-Berkeley, Berkeley Institute for Data Science


We are excited to release the Career Paths and Prospects in Academic Data Science: Report of the Moore-Sloan Data Science Environments (MSDSE) Survey. The survey and report are a joint collaboration between researchers at the Berkeley Institute for Data Science at UC-Berkeley, the eScience Institute at UW-Seattle, and the Center for Data Science at New York University. These three institutes are funded by the Gordon and Betty Moore and Alfred P. Sloan Foundations to support data-intensive research across fields. In this project, we surveyed 167 researchers who were affiliated with these three institutes for data science, with our respondents spanning many fields, roles, and career stages.

Tools & Resources

Year 1: They Don’t Teach This in Grad School

Medium, thewulab, Eugene Wu


… “It’s hard to scale out research without students to help do the work. Since I didn’t have any PhD students and didn’t yet have a sense of how to assess applicants, I was liberal in accepting undergraduate and masters research applications (almost 10) to work on a variety of promising research leads.”

“The hard part is that lots of students don’t yet have the technical background nor research experience to jump onto an open-ended problem, and need more structure. That was something I did not anticipate nor provide, which led to a huge amount of wasted effort on my and, unfortunately, the students’ part. Most students disappeared by the end of the semester, and many within a few weeks.”

Scaling Neural Machine Translation

arXiv, Computer Science > Computation and Language; Myle Ott, Sergey Edunov, David Grangier, Michael Auli


“Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation. On WMT’14 English-German translation, we match the accuracy of (Vaswani et al 2017) in under 5 hours when training on 8 GPUs and we obtain a new state of the art of 29.3 BLEU after training for 91 minutes on 128 GPUs. We further improve these results to 29.8 BLEU by training on the much larger Paracrawl dataset.”

Plotting Shots Using Statsbomb Data



In the previous tutorial we learnt how to create a plot map background for use with Statsbomb data. Let’s use what we learn and then plot some shots for one particular game.

Improving Language Understanding with Unsupervised Learning

OpenAI, Alec Radford


“We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: transformers and unsupervised pre-training. These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well; this is an idea that many have explored in the past, and we hope our result motivates further research into applying this idea on larger and more diverse datasets.”

It’s Time to Make Your Data Count!

Make Data Count


One year into our Sloan funded Make Data Count project, we are proud to release Version 1 of standardized data usage and citation metrics!

As a community that values research data it is important for us to have a standard and fair way to compare metrics for data sharing. We know of and are involved in a variety of initiatives around data citation infrastructure and best practices; including Scholix, Crossref and DataCite Event Data. But, data usage metrics are tricky and before now there had not been a group focused on processes for evaluating and standardizing data usage. Last June, members from the MDC team and COUNTER began talking through what a recommended standard could look like for research data.

Making HST Public Data Available on AWS



The Hubble Space Telescope has undeniably expanded our understanding of the universe during its 28 years in space so far, but this is not just due to its superior view from space: One of the major advantages to Hubble is that every single image it takes becomes public within six months (and in many cases immediately) after it is beamed back to Earth. The treasure trove that is the Hubble archive has produced just as many discoveries by scientists using the data “second hand” as it has from the original teams who requested the observations. Providing access to archives is at the core of our mission.

For all its richness however, the archive of Hubble observations has been geared to individual astronomers analyzing relatively small sets of data. The data access model has always been that an astronomer first downloads the data and then analyzes it on their own computer. Currently, most astronomers are limited both in the volume of data they can reasonably download, and by their access to large-scale computing resources.

ELMo – Deep contextualized word representations

AllenNLP; Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer


ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.


Full-time positions outside academia

Senior Data Manager

The Health Foundation; London, England

Leave a Comment

Your email address will not be published.