Data Science newsletter – May 31, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for May 31, 2017


Data Science News

For Modern Astronomers, It’s Learn to Code or Get Left Behind

WIRED, Science, Sarah Scoles


ASTRONOMER MEREDITH RAWLS was in an astronomy master’s program at San Diego State University in 2008 when her professor threw a curveball. “We’re going to need to do some coding,” he said to her class. “Do you know how to do that?”

Not really, the students said.

And so he taught them—at lunch, working around their regular class schedule. But what he meant by “coding” was Fortran, a language IBM developed in the 1950s. Later, working on her PhD at New Mexico State, Rawls decided her official training wasn’t going to cut it. She set out to learn a more modern language called Python, which she saw other astronomers switching to. “It’s going to suck,” she remembers telling herself, “but I’m just going to do it.”

And so she started teaching herself, and signed up for a workshop called SciCoder.

Using Open Government Data to predict sense of local community

University of Oxford, Oxford Internet Institute, The Policy and Internet Blog


Community-based approaches are widely employed in programmes that monitor and promote socioeconomic development. And building the “capacity” of a community — i.e. the ability of people to act individually or collectively to benefit the community — is key to these approaches. The various definitions of community capacity all agree that it comprises a number of dimensions — including opportunities and skills development, resource mobilization, leadership, participatory decision making, etc. — all of which can be measured in order to understand and monitor the implementation of community-based policy. However, measuring these dimensions (typically using surveys) is time consuming and expensive, and the absence of such measurements is reflected in a greater focus in the literature on describing the process of community capacity building, rather than on describing how it’s actually measured.

A cheaper way to measure these dimensions, for example by applying predictive algorithms to existing secondary data like socioeconomic characteristics, socio-demographics, and condition of housing stock, would certainly help policy makers gain a better understanding of local communities.

25 Tweets to Know You: A New Model to Predict Personality with Social Media

Medium, Synced


In order to provide personalized ads, tech giants such as Google and Facebook are trying to abstract their users’ personality from their posts on social media. Hence, it is essential for social networking applications to predict personality from written text. However, it requires too much input data to be realistically used. In this paper, the authors developed a model that can predict personality with reduced data requirement. The model achieves better performance than state-of-the-art techniques while requiring 8 times less data.

Chris Manning: How computers are learning to understand language​

Stanford University, Stanford Engineering


A computer scientist discusses the evolution of computational linguistics and where it’s headed next. He was recently named the Thomas M. Siebel Professor in Machine Learning.

John Deere opens SF office to make its tractors smarter

Stacey Higginbotham, Stacey on IoT blog


Farm equipment company John Deere is no stranger to the internet of things. It was connecting sensors and actuators on the farm twenty years ago. The next big thing in farming is what connected devices enable: precision agriculture. Precision Agriculture combines connected with devices with machine learning to have them make faster and more precise decisions, possibly without a farmer’s input.

To make precision ag the new reality, John Deere needs Silicon Valley skills. That’s why last week it opened an office in the SoMA area of San Francisco to connect with local talent.

Science Needs a Solution for the Temptation of Positive Results

The New York Times, The Upshot blog, Aaron E. Carroll


esearch is hard, and rarely perfect. A better understanding of methodology, and the flaws inherent within, might yield more reproducible work.

The research environment, and its incentives, compound the problem. Academics are rewarded professionally when they publish in a high-profile journal. Those journals are more likely to publish new and exciting work. That’s what funders want as well. This means there is an incentive, barely hidden, to achieve new and exciting results in experiments.

Some researchers may be tempted to make sure that they achieve “new and exciting results.” This is fraud. As much as we want to believe it never happens, it does. Clearly, fabricated results are not going to be replicable in follow-up experiments.

But fraud is rare.

Steve Ballmer shows how AI and data will enhance NBA broadcasts for fans

GeekWire, Taylor Soper


Ballmer’s NBA team, the Los Angeles Clippers, will partner with Los Angeles-based startup Second Spectrum to release a new product that overlays the traditional TV broadcast with new data and animations largely driven by artificial intelligence.

Led by two former USC professors, Second Spectrum has built technology that essentially allows computers to watch sports and track player/ball movement at a granular level. It then applies machine learning and AI to help derive new insights for coaches, players, and even media-related customers.

Why your art degree could equal job security in the AI age

World Economic Forum, Quartz, Dave Gershgorn


When machines control all the world’s finances and run factory floors, what will humans be left to do?

We’ll make art, says Kai-Fu Lee, a former Google and Microsoft executive who has since launched VC firm Sinovation Ventures.

“Art and beauty is very hard to replicate with AI. Given AI is more objective, analytical, data driven, maybe it’s time for some of us to switch to the humanities, liberal arts, and beauty,” Lee told Quartz editor-in-chief Kevin Delaney during a live Q&A session. “Maybe professions where it’s hard to find a job might be good to study.”

As Computer Coding Classes Swell, So Does Cheating

The New York Times, Jess Bidgood and Jeremy B. Merrill


“There’s a lot of discussion about it, both inside a department as well as across the field,” said Randy H. Katz, a professor in the electrical engineering and computer science department at the University of California, Berkeley, who discovered in one year that about 100 of his roughly 700 students in one class had violated the course policy on collaborating or copying code.

Computer science professors are now delivering stern warnings at the start of each course, and, like colleagues in other subjects, deploy software to flag plagiarism. They have unearthed numerous examples of suspected cheating.

At Brown University, more than half the 49 allegations of academic code violations last year involved cheating in computer science.

Banks Eager For Artificial Intelligence, But Slow To Adopt

Forbes, Adelyn Zhou


Only a few outliers in the banking sector, such as Capital One, have been able to ship AI products as quickly as their counterparts in Silicon Valley. While many financial institutions have publicly announced ambitious plans to integrate artificial intelligence and machine learning, customers are still waiting months later for these proposed products and services.

So why are banks – who are typically the most capable and tech-intensive players in the business world – acting like Luddites with AI? And how can AI entrepreneurs and developers building products for the industry nail their pitches and drive home deals?

Cornell’s Climate-Conscious Urban Campus Arises

The New York Times,


A new complex on Roosevelt Island aims to produce as much energy as it uses. Here’s how the designers are making it happen.

Prescription from health clinic: Play this Bellevue startup’s video game

The Seattle Times, Rachel Lerman


Patients play Litesprite’s video game, and clinics get information about which coping mechanisms seem to work best for each person. Health professionals can then decide which treatment options are best.


Bridging disciplines in analysing text as social and cultural data (#aTTACheD*)

London, England September 21-22 September at Alan Turing Institute. This workshop aims to address the gap between research methodologies in NLP/ML and the humanities and the social sciences. Deadline for attendee applications is June 7.
Tools & Resources

Practical Techniques for Data Preparation



“Trifacta has released Principles of Data Wrangling: Practical Techniques for Data Preparation, the first how-to guide on data wrangling. But why should you read this book? It’s simple – because your time is as valuable as your data.” [free to download]

RISELab Announces 3 Open Source Releases

University of California-Berkeley, RISELab, Joe Hellerstein


“Part of the Berkeley tradition—and the RISELab mission—is to release open source software as part of our research agenda. Six months after launching the lab, we’re excited to announce initial v0.1 releases of three RISElab open-source systems: Clipper, Ground and Ray.”

7 Ways to Handle Large Data Files for Machine Learning

Machine Learning Mastery, Jason Brownlee


“How do I load my multiple gigabyte data file? Algorithms crash when I try to run my dataset; what should I do? Can you help me with out-of-memory errors? In this post, I want to offer some common suggestions you may want to consider.”


GitHub – ropenscilabs


emldown is a package for creating a helpful website based on EML metadata.

A Tour of PyTorch Internals (Part I)

GitHubGist – killeent


“The fundamental unit in PyTorch is the Tensor. This post will serve as an overview for how we implement Tensors in PyTorch, such that the user can interact with it from the Python shell. In particular, we want to answer four main questions:”

  • 1. How does PyTorch extend the Python interpreter to define a Tensor type that can be manipulated from Python code?
  • 2. How does PyTorch wrap the C libraries that actually define the Tensor’s properties and methods?
  • 3. How does PyTorch cwrap work to generate code for Tensor methods?
  • 4. How does PyTorch’s build system take all of these components to compile and generate a workable application?

  • Software simplified

    Nature News & Comment, Andrew Silver


    Containerization technology takes the hassle out of setting up software and can boost the reproducibility of data-driven research.


    Full-time, non-tenured academic positions

    Spacecraft Systems Engineer

    Johns Hopkins University, Space Telescope Science Institute (STScI); Baltimore, MD

    Lecturer/Senior Lecturer in Computer Science

    University of St. Andrews; Fife, Scotland
    Full-time positions outside academia

    Data Scientist

    The Football Association; St. George’s Park, England

    Post-doc, Spatio-temporal Modeling

    University of Washington, Department of Biostatistics; Seattle, WA

    Leave a Comment

    Your email address will not be published.