Data Science newsletter – July 11, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for July 11, 2018

GROUP CURATION: N/A

Data Science News

Cory Doctorow: Zuck’s Empire of Oily Rags

Locus Online, Cory Doctorow

from July 02, 2018

It’s great that the privacy-matters message is finally reaching a wider audience, and it’s exciting to think that we’re approaching a tipping point for indifference to privacy and surveillance.

But while the acknowledgment of the problem of Big Tech is most welcome, I am worried that the diagnosis is wrong.

The problem is that we’re confusing automated persuasion with automated targeting. Laughable lies about Brexit, Mexican rapists, and creeping Sharia law didn’t convince otherwise sensible people that up was down and the sky was green.

Extra Extra

The PGA is installing high-resolution cameras on high profile putting greens to capture players and balls in motion. It’s clearly still early days because the data scientists they’ve hired at Microsoft are hoping “shot data” plus “artificial intelligence” will “find the most relevant, most interesting stats that are contextual.” In other words, big data + algorithm = insight magic. Sure. Right. We’ll check back in three years.

Disney engineers clearly had fun cooking up fearless stunt-double robots.

Avocado

Jacob Schreiber, Timothy Durham, Jeffrey Bilmes, and William Noble

from July 09, 2018

The human epigenome has been experimentally characterized by measurements of protein binding, chromatin acessibility, methylation, and histone modification in hundreds of cell types. The result is a huge compendium of data, consisting of thousands of measurements for every basepair in the human genome. These data are difficult to make sense of, not only for humans, but also for computational methods that aim to detect genes and other functional elements, predict gene expression, characterize polymorphisms, etc. To address this challenge, we propose a deep neural network tensor factorization method, Avocado, that compresses epigenomic data into a dense, information-rich representation of the human genome. We use data from the Roadmap Epigenomics Consortium to demonstrate that this learned representation of the genome is broadly useful: first, by imputing epigenomic data more accurately than previous methods, and second, by showing that machine learning models that exploit this representation outperform those trained directly on epigenomic data on a variety of genomics tasks. These tasks include predicting gene expression, promoter-enhancer interactions, and elements of 3D chromatin architecture. Our findings suggest the broad utility of Avocado’s learned latent representation for computational genomics and epigenomics.

NLP’s ImageNet moment has arrived

The Gradient, Sebastian Ruder

from July 08, 2018

Big changes are underway in the world of Natural Language Processing (NLP).

The long reign of word vectors as NLP’s core representation technique has seen an exciting new line of challengers emerge: ELMo, ULMFiT, and the OpenAI transformer. These works made headlines by demonstrating that pretrained language models can be used to achieve state-of-the-art results on a wide range of NLP tasks. Such methods herald a watershed moment: they may have the same wide-ranging impact on NLP as pretrained ImageNet models had on computer vision.

How artificial intelligence is helping Pearson refocus assessment technology

Edscoop, Ryan Johnston

from July 06, 2018

The company’s new head of AI and personalized learning sees an opportunity to create enhanced ways of evaluating students’ work.

The canary in the city: indicator groups as predictors of local rent increases

EPJ Data Science; Aike A. SteentoftEmail author, Ate Poorthuis, Bu-Sung Lee and Markus Schläpfer

from July 06, 2018

As cities grow, certain neighborhoods experience a particularly high demand for housing, resulting in escalating rents. Despite far-reaching socioeconomic consequences, it remains difficult to predict when and where urban neighborhoods will face such changes. To tackle this challenge, we adapt the concept of ‘bioindicators’, borrowed from ecology, to the urban context. The objective is to use an ‘indicator group’ of people to assess the quality of a complex environment and its changes over time. Specifically, we analyze 92 million geolocated Twitter records across five US cities, allowing us to derive socio-economic user profiles based on individual movement patterns. As a proof-of-concept, we define users with a ‘high-income-profile’ as an indicator group and show that their visitation patterns are a suitable indicator for expected future rent increases in different neighborhoods. The concept of indicator groups highlights the potential of closely monitoring only a specific subset of the population, rather than the population as a whole. If the indicator group is defined appropriately for the phenomenon of interest, this approach can yield early predictions while simultaneously reducing the amount of data that needs to be collected and analyzed. [full text]

Watch Disney’s acrobatic robots – Video

CNET

from July 09, 2018

Disney’s Stuntronics robots can perform high-flying maneuvers on their own, and may be coming to a theme park near you.

University Data Science News

Correction from last week: Kepler space telescope data has been used to identify about 24 HUNDRED new planets, not 24 new planets. Those orders of magnitude matter.

A team lead by Jacob Schreiber at the University of Washington published a pre-print demonstrating how they used tensor factorization and deep neural networks to “reduce redundancy, noise, and bias, so that variance in the representation corresponds to meaningful biological differences” in the human epigenome. They call their model “Avocado.” (Because avocados are hot right now? It’s unclear why they call it Avocado.) The model modestly outperforms existing strategies in 36 out of 47 cell types, but has significant efficiency benefits compared to existing strategies. I like this paper because it’s a great example of data science at work within a domain science, in this case genetics/biology.

Sebastian Ruder, a PhD student in Ireland thinks that the combination of ELMo, ULMFiT, and the OpenAI transformer will usher in a new era for natural language processing (NLP) the same way ImageNet accelerated computer vision research. He argues that instead of starting models pre-trained on word embeddings, researchers will increasing start with models pre-trained on whole language embeddings. People who study NLP more closely, what are your thoughts? Are we at a major acceleration moment in NLP?

The University of Connecticut is milking cows with sensor-laden robots that help monitor their health, yield, and happiness. Just kidding. The “happy cow” thing is just bovine marketing. Real scientists care about healthy cows who have the agency to decide when to be milked. Sensors, actually a “cow-CPS” (cyber-physical system), follow cows throughout the rest of their day to complete the “precision dairying” data ecosystem. AgTech is so damn cool.

University of Chicago Booth School of Business researchers Marianne Bertrand and Emir Kamenica found that a person’s “race, education, gender and income bracket” can be predicted with at least 90% accuracy by their purchases. Gender and income are easy – men don’t buy mascara or tampons; wealthier people simply buy more. But race also turned out to be surprisingly easy: owning pets and/or flashlights are the strongest predictors of whiteness. I own neither and had no idea that makes me a white outlier.

Urban studies researchers in Singapore published a proof-of-concept that adapted the concept of ‘bioindicators’ from ecology to see if they could identify ‘indicator groups’ in urban environments to predict which neighborhoods would hit gentrification tipping points and see rents increase quickly. They used Twitter data and found that visits from wealthy to moderately priced neighborhoods were a suitable indicator for rapid increases in rent in US cities.

John Tech, a medical intern, reviewed a bunch of AI for radiology studies and found that the Conv Nets were often “learning” from features in the images that are strongly predictive, but have little to do with the disease presentation. They were learning differences based on machines – portable machines yield differences in images that are true indicators of sicker people. Portable machines exist to be hauled to the sickest patients who are too sick to come to static machines. We’ve seen this before in radiology studies. Always review predictive features for reasonable suitability.

Zachary Lipton and Jacob Steinhardt put together a thoughtful, reflective, paper on some of the less flattering aspects of machine learning as a field. They gave this talk at ICML and took pains to note that they are at times guilty of bad habits such as using complicated math to obfuscate, confusing explanation with speculation, and generally using language to conceal rather than reveal potential weaknesses or gaps in knowledge. Senior scholars, I call upon you to read carefully and propagate widely.

The University of California system shifted its admissions priorities to admit more in-state transfer students from community colleges. This will not impact the number of in-state freshmen admits, but it does signal a slight move away from increasing the number of out-of-state students who pay higher tuition. There is still not enough capacity in the UC system to educate the states’ growing population of talented young people.

A Swiss research team at Ecole Polytechnique Federale de Lausanne is using data science to design a bike that can go faster than the current land speed record of 83 mph. Europeans are serious about their bikes.

David Rosenberg an adjunct professor at NYU’s Center for Data Science and a data scientist at Bloomberg is now offering a free online machine learning course. He’s the real deal, readers. You’ll get more than what you pay for, but do note that he expects a strong math background.

Neighborhood Synergy

University of Michigan, Michigan Research

from July 06, 2018

[Yvonne] Lewis is among a cohort of Flint leaders who have bridged the gap between community needs and academic research. The result is the Healthy Flint Research Coordinating Center (link is external) (HFRCC).

Launched in 2016 with support from the University of Michigan (led by founding co-directors Rebecca Cunningham and Marc Zimmerman), UM-Flint (link is external) and Michigan State University (link is external), the center helps address, through coordinated research efforts, Flint’s current and future health challenges.

A lack of coordination was evident in 2015 when local physicians urged Flint residents to avoid their tap water after discovering elevated blood lead levels in children.

“It was like a double whammy because, first, we get hit with this water crisis, and then right after, a flood of people from across the country arrived here to do research,” said Ella Greene-Moton, who has lived in Flint for 50 years and worked with Lewis and others to establish HFRCC. “Many of them came to ‘poor little old Flint’ not realizing we had been a close partner in academic research for years. It was almost like we didn’t have a voice.”

Laser-Shooting Planes Uncover the Horror and Humanity of World War I

WIRED, Transportation, Nick Stockton

from July 09, 2018

In the west of Belgium, near the French border, the A19 motorway ends in a four-lane, unfinished overpass. There’s no mountain here, no ocean, no city center. Nothing to explain why the heavy machinery stopped paving through the farms, and the traffic gets diverted to surface streets.

What stopped the Belgian government from paving over this landscape in the early 2000s was the insight that this land contained evidence that might reveal what it was like to live through one of humanity’s greatest horrors. During World War I, this stretch of pastoral landscape, which the generals (and now historians) called the Ypres Salient, was one of the most heavily trenched, mined, mortared, bombed, gassed, pillaged, burned, and bullet-riddled places along the Western Front.

For the archaeologists charged with recovering this landscape’s memories, digging into the past with a vast shovel-and-pickaxe party was out of the question. Not only is the Ypres Salient huge, its scars are so dense they practically form a contiguous strata in the soil. “And, this is an area where people live and plow,” says Birger Stichelbaut, an archaeologist at both Ghent University and the In Flanders Fields Museum. “Our goal was not to turn it into a World War I Disneyland.” They needed non-invasive ways to survey the landscape, identify important sites from the war, and plan for the best way to preserve or protect the artifacts therein.

Zero One – Generative video for Zero One Technology Festival

Creative Applications Network, Filip Visnjic

from July 06, 2018

Created by Raven Kwok in collaboration with L.A. based producer / music technologist Mike Gao, Zero One is a code-based generative video commissioned by Zero One Technology Festival 2018 in Shenzhen, PR China.

This project consists of multiple interlinked generative systems, each of which has its customized features, but collectively share the core concept of an evolving elementary cellular automaton. The entire video is programmed and generated using Processing with minor edits in Premiere during composition.

Events

The Explorables Jam! – explain an idea through play

Maarten Lambrechts

from July 28, 2018

What is an explorable explanation?

It’s our way-too-long phrase for a thing that explains an idea through play. You can say “explorable” for short. Here’s some examples of explorables made in 3 weeks or less, so you can get an idea of what you can make during this jam!

Deadlines

Sage Digital Health Catalyst Program

“The often costly nature of leveraging emerging technologies creates a high barrier to entry for researchers wanting to effectively deploy digital health technologies at scale. The Digital Health (DH) Catalyst Program aims to address this by providing pro bono consulting and in-kind infrastructure to support innovative ideas for biomedical research studies that leverage digital health technologies to answer a pressing scientific question.” Deadline for applications is July 31.

Call for Proposals: Media, Technology and Democracy in Historical Context

“To encourage historically informed research on the impact of recent technological changes on both media and democracy, the Media & Democracy program at the Social Science Research Council is proud to announce an open call for papers for a research workshop to be held in New York City on December 13–14, 2018.” Deadline for application materials is August 10.

NetSci-X 2019

Santiago, Chile January 3-5. Deadline for submissions is October 1.

Tools & Resources

DING magazine

Mozilla

from July 06, 2018

Welcome to DING, a magazin about the internet and things. We founded this magazine to anthologize the sprawling online conversations and provide a place of reflection for people interested in crafting technology in more responsible ways. It is our place of refuge to discuss internet health, emerging technologies and visions for the future. [best viewed in Firefox]

Essential Chrome Developer Tools: Beginner to Master

amanpreet singh

from March 25, 2018

With the advent of modern frameworks, ES6 and the increasing risk of security everyday, knowing how to use Chrome developer tools can give you a major boost in productivity and help in easy diagnosis of a website’s performance. This article aims to give an overview of different features available in Chrome developer tools and their usage. Most of the content of the article can be extrapolated to Mozilla Firefox and Microsoft Edge developer tools. We will divide the article in sections on the basis of various panels available in Chrome developer tools of Google Chrome version 62.

Design Patterns for Production NLP Systems

Delip Rao

from July 09, 2018

Production NLP systems can be complex. When building an NLP system, it is important to remember that the system you are building is solving a task and is simply a means to that end. During system building, the engineers, researchers, designers, and product managers have several choices to make. While our book has focused mostly on techniques or foundational building blocks, putting those building blocks together to come up with complex structures to suit your needs will require some pattern thinking. Pattern thinking and a language to describe the patterns is a “method of describing good design practices or patterns of useful organization within a field of expertise”. This is popular in many disciplines (Alexander, 1979), including software engineering. In this section, we will describe a few common design and deployment patterns of production NLP systems. These are choices or tradeoffs teams often have to make to align the product development with technical, business, strategic, and operational goals. We examine these design choices under six axes …

Glow: Better Reversible Generative Models

OpenAI, Prafulla Dhariwal & Durk Kingma

from July 09, 2018

OpenAI introduces Glow, “a reversible generative model which uses invertible 1×1 convolutions. It extends previous work on reversible generative models and simplifies the architecture. Our model can generate realistic high resolution images, supports efficient sampling, and discovers features that can be used to manipulate attributes of data.”

Careers

Full-time, non-tenured academic positions

Research Associate in Code Generation for Finite Element Simulation

Imperial College London; London, England

Research Associate (Fixed Term)

University of Cambridge, Cancer Research UK Cambridge Institute; Cambridge, England

Sports.BradStenger.com

Data Science newsletter – July 11, 2018

Leave a Comment Cancel reply