Data Science newsletter – June 14, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for June 14, 2017

GROUP CURATION: N/A

Data Science News

US intelligence agencies are beginning to build AI spies

Quartz, Dave Gershgorn

from June 12, 2017

A US intelligence director says a lot of espionage is more boring than you might think, and much of it could be handed over to artificial intelligence.

“A significant chunk of the time, I will send [my employees] to a dark room to look at TV monitors to do national security essential work,” Robert Cardillo, head of the National Geospatial-Intelligence Agency told reporters including Foreign Policy. “But boy is it inefficient.”

Cardillo calls out recent advances in artificial intelligence, giving algorithms the ability to analyze vast amounts of images and video to find patterns, give data about the landscape, and identify unusual objects. This kind of work is critical for assessing national security concerns like foreign missile-silo activity, or even just to check in on North Korean volleyball games.

SAP Academic Conference North America at @SAPNextGen

YouTube, SAPNextGen

from June 12, 2017

AP Academic Conference North America at @SAPNextGen at @SAPLeonardo Center New York

Inspecting Algorithms for Bias

MIT Technology Review, Matthias Spielkamp

from June 12, 2017

Courts, banks, and other institutions are using automated data analysis systems to make decisions about your life. Let’s not leave it up to the algorithm makers to decide whether they’re doing it appropriately.

NIH’s All of Us precision medicine program gets rolling

MobiHealthNews, Jonah Comstock

from June 12, 2017

When former director of health innovation and policy at Intel and cancer survivor Eric Dishman took over the NIH’s ambitious program to collect a deep research dataset on a diverse population of a million Americans, it was called the NIH Precision Medicine Initiative Cohort Program. These days the program is no less ambitious, but it has a much friendlier name: the All of Us Research Program.

“Given how participant-centered we’re being, some of these folks don’t know what a cohort is … so we changed the name,” Dishman said in the opening keynote of the HIMSS Precision Medicine Summit in Boston today. “This wasn’t for the researchers, this was for the participants that we need to reach who have never been involved in biomedical research before.”

Empty rhetoric over data sharing slows science

Nature News & Comment, Editorial

from June 12, 2017

Everyone agrees that there are good reasons for having open data. It speeds research, allowing others to build promptly on results. It improves replicability. It enables scientists to test whether claims in a paper truly reflect the whole data set. It helps them to find incorrect data. And it improves the attribution of credit to the data’s originators. But who will pay? And who will host?

Only rarely does a research-funding agency step up to both of these plates. Examples include NASA and the US National Institutes of Health, and the European Bioinformatics Institute. The National Natural Science Foundation of China has ambitions to host the outputs of those that it supports. The European Commission hopes to offer such platforms with its European Open Science Cloud. The UK Data Archive for the social sciences and humanities, and DANS, the Netherlands Institute for Permanent Access to Digital Research Resources, represent other good models of support from governments.

But in too many cases, government agencies lack the funds to build platforms for data sharing and resist taking responsibility for such infrastructure. They may hope that universities will host data, but the development of institutional repositories is patchy, and to rely on them is effectively to discourage common data standards and curation.

This Microsoft team volunteered their time to develop a SIDS research tool

MedCity News, Erin Dietsche

from June 12, 2017

The team combed through an enormous CDC data set, which included information on 29 million births and more than 27,000 sudden infant deaths from 2004 to 2010. They put the information in Microsoft Azure, a cloud computing system that unveils possible correlations between data and SIDS, and then displayed it on Power BI.

Through the tool, researchers can validate their existing knowledge of SIDS.

“It’s a good opportunity for researchers to be able to dive into the data and understand pretty rapidly what’s correlated and what’s not,” Kahan said.

UMD Computerized System Beats Human Quiz Bowl Team at Atlanta Exhibition

University of Maryland, UMIACS

from June 13, 2017

A computerized question-answering system built by researchers from the University of Maryland (UMD) and University of Colorado recently bested a team of human competitors during a quiz bowl (link is external) exhibition match in Atlanta, Georgia.

The artificial intelligence (AI) system, known as QANTA—which stands for “question answering is not a trivial activity”—won by a score of 260 to 215 (link is external) against a volunteer team of students at the High School National Championship Tournament (link is external), which features top quiz bowl teams from across the U.S.

The exhibition match was not part of the official competition, but was conceived as a way to test recent improvements in artificial intelligence algorithms, says Jordan Boyd-Graber, one of the UMD-affiliated faculty involved in the project.

7 reasons to teach an online course

Coursera Blog, Brian Caffo

from June 13, 2017

What are the benefits of creating an open online course? Can teaching online actually benefit your on-campus courses and career? Brian Caffo, professor of Biostatistics at Johns Hopkins University and one of three instructors of the popular Data Science Specialization on Coursera, recently shared some of his answers to these questions. When we asked him to elaborate, he shared 7 key benefits he’s seen from teaching 15+ courses on Coursera—from expanding his network, to sparking innovation, to winning a prestigious NIH grant.

#1: INCREASE ACCESS TO EDUCATION

NYU Tandon & NSF launch experiment to attract women & minorities to STEM entrepreneurship

NYU Tandon School of Engineering

from June 12, 2017

The National Science Foundation (NSF) has awarded $500,000 to New York University’s Tandon School of Engineering to attract, instruct, and mentor student entrepreneurs — particularly women — in ways to use STEM (science, technology, engineering, and mathematics).

The first cohort of students from across NYU has already been identified, and they will spend the summer enhancing the efficiency of fuel cells for electric cars; creating a fully functional, self-supporting polymer 3D printer; exploring ways to lessen food waste; analyzing urban geography through big data; and more. More than half the student teams in the first cohort will be led by women — a departure from national STEM trends. The most recent U.S. Census data revealed that although women make up nearly half of the working population, they represent only 26 percent of STEM workers. The percentage of those in computer careers – one of the fastest-growing segments — actually declined since the 1990s.

Undergraduates Shape Berkeley’s Digital Humanities

University of California-Berkeley, Digital Humanities

from June 12, 2017

There’s a new interdisciplinary student association forming on campus… Berkeley’s first undergraduate-led group for Digital Humanities: BUDHA

Epic’s Tim Sweeney: Deep Learning A.I. Will Open New Frontiers in Game Design

Medium, Synced

from June 12, 2017

Admittedly, there are a few barriers that are currently preventing the gaming industry from using heuristic algorithms for decision making. A successful industrial product needs to strike a fine balance between functions, speed, robustness and reliability. This applies from the controllers in factories to the Falcon rocket of SpaceX. No video game company will tolerate any error in neural training, or the large amount of time and power it takes to train a network. Therefore, due to the large opportunity cost of building and changing AI systems for games, the enterprising players will probably be reluctant to do so at the moment.

However, Sweeney predicts that video game companies will ultimately utilize advanced AI techniques when the VR/AR games (or so-called “The Metaverse” games) become popular, due to advancements of cameras and displays.

How artificial intelligence is revolutionizing customer management

The Next Web, Ben Dickson

from June 12, 2017

Making sense of data, both structured and unstructured, is something that artificial intelligence is becoming increasingly proficient at. While we’re still least decades away from human-level synthetic intelligence—AI that will match the human brain in reasoning and decision making—machine learning algorithms, computer vision, natural language processing and generation (NLG/NLP), and other forms of narrow artificial intelligence are proving to be the best complement for human activity.

AI-powered tools are now helping scale the efforts of sales teams by gleaning useful patterns from data, finding successful courses of action, and taking care of the bulk of the work in addressing customer needs and grievances.

Foundation supports Crops in silico Project

National Center for Supercomputing Applications at the University of Illinois

from June 13, 2017

The Foundation for Food and Agriculture Research (FFAR) has awarded Principal Investigator Amy Marshall-Colón, Assistant Professor of Plant Biology at the University of Illinois at Urbana-Champaign, $274,000 to continue her research in support of Crops in silico (Cis), a project to develop a suite of virtual plant models that may help resolve a growing gap between food supply and demand in the face of global climate change.

As the planet warms, growing environments around the world are changing faster than traditional crop breeding programs can create new well-adapted varieties. Fully realized, Cis will give crop researchers a tool to examine the effects of environmental challenges on a molecular, cellular, and organ level within a plant to determine the best targets for genetic engineering.

“Science is accelerating faster than ever before, and the Foundation for Food and Agriculture Research is committed to harnessing cutting-edge science for the benefit of the agricultural system,” FFAR Executive Director Sally Rockey said. “Crops in silico will integrate some of today’s most advanced plant models, providing new and exciting insights into how a plant functions that will undoubtedly accelerate our ability to improve plants. I look forward to the results of this exciting project.”

New system allows optical “deep learning”

MIT News

from June 12, 2017

“Deep learning” computer systems, based on artificial neural networks that mimic the way the brain learns from an accumulation of examples, have become a hot topic in computer science. In addition to enabling technologies such as face- and voice-recognition software, these systems could scour vast amounts of medical data to find patterns that could be useful diagnostically, or scan chemical formulas for possible new pharmaceuticals.

But the computations these systems must carry out are highly complex and demanding, even for the most powerful computers.

Now, a team of researchers at MIT and elsewhere has developed a new approach to such computations, using light instead of electricity, which they say could vastly improve the speed and efficiency of certain deep learning computations. Their results appear today in the journal Nature Photonics in a paper by MIT postdoc Yichen Shen, graduate student Nicholas Harris, professors Marin Soljačić and Dirk Englund, and eight others.

Robot Uses Deep Learning and Big Data to Write and Play its Own Music

Georgia Tech News Center

from June 13, 2017

A marimba-playing robot with four arms and eight sticks is writing and playing its own compositions in a lab at the Georgia Institute of Technology. The pieces are generated using artificial intelligence and deep learning.

Researchers fed the robot nearly 5,000 complete songs — from Beethoven to the Beatles to Lady Gaga to Miles Davis — and more than 2 million motifs, riffs and licks of music. Aside from giving the machine a seed, or the first four measures to use as a starting point, no humans are involved in either the composition or the performance of the music.

First grant ever for NumPy, thanks to Moore

NumFOCUS

from June 13, 2017

For the first time ever, NumPy—a core project for the Python scientific computing stack—has received grant funding. The proposal, “Improving NumPy for Better Data Science” will receive $645,020 from the Moore Foundation over 2 years, with the funding going to UC Berkeley Institute for Data Science. The principal investigator is Dr. Nathaniel Smith. NumFOCUS congratulates Nathaniel and all of the NumPy contributors for achieving this important milestone!

How to clean inside the LHC

symmetry magazine, Sarah Charley

from June 12, 2017

The inside of the beam pipes need to be spotless, which is why the LHC is thoroughly cleaned every year before it ramps up its summer operations program.

It’s not dirt or grime that clogs the LHC. Rather, it’s microscopic air molecules.

“The LHC is incredibly cold and under a strong vacuum, but it’s not a perfect vacuum,” says LHC accelerator physicist Giovanni Rumolo.

From post-it note to prototype: The journey of our Firefly

Medium, Waymo Team

from June 12, 2017

From the beginning, Firefly was intended as a platform to experiment and learn, not for mass production. By designing and building a truly self-driving vehicle from scratch, we were able to crack some of the earliest self-driving puzzles — where to place the sensors, how to integrate the computer, what controls passengers need in a car that drives itself. In answering these questions, Firefly defined some of our most recognizable features, like the dome on top of every Waymo car (by putting the LiDAR and cameras in a central spot, our sensors can see further and our computer can process data more efficiently).

How to keep tabs on Atlantic hurricanes – An array of sensors stretches from space to the deep ocean

The Economist

from June 08, 2017

America’s suite of hurricane sensors has grown since 1961. The current Atlantic hurricane season, which began on June 1st, sees the country running a stack of instruments that reach from orbit to a kilometre beneath the ocean. TIROS-3’s successors keep a constant watch on storms’ tracks and sizes. Gulfstream jets fly over and around storms, dropping sensors into them to measure wind speeds. Propeller-driven planes fly right into storms, measuring their properties with radar and its modern, laser-based cousin, lidar. Unmanned drones fly in even deeper. And floats, buoys and aquatic drones survey storms from below.

All of the data these machines gather are transmitted directly to computer models which are used to forecast two things. The first is what track a hurricane will follow, and thus whether, where and when it will make landfall. The second is how much energy it will dump on North America if it does indeed cross the coast—a value known as its intensity.

Catalyst for Collaborative Solutions funds first three interdisciplinary teams

Stanford News

from June 12, 2017

This week, the Catalyst announced its first results, which include a three-year, $3 million grant to a team led by materials scientist Jennifer Dionne. They aim to develop a technology to diagnose bacterial infections within hours rather than days or weeks. To turn this breakthrough technology into a frugal medical product, team members with business, economics and public health experience will work to make sure this quick diagnostic tool is affordable in developing nations.

The Catalyst also awarded $450,000 to two teams in one-time, proof-of-principle funding, which Dabiri said will give these cross-campus teams that came together within a relatively compressed timeline in this inaugural funding round an opportunity to further refine their approach.

Powering Twitch and Medium, search startup Algolia raises $53 million

VentureBeat, Bérénice Magistretti

from June 08, 2017

Matching the power of web search engines like Google on your website is no easy task. But Algolia is trying to tackle this problem by providing businesses with the infrastructure, engine, and tools needed to create intuitive searches for their customers. With a fresh round of $53 million, led by Accel, the San Francisco-based startup is planning to further develop its product and expand internationally. A mix of new and existing investors joined the round, including Alven Capital, Point Nine Capital, Storm Ventures, and Jason Lemkin’s SaaStr fund.

After signing up on Algolia’s website, businesses push the data they want to search, configure the relevance of results, and implement the search user interface (UI) they want for their customers. “We also offer ready to use integrations for ecommerce platforms to make the process even easier,” wrote Algolia cofounder and CEO Nicolas Dessaigne, in an email to VentureBeat. Additionally, businesses can integrate their own analytics tools to determine what their customers are searching for on the website.

Events

ICML 2017 – Workshops

Thirty-fourth International Conference on Machine Learning

from August 06, 2017

Sydney, Australia August 6-11 [$$$$]

Deadlines

Insight Data Science Fellows Program

An intensive 7 week post-doctoral training fellowship bridging the gap between academia & data science. Programs in Boston, New York and Silicon Valley. Deadline for the program starting September is Monday, June 26.

Tools & Resources

[1706.02777] Summary Analysis of the 2017 GitHub Open Source Survey

arXiv, Computer Science > Computers and Society; R. Stuart Geiger

from June 08, 2017

This report is a high-level summary analysis of the 2017 GitHub Open Source Survey dataset, presenting frequency counts, proportions, and frequency or proportion bar plots for every question asked in the survey.

A global dataset of crowdsourced land cover and land use reference data

Scientific Data; Steffen Fritz et al

from June 13, 2017

Global land cover is an essential climate variable and a key biophysical driver for earth system models. While remote sensing technology, particularly satellites, have played a key role in providing land cover datasets, large discrepancies have been noted among the available products. Global land use is typically more difficult to map and in many cases cannot be remotely sensed. In-situ or ground-based data and high resolution imagery are thus an important requirement for producing accurate land cover and land use datasets and this is precisely what is lacking. Here we describe the global land cover and land use reference data derived from the Geo-Wiki crowdsourcing platform via four campaigns. These global datasets provide information on human impact, land cover disagreement, wilderness and land cover and land use. Hence, they are relevant for the scientific community that requires reference data for global satellite-derived products, as well as those interested in monitoring global terrestrial ecosystems in general. [full text]

Deep Learning Toolkit (DLTK) for Medical Imaging

Martin Rachjl

from June 13, 2017

DLTK is a neural networks toolkit written in python, on top of Tensorflow. Its modular architecture is closely inspired by sonnet and it was developed to enable fast prototyping and ensure reproducibility in image analysis applications, with a particular focus on medical imaging. Its goal is to provide the community with state of the art methods and models and to accelerate research in this exciting field.

Learning from Human Preferences

OpenAI; Dario Amodei, Paul Christiano & Alex Ray.

from June 13, 2017

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind’s safety team, we’ve developed an algorithm which can infer what humans want by being told which of two proposed behaviors is better.

Automating low-level tasks for data scientists

IBM Research blog

from June 13, 2017

A team of IBM researchers in Ireland have completed the first phase of a project that aids in automating the feature engineering step at the push of a button. Called the “One Button Machine” project, it computes aggregate features that can be used as input for machine learning models.

The team has successfully applied the One Button Machine in various data science competitions where it outperformed most human teams and ranked among the top 16-24 percent of participants. In a client project with a social service provider from the U.S., it helped improve the accuracy of a complex classification task (involving a database with more than 20 tables) from 57 percent to 64 percent. One Button Machine produced the results within a few hours of effort whereas if the features had to be manually engineered, it would have taken days or even weeks to get to the same levels of accuracy.

Can you help me gather open speech data?

Pete Warden's blog

from June 12, 2017

I miss having a dog, and I’d love to have a robot substitute! My friend Lukas built a $100 Raspberry Pi robot using TensorFlow to wander the house and recognize objects, and with the person detection model it can even follow me around. I want to be able to talk to my robot though, and at least have it understand simple words. To do that, I need to write a simple speech recognition example for TensorFlow.

As I looked into it, one of the biggest barriers was the lack of suitable open data sets. I need something with thousands of labelled utterances of a small set of words, from a lot of different speakers. TIDIGITS is a pretty good start, but it’s a bit small, a bit too clean, and more importantly you have to pay to download it, so it’s not great for an open source tutorial. I like https://github.com/Jakobovski/free-spoken-digit-dataset, but it’s still small and only includes digits. LibriSpeech is large enough, but isn’t broken down into individual words, just sentences.

To solve this, I need your help! I’ve put together a website at https://open-speech-commands.appspot.com/ that asks you to speak about 100 words into the microphone, records the results, and then lets you submit the clips.

Sports.BradStenger.com

Data Science newsletter – June 14, 2017

Leave a Comment Cancel reply