Data Science newsletter – July 26, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for July 26, 2017

GROUP CURATION: N/A

Data Science News

Data Breaches Happening at Record Pace, Report Finds

NBC News, Herb Weisbaum

from July 24, 2017

The number of data breaches in the U.S. jumped 29 percent in the first half of this year, hitting a record high of 791, according to a new report from the Identity Theft Resource Center and CyberScout, the data risk management company.

“Frankly, I was surprised at how significantly the number of breaches has grown,” said Eva Velasquez, ITRC’s president and CEO. “We knew this was a trend, we knew that the thieves would continue to find this lucrative, but the sheer volume of growth has been really surprising.”

Program readies analytics experts for big business

Business in Vancouver, Tyler Nyquvest

from July 25, 2017

The University of British Columbia (UBC) has developed a new program to help graduates tap into this growing market in the new age of data science.

“By tradition, our focus was more narrow; now it is becoming more broad,” said Harish Krishnan, head of the graduate business analytics program at UBC’s Sauder School of Business. “Before, it might be more focused on operational problems like capacity planning, process improvement or patient flow through a hospital.”

“The new program recognizes that while these things are important, there are other areas within various business disciplines that have a strong demand for analytical talent,” Krishnan said.

DOE User Facilities Join Forces to Tackle Biology’s Big Data

US Department of Energy, Joint Genome Institute

from July 25, 2017

Six proposals have been selected to participate in a new partnership between two U.S. Department of Energy (DOE) user facilities through the “Facilities Integrating Collaborations for User Science” (FICUS) initiative. The expertise and capabilities available at the DOE Joint Genome Institute (JGI) and the National Energy Research Scientific Computing Center (NERSC) – both at the Lawrence Berkeley National Laboratory (Berkeley Lab) – will help researchers explore the wealth of genomic and metagenomic data generated worldwide through access to supercomputing resources and computational science experts to accelerate discoveries.

“As we bring researchers into the FICUS program, we are introducing a new user community to the power of supercomputers. Scientists will use whatever tools are readily available to investigate a hypothesis and to date, only a small set of biological tools have needed a supercomputer, but this is changing quickly,” says Kjiersten Fagnan, who serves a dual role as the DOE JGI’s Chief Informatics Officer and NERSC’s Data Science Engagement Group Lead.

Government Data Science News

The City of Syracuse, New York just released DataCuse a new open data portal powered by ESRI. In my experience, one of the most useful aspects of city-level open data is when a city has an API for real-time transit data, which Syracuse does.

The European Commission wants to establish an ‘open science cloud’, but has discovered there won’t be enough data scientists and computational domain scientists to run it as envisioned. The Commission called for universities to offer more masters degrees in data science and adjacent specializations. Buyer beware: masters degree quality varies considerably. Hirer beware: competition for talent in this field leads to very high salaries.

The National Science Foundation surveyed 704 PIs on their biology grants and found that “the most important unmet needs were…training in data integration and data management”. Improvements in the data pipeline significantly improve the technical ability for researchers to make their projects reproducible, though there may be additional technical reasons as well as culture-and-climate impediments to fully reproducible science. Publishing in ReScience, a journal that lives on GitHub, is an intriguing option for computer scientists. (Not yet available in other disciplines.)

Thanks to data science, the fact that NASA’s TESS satellite has out-of-focus cameras is not going to be a major problem.

The large, powerful, contentious Thirty Meter Telescope project that may someday sit atop Mauna Kea in Hawaii cleared one of many hurdles this week. Native Hawaiians continue to oppose building on the Mauna Kea site.

Seattle hired Kate Garman to be its first smart city coordinator. Meanwhile, Tampa Bay may be washed off the map if it is hit by a category 3 hurricane. City officials are uninterested in taking meaningful action, though they truly appreciate the gorgeous beaches. What’s my point? Civic tech and OpenGov – which are having huge impacts at the local level – are empowering some cities to slingshot themselves into a much different future than others.

Washington DC, long an overlooked tech stronghold (especially if we include northern Virginia), recently saw its city government open The Lab @ DC to catalyze civic data science there.

Pittsburgh has become a certified tech hub, which makes sense given how dominant Carnegie Mellon University is in AI and robotics. I recently visited the city and discovered it offers cheap rides on a funicular. How delightful!

500 cities will soon have access to a City Health Dashboard to monitor social, economic, clinical, and environmental factors like primary care coverage, unemployment, housing affordability, opioid deaths, and smoking rates. The project was funded by the Robert Wood Johnson Foundation and developed by a collaboration between Marc Gourevitch at NYU Langone Medical Center and National Resource Networks.

Tech, data center firms increase US lobbying spend

DatacenterDynamics, Sebastian Moss

from July 25, 2017

Over the past three months, lobbying efforts by technology firms to influence US government policy have intensified.

The period from April 1 to June 30 saw record spending by giants like Amazon, Google and Microsoft on a myriad of issues, as well as increased contributions from data center players like Equinix and Iron Mountain, which are interested in federal data center consolidation initiatives and broader regulation of the sector.

How better training can help fix the research reproducibility crisis

Inside Higher Ed, Erin Becker

from July 24, 2017

In a recent survey of 704 principal investigators for National Science Foundation biology grants, the majority said their most important unmet data needs were not software or infrastructure, but training in data integration and data management.[1]

This lack of data skills is holding back progress toward more reproducible research by making it harder for researchers to share, review, and reanalyze one another’s data. In a recent survey by Nature, the top four solutions scientists identified for improving reproducibility related to better understanding of statistics and research design and improved mentoring, supervision, and teaching of researchers.[2] Data skills need to be an integral part of academic training in order to ensure that research is reliable, transparent, and reproducible.

My organization, Data Carpentry, and our sister organization, Software Carpentry, are among the groups filling this gap by training researchers in the latest technologies and techniques for cleaning, organizing, cataloging, analyzing, and managing their data. We see this training as an important part of a larger project to transform academic culture to make research more reproducible and transparent.

The Rise of AI Is Forcing Google and Microsoft to Become Chipmakers

WIRED, Business, Tom Simonite

from July 25, 2017

By now our future is clear: We are to be cared for, entertained, and monetized by artificial intelligence. Existing industries like healthcare and manufacturing will become much more efficient; new ones like augmented reality goggles and robot taxis will become possible.

But as the tech industry busies itself with building out this brave new artificially intelligent, and profit boosting, world, it’s hitting a speed bump: Computers aren’t powerful and efficient enough at the specific kind of math needed. While most attention to the AI boom is understandably focused on the latest exploits of algorithms beating humans at poker or piloting juggernauts, there’s a less obvious scramble going on to build a new breed of computer chip needed to power our AI future.

Company Data Science News

Google’s DeepMind describes an “imagination-based” approach to agents that plan and learn. This kind of leap-ahead thinking is what we’ve come to expect from DeepMind.

Microsoft also has a paper out on what they’re calling Machine Teaching.

Intel, fighting to keep up in the AI hardware race, announced a new $79 deep-learning USB stick. The product was developed by Movidius a start-up Intel acquired in April 2016. If any of you readers use this stick, let me know what tasks you’re sending to the stick and how it works.

Microsoft has a loosely related hardware play, introducing an accelerator for deep neural networks to its HPU for the Hololens. The accelerator will draw on the Hololens battery which means it has to be fairly efficient. In fact, Tom Simonite attempted to make sense of the corporate strategies involving AI + custom chips for Wired.

LinkedIn has always had excellent-bordering-on-creepy data science. Much to my dismay, the week I finally caved in and joined the chorus of spammy invites to “join my network on LinkedIn!”, they came up with another spy tool camouflaged as business intelligence. See the thing is, now that I have consumed the Kool-Aid while holding my nose, I really do invite you to join my network, dearest readers.

Replika is a shadow-bot that tracks what you’re up to on your computer and mimics your style, attitude, and tendencies in order to text like you would. The inventor used it to mimic the presence of a dearly departed friend.

NatureSweetFarms has used AI to boost harvests by 4%. They project they can bring that up to 20% which might be too optimistic. Still, I try to bring you some AgTech news because it’s one of the areas having measurable early success with AI applications.

Amazon is reportedly building a health analytics team based in Seattle. They are calling the team 1492…in reference to the flood of disease and pestilence brought by European settlers to native populations without immunity???

Characterizing and Managing Missing Structured Data in Electronic Health Records

bioRxiv; Brett K. Beaulieu-Jones, Daniel R. Lavage, John W. Snyder, Jason H. Moore, Sarah A. Pendergrass, Christopher R. Bauer

from July 24, 2017

Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR) based analyses. Failure to appropriately consider missing data can lead to biased results. Here, we provide detailed procedures for when and how to conduct imputation of EHR data. We demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. We analyzed clinical lab measures from 602,366 patients in the Geisinger Health System EHR. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness. Our results show that several methods including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute consistently imputed missing values with low error; however, only a subset of the MICE methods were suitable for multiple imputation. The analyses described provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs and all of our methods and code are publicly available.

How Your Brain Is Like the Cosmic Web

Nautilus, Franco Vazza & Alberto Feletti

from July 20, 2017

The task of comparing brains and clusters of galaxies is a difficult one. For one thing it requires dealing with data obtained in drastically different ways: telescopes and numerical simulations on the one hand, electron microscopy, immunohistochemistry, and functional magnetic resonance on the other.

,
It also requires us to consider enormously different scales: The entirety of the cosmic web—the large-scale structure traced out by all of the universe’s galaxies—extends over at least a few tens of billions of light-years. This is 27 orders of magnitude larger than the human brain. Plus, one of these galaxies is home to billions of actual brains. If the cosmic web is at least as complex as any of its constituent parts, we might naively conclude that it must be at least as complex as the brain.

Artificial intelligence holds great potential for both students and teachers – but only if used wisely

The Conversation, Simon Knight and Simon Buckingham Shum

from July 23, 2017

Artificial intelligence (AI) enables Siri to recognise your question, Google to correct your spelling, and tools such as Kinect to track you as you move around the room.

Data big and small have come to education, from creating online platforms to increasing standardised assessments. But how can AI help us use and improve it?

[1707.04393] Sustainable computational science: the ReScience initiative

arXiv, Computer Science > Digital Libraries; Nicolas P. Rougier et al.

from July 14, 2017

Computer science offers a large set of tools for prototyping, writing, running, testing, validating, sharing and reproducing results, however computational science lags behind. In the best case, authors may provide their source code as a compressed archive and they may feel confident their research is reproducible. But this is not exactly true. James Buckheit and David Donoho proposed more than two decades ago that an article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code, and data that produced the result. This implies new workflows, in particular in peer-reviews. Existing journals have been slow to adapt: source codes are rarely requested, hardly ever actually executed to check that they produce the results advertised in the article. ReScience is a peer-reviewed journal that targets computational research and encourages the explicit replication of already published research, promoting new and open-source implementations in order to ensure that the original research can be replicated from its description. To achieve this goal, the whole publishing chain is radically different from other traditional scientific journals. ReScience resides on GitHub where each new implementation of a computational study is made available together with comments, explanations, and software tests.

This Local Company Uses Data Science to Rank Data Scientists

BostInno, Lucia Maffei

from July 19, 2017

To connect companies with data scientists, local startup Experfy is working on both creating a marketplace of available experts as well as training current employees or people who want to start a career in data science.

First, the company created an online network of over 30,000 data experts, but also machine learning, Artificial Intelligence and Internet of Things experts, including academics and full-time consultants. By using the marketplace, companies can post the projects they need help with, connect with the right people and hire on-demand data scientists.

“On an average, I would say that every project gets about 10 to 15 proposals,” Jyotsna Khan, a senior manager at Experfy, during an interview. “The client would select like, say, three people and … pick times to interview them. Everything happens through the platform.”

Events

EMNLP 2017 – Papers accepted for publication

SIGDAT, ACL

from September 07, 2017

Copenhagen, Denmark Empirical Methods in Natural Language Processing, September 7-11. [$$$]

Medicine X 2017

Stanford University

from September 15, 2017

Stanford, CA Stanford Medicine X is a multifaceted program that represents a new way of solving health care’s most pressing problems, September 15-17. [$$$$]

Deadlines

AI Grant

Get $2,500 for your AI project. Apply in five minutes. Applications due August 25, 2017

Data & Society Workshop: Lessons From The Field

New York, NY On October 30, 2017, Data & Society will host a workshop in NYC on how data-driven technologies intersect with society. For this workshop, we invite researchers who do empirical work using qualitative methods to submit papers. Deadline for applications is August 25.

Tools & Resources

Entity Deduplication

University Of Chicago, Center for Data Science and Public Policy

from July 25, 2017

“Combining datasets and performing large aggregate analyses are a powerful new way to improve service across large populations. Critically important in this task is the deduplication of identities across multiple data sets that were rarely designed to work together. Inconsistent data entry, typographical errors, and real world identity changes pose significant challenges to this process. To help, we have built a tool called pgdedupe.”

Australia releases data from MH370 search that will help both science and the fishing industry

Quartz, Steve Mollman

from July 19, 2017

Australia released today a vast trove of underwater data gathered during the search for the missing Boeing 777 airliner, which disappeared in March 2014 while en route from Kuala Lumpur to Beijing with 239 people on board. To find the plane, search teams armed with high-tech equipment and relatively large budgets painstakingly surveyed about 120,000 sq km (46,000 sq miles) of seafloor, turning an otherwise remote part of the Indian Ocean into one of the world’s most thoroughly mapped regions of deep ocean.

37 Reasons why your Neural Network is not working

Slav Ivanov

from July 25, 2017

Where do you start checking if your model is outputting garbage (for example predicting the mean of all outputs, or it has really poor accuracy)?

A network might not be training for a number of reasons. Over the course of many debugging sessions, I would often find myself doing the same checks. I’ve compiled my experience along with the best ideas around in this handy list. I hope they would be of use to you, too.

Lessons Learned From Benchmarking Fast Machine Learning Algorithms

Microsoft, Cortana Intelligence and Machine Learning Blog; Miguel Fierro, Mathew Salvaris, Guolin Ke and Tao Wu

from July 25, 2017

Boosted decision trees are responsible for more than half of the winning solutions in machine learning challenges hosted at Kaggle, according to KDNuggets. In addition to superior performance, these algorithms have practical appeal as they require minimal tuning. In this post, we evaluate two popular tree boosting software packages: XGBoost and LightGBM, including their GPU implementations.

Careers

Full-time positions outside academia

AI Writer

Facebook; Menlo Park, CA

Data Scientist

Bowery Farming; New York, NY

Full-Stack Software Engineer

Metamarkets; San Francisco and New York

Sports.BradStenger.com

Data Science newsletter – July 26, 2017

Leave a Comment Cancel reply