Data Science newsletter – March 14, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for March 14, 2018

GROUP CURATION: N/A

 
 
Data Science News



NVIDIA’s Artificial Intelligence Tech Has Begun Conquering the Multitrillion-Dollar Oil and Gas Industry

The Motley Fool, Beth McKenna


from

NVIDIA’s (NASDAQ:NVDA) graphics processing unit (GPU)-based approach to high-performance computing and deep learning, a category of artificial intelligence (AI) in which machines are trained to make inferences from data the way humans do, has begun making inroads into the global oil and gas industry.

This is great news for investors, as this is a multitrillion-dollar industry that forms the foundation of the global economy. While renewable forms of energy have been steadily displacing fossil fuels to generate electricity and electric vehicles (EVs) have begun lessening the transportation industry’s ravenous appetite for petroleum products, full transformations of these realms will take decades. Moreover, beyond being used to produce just about everything, oil derivatives are key ingredients in products ranging from plastics and fertilizers to the asphalt that paves our roads and the synthetic fibers that clothe many of us.


Voice assistants are helping Utah keep the conversation going with citizens

StateScoop, Ryan Johnston


from

When Utah Chief Technology Officer Dave Fletcher thinks about how to integrate technology into state government, two criteria come to mind.

“What we try to do is focus on the things that will improve quality of life and keep our citizens engaged,” Fletcher told StateScoop.

Utah, which has emerged as a leader among its contemporaries in online services, has found that the emerging technology of voice-activated digital assistants like Amazon’s Alexa fit the bill for satisfaction and engagement in all kinds of citizens.


Waymo self-driving trucks to transport Google Atlanta data center gear

DatacenterDynamics, Sebastian Moss


from

Alphabet’s self-driving vehicle division Waymo has expanded its tests from just passenger cars to include trucks, and its first trials will help Google’s data centers.

After nearly a year of its fully self-driving cars whizzing around Phoenix, Arizona, the company plans to enlist its new trucks for carrying freight bound for Google’s data centers in Atlanta, Georgia.


Google Launches Three New Artificial Intelligence Experiments That Could Be Godsends for Artists, Museums & Designers

Open Culture


from

You’ll recall, a few months ago, when Google made it possible for all of your Facebook friends to find their doppelgängers in art history. As so often with that particular company, the fun distraction came as the tip of a research-and-development-intensive iceberg, and they’ve revealed the next layer in the form of three artificial intelligence-driven experiments that allow us to navigate and find connections among huge swaths of visual culture with unprecedented ease.

Google’s new Art Palette, as explained in the video at the top of the post, allows you to search for works of art held in “collections from over 1500 cultural institutions,” not just by artist or movement or theme but by color palette.


Artificial intelligence can transform industries, but California lawmakers are worried about your privacy

Los Angeles Times, Jazmine Ulloa


from

The use of bots to meddle in political elections. Algorithms that learn who people are and keep them coming back to social media platforms. The rise of autonomous vehicles and drones that could displace hundreds of thousands of workers.

The “robot apocalypse” that some envisioned with the rise of artificial intelligence hasn’t arrived, but machine learning systems are becoming part of Californians’ everyday lives, tech experts told state lawmakers in Sacramento last week. As use of the technology becomes more widespread, so will the challenges for legislators who will have to grapple with how and when they should step in to protect people’s personal data.


MilCloud 2.0 Rollout Reaches for the Sky

SIGNAL Magazine, Kimberly Underwood


from

After the success of the Defense Information Systems Agency’s bold step in 2013 to build an on-premise cloud platform called the milCloud 1.0 Cloud Service Offering based on commercial technology, the agency went for more with milCloud version 2.0, driven by extraordinary customer interest, cloud computing’s advantages and cost savings. Unlike milCloud 1.0, for which mission partners paid a monthly fee regardless of usage, version 2.0 is utility-based, and customers only pay for what they use. This allows military customers to scale usage up or down depending on operational requirements.

Last June, the U.S. Department of Defense awarded a $498 million support contract to CSRA, Falls Church, Virginia, to supply, develop, manage and roll out milCloud 2.0. The company has a tall order to fill with what is a unique commercial system under military authority and protection on DOD bases.

Last week, the agency, known as DISA, granted milCloud 2.0 a Defense Department Provisional Authorization for Impact Level 5 systems (DOD PA IL5) to operate as a cloud portal and to provide infrastructure as a service (IaaS). The approval allows the cloud platform to host unclassified national security systems and high-sensitivity systems.


With Robert Lightfoot Retiring, Who’s Running NASA?

The Atlantic, Marina Koren


from

An interim administrator has been overseeing the space agency for more than 13 months—and now he’s leaving, too.


Computational problems of cosmic proportions

Harvard University, John A. Paulson School of Engineering and Applied Sciences


from

Nearly 2,000 years ago, Chinese astronomers recorded what historians believe to be the first documented sighting of a supernova, a star that “dies” in a catastrophic explosion. Technological advances, from the simple telescopes of the early 1600s to the first computer-controlled supernova search in the 1960s, offered new insights into these dramatic astronomical events.

Now, students in the computational science and engineering master’s program offered by the Institute for Applied Computational Science (IACS) at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS), are applying the latest data science techniques to help scientists shed new light on supernovae.

Led by Pavlos Protopapas, Scientific Program Director of IACS, the students were participating in the fifth year of a Harvard research and education collaboration with Chilean researchers and scientists.


Your next computer could improve with age

MIT Technology Review, Will Knight


from

Generally, computers slow down as they age. Their processors struggle to handle newer software. Apple even deliberately slows its iPhones as their batteries degrade. But Google researchers have published details of a project that could let a laptop or smartphone learn to do things better and faster over time.

The researchers tackled a common problem in computing, called prefetching. Computers process information much faster than they can pull it from memory to be processed. To avoid bottlenecks, they try to predict which information is likely to be needed and pull it in advance. As computers get more powerful, this prediction becomes progressively harder.

In a paper posted online this week, the Google team describes using deep learning—an AI method that employs a large simulated neural network—to improve prefetching. Although the researchers haven’t shown how much this speeds things up, the boost could be big, given what deep learning has brought to other tasks.


University of Arizona Predicting Freshman Dropouts Using Student ID Card Data

Complex, Eric Skelton


from

Researchers at the University of Arizona have been tracking the swipes of freshmen students’ ID cards to predict which of them are most likely to drop out.

Over the last three years, a professor of management information systems named Sudha Ram has been analyzing data gathered from the cards about how often students enter residence halls, libraries, and student recreation centers—as well as how frequently they make purchases.

“It’s kind of like a sensor that’s embedded in them, which can be used for tracking them,” Ram explains in a University of Arizona press release. “It’s really not designed to track their social interactions, but you can, because you have a timestamp and location information.”


A Game Changer: Metagenomic Clustering Powered by HPC

Lawrence Berkeley Lab


from

New Berkeley Lab algorithm allows biologists to harness the capabilities of massively parallel supercomputers to make sense of a genomic ‘data deluge’


Making Better Use of Health Care Data

Harvard Business Review, Benson S. Hsu and Emily Griese


from

At Sanford Health, a $4.5 billion rural integrated health care system, we deliver care to over 2.5 million people in 300 communities across 250,000 square miles. In the process, we collect and store vast quantities of patient data – everything from admission, diagnostic, treatment and discharge data to online interactions between patients and providers, as well as data on providers themselves. All this data clearly represents a rich resource with the potential to improve care, but until recently was underutilized. The question was, how best to leverage it.

While we have a mature data infrastructure including a centralized data and analytics team, a standalone virtual data warehouse linking all data silos, and strict enterprise-wide data governance, we reasoned that the best way forward would be to collaborate with other institutions that had additional and complementary data capabilities and expertise.

We reached out to potential academic partners who were leading the way in data science, from university departments of math, science, and computer informatics to business and medical schools and invited them to collaborate with us on projects that could improve health care quality and lower costs. In exchange, Sanford created contracts that gave these partners access to data whose use had previously been constrained by concerns about data privacy and competitive-use agreements. With this access, academic partners are advancing their own research while providing real-world insights into care delivery.


Big Data and Machine Learning in Health Care

JAMA, The JAMA Network, Viewpoint; Andrew L. Beam and Isaac S. Kohane


from

Nearly all aspects of modern life are in some way being changed by big data and machine learning. Netflix knows what movies people like to watch and Google knows what people want to know based on their search histories. Indeed, Google has recently begun to replace much of its existing non–machine learning technology with machine learning algorithms, and there is great optimism that these techniques can provide similar improvements across many sectors.

It is no surprise then that medicine is awash with claims of revolution from the application of machine learning to big health care data. Recent examples have demonstrated that big data and machine learning can create algorithms that perform on par with human physicians.1 Though machine learning and big data may seem mysterious at first, they are in fact deeply related to traditional statistical models that are recognizable to most clinicians. It is our hope that elucidating these connections will demystify these techniques and provide a set of reasonable expectations for the role of machine learning and big data in health care.


Investigating ad transparency mechanisms in social media: a case study of Facebook’s explanations

Adrian Colyer, the morning paper blog


from

Investigating ad transparency mechanisms in social media: a case study of Facebook’s explanations Andreou et al., NDSS’18

Let me start out by saying that I think it’s good Facebook are making an effort to provide more transparency to advertising. It’s good that Twitter announced they will do something similar too. It’s a shame though that in Twitter’s case they don’t seem to have followed through by actually releasing anything (it’s possible I’m wrong on this and missed the release – let me know if so!). And as today’s paper choice reveals, it’s a shame that in Facebook’s implementation they seem to be making some choices that considerably mask the truth, making the explanations provided significantly less useful than they otherwise could be.

I particularly like the experiment setup in the paper. If you’re going to assess advertisement explanations, then ideally you need to compare the explanation given to the ground truth (i.e., the targeting attributes that were actually used). Unfortunately the ground truth is an unknown of course… unless you also happen to be the advertiser!


The cybernetic newsroom: horses and cars

Reuters, Reginald Chua


from

In newsrooms, machines do some things very well – they analyze and sift through data tirelessly, and at speed and on demand. Humans, on the other hand, are good are asking the right questions, bringing news judgment to bear, and understanding context. Or to put it the opposite way – and very generally – machines write bad stories and journalists struggle with mounds of data.

So that’s why Reuters is building a “cybernetic newsroom” – marrying the best of machine capability and human judgment to drive better journalism, rather than asking one to be a second-rate version of the other.

 
Events



Building Tools to Democratize Access to Medical Data

SVAI, Doc.ai


from

San Francisco, CA “Join us March 21st in SF for a sneak preview of Doc.ai’s developer platform which will soon be open for the community to use and build upon. Doc.ai’s engineering and AI teams will present, along with Chief Science Officer Jeremy Howard.” [free, registration required]


2018 INFORMS Business Analytics Conference: Driving the Future of Analytics

INFORMS, KDnuggets


from

Baltimore, MD April 15-17. ” 2018 conference will gather 1,000 experts and thought leaders in industry, government, and academia, with over 150 sessions by leading universities and organizations. Register by March 12 to get early rates.” [$$$$]


2018 Annual Summit of the Northeast Big Data Innovation Hub

Northeast Big Data Hub


from

New York, NY March 27 at Columbia University. “Join us and learn how the Hub has grown over the past year, including updates on new cross-sector initiatives, lightning talks from our Big Data Spoke PIs, and opportunities to collaborate with our stakeholders in breakout sessions on data literacy, ethics, and health.” [free, registration required]


Health And… Data Science and Public Action Conference

NYU Langone Health, Department of Population Health


from

New York, NY May 21 at NYU Vanderbilt Hall (40 Washington Square South). Hosted by NYU Langone Health, Department of Population Health. [registration required]

 
Deadlines



Call for a High-Level Expert Group on Artificial Intelligence

“Artificial intelligence (AI) is transforming our economy and society. To make the most of it for the citizens and the economy, the European Commission will engage in a dialogue with all actors on the future of AI in Europe, to allow for an open discussion of all aspects of AI development and its impact on the economy and society. A high-level expert group on AI will be selected as a result of this call, to serve as a steering group for this broad multi-stakeholder forum.” Deadline to apply is April 9.

Grand Challenges Explorations – Brazil: Data Science Approaches to Improve Maternal and Child Health in Brazil

“This joint call for proposals targeted specifically to Brazilian researchers is the result of a partnership between the Ministry of Health (MoH), the National Council for Scientific and Technological Development (CNPq), the National Council for State Funding Agencies (CONFAP), engaged State Funding Agency (FAPs) and The Bill & Melinda Gates Foundation (BMGF). This call is part of another initiative funded by the Gates Foundation in 2010 named Healthy Birth, Growth and Development Knowledge integration (HBGDki). The goal of this program is to use data science tools to develop a deep understanding of the risk factors contributing to poor outcomes in preterm birth, physical growth faltering and impaired neurocognitive development. Through Grand Challenges Explorations – Brazil, the above-mentioned partners share this goal and wish to build on the growing expertise in data science, epidemiology and public health in Brazil to address priority issues in maternal and child health.” Deadline for proposals is May 5.
 
Moore-Sloan Data Science Environment News



5 minutes with Suzanne McIntosh

Medium, NYU Center for Data Science


from

Software solutions for space systems? As part of this month’s Women in Data Science series, we catch up with Suzanne McIntosh, Clinical Associate Professor of Computer Science

 
Tools & Resources



[1803.02324] Annotation Artifacts in Natural Language Inference Data

arXiv, Computer Science > Computation and Language; Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, Noah A. Smith


from

Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. We show that, in a significant portion of such data, this protocol leaves clues that make it possible to identify the label by looking only at the hypothesis, without observing the premise. Specifically, we show that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI (Bowman et. al, 2015) and 53% of MultiNLI (Williams et. al, 2017). Our analysis reveals that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Our findings suggest that the success of natural language inference models to date has been overestimated, and that the task remains a hard open problem.


Automating global water maps using Sentinel-2 imagery in GBDX Notebooks

DigitalGlobe Blog, Eric Ong and Darren Wiens


from

Water is essential for life on Earth, as we all know. And when a shortage occurs, as is happening now in Cape Town, South Africa, it’s a crisis. But mapping water across the planet to track changes and potentially prevent water shortages is challenging. For one, because water levels are dynamic and always changing, data collected one day can become obsolete fast; and two, piecing together imagery from areas around the globe from all available sources to get the big view of water locations has been a time-consuming process, which meant available maps were often outdated.

Until now. The GBDX team at DigitalGlobe developed a workflow that enables us to automate the creation of worldwide water surface layer maps using open source, high-resolution 10 m Sentinel-2 satellite imagery available on AWS Amazon S3, run on the GBDX platform. This workflow processes an immense volume of data into a usable vector format, and also enables the aggregation of those layers to show relative change between years.


Making Healthcare Data Work Better with Machine Learning

Google Research Blog, Patrik Sundberg and Eyal Oren


from

Over the past 10 years, healthcare data has moved from being largely on paper to being almost completely digitized in electronic health records. But making sense of this data involves a few key challenges. First, there is no common data representation across vendors; each uses a different way to structure their data. Second, even sites that use the same vendor may differ significantly, for example, they typically use different codes for the same medication. Third, data can be spread over many tables, some containing encounters, some containing lab results, and yet others containing vital signs.

The Fast Healthcare Interoperability Resources (FHIR) standard addresses most of these challenges: it has a solid yet extensible data-model, is built on established Web standards, and is rapidly becoming the de-facto standard for both individual records and bulk-data access. But to enable large-scale machine learning, we needed a few additions: implementations in various programming languages, an efficient way to serialize large amounts of data to disk, and a representation that allows analyses of large datasets.

Today, we are happy to open source a protocol buffer implementation of the FHIR standard, which addresses these issues. The current version supports Java, and support for C++, Go, and Python will follow soon. Support for profiles will follow shortly as well, plus tools to help convert legacy data into FHIR.

 
Careers


Full-time positions outside academia

Researcher/Exposition Project manager



Children and Screens: Institute of Digital Media and Child Development; New York, NY
Internships and other temporary positions

Research Internships



Allen Institute for Artificial Intelligence; Seattle, WA

Leave a Comment

Your email address will not be published.