Data Science newsletter – June 6, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for June 6, 2017


Data Science News

University Data Science News

MIT robotics researchers out of Ted Adelson’s lab introduced sensor technology that crosses one of the remaining boundaries in broader artificial intelligence research: GelSight gives robots the ability to draw rich 3D surface maps from touch.

Harvard has a new masters degree program in health data science and the University of Wisconsin-Madison now offers a Masters in Biomedical Data Science. Domain specific data science masters degrees may provide better training than pan-disciplinary data science masters programs, though I’m not sure of anyone who has evaluated these programs. If anyone has thoughts on how to assess the efficacy of data science masters degree programs, please email me. It’s a topic I have been thinking about on and off for the past couple years.

Cornell’s SC Johnson College of Business will offer online data science courses for business executives. This executive data science coursework is inevitable and I predict we will see many more of these flexible-scheduling data science/analytics masters for business people.

Specialized fintech courses (not full degrees) are also popping up to meet student demand at Stanford, MIT, NYU, Georgetown, Columbia, and Penn State. There appears to be some disagreement about how to teach (and define) fintech. Does a course have to include cryptocurrencies? Should coursework focus on what quants at big banks are doing? What if students, many of whom are MBAs or undergrads, have no training in computer programming?

I remain skeptical that current procedures for informed consent can cover the type of consent asked of the 10,000 New Yorkers whom the Kavli Human Project wants to include. Participants will be asked to share “every aspect of their lives, virtually all the time” including stool samples for two decades or more. I can see the benefits for science of having such detailed data, though I would like to know if researchers have explicit plans to use it or are simply geeking out about being able to have so much in a ‘Gather now, ask the research questions later’ style design. I am unsure that a participant could assess what it means to allow the kind of extensive surveillance for such a long period of time when the research questions are not made clear. What does it mean to inform someone that their *lives* will be dedicated to science? What does this mean for their family and friends who are undoubtedly part of their lives? Are the shadows of those (unconsented and possibly unconsenting) people’s data going to be visible in the data collected?

UW-Seattle, Microsoft, and the City of Bellevue have partnered to use video analytics of roadways to reduce traffic accidents. Elements like cars, bikes, and pedestrians in the videos will be labeled by human volunteers to generate a training data set.

Andrew Ng, co-founder of Coursera, founder of the Google Brain project, recently departed head of Baidu’s AI group and currently a Stanford professor, appears poised to build an AI business of his own.

Maxim Grechkin, Hoifung Poon and Bill Howe have created a process called “Wide-Open,” for identifying as-yet unreleased data and prodding researchers into making it public. Writing in PLOS ONE they note, “Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.”

Extracting Insight from the Data Deluge is a Hard-to-Do Must-Do



“Today’s hardware is ill-suited to handle such data challenges, and these challenges are only going to get harder as the amount of data continues to grow exponentially,” said Trung Tran, a program manager in DARPA’s Microsystems Technology Office (MTO). To take on that technology shortfall, MTO last summer unveiled its Hierarchical Identify Verify Exploit (HIVE) program, which has now signed on five performers to carry out HIVE’s mandate: to develop a powerful new data-handling and computing platform specialized for analyzing and interpreting huge amounts of data with unprecedented deftness. “It will be a privilege to work with this innovative team of performers to develop a new category of server processors specifically designed to handle the data workloads of today and tomorrow,” said Tran, who is overseeing HIVE.

The quintet of performers includes a mix of large commercial electronics firms, a national laboratory, a university, and a veteran defense-industry company: Intel Corporation (Santa Clara, California), Qualcomm Intelligent Solutions (San Diego, California), Pacific Northwest National Laboratory (Richland, Washington), Georgia Tech (Atlanta, Georgia), and Northrop Grumman (Falls Church, Virginia).

New ways of representing information could transform digital technology



Many people who use computers and other digital devices are aware that all the words and images displayed on their monitors boil down to a sequence of ones and zeros. But few likely appreciate what is behind those ones and zeros: microscopic arrays of “magnetic moments” (imagine tiny bar magnets with positive and negative poles). When aligned in parallel in ferromagnetic materials such as iron, these moments create patterns and streams of magnetic bits—the ones and zeros that are the lifeblood of all things digital.

These magnetic bits are stable against perturbations, such as from heat, by a form of strength in numbers: any moment that inadvertently has its orientation reversed is flipped back by the magnetic interaction with the rest of the aligned moments in the bit. For decades, engineers have been able to increase computing capabilities by shrinking these magnetic domains using advances in manufacturing and novel techniques for reading and writing the data.

Alexander Peysakhovich’s Theory on Artificial Intelligence

Pacific Standard, Avital Andrews


Alexander Peysakhovich is technically a behavioral economist, but he bristles a bit at being defined that narrowly. He’s a scientist in Facebook’s artificial intelligence research lab, as well as a prolific scholar, having posted five papers in 2016 alone. He has a Ph.D. from Harvard University, where he won a teaching award, and has published articles in the New York Times, Wired, and several prestigious academic journals.

Despite these accomplishments, Peysakhovich says, “I’m most proud of the fact that I’ve managed to learn enough of lots of different fields so that I can work on problems that I’m interested in using those methods. I’ve co-authored with economists, game theorists, computer scientists, neuroscientists, psychologists, evolutionary biologists, and statisticians.”

Government Use of Social Data

The Big Boulder Initiative


“I know there was just a state-wide internet outage here; for the record, I had nothing to do with that.”

That was the opening line from Andrew Hallman, Deputy Director of CIA for Digital Innovation. That level of self-awareness and an understanding of the biases typically held against the Intelligence Community would prove to be not just a central theme of this session, but also an effective way to challenge those listening to keep their minds open.

Hallman described his particular role as one whose inception was a natural extension of an agency-wide modernization in 2015. Much like brands and other organizations, CIA realized that it was not optimized to deal with the growing complexity of the global threat landscape. Hallman’s job was to come in and unify efforts and help digitally remaster the business of intelligence as the Intelligence Community had always known it.

How to Call B.S. on Big Data: A Practical Guide

The New Yorker, Michelle Nijhuis


The results of our most recent Presidential election notwithstanding, West and Bergstrom maintain that humans are pretty good at detecting verbal bullshit. Members of the species have, after all, been talking rot for millennia, and its warning signs are well known. Bullshit expressed as data, on the other hand, is relatively new outside scientific circles. Multivariate graphs didn’t begin to appear in the popular press until the nineteen-eighties, and only in the past decade, as smartphones and other information-gathering devices have accelerated the accumulation of Big Data, have complex visualizations been routinely presented to the general public. While data can be used to tell remarkably deep and memorable stories, Bergstrom told me, its apparent sophistication and precision can effectively disguise a great deal of bullshit.

Artificial Intelligence Helps in Learning How Children Learn

Scientific American, Alison Gopnik


Bayesian inference considers both the strength of new evidence and the strength of your existing hypotheses. This characteristic of Bayesian statistics lends them a combination of stability and flexibility. Both toddlers and scientists hold on to well-confirmed hypotheses, but eventually enough new evidence can overturn even the most cherished idea. Several studies show that youngsters integrate existing knowledge and new evidence in this way. Elizabeth Bonawitz of Rutgers University and Laura Schulz of the Massachusetts Institute of Technology found that four-year-olds begin by thinking that a psychological state, such as being anxious, is unlikely to have much effect on their physical well-being, such as making your stomach ache, and reject evidence to the contrary. But if you give them accumulating evidence in favor of this “psychosomatic” hypothesis, they gradually become more open to this suspect idea. A Bayesian model, moreover, can predict just when and how a child’s view will change quite precisely.

Master’s Programs in Data Science and Analytics (Continued …)

Amstat News, Steve Pierson


More universities are starting master’s programs in data science and analytics, of which statistics is foundational, due to the wide interest from students and employers. Amstat News reached out to those in the statistical community who are involved in such programs. Given their interdisciplinary nature, we identified programs involving faculty with expertise in different disciplines to jointly reply to our questions. In our April issue, we profiled four universities; here are several more.

[1705.10783] A generalized model of social and biological contagion

arXiv, Physics > Physics and Society; Peter Sheridan Dodds, Duncan J. Watts


We present a model of contagion that unifies and generalizes existing models of the spread of social influences and micro-organismal infections. Our model incorporates individual memory of exposure to a contagious entity (e.g., a rumor or disease), variable magnitudes of exposure (dose sizes), and heterogeneity in the susceptibility of individuals. Through analysis and simulation, we examine in detail the case where individuals may recover from an infection and then immediately become susceptible again (analogous to the so-called SIS model). We identify three basic classes of contagion models which we call \textit{epidemic threshold}, \textit{vanishing critical mass}, and \textit{critical mass} classes, where each class of models corresponds to different strategies for prevention or facilitation. We find that the conditions for a particular contagion model to belong to one of the these three classes depend only on memory length and the probabilities of being infected by one and two exposures respectively. These parameters are in principle measurable for real contagious influences or entities, thus yielding empirical implications for our model. We also study the case where individuals attain permanent immunity once recovered, finding that epidemics inevitably die out but may be surprisingly persistent when individuals possess memory.

It’s All About the Data

Data Science for Social Good – Atlanta, Takeria Blunt


The Housing Justice team partners with Atlanta Legal Aid (ALA) to address the alleged predatory lending practices of Harbour Portfolio Advisors and gentrification on the historic Westside of Atlanta. We have conducted extensive background research on housing topics such as contract-for-deed, liens, mortgages etc. in order to understand key components and arguments surrounding the assignment. Our project leverages the mapping tool “Mapping Justice” created by a team of students here at Georgia Tech.

What if Your Cellphone Data Can Reveal Whether You Have Alzheimer’s?

Slate, Visar Berisha and Julie Liss


We wondered: What if we can measure aspects of everyday activities like talking, walking, and typing to objectively quantify and track our brains’ health and function over time?

Paul Allen’s AI group built a voice search for Alexa skills, but Amazon rejected it

GeekWire, Nat Levy


Amazon’s digital brain Alexa is very skillful, now at more than 12,000 third-party capabilities, but it can sometimes be difficult to wade through them all and discover new skills, and the platform lacks a central voice search to help users find the right skill for a particular task.

Paul Allen’s Allen Institute for Artificial Intelligence wanted to fill that void, and it built a voice-activated search for Alexa skills. But one problem: Amazon said no.

Here is the answer the Allen Institute team got: “Thank you for the recent submission of your skill, ‘Skill Search’. Unfortunately, your skill has not been published on Amazon Alexa. We don’t allow skills that recommend skills to customers at this time. We will contact you if this feature becomes available.”

Research aims to make artificial intelligence explain itself

Oregon State University, News and Research Communications


The four-year grant from DARPA will support the development of a paradigm to look inside that black box, by getting the program to explain to humans how decisions were reached.

“Ultimately, we want these explanations to be very natural – translating these deep network decisions into sentences and visualizations,” said Alan Fern, principal investigator for the grant and associate director of the College of Engineering’s recently established Collaborative Robotics and Intelligent Systems Institute.

Intel predicts a $7 trillion self-driving future

The Verge, Kirsten Korosec


The race to be the first to deploy autonomous vehicles is on among carmakers, emerging startups, and tech giants. Amid this constant news cycle of deals and drama, the purpose of all of it can get lost — or at least a bit muddied. What exactly are these companies racing for?

A $7 trillion annual revenue stream, according to a study released Thursday by Intel. The companies that don’t prepare for self-driving risk failure or extinction, Intel says. The report also finds that over half a million lives could be saved by self-driving over just one decade.

Oregon Health & Science University licenses out data center dome

DatacenterDynamics, Max Smolaks


The original design is an 8,000 square foot, 4MW data center that spends just 0.2MW on non-IT equipment. Each rack can support up to 25kW of hardware, making it especially suitable for high-density computing often used for research purposes.

There are large air intakes toward the bottom of the building and large air vents toward the top. The dome airfoil helps improve exhaust – much like the ‘chicken coop’ design used by Yahoo, among others – and a ‘vegetative bio-swale’ around the facility provides additional cooling and air supply. Portland Business Journal previously reported that the project cost $22 million.


SINET Innovation Summit 2017



New York, NY Attendees will hear from some of the greatest cyber defenders in industry. Tuesday, June 20, starting at 8:30 a.m., Times Center (242 West 41st Street). [$$$]


PSB-2018 Session on Text Mining and Visualization for Precision Medicine

Kohala Coast, Big Island, Hawaii Pacific Symposium on Biocomputing will be January 3-7, 2018. “This session focuses on efforts where informatics researchers are actively collaborating with bench scientists and clinicians for the deployment of integrative approaches in precision medicine that could impact scientific discovery.” Deadline for submissions is August 1.
Tools & Resources

This new tool will help your newsroom create better email newsletters

Nieman Journalism Lab, Joseph Lichterman


The guide, called Opt In, offers a best-practice guide to starting and optimizing email newsletters with tips for design, revenue generation, content suggestions, metrics to follow, and more depending on what you want to accomplish with your newsletter.

The Quartz Directory of Essential Data

Quartz, Christopher Groskopf


A curated list of useful datasets published by important sources.

Does open data make you happy? An introduction to Kaggle Kernels

Medium, Megan Risdal


How much you like open and accessible data probably depends on the kind of person you are — I happen to like it a lot! But, it turns out that at a national scale, more open and accessible government data is positively correlated with happiness.

In this post, I want to share with you how I used Kaggle Kernels — our in-browser code execution environment — to explore two very interesting open datasets on Kaggle’s Datasets platform to come to this conclusion.


Full-time, non-tenured academic positions

Research Associate in Procedural Content Generation

Falmouth University, Digital Creativity Labs; Penryn, England

Leave a Comment

Your email address will not be published.