Data Science newsletter – March 6, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for March 6, 2017


Data Science News

University Data Science News

Jennifer Preece at the University of Maryland urges citizen science help preserve species diversity. She recommends seven apps including,, and

SocArxiv has raised $20k from the libraries of MIT and UCLA to support the new preprint archive currently being incubated by the Center for Open Science which is housed at the University of Maryland.

In 2015 Gert Storms et al. asked “Are we wasting a good crisis?” about the limp response of the field of psychology to numerous claims of data fabrication and other research fraud. Now Storms is on the editorial board of a journal within the American Psychological Association, refusing to accept studies that do not share data. Robert Greene, the journal’s editor in chief, asked Storms to resign because Storms’ “policy conflicts with that of the journal”. Since psychology is a field that obtains personally identifiable data which can be sensitive, many psychologists believe they are protecting their subjects’ identities when they refuse to release data to anonymous peer reviewers. Given the evidence of fraud within psychology and the (low-level but) pervasive fear of getting scooped in many academic fields, it is reasonable to develop field-specific strategies before iron-fisting a mandatory data sharing procedure. If there were technologies that allowed reviewers to use the data in a way that prevented local storage and audited it for anonymity before sharing, that may alleviate many privacy concerns for subjects. Scooping (e.g. a privacy concern for researchers) would be more difficult, but not impossible.

Dina Zielinski and Yaniv Erlich of the New York Genome Center and Columbia University are using DNA to compress, store, and decompress data in a teeny-tiny material footprint compared to current data storage techniques, “a one terabyte hard drive currently weighs around 150 grams. Using their methods, Erlich and Zielinski can fit 215,000 times as much data in a single gram of DNA.” This advance should ameliorate concerns that Moore’s Law will (or has) hit an asymptote. Using DNA as a storage device is still expensive, but prices are dropping quickly. Plus, DNA is so critical to biological research, it is impossible to imagine a time when it will be obsolete. The same cannot be said of floppy disks, zip drives, or USBs. Erlich was also in the news this week discussing the difficulty of providing privacy protection to human gene data given how rapidly genomic knowledge is increasing.

Climate change is real. This prediction from thirty years ago was “prophetic”.

The Ohio State University received $2.4m from the Gordon and Betty Moore Foundation to build three space new observation units so that they can scan the entire night sky, every night. All data will be made publicly available.

UC-Berkeley introduces a student-led Data Scholars program to spread data science training among groups historically under-represented in STEM fields.

Claudia Geib has this week’s long-read on the social life of particle discoveries in physics. She provides compelling evidence that the physics community’s acceptance of new particles as scientifically relevant is tied closely to the social position of the team claiming the discovery. For instance, any particle potentially ‘discovered’ at CERN is likely to be more widely discussed than one discovered by, say, a Hungarian team of physicists.

The median starting salary for an assistant professor in statistics last year was $88,820. Glassdoor reports that data scientists make an average of $113,436 nationwide, putting them at the median range for a newly promoted full Professor in statistics. Yes, I realize it is unwise to compare means and medians, but Glassdoor doesn’t publish a median.

NYU psychology professor Gary Marcus, who joined Uber as the head of its AI lab four months ago when they acquired his AI firm Geometric Intelligence is now leaving that post. On Facebook, he noted he is moving back to New York where he will continue as a “special advisor” to Uber. That was quick.

Our former Moore-Sloan Fellow Dan Cervone who now works in sports posted a great explainer on using win probability models (WPMs) with sports data.

The newly formed Global Life Sciences Data Resources Coalition “aim[s] to create a sustainable and accessible data infrastructure that will benefit scientists worldwide.” It will encompass stand-alone databases like Flybase and the Protein Data Bank in a funding model that is not reliant on short-term grants.

Elsewhere in data sharing, the New England Journal of Medicine partnered with the NIH to release a multi-year clinical dataset of blood pressure treatment data to researchers ahead of schedule. The original research team, led by Jackson Wright and called SPRINT, had spent over a decade gathering the data and were upset that “one-third of the 60 papers” they were planning to write got scooped when other researchers published using the data. There should be mechanisms for giving proper credit to Wright and the SPRINT team for gathering data so useful that hundreds of other researchers are using it to advance science. Gathering good data and sharing it is a force multiplier for scientific achievement and should be recognized as such regardless of how many papers the SPRINT team publishes.

Company Data Science News

“Judging by the number of companies talking up their amazing AI projects, the entire Fortune 500 went from bozo status to the Mensa society,” according to Matt Asay in a right-sizing of “artificially inflated” claims about many companies’ claims of their immense AI integration.

So which companies can back up their claims? Facebook, Amazon, Google, Apple, Spotify, and LinkedIn top my personal list of companies with well-integrated robust AI. Fortune used a moderately more rigorous approach to identify the top fifty AI startups.

Martin Stumpe and Lily Peng of Google Research demonstrated a promising approach to cancer detection from pathology slides. The model improves substantially upon the detection of specific types of cancer. However, it is not trained to detect other cancer types or to spot evidence of non-cancerous disease the way pathologists are. Further, the model is allowed to turn up more false positives than pathologists. In other words, pathologists stand to improve their jobs, not to lose their jobs as a result of this technological advance.

Travis Deyle and Erik Schluntz of Cobalt Robotics teamed up with A-list designer Yves Behar to build a stylish security guard robot that detects security and maintenance problems and calls for help automatically. The Cobalt could put office security guards out of work.

Giant hedge fund Citadel sponsored 60 hackathons at universities around the world to identify and recruit top talent. For winner Eric Munsing, PhD student in civil engineering at UC-Berkeley, the hackathon made him aware that he could work in finance which, he reported, “does seem like an attractive path…the most intellectually stimulating”.

Tesla‘s Autopilot does not perform well, according to Backchannel. Errors include unexpected braking, lack of expected braking (!), exiting at every right lane exit, and veering into oncoming lanes. Mobileye, the company producing Autopilot’s camera, pulled out of their partnership with Tesla citing safety precautions; Tesla said “commercial reasons” forced the break-up.

Apache Kafka, an open source database project, has a commercial company built on top of it called Confluent. Confluent can go out and raise venture capital…to the tune of $80m total with a $50m round led by Sequioa with Benchmark and Index Ventures announced last week.

Cambridge Analytica, a data analytics consultancy, worked with the Trump campaign and is now under fire for claiming to have used psychographic profiling (more). The first problem is a privacy and ethics issue which is largely mitigated by the second and third problems, which is that Cambridge Analytica didn’t actually use psychographic profiling, nor is there evidence that any models they might have developed could function.

I was out with an unnamed colleague who works in sports analytics last night. We agreed that sports teams are often overly reliant on psychometrics that they may or may not actually be using in their stats department (which may or may not be an outside consulting firm). Sometimes I feel like I am selling tickets to the circus. The data science circus.

Google Cloud is buying Kaggle and will keep the service running (more). It’s a great way to gain an edge in recruiting global data science talent and “to combine forces with ImageNet creators Fei-Fei Li and Jia Li“.

Spotify has acquired audio detection startup Sonalytic for an undisclosed amount. Sonalytic’s technology can “identify songs, mixed content and audio clip” and “track copyright-protected material”. Spotify appears to be moving to combine its user-similar model with a music-similarity model to improve its recommendations.

50 Companies Leading the Artificial Intelligence Revolution

Fortune, Brian O'Keefe, Nicolas Rapp


We know that artificial intelligence will soon reshape our world. But which companies will lead the way? To help ­answer that question, research firm CB Insights recently selected the “AI 100,” a list of the 100 most promising artificial intelligence startups ­globally. The private companies were chosen (from a pool of over 1,650 candidates) by CB Insights’ Mosaic algorithm, based on factors like financing history, investor quality, business category, and momentum. A look at the 50 largest startups on the list, ranked by total funds raised, shows that investment in AI is surging worldwide. But, for now at least, the U.S. appears to be leading the revolution.

AI won’t kill you, but ignoring it might kill your business, experts say

Chicago Tribune, Blue Sky Innovation blog, Cheryl V. Jackson


Relax. Artificial intelligence is making our lives easier, but won’t be a threat to human existence, according to panel of practitioners in the space.

“One of the biggest misconceptions today about autonomous robots is how capable they are,” said Brenna Argall, faculty research scientist at the Rehabilitation Institute of Chicago, during a Chicago Innovation Awards event Wednesday.

“We see a lot of videos online showing robots doing amazing things. What isn’t shown is the hours of footage where they did the wrong thing,” she said.

Technique reveals never-before-seen states as protein unfolds

Chemical & Engineering News, Stu Borman


An improved version of single-molecule force spectroscopy (SMFS) has enabled researchers to study the unfolding of a membrane protein with much higher time resolution and force precision than ever before, revealing a multitude of new details.

[1703.00534] Skin cancer reorganization and classification with deep neural network

arXiv, Computer Science > Computer Vision and Pattern Recognition; Hao Chang


As one kind of skin cancer, melanoma is very dangerous. Dermoscopy based early detection and recarbonization strategy is critical for melanoma therapy. However, well-trained dermatologists dominant the diagnostic accuracy. In order to solve this problem, many effort focus on developing automatic image analysis systems. Here we report a novel strategy based on deep learning technique, and achieve very high skin lesion segmentation and melanoma diagnosis accuracy: 1) we build a segmentation neural network (skin_segnn), which achieved very high lesion boundary detection accuracy; 2) We build another very deep neural network based on Google inception v3 network (skin_recnn) and its well-trained weight. The novel designed transfer learning based deep neural network skin_inceptions_v3_nn helps to achieve a high prediction accuracy.

Peer-review activists push psychology journals towards open data

Nature News & Comment, Gautam Naik


An editor on the board of a journal published by the prestigious American Psychological Association (APA) has been asked to resign in a controversy over data sharing in peer review.

Gert Storms — who says he won’t step down — is one of a few hundred scientists who have vowed that, from the start of this year, they will begin rejecting papers if authors won’t publicly share the underlying data, or explain why they can’t.

Assisting Pathologists in Detecting Cancer with Deep Learning

Google Research Blog; Martin Stumpe and Lily Peng


To address these issues of limited time and diagnostic variability, we are investigating how deep learning can be applied to digital pathology, by creating an automated detection algorithm that can naturally complement pathologists’ workflow. We used images (graciously provided by the Radboud University Medical Center) which have also been used for the 2016 ISBI Camelyon Challenge1 to train algorithms that were optimized for localization of breast cancer that has spread (metastasized) to lymph nodes adjacent to the breast.

The results? Standard “off-the-shelf” deep learning approaches like Inception (aka GoogLeNet) worked reasonably well for both tasks, although the tumor probability prediction heatmaps produced were a bit noisy. After additional customization, including training networks to examine the image at different magnifications (much like what a pathologist does), we showed that it was possible to train a model that either matched or exceeded the performance of a pathologist who had unlimited time to examine the slides.

Scientists Develop New Tool to Monitor Reef Health

Eos, Sarah Stanley


Coral reefs around the world face destruction due to ocean acidification, pollution, overfishing, and other factors. These stressors affect reef health on short timescales, but scientists have struggled to capture the details of these rapid changes. Takeshita et al. have now developed and successfully tested a new tool that could improve real-time reef monitoring worldwide.

The Benthic Ecosystem and Acidification Measurements System (BEAMS) is the first tool that can make autonomous, simultaneous measurements of two important reef health indicators: net community production (NCP) and net community calcification (NCC). NCP is the balance between organic production and respiration, and NCC is the balance between calcification (reef building) and dissolution (reef erosion).

A Two-Way Relationship Between the Atlantic and Pacific Oceans

Eos, Brendan Bane


Scientists have long known of an apparent one-way relationship between the Atlantic and Pacific oceans: Unusually warm sea surface temperatures in the eastern and central tropical Pacific during El Niño can curb tropical cyclones in the Atlantic. However, in a new study, Patricola et al. sought to find out if varying sea surface temperatures in the Atlantic Ocean remotely influence tropical cyclones in the eastern and central North Pacific.

Using cyclone records from 1950 to 2015 and simulated idealized climate cycles, the authors found that when the Atlantic’s surface warms, subsequent tropical cyclones in the eastern Pacific are less frequent and weaker.

Data Science Job Report 2017: R Passes SAS, But Python Leaves Them Both Behind, Bob Muenchen


I’ve just updated another section of The Popularity of Data Science Software. It is reproduced below to save you the trouble of reading the entire article.

This Speck of DNA Contains a Movie, a Computer Virus, and an Amazon Gift Card

The Atlantic, Ed Yong


Yaniv Erlich and Dina Zielinski from the New York Genome Center and Columbia University encoded the movie, along with a computer operating system, a photo, a scientific paper, a computer virus, and an Amazon gift card.

They used a new strategy, based on the codes that allow movies to stream reliably across the Internet. In this way, they managed to pack the digital files into record-breakingly small amounts of DNA. A one terabyte hard drive currently weighs around 150 grams. Using their methods, Erlich and Zielinski can fit 215,000 times as much data in a single gram of DNA. You could fit all the data in the world in the back of a car.

Climate change computer model vindicated 30 years later by what has actually happened

The Independent (UK), Ian Johnston


Nearly 30 years ago, scientists developed a computer model of the Earth’s climate that predicted the level of global warming – to the ridicule of ‘sceptics’ at a time when there still seemed to be a debate over the issue.

Now two leading researchers have compared the model’s results with what actually happened over the last three decades and, to their surprise, found they were “very similar”.

Why literature is the ultimate big-data challenge

The Economist, Prospero


In a few decades, statistical analysis of literature has gone from crackpot theorising to cutting-edge research

Becoming a Data Scientist: Profiling Cisco’s Data Science Certification Program

Kaggle, No Free Hunch blog


In an interview with Kristen Burton, Director for the Enterprise Data Science Office and Digital Process Transformation, and Justin Norman, Manager of Cisco’s Enterprise Data Science Office, I learned about Cisco’s Data Science Certification Program. Now in its 4th year, the continuous education program is helping Cisco develop big data skills in their employees in support of Cisco’s digital transformation. For many companies, Cisco’s tactics might serve as a helpful blueprint for developing similar learning plans. Plus, for every level of the four-stage program, I include tips and resources for readers forging their own path towards a career in data science.


Julia in Finance Seminar in London on the 16th of March |

Julia Computing, CQF Institute


London, England Thursday, March 16
at 6 p.m., Fitch Learning (The Corn Exchange, 55 Mark Lane) [free, please register]


National High School Design Competition: Good For Al

Problems with the availability of food are not specific to one group; they occur in both rural and urban communities and pose challenges to people of all ages, races, and household structures. Deadline for submissions is March 20.

Challenge | Erie Hack

The Erie Hack is a data and engineering competition that unites coders, developers, engineers, and water experts to generate enduring solutions to Lake Erie’s biggest challenges. The competition includes $100,000 in prizes for the most creative and effective hacks. Deadline to register teams to participate is April 12.

USDA offers Conservation Innovation Grants

Proposed projects must occur within Oregon and may be county based or statewide in scope. CIG proposals must involve farmers and ranchers who meet the eligibility criteria for the Environmental Quality Incentive Program. Deadline to submit applications is April 28.
NYU Center for Data Science News

Using Robots To Make Decisions

NYU Center for Data Science


Are robots taking over? Vasant Dhar, a data scientist and professor at CDS, focuses his research on the balance between automation and humans and has recently published his work in Harvard Business Review.

Tools & Resources

Join the InfluxData Developer Community

InfluxData, Paul Dix


With this post, I’m announcing, a discourse forum with an InfluxData flavor.

From today we will begin routing any technical questions on usage, code samples, how to, and general questions to the forum.

Deep and Hierarchical Implicit Models

Dustin Tran


I think this is quite a dense paper—chock full of simple ideas that are rife with deep implications. There are many nuggets of wisdom that I could ramble on about, and I just might in separate blog posts.

As a practical example, we show how you can take any standard neural network and turn it into a deep implicit model: simply inject noise into the hidden layers. The hidden units in these layers are now interpreted as latent variables. Further, the induced latent variables are astonishingly flexible, going beyond Gaussians (or exponential families (Ranganath, Tang, Charlin, & Blei, 2015)) to arbitrary probability distributions. Deep generative modeling could not be any simpler!


Full-time positions outside academia

Commissioning Editor

SAGE; London, England

Leave a Comment

Your email address will not be published.