Data Science newsletter – February 27, 2017

Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for February 27, 2017


Data Science News

UW, UBC’s Cascadia urban data project gets $1M from Microsoft

The Seattle Times, Matt Day


The Cascadia Urban Analytics Cooperative, seeded by a $1 million grant from Microsoft, will aim to bring data science to bear on issues faced by communities in Washington state and British Columbia.

Announcing the first SHA1 collision

Google Online Security Blog; Marc Stevens (CWI Amsterdam), Elie Bursztein (Google), Pierre Karpman (CWI Amsterdam), Ange Albertini (Google), Yarik Markov (Google), Alex Petit Bianco (Google), Clement Baisse (Google)


Cryptographic hash functions like SHA-1 are a cryptographer’s swiss army knife. You’ll find that hashes play a role in browser security, managing code repositories, or even just detecting duplicate files in storage. Hash functions compress large amounts of data into a small message digest. As a cryptographic requirement for wide-spread use, finding two messages that lead to the same digest should be computationally infeasible. Over time however, this requirement can fail due to attacks on the mathematical underpinnings of hash functions or to increases in computational power.

Today, more than 20 years after of SHA-1 was first introduced, we are announcing the first practical technique for generating a collision.

Tackling the Promise and Pitfalls of Data Science

National Institutes of Health, Data@NIH blog, Patti Brennan


The amount and diversity of data generated by NIH-funded research programs continues to grow rapidly; safe, scalable storage solutions, new analytic approaches, and an adaptable workforce are urgently needed. As this unprecedented revolution in biomedical information unfolds and NIH looks to the future of data science, some pitfalls remain! We must ensure that researchers have the ability to make meaningful use of this increasingly massive biomedical data resource. It is timely and critical for NIH to identify and implement new strategies to improve data discoverability, utility, and sustainability, including moving many large data sets into the cloud and making them adherent to the FAIR PrinciplesExit Link Disclaimer—Findable, Accessible, Interoperable, and Reusable. Success in meeting this challenge will require leveraging the findings from the Big Data to Knowledge (BD2K) program, and a major infusion of resources.

Over the next several months, I will be working with NIH’s 27 ICs to develop efficient strategies to improve data discoverability, utility, and sustainability for the biomedical research community. I will work with the Division of Program Coordination, Planning, and Strategic Initiatives to engage the various pilot projects that are identifying critical points of success for future data science efforts. As part of this effort, the second phase of NIH’s cornerstone data science initiative, the Big Data to Knowledge (BD2K) program, will include investments to accelerate progress in the development of these new strategies through a pilot program for an NIH data Commons.

[1702.05532] Fast machine learning models of electronic and energetic properties consistently reach approximation errors better than DFT accuracy

arXiv, Physics > Chemical Physics; Felix A. Faber, Luke Hutchison, Bing Huang, Justin Gilmer, Samuel S. Schoenholz, George E. Dahl, Oriol Vinyals, Steven Kearnes, Patrick F. Riley, O. Anatole von Lilienfeld


We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of thirteen electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed with learning curves which report approximation errors as a function of training set size. Molecular structures and properties at hybrid density functional theory (DFT) level of theory used for training and testing come from the QM9 database [Ramakrishnan et al, Scientific Data 1 140022 (2014)] and include dipole moment, polarizability, HOMO/LUMO energies and gap, electronic spatial extent, zero point vibrational energy, enthalpies and free energies of atomization, heat capacity and the highest fundamental vibrational frequency. Various representations from the literature have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), and angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR) and two types of neural networks, graph convolutions (GC) and gated graph networks (GG). We present numerical evidence that ML model predictions for all properties can reach an approximation error to DFT which is on par with chemical accuracy. These findings indicate that ML models could be more accurate than DFT if explicitly electron correlated quantum (or experimental) data was provided.

Botched Zika testing at D.C. public health lab was a failure of ‘basic arithmetic’

The Washington Post, Aaron C. Davis


When Anthony Tran took over the District’s public health lab late last year, he had a feeling something was wrong with its testing for the Zika virus. He had just come from the public health lab in New York City, where technicians had been finding markers for Zika in the blood of arriving travelers almost every day. In the smaller, but still international, city of Washington, the same test was negative — every time.

Soon, U.S. health officials joined in Tran’s concern: Samples supplied by the federal government of the frightening, mosquito-borne virus that were tested in the lab as a control were appearing as if they contained no virus.

“I knew then that something was tremendously wrong,” Tran said late last week in an interview. He halted testing, and with help from analysts at the Centers for Disease Control and Prevention, traced the problem to a mistake that any high school chemistry student could understand.

Company Data Science News

Palantir is combining “information on a subject’s schooling, family relationships, employment information, phone records, immigration history, foreign exchange program status, personal connections, biometric traits, criminal records, and home and work addresses” to aid Immigration and Customs Enforcement (ICE) in immigration and deportation cases.

The W3C released standards for web annotations that spell out “the data model, protocol and vocabulary for annotations.” This move will likely change the storage model for data from web commenting, especially if browsers become the commenting host rather than the current model where comments live on individual websites or in a handful of commenting services.

As the tech side of health care heats up (keep watching precision medicine), executives at top ten tech companies are leaving their jobs to launch health tech startups.

Apple is continuing to beef up its AI recruiting ability by announcing it will open a new office in Seattle which has become Silicon Valley North. Apple is also endowing a professorship in AI and machine learning at UDub.

Chipmaker Intel is competing with GPU maker Nvidia. The question for hardware makers: will specialized chips designed to accommodate machine learning and neural nets eventually move onto the CPU? Other once stand-alone silicon has. Hat tip to The Economist’s coverage of AI. It is routinely excellent.

Stanford Prof and Pinterest Chief Scientist Jure Leskovec announced the launch of Pinterest Labs, another company where we see top staff splitting their time between industry and academia.

NASA Selects Proposals for First-Ever Space Technology Research Institutes



NASA has selected proposals for the creation of two multi-disciplinary, university-led research institutes that will focus on the development of technologies critical to extending human presence deeper into our solar system.

The new Space Technology Research Institutes (STRIs) created under these proposals will bring together researchers from various disciplines and organizations to collaborate on the advancement of cutting-edge technologies in bio-manufacturing and space infrastructure, with the goal of creating and maximizing Earth-independent, self-sustaining exploration mission capabilities.

“NASA is establishing STRIs to research and exploit cutting-edge advances in technology with the potential for revolutionary impact on future aerospace capabilities,” said Steve Jurczyk, associate administrator for NASA’s Space Technology Mission Directorate in Washington.

Is that a boy or a girl? – Exploring a neural network’s construction of gender

Medium, Kerry Rodden


I’ve always been curious about what makes someone “look” male or female, probably because I’m female but have never looked conventionally feminine. I was a tomboy as a child and remained one as an adult, and I’m also tall, with unruly hair that’s easiest to keep short. So strangers often assume that I’m male: in restaurants and on planes, I’m often addressed as “sir”.

People who know me well are usually surprised that anyone could think I was male. But I don’t find it that surprising — we don’t tend to really look closely at strangers, and just make broad assumptions about them based on their outlines. Children are often an exception — they will scrutinize me for a while and then ask their embarrassed parents, “is that a boy or a girl?”

Knowing that there has been huge progress in recent years in using machine learning to classify images, I got curious: could I train a model to classify photos of people according to their gender? What “rules” would it learn, for making the decision? And how would it classify me?

10 Breakthrough Technologies 2017

MIT Technology Review


These technologies all have staying power. They will affect the economy and our politics, improve medicine, or influence our culture. Some are unfolding now; others will take a decade or more to develop. But you should know about all of them right now.

Meet the Math Professor Who’s Fighting Gerrymandering With Geometry

The Chronicle of Higher Education, Shannon Najmabadi


A Tufts University professor has a proposal to combat gerrymandering: give more geometry experts a day in court.

Moon Duchin is an associate professor of math and director of the Science, Technology and Society program at Tufts. She realized last year that some of her research about metric geometry could be applied to gerrymandering — the practice of manipulating the shape of electoral districts to benefit a specific party, which is widely seen as a major contributor to government dysfunction.

At first, she says, her plans were straightforward and research-oriented — “to put together a team to do some modeling and then maybe consult with state redistricting commissions.” But then she got more creative. “I became convinced that it’s probably more effective to try to help train a big new generation of expert witnesses who know the math side pretty well,” she says.

Machine learning is the new “plastics” and four more HIMSS17 observations

MedCity News, Dr. John D. Halamka


This week, 42,000 of my closest friends each walked an average of 5 miles per day through the Orlando Convention Center at the annual HIMSS conference. One journalist told me “It’s overwhelming. You do your best to look professional and wear comfy shoes!”

After 50 meetings, and 12 meals in 3 days, here are my impressions of the experience.

1. Wearables, while still relevant have gone from the peak of the hype curve to the trough of disillusionment.

Sanger Institute’s COSMIC database expands cancer cloud capabilities at the Institute for Systems Biology

Institute for Systems Biology, Wellcome Trust Sanger Institute


The Wellcome Trust Sanger Institute’s Catalog of Somatic Mutations in Cancer (COSMIC) team announces a new agreement to provide their data to the U.S.-based Institute for Systems Biology (ISB).

COSMIC is an expert-curated cancer mutation database, and is the world’s largest and most comprehensive resource for exploring the impact of somatic mutations in human cancers.

With this agreement, ISB has embedded the COSMIC data within the ISB Cancer Genomics Cloud (CGC), which is a cloud-based platform that uses Google BigQuery technology to bring unprecedented computing power to researchers around the world.

Why Former Tech Execs Are Leaving Google And Twitter To Start Health Care Companies

Fast Company, Christina Farr


When Stephanie Tilenius, a former senior executive at eBay and Google, decided to start a health-coaching app, many in her network were incredulous. “Everyone thought I was crazy,” she recalls. “Some people loved that I wanted to do something to help others, but a lot socially ostracized me.”

For many entrepreneurs, the health sector offers an enticing opportunity—with strings attached. It’s an estimated $3 trillion market and is still dominated by a cadre of traditional players. But many in the technology sector have shied away from the industry after witnessing many high-profile failures and realizing that change doesn’t happen quickly. “Silicon Valley operators and investors see that health care needs better technology,” explains veteran health IT consultant Ben Rooks. “But they learn quickly that health care isn’t about radical disruption; it’s about slow evolution.”

Despite the challenges, a small but growing group of former technologists from companies like Google and Twitter are in it for the long haul. In many cases, their motivations are deeply personal

UC Berkeley to Join Major Tech Companies in Advancing 5G Networks

The Daily Californian, Ani Vahradyan


Intel announced in a press release Tuesday that UC Berkeley will join several major tech companies in forming the 5G Innovators Initiative, which is focused on bringing technology and academia together to advance fifth generation networks in the United States.

In collaboration with Intel, Ericsson, Honeywell and General Electric, the campus will work toward transforming network infrastructure in a variety of fields, including telecommunications, healthcare, finance and security. Fifth generation, or 5G, mobile networks, are a new industry standard and they move beyond communications to adapt to the growing needs of various industries, according to Nimish Radia, director of research at the Bay Area branch of Ericsson.

Do political beliefs affect online dating? Q&A with political scientist Gregory Huber

Yale University, YaleNews


How did political orientation compare to religious orientation in driving people’s interest in potential dates?

Religion matching is very important. Catholics want to date other Catholics. Jews want to date other Jews, and so on. That effect is actually quite a bit larger than the political effect, which is still reasonably significant.

Interestingly, disinterest in politics has an effect. People who aren’t interested in politics are not that excited about dating people who are really interested in politics. If you know people who are not interested in politics, then this strikes me as completely accurate.

The Story Behind The Development Of A Brain-computer Interface

Stanford Medicine, Scope Blog


Earlier this week my colleague shared some very cool news: A group of researchers here developed an experimental brain-controlled prosthesis that allows people with paralysis to type on a keyboard just by thinking about moving their hands. The scientists had a long journey reaching this point, and writer Elizabeth Svoboda shared some of it in an online piece.

HIMSS Wrap-up: Digital health news and views from the annual conference



This week was the HIMSS Annual Conference, the biggest week in Health IT. We’ve been covering some of the major digital health news from the conference right along, but today we’re bringing it all together in one wrap-up of the event.

We started off the conference with our own event (in partnership with the Personal Connected Health Alliance), the Digital and Personal Connected Health event, where we heard from a number of providers working on digital health projects. Read all about that event below, and then read on for a roundup of more news from the show.

Foundation Data Science News is giving out $11.5m in grants to fight racial bias in policing with data science.

The Gates Foundation paid five designers to visualize international health data. Yeah, this is click bait in the sense that none of you are likely to make animations any time soon, but it’s substantively worth it. Saving lives!!

Cofounder of the Julia Language, Stefan Karpinski, explains why they decided to create a new language for scientific computing in the first place. Wasn’t python enough?

Andreas Mueller of scikit learn and Columbia University talks about using grant funding to improve women’s participation in open source communities.

Risa Lavizzo-Mourey is leaving the Robert Wood Johnson Foundation, where she served as president for fourteen years, to become a university professor at the University of Pennsylvania.

Apple is expanding its Seattle offices to focus on AI and machine learning

The Verge, James Vincent


In many ways, the tech world’s AI arms race is really a fight for talent. Skilled engineers are in short supply, and Silicon Valley’s biggest companies are competing to nab the best minds from academia and rival firms. Which is why it makes sense that Apple has announced it’s expanding its offices in Seattle, where much of its AI and machine learning work is done.

Seattle is home not only to the University of Washington and its renowned computer science department, but also the Allen Institute for Artificial Intelligence. Microsoft and Amazon are headquartered nearby, and AI startups are finding a home in the region, too. Last August, Apple even bought a Seattle-based machine learning and artificial intelligence startup named Turi for an estimated $200 million, and the team is said to be moving into Apple’s offices at Two Union Square as part of the expansion.

Novel ‘barcode’ tracking of T cells in immunotherapy patients identifies likely cancer-killers

Fred Hutchinson Cancer Center


A new discovery by researchers at the Fred Hutchinson Cancer Research Center in Seattle makes an important step in identifying which specific T cells within the diverse army of a person’s immune system are best suited to fight cancer.

The findings were published February 24 in Science Immunology.

“We found that the cells in each patient’s immune system that will ultimately have a clinical effect are incredibly rare,” said Dr. Aude Chapuis, lead author of the paper and a member of the Clinical Research Division at Fred Hutch. “Knowing what we’ve found, we can now refine the selection of the cells that we will ultimately use for adoptive T cell transfer, so that the cells persist and keep the tumors at bay longer in our patients.”

Microsoft $1M gift for new UW, UBC urban data partnership

University of Washington, eScience Institute


The University of Washington and the University of British Columbia are partnering on a new collaboration called the Cascadia Urban Analytics Cooperative (CUAC), which will connect researchers, students and public stakeholders working on urban issues across the region.

A $1 million donation from Microsoft will allow the two universities to work together using data science to research, innovate and discover sustainable solutions to civic problems in areas such as homelessness, health and transportation.



National Institutes of Health, NIH Bioinformatics Special Interest Group


Bethesda, MD The third annual celebration of Pi Day at NIH on Tuesday, March 14. [free, registration required]

NYU Center for Data Science News

Voices of New York

NYU Center for the Humanities


Voices of New York, created by Renee Blake and her students,wants to hear the voices of immigrant communities in New York City (NYC), and learn about the people behind them. The students have traveled throughout the boroughs of NYC to seek out neighborhoods and communities where ethnic cultures are thought to be flourishing.

Student Research: How Can Data Science Improve National Security?

NYU Center for Data Science


After the CDS Academy Awards last week, we caught up with some of our student winners to find out more about their work.

Scooping up the prize for the project with the greatest social impact was Yiqiu Shen, Zemin Yu, and Xinsheng Zhang’s fascinating research on how data science can be used to combat terrorism.

A major challenge facing us today is how to quickly identify the groups responsible behind terrorist attacks so that the relevant individuals can be apprehended. With roughly 3290 unique terrorist groups around the globe, each with their own set of characteristics and motivations, law enforcement agencies require additional tools to help keep our country safe.

Tools & Resources

How Chief Data Officers Can Get Their Companies to Collect Clean Data

Harvard Business Review, Gahl Berkooz


In analytics, nothing matters more than data quality. The practical way to control data quality is to do it at the point where the data is created. Cleaning up data downstream is expensive and not scalable, because data is a byproduct of business processes and operations like marketing, sales, plant operations, and so on. But controlling data quality at the point of creation requires a change in the behaviors of those creating the data and the IT tools they use.

Enter the chief data officer, or CDO. CEOs are increasingly adding the CDO role to their management teams to tackle the big business issues that come with data. Plenty of CDOs want to improve data quality, but motivating this change requires that CDOs create new organizational incentives and processes. Without the ability to do both, their efforts will fall flat.

Tips From 5 Top Designers On Making Unwieldy Data Tangible

Fast Company, Meg Miller


The Gates Foundation partnered with Fast Company to ask five top designers to translate this year’s global health data. Here’s how they did it. An experiment in open source

GitHub – deptofdefense


The U.S. Department of Defense struggles with open licensing of its software and seeks feedback on a proposed standard open source license agreement.

How we made TensorFlow run on a Raspberry Pi using Rust

Medium, Snips Blog, Thibaut Lorrain


TLDR: They needed to build a new tensorflow_c library to do it.

Reinforcement Learning in R

R Blog


This slide deck “is positioned at the intersection of teaching the basic idea of reinforcement learning and providing practical insights into R.”


Career Advice

Mathematicians becoming data scientists: Should you? How to?

Jordan Ellenberg, Quomodocumque blog
Internships and other temporary positions

PhD Research Fellow in Humanitarian Technology

University of Agder; Grimstad, Norway
Full-time positions outside academia


The Human Diagnostic Project; San Francisco, CA

Leave a Comment

Your email address will not be published.