Data Science newsletter – March 22, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for March 22, 2018


Data Science News

Science PhDs lead to enjoyable jobs

Nature, Career News, Chris Woolston


As universities around the world award science PhDs at an ever-increasing rate, some doctoral students might wonder whether the degree is still worth all the time, effort and sacrifice.

But two recent projects tracking the journeys of PhD holders in the United Kingdom and Canada offer reason for optimism: graduates in the sciences and other fields are highly employable, even if they don’t always end up where they expected. “There’s a lot of pessimism about an oversupply of PhDs,” says Sally Hancock, an education researcher at the University of York, UK, who led the study in her nation — one of only a few of its kind worldwide. “These data can help demystify what happens next.”

Facebook Exit Hints at Dissent on Handling of Russian Trolls

The New York Times, Nicole Perlroth, Sheera Frenkel and Scott Shane


As Facebook grapples with a backlash over its role in spreading disinformation, an internal dispute over how to handle the threat and the public outcry is resulting in the departure of a senior executive.

The impending exit of that executive — Alex Stamos, Facebook’s chief information security officer — reflects heightened leadership tension at the top of the social network. Much of the internal disagreement is rooted in how much Facebook should publicly share about how nation states misused the platform and debate over organizational changes in the run-up to the 2018 midterm elections, according to current and former employees briefed on the matter.

Cosmologists Create Largest Simulation of Galaxy Formation, Break Their Own Record

Gauss Centre for Supercomputing


The vastness of our galaxy—let alone our entire universe—means experiments to understand our origins are expensive, difficult, and time consuming. In fact, experiments are impossible for studying certain aspects of astrophysics, meaning that in order to gain greater insight into how galaxies formed, researchers rely on supercomputing.

In an attempt to develop a more complete picture of galaxy formation, researchers from the Heidelberg Institute for Theoretical Studies, the Max-Planck Institutes for Astrophysics and for Astronomy, the Massachusetts Institute of Technology, Harvard University, and the Center for Computational Astrophysics in New York have turned to supercomputing resources at the High-Performance Computing Center Stuttgart (HLRS), one of the three world-class German supercomputing facilities that comprise the Gauss Centre for Supercomputing (GCS). The resulting simulation will help to verify and expand on existing experimental knowledge about the universe’s early stages.

Learning to Represent Words by how They’re Spelled

Machine Learning @ Georgia Tech, Mark Riedl


A fundamental question in Natural Language Processing (NLP) is how to represent words. If we have a paragraph we want to translate, or a product review we want to determine whether is positive or negative, or a question we want to answer, ultimately the easiest building block to start from is the individual word. The main problem of this approach is that treating each word as just a symbol loses a lot of information. How can we tell from such a representation that the relationship between the symbol PAGE and the symbol PAPER is not the same as that between PAGE and MOON?

Some popular techniques exist that try to learn an abstract representation which identifies these relationships and preserves them. In essence, what these methods do is go over a huge body of text (a corpus), like the entire English Wikipedia, word by word, and come up with a representation assigned to each word. Using mathematical operations over these resulting representations, our model can then know that PAGE is very “similar” to PAPER, DOG is similar to CAT, to KENNEL and to the verb BARK, and so on. These methods are strong enough to also represent analogies: the “difference” between MAN and WOMAN is very similar to that between KING and QUEEN.

Robots Are Coming To Pick Your Berries. So Far, They’re Not So Great At It

NPR, The Salt blog, Dan Charles


Robots have taken over many of America’s factories. They can explore the depths of the ocean, and other planets. They can play ping-pong.

But can they pick a strawberry?

“You kind of learn, when you get into this — it’s really hard to match what humans can do,” says Bob Pitzer, an expert on robots and co-founder of a company called Harvest CROO Robotics. (CROO is an acronym. It stands for Computerized Robotic Optimized Obtainer.)

Any 4-year old can pick a strawberry, but machines, for all their artificial intelligence, can’t seem to figure it out. Pitzer says the hardest thing for them is just finding the fruit. The berries hide behind leaves in unpredictable places. [audio, 4:15]

Amazon’s secretive health team talking with AARP about making products for older people

CNBC, Christina Farr


CNBC has learned that Amazon has in fact been meeting with the AARP since 2015 to discuss potential collaborations and share research, and it is interested in designing technology for aging populations.

Amazon has been very quiet about its plans to develop technologies for aging populations, as well as its health ambitions, but its vice president Babak Parviz spoke about both topics at a very rare public appearance last month.

“Something…we’ve been building for some period of time and we deeply care about… relates to what happens to older people,” said Parviz, the company’s vice president of special projects, at an event hosted by health marketing firm Klick Health.

Google launches $300 million effort to help news publishers

Axios, Sara Fischer


Google is launching the Google News Initiative (GNI), in an effort to help journalism thrive in the digital age. The tech giant says it’s committing more than $300 million over the next three years toward building products that address the news industry’s biggest needs, like driving building sustainable business models and elevating quality journalism.

Why it matters: Publishers have struggled to navigate distribution and content partnerships with Google, Facebook and other open platforms because the economic dynamics of those relationships often didn’t fall in favor of those who own the content. As a result, these companies suffered reputational damage for not being perceived as caring enough about fake news, misinformation and the survival of real journalism.

Commoditisation of AI, digital forgery and the end of trust: how we can fix it

Georgio Patrini, Simone Lini, Hamish Ivey-Law and Morten Dahl


TLDR; It is becoming widely evident that technology will enable total manipulation of video and audio content, as well as its digital creation from scratch. As a consequence, the meaning of evidence and trust will be critically challenged and pillars of the modern society such as information, justice and democracy will be shaken up and go through a period of crisis. Once tools for fabrication becomes a commodity, the effects will be more dramatic than the current phenomenon of fake news. In the tech circles the issue are discussed only at a philosophical level; no clear solution is known at present time. This post discusses pros and cons of two classes of potential solutions: digital signatures and learning based detection systems. We also ran a brief “weekend experiment” to measure the effectiveness of machine learning for detection of face manipulation, on the wave of deepfakes. In the limited scope of the experiment, our model is able to spot image manipulation that is imperceptible to the human eye.

UW and Allen Institute researchers develop new method for large-scale analysis of gene activity to advance disease research

University of Washington, Allen School News


A team of researchers at the University of Washington and the Allen Institute have come up with a highly efficient, scalable approach for measuring gene activity at the cellular level that could aid the fight against potentially devastating diseases. The researchers described their novel technique — called SPLiT-seq, short for Split Pool Ligation-based Transcriptome sequencing — in a paper published this week in the journal Science.

SPLiT-seq enables scientists to identify the cellular origin of ribonucleic acid (RNA) molecules, which are essential to the regulation and expression of genes, without having to rely on expensive instrumentation. It employs an approach called combinatorial barcoding, in which the cells go through multiple rounds of sorting and labeling with a DNA identifier, or barcode, through a process known as in-cell ligation. Each time the cells are sorted, all of the cells in a particular pool — and their corresponding RNA — receive the same barcode. Four rounds of sorting and labeling produced a unique barcode combination for each cell that could be used to identify its RNA during bulk sequencing.

The Machine Learning Reproducibility Crisis

Pete Warden's blog


I was recently chatting to a friend whose startup’s machine learning models were so disorganized it was causing serious problems as his team tried to build on each other’s work and share it with clients. Even the original author sometimes couldn’t train the same model and get similar results! He was hoping that I had a solution I could recommend, but I had to admit that I struggle with the same problems in my own work. It’s hard to explain to people who haven’t worked with machine learning, but we’re still back in the dark ages when it comes to tracking changes and rebuilding models from scratch. It’s so bad it sometimes feels like stepping back in time to when we coded without source control.

Extra Extra

Structural racism is a scourge on the US. That is not news. A new study by researchers at Stanford, Harvard and the Census Bureau with data visualizations by Kevin Quealy and Adam Pearce of The New York Times shows readers just how difficult it is for a black boy to stay in the upper class if he was born there or to leave the lower class if that’s where he was born. White boys’ chances for ending up at the top are better if that’s where they started and their chances for exiting the lower class if that’s where they started are also better. Black and white women are much more similar to one another.

Duke University is now under strict budgetary regulations for all National Institutes of Health grants due to “research misconduct and grant management” issues. The NIH would not comment on specifics, but it appears that there have been a couple cases in which data fabrication led to more than 10 retractions and called into question the basis for awards that had been granted. More than one lab was involved.

Tufts Offers New Data Science Degree

Tufts University, Tufts Now


A new undergraduate degree-granting program in data science will be offered within the School of Engineering starting this fall.

The Bachelor of Science in Data Science encompasses principles and practices that support real-world problem solving through data analysis, said Alva Couch, an associate professor of computer science who co-wrote the program proposal with Shuchin Aeron, an associate professor in the Department of Electrical and Computer Engineering.

CSU to offer new major, first in Rocky Mountain region

Coloradan, Kelly Ragan


Colorado State University is set to launch a new major in data science starting this fall, and it will be the first program of its kind in the Rocky Mountain region, according to CSU.

The major will offer 10 new dedicated data science courses ranging from data wrangling to data graphics and visualization to a group capstone project, according to CSU. It will also give majors a broad foundation in computer science, mathematics and statistics.

Twitter for Scientists: an Idea Whose Time Has Finally Come?

The Chronicle of Higher Education, Paul Basken


Known as Polyplexus, meaning “a network of many,” it’s a compilation of 300-character summaries of research findings, created with the idea of driving crossfield discoveries and spawning public and private funding for follow-up studies.

Unlike Twitter, it’s meant to be “a professional environment for research-and-development professionals,” said a Polyplexus developer, John A. Main, a program manager at the Pentagon’s Defense Advanced Research Projects Agency, or Darpa.

New software could run the lab of the future

Chemical & Engineering News, Sam Lemonick


Robots have repeatedly demonstrated that they can do the work that humans do. And chemists haven’t escaped this automation trend. Food, pharmaceutical, and other industries have sped up routine processes by using robots to sample and analyze products. Chemists have deigned programs like Chematica to let computers plan synthetic routes. Research groups have even demonstrated nearly autonomous systems that use machine learning to design, carry out, and evaluate experiments.

But the cost and complexity of some of these systems make them unattainable for many chemistry labs. At the American Chemical Society national meeting in New Orleans on Sunday, Alán Aspuru-Guzik of Harvard University described a free software package called ChemOS that he developed to make the tools of automation available to more scientists.

“If you look at the chemistry lab of the 16th century or even the 21st century, you will see the same thing,” in terms of how chemists run experiments, Aspuru-Guzik says. “Nothing has changed really. If we really want to rethink discovery, we need to rethink the laboratory.”


Summer Institute on Cyberinfrastructure for SES

Annapolis, MD “The National Socio-Environmental Synthesis Center (SESYNC) invites applications to a short course on data and coding skills for socio-environmental synthesis. The 5th annual Summer Institute will be held July 24-27 at SESYNC.” Deadline to apply is April 27.

BELIV Workshop 2018

Berlin, Germany October 21. “Evaluation and Beyond – Methodological Approaches for Visualization,” a one-day workshop at IEEE VIS 2018. Deadline for submissions is June 30.

Biological Data Science

Cold Spring Harbor, NY November 7-10. “We are pleased to announce the third meeting on Biological Data Science.” Deadline for submissions is August 17.
Tools & Resources

Laboratory Assistant

Fabrício Olivetti de Franca


This webpage was developed to show the possibility of applying the SymTree algorithm as a lightweight alternative to common Symbolic Regression algorithms. This tool can be used as an assistent to physics and engineering experimental labs to verify equations and functions seen in theory classes. The algorithm was entirely developed in JavaScript and, as such, it runs in browser without any further requirements. Experiment and enjoy!

U-Index, a dataset and an impact metric for informatics tools and databases

Nature, Scientific Data, Alison Cahill et al.


Measuring the usage of informatics resources such as software tools and databases is essential to quantifying their impact, value and return on investment. We have developed a publicly available dataset of informatics resource publications and their citation network, along with an associated metric (u-Index) to measure informatics resources’ impact over time. Our dataset differentiates the context in which citations occur to distinguish between ‘awareness’ and ‘usage’, and uses a citing universe of open access publications to derive citation counts for quantifying impact. Resources with a high ratio of usage citations to awareness citations are likely to be widely used by others and have a high u-Index score. We have pre-calculated the u-Index for nearly 100,000 informatics resources. We demonstrate how the u-Index can be used to track informatics resource impact over time. The method of calculating the u-Index metric, the pre-computed u-Index values, and the dataset we compiled to calculate the u-Index are publicly available.


Full-time, non-tenured academic positions

Associate Director Recruitment Analytics and Strategy

University of Chicago, Office of the Provost; Chicago, IL

Postdoctoral Fellowships in Computational Bioscience

University of Colorado-Denver; Denver, CO

Leave a Comment

Your email address will not be published.