Data Science newsletter – August 22, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for August 22, 2017

GROUP CURATION: N/A

 
 
Data Science News



Tandon Faculty and Researchers Discuss Urban Resiliency with The National Academies

NYU Tandon School of Engineering


from

In the aftermath of Hurricane Sandy and Katrina, it was easy to see how prepared or unprepared a community was for a devastating natural disaster. Understanding and measuring a city or state’s resiliency, though, is quite another task — one that the National Academies of Sciences, Engineering, and Medicine is taking on in an effort to reinforce local communities. The renowned institution, which addresses national challenges within science, engineering and medicine and advises on policy-making, aims to identify effective methods for measuring a community’s resilience to extreme events, as well as strategies to help cities like New York build and maintain resilience.

The organization’s Committee on Measuring Community Resilience recently convened at NYU Tandon in August to exchange ideas and findings from professors, researchers, and community leaders from various New York institutions, who discussed how their projects can help measure and improve resiliency within New York and beyond through data analysis and community collaboration.


Employee retention rate at top tech companies

Business Insider, Becky Peterson


from

Despite the high pay and jaw-dropping perks found across the tech industry, giants like Amazon and Apple can’t avoid the job-hopping nature of today’s workforce.

The company with the longest average retention is Facebook, where employees spend around 2.02 years at the company, according to Paysa, a company which publicizes salary data.

On the low end of the spectrum is Uber, where employees spend an average of just 1.23 years. That makes former CEO Travis Kalanick an outlier, with his 6.5 year stint at the ride-sharing company — though as a cofounder it’s natural that he would have a much lengthier tenure than the average staffer.


Prospecting for Space Resources with Intel® Nervana™

Intel Nervana, Katie Frisch


from

We live in an exciting time for commercial space exploration. Every week, we see exciting leaps forward: reusable rockets, plans to mine the moon, and missions to colonize Mars. Sound like science fiction? It’s not. A little-known NASA program is there to accelerate the pace of that innovation. The NASA Frontier Development Laboratory (FDL), run by the SETI Institute (yes, the one involved with finding extra-terrestrials) is a public-private partnership chartered with solving problems in space exploration in conjunction with industry partners such as Intel, NVidia, and IBM.


University Data Science News

Samford University in Alabama launched the school’s Center for Sports Analytics which represents a trend in analytical courses: focus on one application area. Samford’s focus crosses traditional academic disciplines by gathering up the medical, marketing, media, operations, and quantitative skills which looks like “fan engagement, sponsorship, player tracking, sports medicine, sports media and operations” in the context of sports.

UC-Berkeley was back in session last week and held the first meeting of Foundations of Data Science, aka Data8. Standing room only in a room that holds 732 students! All materials are online. Expect forward-thinking universities to copy the Berkeley model and teach a super-size undergraduate intro course in data science that feeds into narrower, field-specific courses tailored to the applications within disciplines.



The American Chemical Society launched a pre-print archive, ChemRxiv, on 14 August. Just one week earlier, Elsevier launched a pre-print archive for chemistry, too, called ChemRN. Direct competition could be good, or it could dilute the effort. This newsletter recommends ChemRxiv from an efficiency and open science perspective. Elsevier has not historically been committed to open science.

UCSF is using their Chan Zuckerberg money for a major impact project: 10,000 Immunomes Project. It’s a dataset including things like HLA, flow cytometry, gene expression, multiplex ELISA, and many other pieces of immunologically relevant biological data from 10,000 control subjects. A big hope is that there may be a way to measure immunological robustness through a finger prick test within our lifetimes. The dataset includes pregnant women, which is awesome because 1) pregnant women’s immune systems change so they can share their bodies with other humans-in-the-making and 2) not all datasets plan to include enough pregnant women to produce robust findings about pregnant women.

Arizona State University, the public university most dedicated to online education, is now offering an online graduate degree in complexity science. ASU is partnering with the Santa Fe Institute in this effort.

Alice H. Wu scraped and analyzed text from Economic Job Market Rumors – a job board turned general forum for economists. Her research demonstrated widespread sexism; she has not yet looked for racism. (Certainly glad a robot did all the reading here.) In my first 30 seconds browsing the site I found this response to a question about “hiring servants”:

Poster B: “I also thought my wife would be happy for us to pay money so she can have more leisure but somehow I was very wrong about this.”

Poster C: “They know what happens between you and a short fat and ugly maid from Guatemala.”

My apologies to women, people whose jobs entail cleaning up after others, people who clean up after others without pay, and everyone from Guatemala as well as to economists who wouldn’t dream of writing this way. Working with these forum posters or for these forum posters is likely an experience riddled with frustration and alienation. But there I go, being a woman sociologist and using terms like alienation. Click through to read the 30 most frequently used words in discussions about women: top three are hotter, lesbian, and bb (“baby”).

In an unrelated piece on a very similar topic, science journalist Ushma Neill published a set of emails containing unwanted advances from men in science to preface her request: “If you don’t want what you’ve written to be published; sent to your daughter, sister or wife; or turned around and said to you, don’t write it.”

McGill University researchers are using AI and data drawn from PET scans to detect dementia two years before symptoms appear. This is one of those ethically dubious uses of AI that has outpaced our social and medical capacity to handle the news with grace. There are no treatments that can be administered at the T-2 year time point, nor is it clear that knowing about impending dementia is a social good. In the absence of any treatment options, two years that may have been spent in (blissful?) ignorance can now be shrouded by the knowledge of impending cognitive decline. I suppose, for academics, it may kick off a couple of years of immense productivity. There has to be a better way to get that book written, rather than trying to outrun looming mental decline.

NYU’s Center for Data Science made it into the news last week when Bloomberg declared us a source for “super quants” looking for jobs on Wall Street. That is misleading. We do place some of our MS students in finance jobs, but the largest part of our graduating class goes to work for tech companies. Some of our graduates actually prefer to work in jobs where they make a large social impact, like, say, at a hospital. Our PhD students may want to stay in academia and become professors, though these days that does not preclude also working for a financial firm, start up, or established tech company.

Reproducibility is an extremely important part of science. A study with results that cannot be replicated should not become the basis for treating patients, right? Anti-ageing researchers dug into reproducibility flops in their field. Tiny differences in techniques and reagents – slow gentle rocking versus vigorous stirring at one point in a protocol – have huge impacts on results. Documentation is crucial. Using the exact same reagents and water is key (people have been known to lug water from an old lab to a new lab to maintain consistency). Does it make sense to videotape procedures? Or are there other lessons to be learned from attempting to reproduce others’ work that go beyond replicating it?

Stanford University researchers have released a protocol that can grant near-complete privacy to individuals whose genetic material is research-active. (I love the idea of my genetic material being research active without me.)

Stanford researchers in another lab discovered that scientists know very little about 99 percent of the microbes living in our bodies. Many of the microbes they found are not even in the catalogue of known microbes. They also affirmed that we know very little about how the immune system changes during pregnancy.



University of Utah chemical engineers have developed a blood test that can detect liver cancer in two minutes for $3 without using a lab.


GANs for Biological Image Synthesis

arXiv, Computer Science > Computer Vision and Pattern Recognition; Anton Osokin, Anatole Chessel, Rafael E. Carazo Salas, Federico Vaggi


from

In this paper, we propose a novel application of Generative Adversarial Networks (GAN) to the synthesis of cells imaged by fluorescence microscopy. Compared to natural images, cells tend to have a simpler and more geometric global structure that facilitates image generation. However, the correlation between the spatial pattern of different fluorescent proteins reflects important biological functions, and synthesized images have to capture these relationships to be relevant for biological applications. We adapt GANs to the task at hand and propose new models with casual dependencies between image channels that can generate multi-channel images, which would be impossible to obtain experimentally. We evaluate our approach using two independent techniques and compare it against sensible baselines. Finally, we demonstrate that by interpolating across the latent space we can mimic the known changes in protein localization that occur through time during the cell cycle, allowing us to predict temporal evolution from static images.


Microsoft’s speech recognition system hits a new accuracy milestone

TechCrunch, Catherine Shu


from

Microsoft announced today that its conversational speech recognition system has reached a 5.1% error rate, its lowest so far. This surpasses the 5.9% error rate reached last year by a group of researchers from Microsoft Artificial Intelligence and Research and puts its accuracy on par with professional human transcribers who have advantages like the ability to listen to text several times.

Both studies transcribed recordings from the Switchboard corpus, a collection of about 2,400 telephone conversations that have been used by researchers to test speech recognition systems since the early 1990s. The new study was performed by a group of researchers at Microsoft AI and Research with the goal of achieving the same level of accuracy as a group of human transcribers who were able to listen to what they were transcribing several times, access its conversational context and work with other transcribers.


Using machine learning to improve patient care

MIT News, CSAIL


from

One team created a machine-learning approach called “ICU Intervene” that takes large amounts of intensive-care-unit (ICU) data, from vitals and labs to notes and demographics, to determine what kinds of treatments are needed for different symptoms. The system uses “deep learning” to make real-time predictions, learning from past ICU cases to make suggestions for critical care, while also explaining the reasoning behind these decisions.

“The system could potentially be an aid for doctors in the ICU, which is a high-stress, high-demand environment,” says PhD student Harini Suresh, lead author on the paper about ICU Intervene. “The goal is to leverage data from medical records to improve health care and predict actionable interventions.”

Another team developed an approach called “EHR Model Transfer” that can facilitate the application of predictive models on an electronic health record (EHR) system, despite being trained on data from a different EHR system. Specifically, using this approach the team showed that predictive models for mortality and prolonged length of stay can be trained on one EHR system and used to make predictions in another.


Evidence of a Toxic Environment for Women in Economics

The New York Times, The Upshot blog, Justin Wolfers


from

But the intersection of two technological shifts has opened up new avenues for research. First, many “water cooler” conversations have migrated online, leaving behind a computerized archive. In addition, machine-learning techniques have been adapted to explore patterns in large bodies of text, and as a result, it’s now possible to quantify the tenor of that kind of gossip.

This is what Ms. Wu did in her paper, “Gender Stereotyping in Academia: Evidence From Economics Job Market Rumors Forum.”

Ms. Wu mined more than a million posts from an anonymous online message board frequented by many economists. The site, commonly known as econjobrumors.com (its full name is Economics Job Market Rumors), began as a place for economists to exchange gossip about who is hiring and being hired in the profession. Over time, it evolved into a virtual water cooler frequented by economics faculty members, graduate students and others.


How do you bring artificial intelligence from the cloud to the edge?

The Next Web, Ben Dickson


from

Despite the enormous speed at processing reams of data and providing valuable output, artificial intelligence applications have one key weakness: Their brains are located at thousands of miles away.

Most AI algorithms need huge amounts of data and computing power to accomplish tasks. For this reason, they rely on cloud servers to perform their computations, and aren’t capable of accomplishing much at the edge, the mobile phones, computers and other devices where the applications that use them run.

In contrast, we humans perform most of our computation and decision-making at the edge (in our brain) and only refer to other sources (internet, library, other people…) where our own processing power and memory won’t suffice.


Teaching Kids Coding, by the Book

The New York Times, Alexandra Alter


from

“How many of you take computer science class at your schools?” she asked. Hands shot up. “Are you the only girls in your class?” she asked. Most of the girls nodded.

Over the past five years, some 40,000 girls have learned to code through the organization’s summer camps and afterschool programs. But Ms. Saujani wanted to expand the group’s reach, and was looking for new ways to recruit girls into the tech industry.

For a tech evangelist, her solution was surprisingly retro and analog: books. Girls Who Code is creating a publishing franchise, and plans to release 13 books over the next two years through a multibook deal with Penguin. The titles range from board books and picture books for babies and elementary school children, to nonfiction coding manuals, activity books and journals, and a series of novels featuring girl coders.


Growing Up with Alexa

MIT Technology Review, Rachel Metz


from

What will it do to kids to have digital butlers they can boss around?


Inside the fighter jet of the future where AI is the pilot

New Scientist, Field Notes, Timothy Revell


from

Next-gen planes won’t have controls – or maybe even a cockpit. Timothy Revell got on board to find out whether pilots are getting the ejector seat


Google Is Using Machine Learning To Study The Eclipse

Fast Company, Meg Miller


from

It’s been a century since the U.S. has seen an eclipse like this–and Google has spent years preparing to learn from it.


AI in Space

IEEE Spectrum, Katherine Bourzac


from

Last Thursday at an event at Intel, participants in the NASA Frontier Development Laboratory research accelerators presented results showing how artificial intelligence can speed up space science. The lab, part of an effort by NASA to test the machine learning waters, is run by the SETI Institute; engineers at private companies including Intel, IBM, NVIDIA, and Lockheed Martin, among others, helped support the projects.

Companies such as Facebook and Google use machine learning to predict people’s buying habits and tag photos, but so far it hasn’t been widely applied to basic science problems, says Bill Diamond, CEO of the SETI Institute. Through Frontier Development Laboratory, which just finished its second year, NASA is exploring the possibilities. The lab sponsors small groups of computer and planetary science researchers to work on important problems in space science for two months each summer.

 
Deadlines



Helix and Illumina Accelerator collaborate to aid genomics startups

Helix, a San Francisco-based genomics company, has teamed up with Illumina Accelerator, a startup creation engine also based in the Bay Area.

Through the collaboration, they will partner with entrepreneurs looking to promote innovation in the genomics space.

Are you a startup interested in being part of this alliance? If so, you should be developing a DNA-driven product geared toward consumers. The deadline to apply is less than two weeks away (September 1). All qualified startups will compete against other companies that will be assessed by Illumina Accelerator.

 
NYU Center for Data Science News



Predictive Analytics Interview Series: Anasse Bari, New York University

Predictive Analytics Times, Eric Siegel


from

In anticipation of his upcoming conference presentation, Wall Street and the New Data Paradigm at Predictive Analytics World for Financial in New York, Oct 29-Nov 2, 2017, we asked Anasse Bari, University Professor of Computer Science at New York University, a few questions about his work in predictive analytics.

Professor Anasse Bari of New York University, formerly with the World Bank, is steadily becoming the go-to advisor for Wall Street. Although an outsider, he is providing data-driven insights that can help Wall Street hedge funds and other institutions make sound investment decisions.

 
Tools & Resources



Backprop is not just the chain rule

Tim Viera


from

Almost everyone I know says that “backprop is just the chain rule.” Although that’s basically true, there are some subtle and beautiful things about automatic differentiation techniques (including backprop) that will not be appreciated with this dismissive attitude.

This leads to a poor understanding. As I have ranted before: people do not understand basic facts about autodiff.


[1708.05866] A Brief Survey of Deep Reinforcement Learning

arXiv, Computer Science > Learning; Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, Anil Anthony Bharath


from

“In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning.”


In-Database Analytics for Large Array Data

Intel Science & Technology Center for Big Data, Jack Dongarra, Piotr Luszczek and Thomas Herault


from

The split between the analytics (compute) and reliable storage worlds is clear and breaching the two commonly includes data copies to fit the needs. But with data growth, the copies become a significant overhead with increasingly high associated costs of data extraction, redistribution, and copy-back operations.

We will show how in-database analytics addresses this problem directly with the use of modern database systems and established numerical algorithms.


Allen School’s open-source TVM framework bridges the gap between deep learning and hardware innovation

University of Washington, Paul G. Allen School of Computer Science & Engineering


from

Deep learning has become increasingly indispensable for a broad range of applications, including machine translation, speech and facial recognition, drug discovery, and social media filtering. This growing reliance on deep learning has been fueled by a combination of increased computational power, decreased data storage costs, and the emergence of scalable deep learning systems like TensorFlow, MXNet, Caffe and PyTorch that enable companies and organizations to analyze and extract value from vast amounts of data with the help of neural networks.

But existing systems have limitations that hinder their deployment across a range of devices. Because they are built to be optimized for a narrow range of hardware platforms, such as server-class GPUs, it takes considerable engineering effort and expense to adapt them for other platforms — not to mention provide ongoing support. The Allen School’s novel TVM framework aims to bridge that gap between deep learning systems, which are optimized for productivity, and the multitude of programming, performance and efficiency constraints enforced by different types of hardware.

Leave a Comment

Your email address will not be published.