Data Science newsletter – April 11, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for April 11, 2018

GROUP CURATION: N/A

 
 
Data Science News



A finger bone from an unexpected place and time upends the story of human migration out of Africa

Los Angeles Times, Karen Kaplan


from

Using lasers, researchers collected a microscopically small sample of the fossilized bone and used it to gauge its age. The analysis showed that the bone was about 87,600 years old, give or take 2,500 years.

Ben Marwick, an archaeologist at the University of Washington in Seattle who was not involved in the study, said evidence for humans leaving Africa before 60,000 years ago has become difficult to ignore over the last few years.

Still, he said the new study stands out.

“What makes the Al Wusta find especially important is that it is a direct date on a bone,” he said. “Most of the other evidence that support this pre-65,000 year idea have not been able to directly date human bones.”


University Data Science News

Casey Greene and Gregory Way of the University of Pennsylvania teamed up with Yolanda Sanchez of Dartmouth to use machine learning techniques to bring transcriptome sequencing to precision medicine. They are working within the field of precision medicine to predict which therapies are effective in combatting cancer. Simply sequencing the DNA of tumors has not proven perfectly effective in matching the therapy to the cancer. Their technique combines genetic data from 33 cancer types with transcriptome sequencing and data about the impacts of specific drugs. Not only is this an important step for cancer therapies, it presents a data science case study that demonstrates a mechanism and rationale for combining heterogeneous data types at scale.



In a related development, Columbia University researchers led by Richard Mann are working on a new transcription technique that can decode more insight from binding sites that have differing affinities for actually binding. Called No Read Left Behind, it will hopefully deliver its promised results more effectively than President Bush’s No Child Left Behind program.



Tony Zador, an MD/PhD working at Cold Spring Harbor, and a Simons Foundation grant recipient, has established a new technique for mapping the neurological connectome at the level of single neurons. This improves upon current techniques by getting more granular and may lead to advancements in our understanding of diseases like schizophrenia and autism that are rooted in circuitry.



The University of Oregon is continuing its development of data science by establishing the Center for Big Learning, which will be an industry-academia crossroads.



Waterloo University launched the Waterloo Artificial Intelligence Institute. It is another academic-industry hybrid research center that will “bring together almost 100 faculty members to tackle practical and fundamental problems brought to them by partners in business, government and the non-profit sector.”



Not to be forgotten, the University of Alberta announced $940,000 in scholarships funded by ATB Financial for students interested in pursuing artificial intelligence and machine learning. The race for data science success is on in Canada.

Stylishly late to the data science party, the University of Chicago announced they are launching a new Center for Data and Applied Computing. Like many similar centers at universities around the world, this one will focus on “new methods in computation and data analytics” with hopes to bring a broad, interdisciplinary “spectrum of science and scholarship” together.



Vasant Dhar of NYU gave a thoughtful TED talk that asks what we know about how, when, why, and with what consequences humans trust machines and what this means for social life.



A new tool could help academics get to the scientific papers they want to read more efficiently. Kopernio is “an artificial intelligence tool developed by the co-founders of research sharing platform Mendeley and media monitoring service Newsflo. It automatically detects what individual or institutional subscriptions a user already has access to and facilitates instant retrieval.



Gunes Acar, Steve Englehardt, and Arvind Narayanan at Princeton University show us just how easy it is to readable, accurate email addresses from hashed data that is supposed to hide them. Mostly, its as easy as buying this service from an un-hashing company, but if you want to know how it’s done, read on at Freedom to Tinker, the excellent blog hosted by Princeton’s Center for Information Technology Policy.



Andreas Rossler, a PhD. student in Matthias Nießner’s group at the Technical University of Munich, published a paper to the arXiv that demos a new algorithm for detecting face swaps in video. Thank you, Mr. Rossler, for combatting deepfakes.



Graduate students in the bench sciences, robots may soon be taking your jobs.



The National Science Foundation announced a $22.5 million grant (including private partner contributions) to New York City and a team of researchers including faculty from the NYU Tandon Department of Electrical Engineering and NYU Wireless, along with Rutgers and Columbia University, to design and build COSMOS: an experimental testbed for developing 5G wireless networking technologies.



This week there is good news in the ongoing series, The Passions and the p-values. In order to foster the concept of “living publications” and make it easier to publish reproducible research, University of Notre Dame researchers are working to create a new Whole Tale platform. Researchers will be able to upload the “whole tale” of their research, so that “when someone else goes to view that research tale, they will be able to see the source data used in analyses and models as well as the resulting data and variations all within the working environment the research was conducted in” according to Bertram Ludäscher, of the University of Illinois at Urbana-Champaign.



University of California-Berkeley is about to vote in a new undergraduate major in data science. This follows the launch of their massively successful Data8 course with more than 1000 students enrolled on campus and the online version, Data8X, with more than 37,000 students enrolled from 181 countries. Winning.



University of New Hampshire physicist Jiadong Zang is building tiny, stable, computational storage devices that will help us continue pursuing Moore’s law. The problem with existing tiny storage is that it’s fragile; data can easily be erased when it loses power. Always smart to keep an eye on hardware developments, even though they are slower.


New Brain Maps With Unmatched Detail May Change Neuroscience

WIRED, Science, Monique Brouillette


from

What [Tony] Zador showed me was a map of 50,000 neurons in the cerebral cortex of a mouse. It indicated where the cell bodies of every neuron sat and where they sent their long axon branches. A neural map of this size and detail has never been made before. Forgoing the traditional method of brain mapping that involves marking neurons with fluorescence, Zador had taken an unusual approach that drew on the long tradition of molecular biology research at Cold Spring Harbor, on Long Island. He used bits of genomic information to imbue a unique RNA sequence or “bar code” into each individual neuron. He then dissected the brain into cubes like a sheet cake and fed the pieces into a DNA sequencer. The result: a 3-D rendering of 50,000 neurons in the mouse cortex (with as many more to be added soon) mapped with single cell resolution.


New UO research center to focus on ‘deep learning’

University of Oregon, Around the O


from

Imagine an application that could translate speech in real time, including slang words, without any grammatical errors. International travel would become a breeze.

That’s just the kind of innovation that could be fueled by a new UO research center that will leverage collective wisdom from academia, industry, governments and the UO’s High Performance Computing Research Core Facility. Known as the Center for Big Learning, it will be located on the third floor of Deschutes Hall.

The recently launched center combines artificial intelligence, big data and large-scale “deep learning” — a class of machine learning algorithms that allow computers to perform operations that usually require human thought.


YouTube Kids Is Going To Release A Whitelisted, Non-Algorithmic Version Of Its App

BuzzFeed News, Alex Kantrowitz


from

YouTube will approve all channels allowed to post, a source told BuzzFeed News, giving parents a firewall against an algorithm that’s often proved lacking.


Company Data Science News

Your favorite Twitter app may be about to break. Twitter announced it won’t support streaming services after June 19th, 2018. That means no more push notifications and no more automatic refreshing of timelines. If you are used to using Twitter on your iPhone, Tweetbot, Talon, Tweetings, or Twitterific, a small part of your way of life is coming to an end. Whoever decided that moving away from smartphones back to the laptop would be a good idea is…living in a different century? Nobody has a good explanation for this, so far as I’ve read.



YouTube is releasing an algorithm-free product for children that will help assuage parents’ (apparently well-founded) fears that the recommender system had been leading their children to content beyond their maturity level. This is quite a bit more interesting than it may at first seem. It could be an admission that recommender systems revert to what an average person would like, suggesting weaknesses in their ability to exercise strong market segmentation.



Want to build a data science team in your organization? Here’s how Instapage did it. And here’s how Wish built an analytics team of 30. (I’m thinking of you, Foster P., if you’re reading this.)



Uber posted Uber Movement that allows residents of Pittsburgh, Boston, San Francisco, Cincinnati, Washington DC, and Toronto as well as cities in the Europe, Africa, South Asia, Australia, and South America to figure out how long it will take to drive from Point A to Point B based on years of Uber data. The full dataset is also available for download.



The only thing I’ll say about Facebook this week is that behavioral scientists suggest that only a vanishingly small number of people are likely to change their behavior as a result of Facebook’s revelations about what it does (save, model, sell) and doesn’t do (delete) with users’ data. We are creatures of habit, which is one of the fundamental principles of human behavior Facebook can take to the bank. Over and over and over again.



Thomson Reuters has looped in lawtech company Logical Construct to take pre-Brexit contracts and turn them into functional post-Brexit contracts. This sounds eminently sensible, as do many applications of data science in the legal community. Their structured data is just begging for robotic companions.


Why Reddit Data Matters for Consumer Insights

Crimson Hexagon, Garrett Huddy


from

What are consumers saying about your brand online? Most brands would answer that question with data from top social media sites like Twitter, Facebook, and Instagram, but they are likely missing a hugely insightful chunk of data: consumer conversations on Reddit and other forums.

Brands’ lack of attention to Reddit may be a result of a lack of access to the data or lack of understanding of the value that the data can provide. Reddit data can and should be analyzed just like the rest of social media, but conversation on Reddit has some key differences and advantages.

In this post we’ll go through some of the key benefits of analyzing Reddit data and why it’s so powerful for uncovering consumer insights.


To work for society, data scientists need a hippocratic oath with teeth

Wired UK, Tom Upchurch


from

Data scientists need to understand the weight of their influence and the limitations of their wisdom, says Cathy O’Neil. The Weapons of Math Destruction author lays out her plan for an effective system


The Moscow Midterms

FiveThirtyEight, Clare Malone


from

The first Americans to line up to vote on Nov. 6, 2018, will be the East Coast’s earliest risers. As early as 5 a.m. EST, rubbing the sleep from their eyes and clutching travel thermoses of coffee, they will start the procession of perhaps 90 million Americans to vote that day. The last to cast ballots will be Hawaiians, who will do so until 11 p.m. East Coast time. When all is said and done, the federal election will unfold something like an 18-hour-long ballet of democracy: 50 states, dozens of different kinds of voting machines and an expectation that everything should be counted up in time for TV networks to broadcast the results before Americans head to bed. Election Day 2018 is expected to unfold no differently than it has in years past.

Except it might.

While Americans are well-acquainted with Russian online trolls’ 2016 disinformation campaign, there’s a more insidious threat of Russian interference in the coming midterms. The Russians could hack our very election infrastructure, disenfranchising Americans and even altering the vote outcome in key states or districts. Election security experts have warned of it, but state election officials have largely played it down for fear of spooking the public. We still might not know the extent to which state election infrastructure was compromised in 2016, nor how compromised it will be in 2018.


OpenAI Charter

Open AI


from

We’re releasing a charter that describes the principles we use to execute on OpenAI’s mission. This document reflects the strategy we’ve refined over the past two years, including feedback from many people internal and external to OpenAI. The timeline to AGI remains uncertain, but our charter will guide us in acting in the best interests of humanity throughout its development.


Columbia Scientists Build Better Way to Decode the Genome

Columbia University, Zuckerman Institute


from

The genome is the body’s instruction manual. It contains the raw information — in the form of DNA — that determines everything from whether an animal walks on four legs or two, to one’s potential risk for disease. But this manual is written in the language of biology, so making sense of all that it encodes has proven challenging. Now, Columbia University researchers have developed a computational tool that shines a light on the genome’s most hard-to-translate segments. With this tool in hand, scientists can get closer to understanding how DNA guides everything from growth and development to aging and disease.

The researchers recently published their findings in the Proceedings of the National Academy of Sciences.

“The genomes of even simple organisms such as the fruit fly contain 120 million letters worth of DNA, much of which has yet to be decoded because the cues it provides have been too subtle for existing tools to pick up,” said Richard Mann, PhD, a principal investigator at Columbia’s Mortimer B. Zuckerman Mind Brain Behavior Institute and a senior author of the paper. “But our new algorithm lets us sweep through these millions of lines of genetic code and pick up even the faintest signals, resulting in a much more complete picture what DNA encodes.”


AI and robotics researchers call off boycott of KAIST

The Engineer (UK)


from

A boycott by global AI & robotics researchers of South Korea’s KAIST has been called off after the university’s president committed not to develop lethal autonomous weapons.


Opinion | Mark Zuckerberg Can Still Fix This Mess

The New York Times, Jonathan Zittrain


from

On the policy front, we should look to how the law treats professionals with specialized skills who get to know clients’ troubles and secrets intimately. For example, doctors and lawyers draw lots of sensitive information from, and wield a lot of power over, their patients and clients. There’s not only an ethical trust relationship there but also a legal one: that of a “fiduciary,” which at its core means that the professionals are obliged to place their clients’ interests ahead of their own.

The legal scholar Jack Balkin has convincingly argued that companies like Facebook and Twitter are in a similar relationship of knowledge about, and power over, their users — and thus should be considered “information fiduciaries.”


America should borrow from Europe’s data-privacy law

The Economist


from

The GDPR’s premise, that consumers should be in charge of their own personal data, is the right one

 
Events



BrainHack

Indiana University; Eleftherios Garyfallidis, Valentin Pentchev and Franco Pestilli


from

Bloomington, IN May 2-4. “Brainhack is a unique conference that convenes researchers from across the globe and a myriad of disciplines to work together on innovative projects related to neuroscience.” [registration opens on April 12]


CFEM Seminar: Alexander Lipton (Stronghold Labs) – Asset-backed Currencies in Retrospective and Perspective

Cornell Engineering, Operations Research and Information Engineering


from

New York, NY April 18, starting at 6 p.m., Cornell Tech (2 W Loop Rd). [free]


DARPA Announces First Annual Electronics Resurgence Initiative Summit

DARPA


from

San Francisco, CA July 23-25. “DARPA announced in June 2017 that it would coalesce a broad series of programs into the Electronics Resurgence Initiative (ERI). ERI, which received an additional $75 million allocation in the FY18 budget, calls for innovative new approaches to microsystems materials, designs, and architectures.” [registration opens May 1]

 
Deadlines



14th International Workshop on Mining and Learning with Graphs

London, England August 20, in conjunction with KDD 2018. Deadline for paper submissions is May 8.
 
Tools & Resources



FASTER: A Concurrent Key-Value Store with In-Place Updates

Microsoft Research; Badrish Chandramouli et al.


from

Over the last decade, there has been a tremendous growth in data-intensive applications and services in the cloud. Data is created on a variety of edge sources, e.g., devices, browsers, and servers, and processed by cloud applications to gain insights or take decisions. Applications and services either work on collected data, or monitor and process data in real time. These applications are typically update intensive and involve a large amount of state beyond what can fit in main memory. However, they display significant temporal locality in their access pattern. This paper presents FASTER, a new key-value store for point operations. FASTER combines a highly cache-optimized concurrent hash index with a “hybrid log”: a concurrent log-structured record store that spans main memory and storage, while supporting fast in-place updates of the hot set in memory. FASTER extends the standard key-value store interface to handle read-modify-writes, blind updates, and CRDT-based updates. Experiments show that FASTER achieves orders-of-magnitude better throughput – up to 160M operations per second on a single machine – than alternative systems deployed widely today, and exceeds the performance of pure in-memory data structures when the workload fits in memory.


Four cents to deanonymize: Companies reverse hashed email addresses

Princeton CITP, Freedom to Tinker blog; Gunes Acar, Steve Englehardt, and Arvind Narayanan


from

Your email address is an excellent identifier for tracking you across devices, websites and apps. Even if you clear cookies, use private browsing mode or change devices, your email address will remain the same. Due to privacy concerns, tracking companies including ad networks, marketers, and data brokers use the hash of your email address instead, purporting that hashed emails are “non-personally identifying”, “completely private” and “anonymous”. But this is a misleading argument, as hashed email addresses can be reversed to recover original email addresses. In this post we’ll explain why, and explore companies which reverse hashed email addresses as a service.


tf-tutorial

GitHub – dfm


from

“This repository contains an interactive IPython worksheet (worksheet.ipynb) designed to introduce you to model fitting using TensorFlow. Only very minimal experience with Python should be necessary to get something out of this.”

 
Careers


Full-time, non-tenured academic positions

Spatial Biodiversity Data Scientist.



Max Planck – Yale Center for Biodiversity Movement and Global Change; New Haven, CT
Full-time positions outside academia

Kepler/K2 Support Scientist



NASA Ames Research Center; Moffett Field, CA

Leave a Comment

Your email address will not be published.