Data Science newsletter – March 29, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for March 29, 2017


Data Science News

candidate: Tweet of the Week

Twitter, Mark Cuban


Government Data Science News

President Trump announced he would pursue an overall 18 percent reduction in the NIH budget. NYU President Andrew Hamilton penned an impassioned op-ed in the Washington Post against the cuts. Steven Eastlack was already writing a blog post about how scarce funding impacts early career researchers, a post that he amended in the 11th hour to reflect Trump’s new fist-shaking at an already beleaguered group of scientists rattling the pennies in their proverbial tin cups.

Victoria Hermann, an Arctic researcher, explained in an op-ed that her citations and data keep disappearing from federal databases. She noted that recent data rescue and data refuge events fall short of perfection, “Volunteers tried tirelessly to save what they could, but the federal government is a massive warehouse of information. Some data was bound to get left behind.”

Under new leadership (Ajit Pai, formerly a Verizon executive), the FCC has moved to revoke net neutrality and allow ISPs to sell consumer data. These changes could be signed into law by President Trump “any day now” according to Variety or at a vague future time according to Fortune. Either way, you may want to get comfortable using a virtual private network (VPN).

The Senate voted to allow internet service providers to sell individuals’ data to tech companies without obtaining consent. Note also: the individuals would not be paid for their data, but that has never been the real issue here. The real issue has always been about how much power individuals have to determine the use of their sensitive information. I’m tired of the argument that this will lead to better ads. Really? Better ads are more valuable than protecting a key civil liberty? … Data sharing always sounds so nice. Pro-tip that works on the playground, in the board room and so many other places: never let the bullies or mean girls convince you that coercion is sharing. … Meanwhile big banks are fighting to hold onto their customer’s data, refusing to share it with the same big tech companies as well as potential fintech start-ups who might leak customers and transactions away from traditional banks.

A new study out of the Columbia Law School‘s Data Analytics Working Group provides a framework for using data science to identify and fight public corruption. It has a pun-ny title: Taking a Byte out of Corruption, which seems like a mash up between McGruff the Crime Dog and that part of the late 1990s when the word cyber was edgy.

The National Science Foundation’s Jetstream, an HPC cloud-computing resource made available to researchers, is sequencing rattlesnake genomes to investigate their venom. Jetstream is great for gene sequencing and for snake researchers who prefer air conditioned safety over schlepping through the withering heat of the Arizona desert looking for animals with a fatal bite. One definition of research collaboration: you go find the deadly snakes, I’ll look for patterns in the data.

Andrew Nicklin argues that cities and states need Chief Data Officers to “do for data what fleet managers do for ambulances…ensuring their government’s valuable data gets the same care as any other strategic asset”.

Brownsville, Brooklyn will be the first New York City neighborhood to have a lab dedicated to smart city planning. (Outside the university system that is. NYU and Cornell have both gone to great lengths to run urban data labs on their Brooklyn and Roosevelt Island campuses, respectively). The Brownsville lab will be far more accessible to broader age ranges and economic backgrounds than the universities typically are. Full disclosure: NYU’s Center for Urban Science and Progress is a partner in the new Brownsville project.

Bloomberg Philanthropies is under-writing a LEED-style certification program for data-smart cities. There are 50 criteria for which cities can get points adding up to silver, gold, or platinum status. Bloomberg’s What Works Cities program already has 77 participating cities.

The Art and Artificial Intelligence Lab

YouTube, StateoftheArtsNJ


At the Art & Artificial Intelligence Lab at Rutgers University, computer scientists Ahmed Elgammal and Babak Saleh are teaching computers how to see and think like human beings – more specifically, like art historians.

Netflix introduces Hermes, a platform for screening its translators

Translation news


Netflix has introduced a translator screening test for its original content called Hermes. Via Hermes, translators can take a test to be rated qualified to translate Netflix content. The test is scored on a scale of 1 to 100, with 80 being the minimum score to be eligible. Hermes is the result of efforts to improve Netflix’s translated content.

[1703.07950] Failures of Deep Learning

arXiv, Computer Science > Learning; Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah


In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art. However, it is important, for both theoreticians and practitioners, to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. We describe four families of problems for which some of the commonly used existing algorithms fail or suffer significant difficulty. We illustrate the failures through practical experiments, and provide theoretical insights explaining their source, and how they might be remedied.

Banks and Tech Firms Battle Over Something Akin to Gold: Your Data

The New York Times, Dealbook blog, Nathaniel Popper


The big banks and Silicon Valley are waging an escalating battle over your personal financial data: your dinner bill last night, your monthly mortgage payment, the interest rates you pay.

Technology companies like Mint and Betterment have been eager to slurp up this data, mainly by building services that let people link all their various bank-account and credit-card information. The selling point is to make budgeting and bookkeeping easier. But the data is also being used to offer new kinds of loans and investment products.

Now, banks have decided they aren’t letting the data go without a fight.

ADATS Could Assist X-planes With Large, Super-Fast Data Transmission

NASA Armstrong Flight Research Center


A network and communication architecture that can more efficiently move data from research aircraft, while using half the bandwidth of traditional methods, could eventually also enable data collection of precise measurements needed for testing the next generation of X-planes.

Called the Advanced Data Acquisition and Telemetry System, or ADATS, researchers at NASA Armstrong Flight Research Center in California integrated the new systems into a NASA King Air recently for a series of three flights following extensive ground testing. The new system can move 40 megabits per second, which is the equivalent of streaming eight high-definition movies from an online service each second, said Otto Schnarr, principal investigator.

Adapting ideas from neuroscience for AI

O'Reilly Radar, Jack Clark


Jack Clark: Why should we look at the brain when developing AI systems, and what aspects should we focus on?

Geoff Hinton: The main reason is that it’s the thing that works. It’s the only thing we know that’s really smart and has general purpose intelligence. The second reason is that, for many years, a subset of people thought you should look at the brain to try to make AI work better, and they didn’t really get very far—they made a push in the 80s, but then it got stalled, and they were kind of laughed at by everyone in AI saying “you don’t look at a bumblebee to design a 747.” But it turned out the inspiration they got from looking at the brain was extremely relevant, and without that, they probably wouldn’t have gone in that direction. It’s not just that we have an example of something that’s intelligent; we also have an example of a methodology that worked, and I think we should push it further.

The AI Misinformation Epidemic

Zachary C. Lipton, Approximately Correct blog


This pairing of interest with ignorance has created a perfect storm for a misinformation epidemic. The outsize demand for stories about AI has created a tremendous opportunity for impostors to capture some piece of this market.

When founding Approximately Correct, I lamented that too few academics possessed either the interest or the talent for both expository writing and for addressing social issues. And on the other hand, too few journalists possess the technical strength to relate developments in machine learning to the public faithfully. As a result, there are not enough voices engaging the public in the non-sensational way that seems necessary now.

Unfortunately, the paucity of clear and informed voices has not resulted in a silent media.

It’s time for Canada to invest in developing artificial intelligence

The Globe and Mail, Alan Bernstein, Pierre Boivin and David McKay


Despite our early scientific lead, we’re losing ground to the AI superpowers. One indicator of that: Canadian companies last year acquired only 18 AI startups, out of 658 that were acquired globally.

So while our goal should be to ensure that Canada is a global centre for AI science, we also need to push Canadian companies, entrepreneurs and investors to seize the moment. The opportunities go hand in hand

Crystal structure databases to have single portal

Chemical & Engineering News, Jyllian Kemsley


Two of the main crystallographic structure databases, the Cambridge Structural Database (CSD) and the Inorganic Crystal Structure Database (ICSD), will have a single portal for users to deposit and search for data starting later this year, their operating organizations announced yesterday.

Attention Markets & the Law by Tim Wu

SSRN, Tim Wu


Human attention is a resource. An increasingly large and important sector of the economy, including firms such as Google, Facebook, Snap, along with parts of the traditional media, currently depend on attentional markets for their revenue. Their business model, however, present a challenge for laws premised on the presumption of cash markets. This paper introduces a novel economic and legal analysis of attention markets centered on the “attention broker,” the firms that attract and resell attention to advertisers.

The analysis has important payouts for two areas: antitrust analysis, and in particular the oversight of mergers in high technology markets, as well as the protection of the captive audiences from so-called “attentional theft.”

Artificial Intelligence: The Park Rangers of the Anthropocene

The Atlantic, Ed Yong


In Australia, autonomous killer robots are set to invade the Great Barrier Reef. Their target is the crown-of-thorns starfish—a malevolent pincushion with a voracious appetite for corals. To protect ailing reefs, divers often cull the starfish by injecting them with bile or vinegar. But a team of Australian scientists has developed intelligent underwater robots called COTSBots that can do the same thing. The yellow bots have learned to identify the starfish among the coral, and can execute them by lethal injection.

These robots probably aren’t going to be the saviors of the reef, but that’s not the point. It’s the approach that matters. The work of conservationists typically involves reducing human influence: breeding the species we’ve killed, killing the species we’ve introduced, removing the pollutants we’ve added, and so on. But all of these measures involve human action—some, intensively so. The COTSBots are different: They’re of us, but designed to ultimately operate without us. They represent a burgeoning movement to remove human influence from conservation—to save wild ecosystems by taking us out of the picture entirely.

NYU Helps Open Neighborhood Innovation Lab

Washington Square News, Greta Chevance


NYU is to helping improve urban life throughout New York City by eliminating technical difficulties arising on a daily basis through its collaboration with New York City’s first Neighborhood Innovation Lab.

New York’s Mayor’s Office of Technology Innovation, the Economic Development Corporation and NYU’s Center for Urban Science and Progress recently worked together to open the city’s first Neighborhood Innovation Lab at Osborn Plaza in Brownsville, Brooklyn last week. The first forum open to the community is set to occur in May.


Computational Challenges in Machine Learning

Simons Institute for the Theory of Computing


Berkeley, CA The aim of this Simons Institute workshop is to bring together a broad set of researchers looking at algorithmic questions that arise in machine learning. May 1-5. [registration required]

Future Labs AI Summit

NYU Tandon School of Engineering


New York, NY Wednesday, April 5, starting at 12 noon, NYU Skirball Center [$$$]

IPAM Workshop – New Deep Learning Techniques



Los Angeles, CA February 5-9, 2018, at UCLA’s Luskin Conference Center. [$$$]

The Art of Data Visualization: Art or Knowledge?

Columbia University Libraries


New York, NY Day 1 is a series of presentations and demos on different data visualizations. April 11 at 11 a.m., Columbia University, Davis Auditorium. [registration required]

Data Science Challenges for Cancer Immunotherapy 

South Big Data Hub


Chapel Hill, NC, and Online South Big Data Hub Data Science Roundtable on the burgeoning age of immuno-oncology, Thursday, April 13, starting at 12 noon. [please rsvp]

GPUniversity at University of Washington: Deep Learning and Beyond



Seattle, WA Join NVIDIA for GPUniversity Day on April 14, at 10:30 a.m. at the University of Washington in the Husky Union Building (HUB). [free, registration required]


International Symposium on Robotics Research 2017 Call for Papers

Puerto Varas, Chile The 18th International Symposium on Robotics Research (ISRR ‘17) has released a call for papers. Deadline for submissions is May 15.
NYU Center for Data Science News

Creating synthetic languages

Medium, NYU Center for Data Science


Studying the underlying morpheme structures within a word has been one focus of Professor Jason Eisner’s research from Johns Hopkins University. When speaking at the Center’s NLP and Text As Data seminar, Eisner explained that they have so far used computational techniques to automatically analyze the pronunciations of a set of words to identify the underlying morphemes that they share.

Tools & Resources

IEEE DataPort



“IEEE DataPort™ is an easily accessible repository of datasets, including Big Data datasets. IEEE DataPort™ is designed to make storage of datasets easier, provide access to valuable datasets across a broad scope of topics, facilitate analysis of datasets, and retain referenceable data for reproducible research.”

ISTC Releases Open Source Code for BigDAWG Polystore System

Intel Science & Technology Center for Big Data, Dr. Tim Mattson with Dr. Vijay Gadepally and Kyle O’Brien (MIT Lincoln Laboratory)


“The ISTC for Big Data released the first version of BigDAWG, our polystore system for simplifying integration and analytics of disparate data at scale. BigDAWG is open-source software and available for download at”

Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning

Journal of Machine Learning Research; Guillaume Lemaître, Fernando Nogueira, Christos K. Aridas


imbalanced-learn is an open-source python toolbox aiming at providing a wide range of methods to cope with the problem of imbalanced dataset frequently encountered in machine learning and pattern recognition. The implemented state-of-the- art methods can be categorized into 4 groups: (i) under- sampling, (ii) over-sampling, (iii) combination of over- and under-sampling, and (iv) ensemble learning methods. The proposed toolbox depends only on numpy, scipy, and scikit-learn and is distributed under MIT license. Furthermore, it is fully compatible with scikit-learn and is part of the scikit-learn-contrib supported project. Documentation, unit tests as well as integration tests are provided to ease usage and contribution. Source code, binaries, and documentation can be downloaded from



Postdoctoral Research Scientist in Statistical Functional Genomics

Wellcome Trust Centre for Human Genetics; Oxford, England

Leave a Comment

Your email address will not be published.