NYU Data Science newsletter – June 6, 2016

NYU Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for June 6, 2016

GROUP CURATION: N/A

Data Science News

A match-making service pairs neuroscientists with designers to explain scientific breakthroughs

Quartz, Anne Quito

from June 01, 2016

A recent initiative to bridge art and science is a project called Leading Strand. Conceived by neuroscientist-turned art director Amanda Phingbodhipakkiya, the start-up pairs academic researchers with design professionals, to create visual representations of scientific breakthroughs, often too technical and abstract for the general public to grasp.

The project’s goal is to counter what Phingbodhipakkiya identifies as “budget cuts, media misrepresentation, and public apathy” toward science in the US. Although scientific institutes such as the National Institutes of Health, NASA and the Department of Energy received a boost in the 2016 federal budget, public funding for scientific research and development has been declining or stagnant since 1968.

Meet Jackrabbot: the social robot

Stanford Engineering

from June 01, 2016

This three-foot tall prototype uses machine learning to discern the unwritten rules of pedestrian behavior so it can move safely as it mingles with humans.

Two-hundred-terabyte maths proof is largest ever

Nature News & Comment

from May 26, 2016

Three computer scientists have announced the largest-ever mathematics proof: a file that comes in at a whopping 200 terabytes1, roughly equivalent to all the digitized text held by the US Library of Congress. The researchers have created a 68-gigabyte compressed version of their solution — which would allow anyone with about 30,000 hours of spare processor time to download, reconstruct and verify it — but a human could never hope to read through it.

Computer-assisted proofs too large to be directly verifiable by humans have become commonplace, and mathematicians are familiar with computers that solve problems in combinatorics — the study of finite discrete structures — by checking through umpteen individual cases. Still, “200 terabytes is unbelievable”, says Ronald Graham, a mathematician at the University of California, San Diego.

[1605.07725] Virtual Adversarial Training for Semi-Supervised Text Classification

arXiv, Statistics > Machine Learning; Takeru Miyato, Andrew M. Dai, Ian Goodfellow

from May 25, 2016

Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting. However, both methods require making small perturbations to numerous entries of the input vector, which is inappropriate for sparse high-dimensional inputs such as one-hot word representations. We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieves state of the art results on multiple benchmark semi-supervised and purely supervised tasks. We provide visualizations and analysis showing that the learned word embeddings have improved in quality and that while training, the model is less prone to overfitting.

Tech moguls declare era of artificial intelligence

Reuters

from June 02, 2016

Artificial intelligence and machine learning will create computers so sophisticated and godlike that humans will need to implant “neural laces” in their brains to keep up, Tesla Motors and SpaceX CEO Elon Musk told a crowd of tech leaders this week.

While Musk’s description of an injectable human-computer link may sound like science fiction, top tech executives repeatedly said that artificial intelligence (AI) was on the verge of changing everyday life, during discussion at a conference by online publication Recode this week.

The New Digital Economy: Shared, Collaborative and On Demand

Pew Research Center

from May 19, 2016

A number of new commercial online services have emerged in recent years, each promising to reshape some aspect of the way Americans go about their lives. Some of these services offer on-demand access to goods or services with the click of a mouse or swipe of a smartphone app. Others promote the commercialized sharing of products or expertise, while still others seek to connect communities of interest and solve problems using open, collaborative platforms. These services have sparked a wide-ranging cultural and political debate on issues such as how they should be regulated, their impact on the changing nature of jobs and their overall influence on users’ day-to-day lives.

A national Pew Research Center survey of 4,787 American adults – its first-ever comprehensive study of the scope and impact of the shared, collaborative and on-demand economy – finds that usage of these platforms varies widely across the population. In total, 72% of American adults have used at least one of 11 different shared and on-demand services. And some incorporate a relatively wide variety of these services into their daily lives: Around one-in-five Americans have used four or more of these services, and 7% have used six or more.

The ICML 2016 Space Fight

John Langford, Machine Learning (Theory) blog

from June 04, 2016

The space problem started long ago.

At ICML last year and the year before the amount of capacity that needed to fit everyone on any single day was about 1500. My advice was to expect 2000 and have capacity for 2500 because “New York” and “Machine Learning”. Was history right? Or New York and buzz?

A matter of integrity: Can improved curation efforts prevent the next data sharing disaster?

London School of Economics, The Impact Blog; Limor Peer

from June 02, 2016

Increasingly, data repositories accept a variety of materials, mostly driven by the fact that data and software are increasingly inextricable. Code may be integral to using the data because it actually creates or models the data, because it is necessary for reading ASCII data into statistical software packages, or because it is used to analyze, interpret, and visualize the data. So when aiming to vouch for the quality of the data we have to take into account code as well. Code review is an established practice in computer science, and data repositories would be advised to heed the Recomputation Manifesto which states, among other things, that “tools and repositories can help recomputation become standard.”

Researchers Uncover a Flaw in Europe’s Tough Privacy Rules

The New York Times

from June 03, 2016

Europe likes to think it leads the world in protecting people’s privacy, and that is particularly true for the region’s so-called right to be forgotten. That legal right allows people connected to the Continent to ask search engines like Google to remove links about themselves from online search results under certain conditions.

Yet that right — one of the world’s most widespread efforts to protect people’s privacy online — may not be as effective as many European policy makers think, according to new research by computer scientists based, in part, at New York University.

Why Cancer Patients And Doctors Should Rethink The Value Of Phase 1 Trials

Forbes, Elaine Schattner

from June 02, 2016

Of all the reports being presented at this year’s big cancer meeting, the one I think most important is not about a particular drug or malignancy. It’s about the design of clinical trials.

For cancer patients trying an experimental drug, participating in a “matched” study–using biomarkers, like genetics, to link their condition to a treatment–offers much greater chances of clinical benefit than does participating in a similar, unmatched study. The abstract*, authored by a geographically wide research group, will be delivered in Chicago by Maria Schwaederlé, PharmD, of the University of California in San Diego.

The results, while not surprising, are remarkable for their clarity. In phase 1 trials, a precision strategy boosted the response rate from 4.9%, to 30.5%. That is a huge difference. The meta-analysis included 346 studies published from 2011 through 2013, a fairly recent data set, involving 13,203 research subjects. The “p-value”–a statistical term–is impressive, at <0.0001. The point is, this is a clinically meaningful and significant find.

This study, and others like it, have the potential to upend our clinical trials system.

Do We Want Robot Warriors to Decide Who Lives or Dies?

IEEE Spectrum, Erico Guizzo and Evan Ackerman

from May 31, 2016

As artificial intelligence in military robots advances, the meaning of warfare is being redefined.

Events

WiML ICML Luncheon 2016

Underrepresented minorities and undergraduates interested in machine learning research are Encouraged to expect. The luncheon is co-located with ICML.

New York, NY Saturday, June 21, at 11 Times Square, Microsoft building (6th floor) startiing at 12 noon

Complex Networks — From theory to interdisciplinary applications

The conference will present the state of the art of the research in complex networks in various directions: from the most advanced theoretical approaches dealing with multiplex and temporal networks, to the applications of network theory in epidemiology, in computational social science and in studies of the brain.

Marseilles, France Monday-Wednesday, July 11-13. [$$$]

Deadlines

ICDM2016 Workshop Data Mining in Human Activity Analysis (DMHAA)

deadline: subsection?

This workshop welcomes a broad range of submissions developing and using data mining techniques for human activity analysis. We are especially interested in 1) theoretical advances as well as algorithm developments in data mining for human activity analysis, 2) reports of practical applications and system innovations in human activity analysis, and 3) novel data sets as test bed for new developments, preferably with implemented standard benchmarks.

Barcelona, Spain IEEE Conference on Data Mining (ICDM 2016) is Tuesday-Friday, December 13-16.

Deadline for workshop paper submissions is Friday, August 12.

Tools & Resources

Meet The Internet’s Best Productivity Tool: If This Then That

Wall Street Journal

from June 01, 2016

Nerds have all sorts of superpowers. The AC cranks up automatically when they pull into the driveway. They get an alert every time a decent vacation rental shows up on Craigslist. Their Instagrams magically post to Twitter.

You’d think any one of these tricks requires lines and lines of code. Nope. The secret is IFTTT, and you can master it in minutes.

Defining success – Four secrets of a successful data science experiment

StatsBlogs.com, Simply Statistics

from June 03, 2016

Defining success is a crucial part of managing a data science experiment. Of course, success is often context specific. However, some aspects of success are general enough to merit discussion. A list of hallmarks of success includes:

New knowledge is created.

Decisions or policies are made based on the outcome of the experiment.

A report, presentation, or app with impact is created.

It is learned that the data can’t answer the question being asked of it.

Practicing Data Science Responsibly

Rahul Bhargava, Data Therapy blog

from June 03, 2016

Data science and big data driven decisions are already baked into business culture across many fields. The technology and applications are far ahead of our reflections about intent, appropriateness, and responsibility. I want to focus on that word here, which I steal from my friends in the humanitarian field. What are our responsibilities when it comes to practicing data science? Here are a few examples of why this matters, and my recommendations for what to do about it.

Make your own tagging system from scratch

Kequc, Nathan Lunde-Berry

from June 03, 2016

Build a tagging tool from scratch rather than using one that is pre-made. You get more control, there isn’t unused code, and you learn something too. In this article I will go through the process of building a tagging system from scratch.

Magenta

Google TensorFlow, Douglas Eck

from June 01, 2016

We’re happy to announce Magenta, a project from the Google Brain team that asks: Can we use machine learning to create compelling art and music? If so, how? If not, why not? We’ll use TensorFlow, and we’ll release our models and tools in open source on our GitHub. W

Sports.BradStenger.com

NYU Data Science newsletter – June 6, 2016

Leave a Comment Cancel reply