NYU Data Science newsletter – January 8, 2016

NYU Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for January 8, 2016

GROUP CURATION: N/A

 
Data Science News



Undergraduates, inspired by Silicon Valley, expect universities to teach them how to make ideas into businesses.

Twitter, NYTBusiness


from December 28, 2015

Ten years ago, it may have sufficed to offer a few entrepreneurship courses, workshops and clubs. But undergraduates, driven by a sullen job market and inspired by billion-dollar success narratives from Silicon Valley, now expect universities to teach them how to convert their ideas into business or nonprofit ventures.

As a result, colleges — and elite institutions in particular — have become engaged in an innovation arms race. Harvard opened an Innovation Lab in 2011 that has helped start more than 75 companies. Last year, New York University founded a campus entrepreneurs’ lab, and this year Northwestern University opened a student start-up center, the Garage.

 

Berkeley Students use Machine Learning to Predict RAP Success

Berkeley iSchool


from December 30, 2015

As a capstone project for the MIDS program at the University of California, Berkeley, our team applied machine learning techniques and data science principles to a database of rap lyrics from 1980 to 2015. After an active exploration of the data, we chose to focus our efforts on ‘hit prediction’, particularly on what it takes to make it onto the weekly Billboard Top 100 charts. Through a combination of lyric features and the support vector machine (SVM) model, we were able to obtain over 70% accuracy in the prediction of past songs on the Billboard Top 100 chart.

 

Data vs Theory: The Mathematical Battle for the Soul of Physics 

Huffington Post, David H. Bailey


from December 30, 2015

These are exciting times for the field of physics. In 2012, researchers announced the discovery of the Higgs boson, a discovery four decades in the making, costing billions of dollars (and euros, pounds, yen and yuan) and involving some of the best minds on the planet. And in December 2015, researchers at the Large Hadron Collider in Europe reported that two separate experiments have reported possible traces of a new particle, one that might lie outside the Standard Model, although much more data and scrutiny will be required before anything definite can be said.

Yet behind the scenes a far-reaching battle has been brewing. The battle pits leading figures of string theory and the multiverse on one hand, against skeptics who argue that physics is parting ways with principles of empirical testability and falsifiability that have been the hallmarks of scientific research for at least a century.

 

Artificial Intelligence: What Will the Next 20 Years Bring? 

HuffPost Science, Babak Hodjat


from December 30, 2015

In Stuart Russell and Peter Norvig’s seminal book, Artificial Intelligence: a Modern Approach, the two authors wind up their work with a chapter looking at the future of artificial intelligence (AI). Their book is still the text of choice for teaching AI at many universities and so, I thought, reviewing the predictions they made 20 years ago could help guide us to make better predictions now, for the next 20 years of AI.

Upon rereading the book, though, their predictions from 1995 felt surprisingly salient and topical. There must be something wrong! How could a chapter on the future of AI written 20 years ago seem so similar to the many articles predicting the future of AI being written today? Have we not made any progress worth a mention in the past two decades?

 

Artificial Intelligence Finally Entered Our Everyday World

WIRED, Business


from January 01, 2016

Andrew Ng hands me a tiny device that wraps around my ear and connects to a smartphone via a small cable. It looks like a throwback—a smartphone earpiece without a Bluetooth connection. But it’s really a glimpse of the future. In a way, this tiny device allows the blind to see.

Ng is the chief scientist at Chinese tech giant Baidu, and this is one of the company’s latest prototypes. It’s called DuLight. The device contains a tiny camera that captures whatever is in front of you—a person’s face, a street sign, a package of food—and sends the images to an app on your smartphone. The app analyzes the images, determines what they depict, and generates an audio description that’s heard through to your earpiece. If you can’t see, you can at least get an idea of what’s in front of you.

 

Winners of the snake Olympics can inspire better robots

Conservation Magazine


from December 30, 2015


To the uninitiated, snakes are really just variations on a theme. Some are black, some are green. Some have pointier heads, while others have flatter snouts. Many folks think of them as little more than cylindrical tubes of death and destruction. But that’s not what University of Cincinnati biologist Bruce C. Jayne sees. Show him a Boa constrictor and a brown tree snake (Boiga irregularis) and he’ll tell you that the animals couldn’t be more different. And those differences might help engineers design better robots and aid wildlife managers in keeping invasive snakes away from places they’ll cause trouble.

 

The Next Next Thing in Sequencing

GEN Feature Articles


from January 01, 2016

Though it may seem to be navigating by perceptibly unfixed stars, next-generation sequencing (NGS) is journeying ever more adventurously into the obscure, the rare, and the confoundingly heterogeneous domains within life’s molecular codescapes.

NGS is already capable of producing billions of short reads, and it can do so quickly and economically. And NGS is reaching well beyond genomics. For example, it is revolutionizing transcriptomics through advances in RNA sequencing (RNA-Seq). Yet, despite this dazzling progress, a number of significant challenges remain.

These challenges were discussed at a recent Oxford Global event, the “Seventh Annual Next Generation Sequencing Congress”. The event provided a window through which attendees could browse the NGS field’s most daunting obstacles. It also displayed technologies that could allow these obstacles to be circumvented.

 

The Crowd-Sourced Supercomputer Speeds Research Results

GEN News Highlights


from December 31, 2015

The human brain has evolved over millions of years to become the best tool of visual perception and interpreter creative abstraction there is, though it still requires rest and skilled perception to process effectively. In contrast, modern computers can be relied on to process huge amounts of data but lack the intelligence to truly “understand” big picture concepts. (And while some computers are almost capable of reaching the processing speed of a human brain, they require millions of times more energy than the standard 20-Watt biological model.) By integrating human cognitive sophistication and tireless computational networking scientists are creating a new kind of supercomputer.

 

The Star Wars social networks – who is the central character?

KDnuggets, Evelina Gabasova


from December 30, 2015

Some of us are looking forward to Christmas, and some of us are looking forward to the new film in the Star Wars franchise, The Force Awakens. Meanwhile, I decided to look at the whole 6-movie cycle from a quantitative point of view and extract the Star Wars social networks, both within each film and across the whole Star Wars universe. Looking at the social network structure reveals some surprising differences between the original trilogy and the prequels.

 

How Facebook’s news feed algorithm works.

Slate, Will Oremus


from January 03, 2016

Cover Story
Read this first.
Jan. 3 2016 8:02 PM
Who Controls Your Facebook Feed
14.1k
84
A small team of engineers in Menlo Park. A panel of anonymous power users around the world. And, increasingly, you.
By Will Oremus
Facebook algorithm

Photo illustration by Lisa Larson-Walker. Photo by Tang Ming Tung/Getty Images.

Every time you open Facebook, one of the world’s most influential, controversial, and misunderstood algorithms springs into action. It scans and collects everything posted in the past week by each of your friends, everyone you follow, each group you belong to, and every Facebook page you’ve liked. For the average Facebook user, that’s more than 1,500 posts. If you have several hundred friends, it could be as many as 10,000. Then, according to a closely guarded and constantly shifting formula, Facebook’s news feed algorithm ranks them all, in what it believes to be the precise order of how likely you are to find each post worthwhile. Most users will only ever see the top few hundred.

No one outside Facebook knows for sure how it does this, and no one inside the company will tell you. And yet the results of this automated ranking process shape the social lives and reading habits of more than 1 billion daily active users—one-fifth of the world’s adult population.

 

2016 : WHAT DO YOU CONSIDER THE MOST INTERESTING RECENT [SCIENTIFIC] NEWS? WHAT MAKES IT IMPORTANT? — Weather Prediction Has Quietly Gotten A Lot Better

Edge.org, Samuel Arbesman


from January 03, 2016

Surveying the landscape of scientific and technological change, there are a number of small and relatively steady advances that have unobtrusively combined to yield something startling, almost without anyone noticing. Through a combination of computer hardware advances (Moore’s Law marching on and so forth), ever-more sophisticated algorithms for solving certain mathematical challenges, and larger amounts of data, we have gotten something new: really good weather prediction.

According to a recent paper on this topic, “Advances in numerical weather prediction represent a quiet revolution because they have resulted from a steady accumulation of scientific knowledge and technological advances over many years that, with only a few exceptions, have not been associated with the aura of fundamental physics breakthroughs.” Despite these advances seeming to be profoundly unsexy, these predictive systems have yielded enormous progress.

 

Scientists Find Minor Flu Strains Pack Bigger Punch

NYU News


from January 04, 2016

Minor variants of flu strains, which are not typically targeted in vaccines, carry a bigger viral punch than previously realized, a team of scientists has found. Its research, which examined samples from the 2009 flu pandemic in Hong Kong, shows that these minor strains are transmitted along with the major strains and can replicate and elude immunizations.

“A flu virus infection is not a homogeneous mix of viruses, but, rather, a mix of strains that gets transmitted as a swarm in the population,” explains Elodie Ghedin, a professor in New York University’s Department of Biology and College of Global Public Health.

 

AI Storytelling and Its Future in Games

OnlySP


from December 26, 2015

As in the wider field of technology to which it is subject, revelations and disruptions occur frequently in gaming. The Xbox 360 ushered in the online age of gaming, the explosive popularity of the Nintendo Wii and smartphones introduced novel input methods, and consumer-ready virtual reality headsets are set to offer something fundamentally different for gamers from next year on.

Recent years have also seen budgets ballooning at the top end of the development scale, which has resulted in some studios, including Firaxis and From Software, using procedural content generation to increase replayability. While the technique is effective in randomising environments and challenges, that is not the full extent of its usefulness.

Enter the Entertainment Intelligence Lab at Georgia Institute of Technology, led by Associate Professor Mark Riedl. Earlier this year, a small team from the lab released the first report on Scheherazade-IF (Interactive Fiction), a program capable of generating video game stories with “near-human level authoring” based on crowdsourced examples.

 

Johns Hopkins University Data Science Team doing AMA on January 11th! : datascience

reddit.com/r/datascience


from January 05, 2016

JHU’s Data Science Team and creators of the Coursera Data Science Specialization [Roger Peng, Brian Caffo, and Jeff Leek] will be conducting an AMA at 3pm next Monday over at /r/IAmA

 

The big data of bad driving, and how insurers plan to track your every turn

The Washington Post, The Switch blog


from January 04, 2016

For years, insurance companies have used estimates of your annual mileage to determine your car insurance rates. But with recent changes in technology, insurers now have an unprecedented ability to judge your actual driving habits. Armed with detailed data on how often you slam on the brakes and what times of day you’re on the road, insurance companies are increasingly relying on precise, technological means of assessing risk — and using that information to set your monthly premiums.

Liberty Mutual, the country’s third-largest property-and-casualty insurer, took the latest step in that direction Monday when it announced a partnership with Subaru. Beginning later this year, Subaru drivers who have paid for the automaker’s Starlink infotainment system will be able to download an app to their cars that notifies them when they are accelerating too aggressively or braking too hard.

 

Open journals that piggyback on arXiv gather momentum

Nature News & Comment, Elizabeth Gibney


from January 04, 2016

An astrophysicist has launched a low-cost community peer-review platform that circumvents traditional scientific publishing — and by making its software open-source, he is encouraging scientists in other fields to do the same.

The Open Journal of Astrophysics works in tandem with manuscripts posted on the pre-print server arXiv. Researchers submit their papers from arXiv directly to the journal, which evaluates them by conventional peer review. Accepted versions of the papers are then re-posted to arXiv and assigned a DOI, and the journal publishes links to them.

 

The unsung heroes of scientific software

Nature News & Comment, Dalmeet Singh Chawla


from January 04, 2016

Creators of computer programs that underpin experiments don’t always get their due — so the website Depsy is trying to track the impact of research code.

 

How to write 107,000 stories

Columbia Journalism Review


from December 30, 2015

When Frank Matt finally received the Department of Labor database after an eight-month FOIA battle, it was a mess: 70 million records in a few gigantic spreadsheets. Matt was the data journalist on McClatchy DC’s “Irradiated,” a year-long investigation into the unseen costs of America’s nuclear weapons program, and when he got the spreadsheets, his work was just beginning.

The database contained claims submitted to a government program, launched in 2001, to compensate former nuclear employees suffering from illnesses related to their work. For the government, it was a database of cases to be adjudicated, and was documented accordingly. But for Matt and the other McClatchy journalists, the spreadsheets held the lives and fates of tens of thousands of cold war workers. They just had to decode it.

“Our goal,” says Matt, “was to humanize an anonymous dataset.” The data came with unique identifiers for every worker instead of names, to protect their privacy, and because any given worker might have multiple claims.

The final project combined the data with deep on-the-ground reporting and an ambitious presentation that used a homebrewed algorithm to transform the massive database into 107,392 micro-stories.

 
Events



The 19th ACM conference on Computer-Supported Cooperative Work and Social Computing



CSCW is the premier venue for presenting research in the design and use of technologies that affect groups, organizations, communities, and networks. Bringing together top researchers and practitioners from academia and industry who are interested in the area of social computing, CSCW encompasses both the technical and social challenges encountered when supporting collaboration. The development and application of new technologies continues to enable new ways of working together and coordinating activities. Although work is an important area of focus for the conference, CSCW also embraces research and technologies supporting a wide variety of recreational and social activities using a diverse range of devices.

Saturday-Wednesday, February 27-March 2, in San Francisco CA

 

SciDataCon



SciDataCon 2016 seeks to advance the frontiers of data in all areas of research. This means addressing a range of fundamental and urgent issues around the ‘Data Revolution’ and the recent data-driven transformation of research and the responses to these issues in the conduct of research.

Sunday-Tuesday, September 11-13 in Denver CO

 
Deadlines



Jupyter Notebook User Experience Survey

deadline: subsection?

For the past few weeks, a handful of folks and I have been working on a survey about the Jupyter Notebook user experience. The purpose of this questionnaire is to gather information from you, the Jupyter community, about how you are using (or not using) Jupyter Notebook today. We hope that your answers will help uncover pain-points to address, identify core features to retain, and generate new ideas to consider in Jupyter Lab.

Deadline to submit responses is not listed.

 

The #LSST Data Science Fellowship Program

deadline: subsection?

We invite applications for a postdoctoral scholar to join the leadership of the newly-established LSST Data Science Fellowship Program. The LSST Data Science Fellowship Program is a series of survey-science-focused schools, designed to supplement graduate curricula with the skills researchers will need to make best use of LSST data. The postdoctoral scholar will divide their time equally between conducting a competitive research program of their own choosing involving data science in astronomy/astrophysics, and developing this new LSST-focused educational initiative. The position is formally located at CIERA/Northwestern, but the postdoctoral scholar will also spend time working with the LSST DSFP Director Lucianne Walkowicz at the Adler Planetarium (also in Chicago).

Deadline to apply is Friday, January 15.

 
CDS News



Fall 2015 Career Information Session Retrospective

NYU Center for Data Science


from January 05, 2016

One of the most exciting aspects of pursuing an advanced degree in the study of data science is the breadth of career opportunities available upon graduation. From nonprofits to private banking, NYU’s Master of Data Science program not only gives its students the knowledge and tools to begin a career in the world of data science, our program also gives students the connections and access to top companies looking to hire the next generation of data scientists. Our career information sessions with HR representatives from companies such as Apple, Intel, American Express, Capital One, allow students to see how their personal interests can be combined with their knowledge of data science to find a post-graduation job to fit their personal and career goals.

 
Tools & Resources



Elements of Python Style

GitHub, amontalenti ·


from January 06, 2016

This document goes beyond PEP8 to cover the core of what I think of as great Python style. It is opinionated, but not too opinionated. It goes beyond mere issues of syntax and module layout, and into areas of paradigm, organization, and architecture. I hope it can be a kind of condensed “Strunk & White” for Python code.

 

Science is “show me,” not “trust me”

Berkeley Initiative for Transparency in the Social Sciences, Philip B. Stark


from December 31, 2015

Reproducibility and open science are about providing evidence that you are right, not just claiming that you are right. Here’s an attempt to distill the principles and practices.

Shortest: Show your work.

Next shortest: Show your work. All your work.

 

Insights-as-a-Service : Data Science Engagement Model

Sameer Dhanrajani, Demystifying Data Analytics, Decision Science & Digital blog


from January 05, 2016

Most organizations realize the value in their own data and are increasingly faced with a deluge of new “big data” from sources. New technologies are available to help but internal teams and data warehouse platforms are already over-stretched.

At the same time, business users are demanding new data analysis to keep pace with, and get ahead of, competitors. But traditional development is often too slow and complex to address the immediate business needs.

Insight-as-a-service can become the next layer of the cloud stack (following Infrastructure-as-a-Service, Platform-as-a-Service and Software-as-a-Service). In addition to SaaS application vendors that can start offering such services, there exists an opportunity to create a new class of pure-play Insight-as-a-Service vendors.

 

Attention and Memory in Deep Learning and NLP

Denny Britz, WildML blog


from January 03, 2016

A recent trend in Deep Learning are Attention Mechanisms. In an interview, Ilya Sutskever, now the research director of OpenAI, mentioned that Attention Mechanisms are one of the most exciting advancements, and that they are here to stay. That sounds exciting. But what are Attention Mechanisms?

Attention Mechanisms in Neural Networks are (very) loosely based on the visual attention mechanism found in humans. Human visual attention is well-studied and while there exist different models, all of them essentially come down to being able to focus on a certain region of an image with “high resolution” while perceiving the surrounding image in “low resolution”, and then adjusting the focal point over time.

Attention in Neural Networks has a long history, particularly in image recognition. Examples include Learning to combine foveal glimpses with a third-order Boltzmann machine or Learning where to Attend with Deep Architectures for Image Tracking. But only recently have attention mechanisms made their way into recurrent neural networks architectures that are typically used in NLP (and increasingly also in vision). That’s what we’ll focus on in this post.

 

AWS S3 vs Google Cloud vs Azure:Cloud Storage Performance

Zach Bjornson


from December 29, 2015

I’m building a new cloud product that quickly processes large amounts of scientific data. Our largest customer dataset so far is about 3,000 tables, each 40 to 80 MB and totaling 150 GB, which we aim to process in 10 seconds or less. Each table can be processed independently, so we parallelize heavily to reach this goal?—?our deployment uses 1000 vCPUs or more as needed. The tricky part is rapidly reading that much data into memory. Right now about 80% of the computation time is spent reading data, and that leads to the focus of this three-part post: cloud storage performance. As part of solving this problem, I evaluated a few different approaches: object storage, database-backed storage and attached storage, each of which I’ll be detailing in separate posts.

Midway through writing this, Google happened to released their multi-cloud PerfKitBenchmarker. As far as I can find, this post is also the first set of published results from PerfKitBenchmarker.

 

4 Things to Know Before Building Data Science Into Your Organization

Business 2 Community


from January 01, 2016

“Data scientist” is a popular role these days. Everyone seems to have one — or claims to be one.

But how do executives know whether they have the “real thing” and how that value is best employed?

I met with Kaiser Fung, leader of the Applied Analytics program at Columbia University and co-founder of a new school for data scientists called RSquare Edge, to discuss this question. Here, I offer our perspective on capturing true value in the corporation with human talents in data science.

 

Algorithms Need Managers, Too

Harvard Business Review; Michael Luca, Jon Kleinberg and Sendhil Mullainathan


from January 03, 2016

Most managers’ jobs involve making predictions. When HR specialists decide whom to hire, they’re predicting who will be most effective. When marketers choose which distribution channels to use, they’re predicting where a product will sell best. When VCs determine whether to fund a start-up, they’re predicting whether it will succeed. To make these and myriad other business predictions, companies today are turning more and more to computer algorithms, which perform step-by-step analytical operations at incredible speed and scale.

Algorithms make predictions more accurate—but they also create risks of their own, especially if we do not understand them. High-profile examples abound. When Netflix ran a million-dollar competition to develop an algorithm that could identify which movies a given user would like, teams of data scientists joined forces and produced a winner. But it was one that applied to DVDs—and as Netflix’s viewers transitioned to streaming movies, their preferences shifted in ways that didn’t match the algorithm’s predictions.

 

Machine Learning for Economists: An Introduction

Anton Tarasenko, Economics and Development blog


from December 28, 2015

A crash course for economists who would like to learn machine learning.

Why should economists bother at all? Machine learning (ML) generally outperforms econometrics in predictions. And that is why ML is becoming more popular in operations, where econometrics’ advantage in tractability is less valuable. So it’s worth knowing the both, and choose the approach that suits your goals best.

 

Machine learning for artists

Medium, Gene Korgan


from January 03, 2016

This spring I will be teaching a course at NYU’s Interactive Telecommunications Program (ITP) called “Machine Learning for Artists.” Since the subject is fairly uncommon outside of the realm of scientific research, I thought it would be helpful to outline my motivations for offering this class.

 

The T-Shaped Data Scientist

Medium, Phil Anderson


from January 05, 2016

At this point, it’s pretty clear that there is still no definitive consensus on what the expression “data science” is referring to. Of course, many definitions have been suggested, but the most memorable is a variation of the tongue-in-cheek “Statistics performed in San Francisco,” which is hardly an exhaustive description. While an exact definition may still escape us, however, it is arguably clear that the term is largely referring to traditional advanced analytics, but practiced in a more modern context?—?the focus being on data and applications made possible by the widespread use of the Internet.

 

Announcing the Data Science Journal

Software Carpentry


from January 06, 2016

The Data Science Journal is, as its title suggests, a journal dedicated to the advancement of data science. The first thing that’s good about it is that you won’t get random emails about it with poor grammar and wild claims about its impact factor that begin with DEAR ESTEEMED RESEARCHER….

Even though it’s about data science, it’s not obsessed with building ever better recommender algorithms for Netflix or mining twitter feeds. Its focus is very much on its application in the policies, practices, and management of open data. It tries to take as wide a definition as possible when considering the subject.

 

Leave a Comment

Your email address will not be published.