NYU Data Science newsletter – March 2, 2016

NYU Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for March 2, 2016

GROUP CURATION: N/A

Data Science News

The games industry’s shift to deep data

Dataconomy

from February 25, 2016

When it comes to collecting customer data, the games industry is in a unique and enviable position. Unlike other sectors where customer data is often incomplete, games makers and marketers can access a wealth of live data about every single one of their players.
But it wasn’t always so.

Games used to be sold almost exclusively in boxes, and companies, up until very recently, knew little about their players beyond what market research revealed. While it’s not surprising that mobile has heralded a big shift in games, what sets the industry apart from others in the mobile space is how it’s leading data innovation. And this all boils down to the way we now play mobile games.

Performance @Scale 2016 recap

Facebook, Code blog

from February 29, 2016

Making fast apps and services that scale to millions or billions of people is no simple task. Dealing with the impact of poor performance can be just as challenging, and can be measured in slow user experiences and inefficient infrastructure. History is littered with examples of projects that failed to maintain their performance as they scaled. No two performance problems are ever quite the same, but there’s a lot we can learn from one another as an industry.

Last Wednesday, hundreds of engineers came to Facebook’s Menlo Park campus for Performance @Scale, an all-day event dedicated to making technology fast and efficient. Speakers from Facebook, Google, LinkedIn, Microsoft, and Netflix came together and covered topics such as low-level system profiling, production measurements, regression detection, efficient triage, and more.

ARM unveils processor for wearables and the Internet of Things

VentureBeat, Dean Takahashi

from February 22, 2016

Smart watches and smart light bulbs are bound to get even smarter with this announcement.

ARM is launching its smallest and lowest-power ARMv8-A processor today as it further targets next-generation wearables and Internet of Things applications.

Is Being A Data Scientist Really The Best Job In America?

Forbes, Bernard Marr

from February 25, 2016

… Tye Rattenbury, director of Data Science at Trifacta, says “If you look at a data science job description from five years ago, it was basically ‘advanced degree, computer skills, predictive modelling’.

“Now that’s only a third of it – the other two thirds are ‘works well with others’, ‘knows how to report and communicate’. It’s great when people are smart and can do clever stuff, but they need to be able to feed it back into the business so we can do something about it.”

Netflix: The Force Awakens

TechCrunch, Gene Hoffman

from February 27, 2016

I’ve argued in the past that Netflix will have the last laugh — that its success with content production will soon rival and surpass traditional TV networks and movie studios. Now I’m more convinced than ever. … Here’s another reason why Hollywood should be worried: Netflix is now able to leverage big data to produce hit shows almost routinely.

A Project-Based Case Study of Data Science Education

CODATA, Data Science Journal

from February 29, 2016

The discipline of data science has emerged over the past decade as a convergence of high-power computing, data visualization and analysis, and data-driven application domains. Prominent research institutions and private sector industry have embraced data science, but foundations for effective tertiary-level data science education remain absent. This is nothing new, however, as the university has an established tradition of developing its educational mission hand-in-hand with the development of novel methods for human understanding (Feingold, 1991). Thus, it is natural that universities “figure out” data science concurrent with the development of needed pedagogy. We consider data science education with respect to recent trends in interdisciplinary and experiential educational methodologies. The first iteration of the Berkeley Institute for Data Science (BIDS) Collaborative, which took place at the University of California, Berkeley in the Spring of 2015, is used as a case study.

Data Science and Disability

KDnuggets, Chris Pearson

from March 01, 2016

Data Science and Artificial Intelligence has come to the forefront of technology in the last few years. Learn, how practitioners are taking a more philanthropic outlook on life, supporting people suffering with both physical and mental disabilities.

A world where everyone has a robot: why 2040 could blow your mind

Nature News & Comment

from February 24, 2016

Technological change is accelerating today at an unprecedented speed and could create a world we can barely begin to imagine.

Q&A With Jamie Dimon on the Future of Finance

Bloomberg Business

from March 01, 2016

… John Micklethwait: Your career has seen a host of companies trying to challenge the status quo in finance; some have succeeded, others fail. The new threat is Silicon Valley. All kinds of fintech startups are coming for Wall Street. Where do you feel most vulnerable?

Jamie Dimon: Let me just give you the big picture first. The best way to look at any business is from the standpoint of the clients. So there are these certain basic things that aren’t going to change. Companies are going to have needs for equity, debt, advice, FX, and derivatives. Individuals are going to have needs for auto loans, mortgages, something that looks like a deposit account, and the ability to send money to people. Those things aren’t going to change.

Building Systems that Query on Compressed Data

Microsoft Research Blog

from March 02, 2016

Web services today want to support sophisticated queries, with stringent interactivity (latency and/or throughput) constraints. Many recent studies have argued that in-memory query execution is one of the keys to achieving query interactivity. However, as web services scale to larger data sizes, executing queries in memory becomes increasingly challenging. As a result, existing systems fall short of supporting sophisticated interactive queries at scale.

In this talk, I [Rachit Agarwal, UC Berkeley] will present Succinct, a distributed data store that supports functionality comparable to state-of-the-art NoSQL stores and yet, enables query interactivity for an order of magnitude larger data sizes than what is possible today (or, alternatively, up to two orders of magnitude faster queries at scale).

Events

Computer Science Colloquium: New machine learning for ubiquitous genomics and beyond

JOB TALK — James Zou, Microsoft Research New England and MIT

Large-population human datasets are being generated that can transform science and medicine. New machine learning techniques are necessary to unlock this data resource and enable discoveries. I will first survey recent advances in human population genomics, and describe new computational techniques we have developed to connect genetic and epigenetic variations to human diseases. These methods required significant innovations in latent-variable models, non-convex optimization and histogram estimation. I collaborated closely with the largest genomics consortia to apply these approaches—which are scalable and have strong mathematical guarantees—to systematically estimate the effects of mutations and to identify disease biomarkers.

Monday, March 7, at Warren Weaver Hall 1302. Refreshments at 11:15 a.m. Presentation at 11:30 a.m.

Upcoming Webinar — Research Computing Skills for Scientists: Lessons, Challenges, and Opportunities from Software Carpentry

Since 1998, Software Carpentry has evolved from a week-long training course at the US national laboratories into a worldwide volunteer effort to improve researchers’ computing skills. In this webinar, Software Carpentry’s co-founder will explore what’s been learned along the away about what scientists, engineers, and other researchers actually need to know about programming in order to make their work more shareable, more reproducible, more likely to be correct, and more efficient. He will also discuss practices that the DataONE community and similar groups may be able to use to help researchers deal with large or messy data in a broad range of disciplines.

Tuesday, March 8, at 12 noon Eastern

PyCon 2016 in Portland, OR | May 28th – June 5th

PyCon is the largest annual gathering for the community using and developing the open-source Python programming language. PyCon is organized by the Python community for the community. We try to keep registration far cheaper than most comparable technology conferences, to make PyCon accessible to the widest group possible. PyCon is a diverse conference dedicated to providing an enjoyable experience to everyone. Our code of conduct is intended to help everyone maintain the PyCon spirit. We thank all attendees and staff for observing it.

Saturday, May 28, to Sunday, June 5, in Portland OR

Deadlines

Google Summer of Code

deadline: subsection?

Spend your summer break writing code for an open source software project!

Applications open March 14, 2016 at 15:00 (EDT)

Deadline for application submission is Friday, March 25.

Watson Developer Challenge: Conversational Applications

deadline: subsection?

There’s a new direction coming for user interfaces, and we want you to help shape it. Pages of forms and static information are a thing of the past: why not let your applications talk to your users? IBM Watson has four new services that make it simple to set up a conversational interface to interact with your users as if they were talking to a person – Natural Language Classifier, Dialog, Retrieve and Rank, and Document Conversion.

We want to see you use this technology to build a conversational application.

Deadline for submissions is Friday, April 15.

PrecisionFDA Consistency Challenge

deadline: subsection?

The goal of the FDA’s first precisionFDA challenge is to engage the genomics community in advancing the quality standards in order to achieve more consistent results in the context of genetic tests (related to whole human genome sequencing), advancing the goal of better personalized care.

PrecisionFDA invites all innovators to take the challenge and assess their software on the supplied reference human datasets.

Deadline for submissions is Monday, April 25.

Tools & Resources

Ten Things You Can Do on the Microsoft Data Science Virtual Machine

TechNet, Machine Learning Blog

from March 01, 2016

In November last year, we announced the availability of the Microsoft Data Science Virtual Machine (DSVM), an operating system image we published in the Azure Marketplace with a host of popular data science tools pre-installed and pre-configured. In January this year, we updated the image to include the Microsoft R Server, an enterprise class analytics platform based on the R language, and also supported Jupyter notebooks for browser based data exploration in both R as well as Python.

The Microsoft DSVM offers a powerful development environment for all your data analytics and modeling tasks. The DSVM makes it easy to get started quickly with your data science projects for cloud, on-premises, or hybrid deployments. The DSVM is able to read and process data to and from various Azure data and analytics technologies like Azure SQL Data Warehouse, Azure Data Lake, HDInsight, Blob Storage, DocumentDB and Azure Machine Learning.

Canvas Offers Anonymized Canvas Network Data to Researchers

PR Newswire, Canvas

from March 01, 2016

Canvas, the leading learning management system created by the company Instructure Inc. (NYSE: INST), today announced it has opened Canvas Network data to researchers. The de-identified data comes from more than 230 massive open online courses (MOOCs) hosted on Canvas Network and represents thousands of learning experiences from around the world.

Hungry?

Lab41

from March 01, 2016

Just how data-hungry is deep learning? It is an important question for those of us who don’t have an ocean of data from somewhere like Google or Facebook and still want to see what this deep learning thing is all about. If you have a moderate amount of your own data and your fancy new model gets mediocre performance, it is often hard to tell whether the fault is in your model architecture or in the amount of data that you have. Learning curves and other techniques for diagnosing training-in-progress can help, and much ink has been spilled offering guidance to young deep learners. We wanted to add to this an empirical case study in the tradeoff between data size and model performance for sentiment analysis.

We asked that question ourselves in the course of our work on sunny-side-up, a project assessing deep learning techniques for sentiment analysis (check out our post on learning about deep learning, which also introduces the project). Most real-world text corpora have orders of magnitude fewer documents than, for instance, the popular Amazon Reviews dataset. Even one of the stalwart benchmark datasets for sentiment analysis, IMDB Movie Reviews, has a “mere” tens of thousands of reviews compared to the Amazon dataset’s millions. While deep learning methods have claimed exceptional performance on the IMDB set, some of the top performers are trained on outside datasets. If you were trying to do sentiment analysis in small collections of documents in under-resourced langauges like Hausa or Aymara, then 30 million Amazon Movie reviews might not be a great analogue.

Training and serving NLP models using Spark MLlib

O'Reilly Media, Michelle Casbon

from March 01, 2016

Identifying critical information out of a sea of unstructured data, or customizing real-time human interaction are a couple of examples of how clients utilize our technology at Idibon—a San Francisco startup focusing on Natural Language Processing (NLP). The machine learning libraries in Spark ML and MLlib have enabled us to create an adaptive machine intelligence environment that analyzes text in any language, at a scale far surpassing the number of words per second in the Twitter firehose.

Our engineering team has built a platform that trains and serves thousands of NLP models, which function in a distributed environment. This allows us to scale out quickly and provide thousands of predictions per second for many clients simultaneously. In this post, we’ll explore the types of problems we’re working to resolve, the processes we follow, and the technology stack we use. This should be helpful for anyone looking to build out or improve their own NLP pipelines.

How to build your own recommendation engine using machine learning on Google Compute Engine

Google Cloud Platform Blog

from March 01, 2016

… There are various components to a recommendation engine, ranging from data ingestion and analytics to machine learning algorithms. In order to provide relevant recommendations, the system must be scalable and able to handle the demands that come with processing Big Data and must provide an easy way to improve the algorithms.

Recommendation engines, particularly the scalable ones that produce great suggestions, are highly compute-intensive workloads. The following features of Google Cloud Platform are well-suited to support this kind of workload

Sports.BradStenger.com

NYU Data Science newsletter – March 2, 2016

Leave a Comment Cancel reply