NYU Data Science newsletter – April 26, 2016

NYU Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for April 26, 2016

GROUP CURATION: N/A

Data Science News

Your Media Business Will Not Be Saved

Medium, Joshua Topolsky

from April 25, 2016

… What’s The Problem, you ask? The Problem is that we used to have a really neat and tidy version of a media business where very large interests controlled vast swaths of the things we read, watched, and listened to. Because that system was built on the concept of scarcity and locality?—?the limits of what was physically possible?—?it was very easy to keep the gates and fill the coffers. Put simply, there were far fewer players in the game with far fewer outlets for their content, so audiences were easy to sell to and easy to come by.

Then digital. Then you and me. And all of a sudden all those old, fixed channels started falling apart. Papers didn’t sell. Magazines died. Networks scrambled. Local news meant a lot less. Local papers even less than that. Suddenly a lot more free stuff was available online, and anyone could start a blog! But the media industry is a hulking, stupid, slow moving beast that has little awareness about its threats and surrounding environs. I’m skipping over a few parts, but by and large the industry responded to the promise (or threat, as they treated it) of digital by ignoring it or denying it. So instead of the content creators and advertisers who paid them shifting their attention and understanding of user value towards the future (digital everything), they kept plugging away at the old system. Basically: it was really hard for them to figure out the internet, and all of the money (like subscriber dollars) were still going to traditional outlets.

5 Ways Zuckerberg Attracts And Keeps Talent – Facebook

Seeking Alpha

from April 25, 2016

In lieu of the high profile hire of Regina Dugan, it’s relevant to highlight one of Facebook’s hidden competitive advantages – its ability to attract and retain talent. Dugan’s hire is of importance since she was an executive at Google, reportedly one of the best companies in the world to work for. Facebook has consistently attracted top talent, and has held on to many of its early employees that helped shape the company into what it is today. Mark Zuckerberg has created a corporate climate that employees seem to thoroughly enjoy, which is essential to reducing turnover. A report by Payscale discovered that 96% of Facebook’s 13,000 employees report high job satisfaction.

Below are five reasons as to why Facebook is so successful in attracting and retaining talent.

1. Its CEO is its Chief Recruiter

Market Think Trump Has a Shot (not Cruz)

PredictWise

from April 24, 2016

There are a linty of reasons why you cannot just divide the probability of victory in the general election by the probability of victory in the primary and get the conditional probability of candidate winning the general elections on their nomination. First, these values are both imprecise. Second, things change conditional on nomination (i.e., there is a complicated relationship between these two estimates). Third, candidates do not technically need to be the party nominee to win the general (although it is nearly 0% that any candidate wins the general should they not be a major party nominee). I could go on, but this simple rubric is suggestive of something.

The markets have been extremely consistent that Donald Trump is more likely to win the general, conditional on getting the nomination, than Ted Cruz. (The missing data in the late February is when Cruz dropped below 5% to get the nomination.)

Placing Authors at the Center of the Scientific Endeavor

The Official PLOS Blog

from April 19, 2016

For some time now we have turned our attention to the core of our organization: how we work with our authors and how our authors work together. Our forthcoming manuscript submission system is the result of improvements we have made both technically and in how we here at PLOS work together. For more details on this read the PLOS Tech blog, A Tech Framework for Innovations in Open Science, by PLOS Chief Technology Officer CJ Rayhill.

To honor and connect our roots in the Open Access movement to the exciting Open Science era ahead, we chose the name Aperta™ for our new submission system. Aperta means Open in Italian and brings with it the association of forthcoming and fairness, qualities that PLOS strives to bring to the process of publishing scientific research.

Race For AI: Google, Facebook, Amazon, Apple Grab Artificial Intelligence Startups

CB Insights

from April 10, 2016

Over 60% of the AI companies acquired in the last 3 years had VC backing. There have been 4 major acquisitions already in 2016.

How Big Data Creates False Confidence

Nautilus, Jessie Dunietz

from April 23, 2016

If I claimed that Americans have gotten more self-centered lately, you might just chalk me up as a curmudgeon, prone to good-ol’-days whining. But what if I said I could back that claim up by analyzing 150 billion words of text? A few decades ago, evidence on such a scale was a pipe dream. Today, though, 150 billion data points is practically passé. A feverish push for “big data” analysis has swept through biology, linguistics, finance, and every field in between.

Although no one can quite agree how to define it, the general idea is to find datasets so enormous that they can reveal patterns invisible to conventional inquiry. The data are often generated by millions of real-world user actions, such as tweets or credit-card purchases, and they can take thousands of computers to collect, store, and analyze. To many companies and researchers, though, the investment is worth it because the patterns can unlock information about anything from genetic disorders to tomorrow’s stock prices.

But there’s a problem: It’s tempting to think that with such an incredible volume of data behind them, studies relying on big data couldn’t be wrong.

Innovation Analytics: A guide to new data and measurement in innovation policy

Nesta, UK

from April 25, 2016

Good innovation policy requires accurate, reliable and timely data, but getting this data is becoming harder as change in the economy speeds up.

We have been working with new data sources and methods to address this gap between the reality of innovation, as it happens, and our ability to measure it. Our ultimate goal is to turn this into information for effective use by policymakers, and to drive decision making from an evidence-based position.

Google to encourage employee startups

San Jose Mercury News

from April 24, 2016

Add founding a new company to the list of things Google employees can do without leaving the office. The Internet search giant, which already famously offers on-site perks such as all-you-can-eat snacks, massages and fitness centers, is reportedly launching a new startup incubator that lets entrepreneurially minded employees pursue their dreams without leaving the mother ship. Dubbed “Area 120,” the incubator will be based in one of Google’s new San Francisco buildings, according to tech news site The Information, which cited anonymous “people familiar with the project.”

Also:

Race For AI: Google, Facebook, Amazon, Apple Grab Artificial Intelligence Startups (April 10, CB Insights)

Google believes its superior AI will be the key to its future (April 21, The Verge)

The Adequacy of Individual Hospital Data to Identify High Utilizers and Assess Community Health

JAMA, Research Letter; David Horrocks et al.

from April 25, 2016

This study examined the patterns of hospital use of all patients with more than 5 emergency department visits to Maryland hospitals in 2014 to determine whether high utilizers can be identified by data from individual hospitals.

Many hospitals are analyzing data from their own information systems to develop new strategies to improve population health. Such analyses aim to identify high utilizers who may benefit from additional support and to elucidate community patterns of potentially preventable serious illness, but the extent to which individual hospital data are sufficient for these purposes is unclear.

How are companies harnessing APIs for healthcare?

MedCity News

from April 25, 2016

One of the first things a new developer in healthcare quickly realizes is integrating and launching your application on top of medical data can be a total nightmare and often a showstopper. There are hundreds of electronic medical record (EMR) vendors and every implementation has a different “flavor” of a handful of competing standards.

That is all about to change. There is a growing movement emerging in Health IT to finally begin standardizing the way applications can access data. The preferred method is through what is called an Application Programming Interface, or API.
APIs have been notoriously absent in Health IT until very recently. This year has seen a surge of enthusiasm and support for APIs specifically around facilitating access to Electronic Medical Record (EMR) systems. Some of these initiatives have been driven by the EMR vendors themselves, and others are cross-vendor collaborations such as HL7 FHIR.

New telescopes will search for signs of life on distant planets

Science News

from April 19, 2016

NASA’s Transiting Exoplanet Survey Satellite, or TESS, will launch in 2017 on a quest to detect many of the exoplanets that orbit the stars closest to us. One year later, the James Webb Space Telescope will launch and peek inside some of these newfound atmospheres. With their powers combined, TESS and James Webb could identify nearby planets that are good candidates for life. These worlds will probably be quite different from Earth — they’ll be a bit larger and orbit faint, red suns — but some researchers hope that a few will offer hints of alien biology.

Exponential growth of R’s open source community threatens commercial competitors

TechRepublic, Matt Asay

from April 21, 2016

With more than 2 million users and developers, how can proprietary vendors stand against the R programming language and software environment’s open source community?

Events

MLTalks Series: Helen Margetts in Conversation with Ethan Zuckerman

How does the changing use of social media affect politics? In her recent book, Political Turbulence, Helen Margetts and colleagues Peter John, Scott Hale and Taha Yasseri show how social media are now inextricably intertwined with the political behavior of ordinary citizens, and exert an unruly influence on the political world. … In this talk, Professor Margetts will discuss the implications of these findings both for political science research and the future of the modern state.

Boston, MA. Tuesday, May 3, starting at 2 p.m., MIT Media Lab, 3rd Floor Atrium

Workshop III: Cultural Patterns: Multiscale Data-driven Models (Schedule)

The proliferation of cultural data has given data-driven approaches a significant edge in modeling various cultural phenomena. This workshop focuses on such approaches that make use of mathematical tools in machine learning, data mining, network science, and computational social science. We are particularly interested in presenting methods, both normative and descriptive, that offer a gestalt or structure-first approach to culture analysis and that provide a multi-layered summarization of these phenomena suitable for exploration at multiple scales. These models are applied to various datasets such as social and information networks, social media, narrative and story detection in texts, group dynamics or behavior, and collaboration and competition leading to emergent behavior.

This workshop will include a poster session; a request for posters will be sent to registered participants in advance of the workshop.

Los Angeles, CA. Monday-Thursday, May 9-13, at UCLA IPAM

SIGIR 2016 Industry Track: Invited Speakers and Accepted Papers

Invited Speakers: Hadar Shemtov (Google) and Debora Donato (StumbleUpon).

Accepted Papers listed.

SIGIR 2016 is Sunday-Thursday, July 17-21, in Pisa, Italy.

Deadlines

Call for Software Carpentry Foundation Subcommittees and Task Forces

deadline: subsection?

The 2016 Steering Committee would like to encourage and invite the community to propose new initiatives in the form of subcommittees (a standing group for ongoing activities) and task forces (an ad hoc group focused on a finite task). Read on to learn more about existing initiatives and for information about how to propose a new initiatives that will shape our community.

BITSS and BIDS Collaboration: Call for Reproducible Workflows

deadline: subsection?

BITSS and the Reproducibility Working Group at the Berkeley Institute for Data Science are collaborating on an edited volume of reproducible workflows in the social sciences, and we are looking for submissions.

Also, see Call for Reproducibility Workflows by Cyrus Dioun and Garret Christensen at the Bad Hessian blog.

Call for Abstracts – The 2016 Conference on Complexity Systems

deadline: subsection?

The Conference on Complex Systems (CCS) has become a major venue for the Complex Systems community since 2003. After last year success in USA, we are now back in Europe. AMSTERDAM CCS 2016, will be the major international conference and event for complex systems and interdisciplinary science.

Amsterdam, The Netherlands. Deadline for abstracts’ submissions is Sunday, May 15.

Complex Networks 2016

deadline: subsection?

The International Workshop on Complex Networks and their Applications aims at bringing together researchers from different scientific communities working on areas related to complex networks.

Two types of contributions are welcome: theoretical developments arising from practical problems, and case studies where methodologies are applied. Both contributions are aimed at stimulating the interaction between theoreticians and practitioners.

University of Milan, Milan, Italy. Deadline for submissions is Monday, September 5.

Tools & Resources

15 Must Read Books for Entrepreneurs in Data Science

AnalyticsVidhya

from April 25, 2016

… The books listed below gives immense knowledge and motivation in technology arena. Reading these books will give you the chance to live many different entrepreneurial lives. Take them one by one. Don’t get overwhelmed. I’ve displayed a mix of technical and motivational books for entrepreneurs in data science.

Join Our Community – NumFOCUS | Open Code = Better ScienceNumFOCUS. Open Code = Better Science

NumFOCUS

from April 25, 2016

Show your support for the NumFOCUS mission by becoming a Community Member — it’s free to join!

Or, Supporting Members provide funding that is crucial to the success of NumFOCUS projects and programs! These individuals enjoy the right to vote on certain NumFOCUS actions, receive a 20% discount on PyData conferences, and are eligible for special discounts on products from our supporting partners and sponsors.

Sequence-to-sequence model with LSTM encoder/decoders and attention

GitHub – harvardnlp

from April 24, 2016

Torch implementation of a standard sequence-to-sequence model with attention where the encoder-decoder are LSTMs. Also has the option to use characters (instead of input word embeddings) by running a convolutional neural network followed by a highway network over character embeddings to use as inputs.

Keras as a simplified interface to TensorFlow: tutorial

The Keras Blog

from April 24, 2016

If TensorFlow is your primary framework, and you are looking for a simple & high-level model definition interface to make your life easier, this tutorial is for you.

Keras layers and models are fully compatible with pure-TensorFlow tensors, and as a result, Keras makes a great model definition add-on for TensorFlow, and can even be used alongside other TensorFlow libraries. Let’s see how.

Druid Query Optimization with FIFO: Lessons from Our 5000-Core Cluster

Metamarkets, Charles Allen

from April 25, 2016

A large strength of using Druid as a data store and aggregation engine is its ability to horizontally scale. Whenever more data is in the system, or whenever faster compute times are desired, it is simply a matter of throwing more hardware at the problem, and Druid auto-detects, and auto-balances its workloads. At Metamarkets we are currently ingesting over 3M events/ second (replicated) into our Druid cluster and have multiple hundreds of historical nodes serving this data across multiple tiers. … the balancing algorithms in Druid are not perfect, which means segments will not be perfectly balanced across the cluster, but will usually be “good enough” for most use cases. As such, many clusters will end up with some degree of over-committing cores to number of data segments that are needing to be scanned. This leads to an interesting aspect of Druid’s processing queue. The release of Druid 0.9.0 adds the feature flag druid.processing.fifo. Let’s take a look at where this flag comes from and how it should be used.

Careers

Senior SDE, Microsoft FUSE Labs

Microsoft

Jobs · Hammer Lab — Operations Lead

Icahn Institute at Mount Sinai, Hammer Lab

Big Data: Cloud Computing Juggernauts Salesforce, Amazon, Google, Are Snapping Up MBAs

BusinessBecause

Postdoctoral Scholar in Computational Genomics University of Chicago

University of Florida Statistics Department, Statistics Jobs

Sports.BradStenger.com

NYU Data Science newsletter – April 26, 2016

Leave a Comment Cancel reply