NYU Data Science newsletter – May 18, 2015

NYU Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for May 18, 2015

GROUP CURATION: N/A

 
Data Science News



Why Propensity Scores Should Not Be Used for Matching

Gary King


from May 15, 2015

Researchers use propensity score matching (PSM) as a data preprocessing step to selectively prune observations prior to applying a model to estimate a causal effect. The goal of PSM is to reduce imbalance in pre-treatment covariates between the treatment and control groups, thereby reducing the degree of model dependence and potential for bias. Although some applied researchers have combined PSM with various ad hoc procedures and checks to produce useful analyses, we show that the core PSM procedure itself often accomplishes the opposite of what is intended — increasing imbalance, model dependence, and bias. The weakness of PSM is that it approximates a completely randomized experiment, rather than, as with other matching methods, a more powerful fully blocked randomized experiment. PSM is therefore blind to the portion of imbalance that would have been eliminated by also approximating full blocking.

 

Toolbox

Nature News & Comment


from May 15, 2015

Welcome to Toolbox: Nature’s hub for scientific software, apps and online tools.

 

To Clean or Not to Clean?

Accenture, Technology blog


from May 14, 2015

Similar to data munging described in my previous post, enterprises must clean their data to derive the best possible insights for the business. Data cleaning is part of the “transform” step in the extract, transform, load (ETL) process. Arguably the most tedious of all the steps in the data manipulation pipeline, cleaning is the act of fixing spelling errors, deleting erroneous white spaces, removing duplicate entries and performing any other action to eliminate inaccurate entries in a data set. With big data containing millions of entries, addressing these issues can be prohibitively time consuming.

What happens if a company decides not to clean its data? Embarrassing mistakes are a best-case scenario; worst case, executives draw incorrect conclusions about the business and act on them.

Thankfully, there are a number of data cleaning programs that when used correctly can reduce a month-long project to mere days. This is a fast growing market, and here are a few new tools that have entered recently.

 

Elements of Scale: Composing and Scaling Data Platforms

ben stopford


from April 28, 2015

… today’s data platforms range greatly in complexity. From simple caching layers or polyglotic persistence right through to wholly integrated data pipelines. There are many paths. They go to many different places. In some of these places at least, nice things are found.

So the aim for this talk is to explain how and why some of these popular approaches work. We’ll do this by first considering the building blocks from which they are composed. These are the intuitions we’ll need to pull together the bigger stuff later on.

 

GeoJSON Hexagonal “Statebins” in R

Bob Rudis, rud.is


from May 14, 2015

There’s been lots of buzz about “statebin” maps of late. A recent tweet by @andrewxhill referencing work by @dannydb pointed to a nice shapefile that ends up being a really great way to handle statebin maps (and I feel like a fool for not considering it for a more generic solution earlier).

 

A First Big Step Toward Mapping the Human Brain | WIRED

WIRED, Science


from May 14, 2015

… the Allen Institute for Brain Science, a key player in the BRAIN Initiative, has launched a database of neuronal cell types that serves as a first step toward a complete understanding of the brain. It’s the first milestone in the Institute’s 10-year MindScope plan, which aims to nail down how the visual system of a mouse works, starting by developing a functional taxonomy of all the different types of neurons in the brain.

 

Local Company Earns Partnership with Google

Charlottesville Newsplex


from May 15, 2015

Commonwealth Computer Research Inc. specializes in data science and analysis. The Charlottesville company hit the big leagues of technology landing a partnership with the multibillion-dollar company Google. … Years in the making, Google just released Cloud Big Table to the public. It allows storage for large amounts of spatial data, which deals with geographic location.

 

New Tool Helps Researchers, Managers Plan for Sea Scallop Fishery in the Future

ScienceNewsline


from May 09, 2015

Sea scallops, one of the most valuable commercial fisheries in the United States, are a well managed and monitored fishery, yet little is known about how changing ocean temperatures and ocean chemistry and other environmental factors could impact the fishery. A new study published May 6 in PLOS ONE describes a new computer model to help inform scallop management discussions and decisions in the coming decades.

Researchers from the Woods Hole Oceanographic Institution (WHOI), NOAA Fisheries’ Northeast Fisheries Science Center (NEFSC), and Ocean Conservancy developed an integrated assessment model (IAM) to reproduce scallop population dynamics, market dynamics and seawater chemistry between 2000 and 2012. Data on actual landings, revenue, biomass, number and scallop size distribution were provided by NOAA’s Northeast Fisheries Science Center (NEFSC), along with ocean temperature, salinity and other oceanographic information.

 

How Data Became a New Medium for Artists

The Atlantic


from May 14, 2015

A growing number of artists are using data from self-tracking apps in their pieces, showing that creative work is as much a product of its technology as of its time.

 

Inside The Mind That Built Google Brain: On Life, Creativity, And Failure

Huffington Post, HuffPost Tech, Nico Pitney


from May 14, 2015

Here’s a list of universities with arguably the greatest computer science programs: Carnegie Mellon, MIT, UC Berkeley, and Stanford. These are the same places, respectively, where Andrew Ng received his bachelor’s degree, his master’s, his Ph.D., and has taught for 12 years.

Ng is an icon of the artificial intelligence world with the pedigree to match, and he is not yet 40 years old. In 2011, he founded Google Brain, a deep-learning research project supercharged by Google’s vast stores of computing power and data. Delightfully, one of its most important achievements came when computers analyzing scores of YouTube screenshots were able to recognize a cat. (The New York Times’ headline: “How Many Computers to Identify a Cat? 16,000.”) As Ng explained, “The remarkable thing was that [the system] had discovered the concept of a cat itself. No one had ever told it what a cat is. That was a milestone in machine learning.”

 
Events



Astro Hack Week 2015



ASTRO HACK WEEK IS A WEEK-LONG SUMMER SCHOOL / HACK WEEK / UNCONFERENCE
FOCUSED ON ASTROSTATISTICS AND DATA-INTENSIVE ASTRONOMY.

The mornings will be a typical summer school format, with lectures and exercises covering essential skills for working effectively with large astronomical datasets. The afternoons will be entirely unstructured, and offer opportunities for collaborative research, breakout sessions on special topics, and application of the concepts covered during the morning sessions. The vision is to provide a space to encourage learning, research, collaboration, and sharing of expertise, for the benefit of both young and experienced astronomical researchers alike.

Deadline for Expression of Interest: Monday, June 15

 
CDS News



Kids Put in Danger by Parents’ Online Oversharing

Tom's Guide


from May 12, 2015

Parents will do just about anything to keep their children safe. Ironically, parents may also be the primary reason their children are vulnerable online. A new study suggests that parents’ social-media postings can put young children at risk, due to what seem like fairly innocuous activities. It turns out that compromising children’s safety online doesn’t take a hacker — just a savvy individual and some information that parents are all too willing to provide.

This news comes by way of a research paper entitled “Children Seen but Not Heard: When Parents Compromise Children’s Online Privacy” issued by New York University Polytechnic School of Engineering in Brooklyn. The paper asserts that parents who share their children’s information on social media are unwittingly putting their children and their families at risk by compromising personal privacy and, by extension, security.

 

Leave a Comment

Your email address will not be published.