NYU Data Science newsletter – August 24, 2016

NYU Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for August 24, 2016

GROUP CURATION: N/A

Data Science News

Can big data and AI fix our criminal-justice crisis?

Engadget, A.I. Week

from August 17, 2016

Body cameras and complex algorithms have a lot of potential — and political baggage.

Influential References for Data-Driven Research

Medium, Moore Data, Carly Strasser

from August 23, 2016

… The [Moore] Foundation had about ~1100 researchers apply for the investigator awards, and all of them provided a list of up to five “influential works” as part of their application. This provided quite the treasure trove of data on what data-driven researchers think are the most important works for data science. Moore Fellow Mark Stalzer and DDD Program Director Chris Mentzel recognized the potential importance of this information, and recently published a paper reviewing the works identified by the applicants.

Stalzer and Mentzel found that out of the nearly 5000 references, there were 53 works that were cited at least six times. Their Table 1 lists the 22 works cited 10 or more times, and could be used as a reading list for an intro data science course.

Fundamental structures of dynamic social networks

Proceedings of the National Academy of Sciences; Vedran Sekara, Arkadiusz Stopczynski and Sune Lehmann

from August 23, 2016

We study the dynamic network of real world person-to-person interactions between approximately 1,000 individuals with 5-min resolution across several months. There is currently no coherent theoretical framework for summarizing the tens of thousands of interactions per day in this complex network, but here we show that at the right temporal resolution, social groups can be identified directly. We outline and validate a framework that enables us to study the statistical properties of individual social events as well as series of meetings across weeks and months. Representing the dynamic network as sequences of such meetings reduces the complexity of the system dramatically. We illustrate the usefulness of the framework by investigating the predictability of human social activity.

AI’s Research Rut

MIT Technology Review, Olga Russakovsky

from August 23, 2016

When you picture AI, what do you see? A humanoid robot? When you think about a real-world application of AI, what comes to mind? Probably autonomous driving. When you think about the technical details of AI, what approach do you name? I’m willing to bet it’s deep learning.

In reality AI comes in many shapes and forms. AI machines go far beyond humanoid robots; they range from software detecting bullying on social media to wearable devices monitoring personal health risk factors to robotic arms learning to feed paralyzed people to autonomous robots exploring other planets. The potential applications of AI are limitless: personalized education, elderly assistance, wildlife behavior analysis, medical-record mining, and much more.

Spreadsheet software defaults damage science

Genome Biology; Mark Ziemann, Yotam Eren and Assam El-Osta

from August 23, 2016

As I tweeted earlier this week, Excel is bad for science. Libre Office and most spreadsheet applications are no better. The issue this time? “Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.” [full text]

Big data and hidden cameras are emerging as dangerous weapons in the gentrification wars

Quartz, Dia Kayyali

from August 23, 2016

The gentrification wars have a dangerous new weapon: invasive surveillance technology.

Earlier this summer, the Washington Post wrote about a disturbing tenant-screening software service called Tenant Assured. The service, provided by London startup Score Assured, scans the LinkedIn, Instagram, Twitter, and Facebook accounts of prospective tenants to create a “comprehensive” personality profile and risk score. The software tracks prospective tenants’ use of keywords like “poor” or “loan,” as well as activities such as frequent check-ins at bars. Using such information, the company boasts that it can highlight the top five personality traits of a potential tenant as well as any risks, offering features such as a “new to country alert.”

You’re Being Tracked (and Tracked and Tracked) on the Web

IEEE Spectrum

from August 23, 2016

The number of third parties sending information to and receiving data from popular websites each time you visit them has increased dramatically in the past 20 years, which means that visitors to those sites may be more closely watched by major corporations and advertisers than ever before, according to a new analysis of Web tracking.

A team from the University of Washington reviewed two decades of third-party requests by using Internet Archive’s Wayback Machine. They found a four-fold increase in the number of requests logged on the average website from 1996 to 2016, and say that companies may be using these requests to more frequently track the behavior of individual users. They presented their findings at the USENIX Security Conference in Austin, Texas, earlier this month.

Chicago’s Experiment in Predictive Policing Isn’t Working

MIT Technology Review

from August 19, 2016

A new report suggests that a data-driven tool meant to reduce gun violence was ignored by police and, in a few cases, may have been misused.

[1608.05878] The ground truth about metadata and community detection in networks

arXiv, Computer Science > Social and Information Networks; Leto Peel, Daniel B. Larremore, Aaron Clauset

from August 20, 2016

Across many scientific domains, there is common need to automatically extract a simplified view or a coarse-graining of how a complex system’s components interact. This general task is called community detection in networks and is analogous to searching for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-called textit{ground truth} communities. This works well in synthetic networks with planted communities because such networks’ links are formed explicitly based on the planted communities. However, there are no planted communities in real world networks. Instead, it is standard practice to treat some observed discrete-valued node attributes, or metadata, as ground truth. Here, we show that metadata are not the same as ground truth, and that treating them as such induces severe theoretical and practical problems. We prove that no algorithm can uniquely solve community detection, and we prove a general No Free Lunch theorem for community detection, which implies that no algorithm can perform better than any other across all inputs. However, node metadata still have value and a careful exploration of their relationship with network structure can yield insights of genuine worth. We illustrate this point by introducing two statistical techniques that can quantify the relationship between metadata and community structure for a broad class models. We demonstrate these techniques using both synthetic and real-world networks, and for multiple types of metadata and community structure.

Programmable network routers

MIT News

from August 23, 2016

Hardware is an occasionally overlooked area in data science, possibly due to our disciplinary distance from electrical engineering. Programmable routers allow for algorithm updates which are typically “hardwired into the routers’ circuitry…[so] if someone develops a better algorithm, network operators have to wait for a new generation of hardware before they can take advantage of it.”

Melding Mind and Machine

Society for Neuroscience, BrainFacts.org

from August 02, 2016

In 2015, Erik Sorto did something he thought he’d never do again: he raised a bottle of beer to his lips and took a drink. Paralyzed below the sternum for more than a decade after a gunshot damaged his spinal cord, Sorto didn’t use his own hand to hold his beverage, instead he controlled a robotic arm with his own thoughts.

Want More Accurate Polls? Maybe Ask Twitter | WIRED

WIRED, Culture

from August 22, 2016

In a Public Policy Polling survey, quite a few Texans say they’ll vote for Harambe for president in November. If you haven’t looked at the Internet in a while, Harambe was a gorilla fatally shot by a zookeeper after a toddler fell into his pen, but he’s more than that. He’s a meme, and his candidacy in Texas represents the voice of the Internet insinuating its way into polling. It’s silly, but it’s actually a sign of positive change.

Traditional polling methods aren’t working the way they used to. Upstart analytics firms like Civis and conventional pollsters like PPP, Ipsos, and Pew Research Institute have all been hunting for new, more data-centric ways to uncover the will of the whole public, rather than just the tiny slice willing to answer a random call on their landline. The trending solution is to incorporate data mined from the Internet, especially from social media. It’s a crucial, overdue shift. Even though the Internet is a cesspool of trolls, it’s also where millions of Americans go to express opinions that pollsters might not even think to ask about.

Events

Geohackweek

Seattle, WA “Geohackweek is a 5-day workshop to be held at the University of Washington eScience Institute. Participants will learn about open source technologies used to analyze geospatial datasets. — Monday-Friday, November 14-18

Deadlines

Seeking Public Input on the HHS Open Government Plan for 2016–2018

deadline: Survey

“Every two years, we’ve worked across all corners of HHS to coordinate our strategies for making government more open. Earlier this summer, we called out for your ideas on getting our plan started. Today, we’re back in the village square to engage you once again and invite you into our open government plan.”

The deadline to contribute comments is Friday, September 9.

EMNLP 2016 – Joint Call for Student Scholarship Applications and Student Volunteers

deadline: Conference

Austin, TX The 2016 Conference on Empirical Methods in Natural Language Processing will be held on Tuesday-Friday, November 1-4. Applicants for either the Student Volunteer Program or the Student Scholarship Program must be full-time students.

Deadline to apply is Tuesday, September 13, 2016

CDS News

Big Data is a junkyard.

Medium Bruno Goncalves

from August 23, 2016

Our Moore-Sloan Fellow Bruno Goncalves reminds us of the composite nature of individuals when drawn from multi-platform social media data: “Every user shows only a piece of himself by using [particular platforms], but by carefully analyzing large amounts of users one might get a fuller picture of human behavior”.

Tools & Resources

[1608.05742] Extending the OpenAI Gym for robotics: a toolkit for reinforcement learning using ROS and Gazebo

Computer Science > Robotics; Iker Zamora et al.

from August 19, 2016

This paper presents an extension of the OpenAI Gym for robotics using the Robot Operating System (ROS) and the Gazebo simulator. The content discusses the software architecture proposed and the results obtained by using two Reinforcement Learning techniques: Q-Learning and Sarsa. Ultimately, the output of this work presents a benchmarking system for robotics that allows different techniques and algorithms to be compared using the same virtual conditions.

WPRDC data guide – Protecting Privacy

Western Pennsylvania Regional Data Center

from May 24, 2016

Open data publishing requires policies, procedures, and actions to protect individuals and organizations from unintentional sharing of personally identifiable information (PII). In some cases, sharing data including PII can cause people serious harm. The proliferation of digital public records, data contained in social media, inadequate policies for managing and protecting PII, and inconsistent data management practices increase the likelihood that PII can be combined across multiple sources to uniquely identify an individual. This section of the guide is designed to help publishers identify PII, and take steps to minimize the risk and harm of a breach. … As a final check, we ask all publishers to mark a dataset as “private” the first time it is shared.

Hadoop: What you need to know

O'Reilly Media, Donald Miner

from August 23, 2016

This report is written with the enterprise decision maker in mind. The goal is to give decision makers a crash course on what Hadoop is and why it is important. Hadoop technology can be daunting at first and it represents a major shift from traditional enterprise data warehousing and data analytics. Within these pages is an overview that covers just enough to allow you to make intelligent decisions about Hadoop in your enterprise.

I Don’t Need No Stinking API – Web Scraping in 2016 and Beyond

Francis Kim

from August 24, 2016

I’m going to share everything that I’ve learnt to date from my recent love affair with Selenium automation/scraping/crawling. The purpose of this post is to illustrate some of the techniques I’ve created which I haven’t seen published anywhere else – as a broader, applicable idea to be shared around and discussed by the webdev community.

Lime: Explaining the predictions of any machine learning classifier

GitHub – marcotcr

from August 19, 2016

This project is about explaining what machine learning classifiers (or models) are doing. At the moment, we support explaining individual predictions for text classifiers or classifiers that act on tables (numpy arrays of numerical or categorical data), with a package caled lime (short for local interpretable model-agnostic explanations).

Careers

Tenured and tenure track faculty positions

Associate Professor (2 openings) Population Health and Labor Demography

Max Planck Institute for Demographic Research; Rostock, Germany

Sports.BradStenger.com

NYU Data Science newsletter – August 24, 2016

Leave a Comment Cancel reply