Data Science newsletter – February 2, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for February 2, 2017

GROUP CURATION: N/A

Data Science News

IEEE Spectrum, Peter Fairley

from January 31, 2017

Robotic cars are great at monitoring other cars, and they’re getting better at noticing pedestrians, squirrels, and birds. The main challenge, though, is posed by the lightest, quietest, swerviest vehicles on the road.

“Bicycles are probably the most difficult detection problem that autonomous vehicle systems face,” says UC Berkeley research engineer Steven Shladover.

Ohio State University helps fuel new research into autonomous vehicles

The Ohio State University, News Room

from January 26, 2017

The path toward a safe and smart autonomous car hit the fast lane Thursday with a $45 million commitment to expand The Ohio State University-affiliated Transportation Research Center. The funds will support research and innovation for autonomous – or driverless – vehicles.

President Michael V. Drake announced the funding with Gov. John Kasich at an event at Ohio State’s Center for Automotive Research. Ohio State will contribute $25 million to TRC, with an additional $20 million coming from the state.

The Internet Is Mostly Bots

The Atlantic, Adrienne LaFrance

from January 31, 2017

Look around you, people of the internet. The bots. They’re everywhere.

Most website visitors aren’t humans, but are instead bots—or, programs built to do automated tasks. They are the worker bees of the internet, and also the henchmen. Some bots help refresh your Facebook feed or figure out how to rank Google search results; other bots impersonate humans and carry out devastating DDoS attacks.

Overall, bots—good and bad—are responsible for 52 percent of web traffic, according to a new report by the security firm Imperva, which issues an annual assessment of bot activity online.

Computer Diagnoses Cataracts as Well as Eye Doctors Can

Live Science, Charles Q. Choi

from January 31, 2017

A new artificial-intelligence system designed to imitate the way the brain handles vision can diagnose a rare eye condition just as well as eye doctors can, a new study shows.

The new system, which focuses on identifying a rare eye condition called congenital cataracts, could also help diagnose other rare diseases someday, the researchers said.

In the study, scientists in China used an artificial neural network named CC-Cruiser. This network was a “convolutional neural network,” meaning it was designed based on the way neurons are organized in the brain region that deals with vision. The scientists wanted to see if neural networks could help address rare diseases, which afflict about 10 percent of the world’s population.

IBM’s Watson will help you file your taxes

Engadget, Jon Fingas

from February 01, 2017

Tax experts can find deductions that you might otherwise miss, but they’re only human — they can only find so many potential savings, let alone paint a larger picture of your finances. They’re about to get a helping hand, though. IBM is partnering with H&R Block to make Watson a part of the tax filing process at locations across the US starting on February 6th. After you participate in an initial interview, the artificial intelligence will offer suggestions to Tax Pros (read: experts) looking for deductions, and illustrate the bigger picture for you on a dedicated client screen. Ideally, Watson’s ability to understand context and intent will turn your statements into tangible data that leads to bigger tax breaks.

The Transportation Revolution Is Happening Faster Than You Think

PBS, NOVA Next, Tim De Chant

from February 01, 2017

If it weren’t for the metal canister sprouting from the peak of its roof and the small, black hockey pucks jutting out from the front fenders, you’d probably think this was just another company car out on an errand.

Instead, it’s a relatively low-key introduction of three technologies that could fundamentally change the way we get around—autonomy, ride sharing, and electrification.

The car is one of two that began prowling this remote corner of Boston a few weeks ago, the small test fleet of nuTonomy, a Cambridge-based startup that’s developing an autonomous car-sharing system. The car’s autonomous capabilities are the most obvious of the three technologies and certainly the most captivating. They’re also the linchpin that could enable the other two—ride sharing and electrification—become part and parcel of mobility in the 21st century.

Computing and the Fight against Epidemics

The Huffington Post, ACM, Alessandro Vespignani

from January 18, 2017

While public health data collected through traditional surveillance systems are still indispensable, a new era of opportunities has been ushered in by the availability of big data streams, such as electronic health records, social media, Internet, mobile phones, and remote sensors. We can even design interactive digital platforms where anyone can contribute information on their own health status, becoming sentinels on the ground who provide earlier warnings about disease outbreaks.

Computing and data sciences are the keys to accessing and harnessing this wealth of “intel,” creating the possibility of real-time acquisition and analysis of highly resolved digital data. A recent special issue of the Journal of Infectious Diseases which focuses on big data provides a stunning range of examples. From mining Twitter posts to analyzing the flu season to using cell phone data and satellite imagery to understand the population movements driving the dissemination of epidemic diseases, a computational approach would strengthen the usual disease surveillance system and provide public health systems with new lenses on human social behavior.

Patients are about to see a new doctor: artificial intelligence

VentureBeat, Alston Ghafourifar

from January 31, 2017

The World Health Organization has estimated that there is a global shortfall of approximately 4.3 million doctors and nurses, with poorer countries disproportionately impacted. In the U.S., these shortages are less acute; instead, the country struggles with ever-increasing health care costs, which often translate into limits on the time a patient is able to spend with a doctor. One study estimated that U.S. doctors spend on average just 13 to 16 minutes with each patient.

So against this backdrop of a global shortage in doctors and nurses, and cost-driven strains in patient care, let’s take a look at some of the ways AI systems are being evaluated for use in medical care.

In the Age of Trump, Open Data Matters Now More Than Ever

Next City, Neil Kleiman

from January 27, 2017

I have been struck by the national obsession — our new president’s obsession — with facts and which ones are worth proving. There has been much debate about approval ratings, how many people voted, when it rained at the inauguration, how big the crowd was. Frankly, it’s all a little trivial and far removed from actual governance and public policy — from what needs to be done to improve people’s lives.

At the municipal level a whole other approach to facts has been brewing; let’s think of it as a street-level alternative to recent “alternative facts.” City generated data points tend to be transparent and explicitly tied to services that citizens care about.

I was reminded of this during a recent conversation with Kate Bender, a manager in Kansas City’s performance management office. Right about when Donald Trump was being inaugurated, her city was also marking a milestone — the fifth anniversary of KCStat.

Reproducible Research Needs Some Limiting Principles

Simply Statistics, Roger Peng

from February 01, 2017

Over the past 10 years thinking and writing about reproducible research, I’ve come to the conclusion that much of the discussion is incomplete. While I think we as a scientific community have come a long way in changing people’s thinking about data and code and making them available to others, there are some key sticking points that keep coming up that are preventing further progress in the area.

When I used to write about reproducibility, I felt that the primary challenge/roadblock was a lack of tooling. Much has changed in just the last five years though, and many new tools have been developed to make life a lot easier. Packages like knitr (for R), markdown, and iPython notebooks, have made writing reproducible data analysis documents a lot easier. Web sites like GitHub and many others have made distributing analyses a lot simpler because now everyone effectively has a free web site (this was NOT true in 2005).

Even still, our basic definition of reproducibility is incomplete. Most people would say that a data analysis is reproducible if the analytic data and metadata are available and the code that did the analysis is available. Furthermore, it would be preferable to have some documentation to go along with both. But there are some key issues that need to be resolved to complete this general definition.

[1701.08170] Computational Social Science to Gauge Online Extremism

arXiv, Computer Science > Social and Information Networks; Emilio Ferrara

from January 27, 2017

Recent terrorist attacks carried out on behalf of ISIS on American and European soil by lone wolf attackers or sleeper cells remind us of the importance of understanding the dynamics of radicalization mediated by social media communication channels. In this paper, we shed light on the social media activity of a group of twenty-five thousand users whose association with ISIS online radical propaganda has been manually verified. By using a computational tool known as dynamical activity-connectivity maps, based on network and temporal activity patterns, we investigate the dynamics of social influence within ISIS supporters. We finally quantify the effectiveness of ISIS propaganda by determining the adoption of extremist content in the general population and draw a parallel between radical propaganda and epidemics spreading, highlighting that information broadcasters and influential ISIS supporters generate highly-infectious cascades of information contagion. Our findings will help generate effective countermeasures to combat the group and other forms of online extremism.

San Francisco’s Data Academy Develops a Data-Savvy Workforce

Harvard University, Data-Smart City Solutions, Blake Valenta

from February 01, 2017

On a Wednesday morning, forty San Francisco city employees file into a room and find their seats. Over the next three hours, Joy Bonaguro, Chief Data Officer of the City and County of San Francisco, will walk them through the basics of information design. Matthew Drazba, an analyst from the Treasurer and Tax Collection department, is here because he wants to learn “a different way of communicating analysis.” It’s not that Matthew is poorly trained; four years at Harvard for his undergraduate degree and five and a half years in the Marine Corps taught him plenty about analysis, research, and writing. Yet in his new job he has found it difficult to translate tax codes and tax database relations in a manner others can easily understand. He heard about the information design class from a coworker and hopes it will help.

“Basics of Information Design” is part of Data Academy, a collection of tool- and skill-focused workshops designed for San Francisco employees. Started in early 2014 through a partnership between DataSF (San Francisco’s data office) and the Controller’s Office, Data Academy has grown from two courses to over thirteen. The number of students has also grown from 80 trained in FY2014, to 400 in FY2015, and over 600 in FY2016.

Credit-Card Fraud Keeps Rising, Despite New Security Chips—Study

Wall Street Journal, AnnaMaria Andriotis and Peter Rudegeair

from February 01, 2017

More consumers became victims of identity fraud last year than at any point in more than a decade despite new security protections implemented by the credit-card industry, a report released Wednesday said.

Some 15.4 million U.S. consumers were victims of identity fraud in 2016, resulting in $16 billion in total losses, according to the report by consulting firm Javelin Strategy & Research and identity-theft-protection firm LifeLock Inc. The number of victims rose 18% from 2015 and was the highest since Javelin, a unit of Greenwich Associates LLC, started tracking the phenomenon in 2003.

The increase in identity fraud, the bulk of which comes from card activity, was driven in part by a 15% rise in cases of fraudulent online purchases, the study noted.

Private weather data should not replace basic research

Nature News & Comment, Editorial

from February 01, 2017

It’s more than two decades since a team of US scientists proved that signals from the Global Positioning System (GPS) satellites could be used to gather atmospheric temperature data. The technique, known as radio occultation, depends on precise measurements of delays in the radio signals as they pass through the atmosphere. The first mission, GPS Meteorology, paved the way for the Constellation Observing System for Meteorology, Ionosphere, and Climate (COSMIC-1) in 2006. Radio-occultation data from COSMIC-1 and a handful of follow-on missions have been integrated into government forecasting systems around the globe, and now a trio of private companies is vying to get into the game.

As a result, recent years have seen tension between researchers at the University Corporation for Atmospheric Research (UCAR) in Boulder, Colorado, who want to continue advancing the science of radio occultation, and technology-savvy entrepreneurs, who say they can push the field forward more quickly and cheaply — while making a profit. All parties say that they are marching towards the same goal — to provide high-quality data that could improve weather forecasts. But behind the scenes, both sides have accused the other of sabotage and turf wars. The debate has been counter-productive.

Government Data Science News

Government Data Science News
The National Institute of Health put out a request for information on data sharing; PLOS responded, citing “major cultural impediments” to data sharing in academia. If researchers do not get credit for sharing data, is it logical to be mad at them for requesting embargoes to protect their ability to publish first, something for which they do get credit? PLOS’s response, in other words, is honest, measured, and reveals a depth of understanding long-term commitment to institutional change. Statistician Roger Peng, in an otherwise unrelated post, takes another issue PLOS addresses about how long shared datasets need to be maintained.

Argonne National Labs put out an explainer on what’s happening at each of its four exascale computing centers. The four center, task-centered approach seems likely to lead to a wider range of innovations than the “single best computer in the world!” computational nationalism that we may be seeing with China and Japan.

Los Alamos National Lab released 16 years of GPS solar weather data thanks to the Obama administration. Let’s agree to scrape this from data.gov (search “GPS energetic particles” and store it locally.

Steve Pierson echoes an appeal first made in The Guardian on January 30 to monitor and guard the integrity of U.S. government statistics. Federal statisticians who lack deep-pocketed support have lost funding (and stature) during past political cycles, but the stakes are higher now.

Astronomers explore uses for AI-generated images

Nature News & Comment, Davide Castelvecchi

from February 01, 2017

Volcanoes, monasteries, birds, thistles: the varied images in Jeff Clune’s research paper could be his holiday snaps. In fact, the pictures are synthetic. They are generated by deep-learning neural networks: layers of computational units that mimic how neurons are connected in the brain.

In recent years, neural networks have made huge strides in learning to recognize and interpret information in pictures, videos and speech. But now, computer scientists such as Clune are turning those artificial-intelligence (AI) systems on their heads to create ‘generative’ networks that churn out their own realistic-seeming information. “I have reached a point in my life where I’m kind of having reality vertigo,” says Clune, who works at the University of Wyoming in Laramie.

Generative systems also give an insight into how neural networks interpret the world, says Kyle Cranmer, a particle physicist and computer scientist at New York University. Although it’s not clear how virtual neurons store and interpret information, the plausibility of the data they generate suggests that they have some handle on the real world.

Events

The Secret Science Club: Secret Science Club presents Philosopher, Cognitive Scientist & Author Daniel Dennett

Brooklyn, NY Tuesday, February 7 at 8 p.m., the Bell House (149 7th St. in Gowanus) [free]

HackNYU

New York, Shanghai, Abu Dhabi February 17-19. Register with the downloadable HackNYU mobile app. [free]

New York R Conference

New York, NY April 21-22 at Work-Bench (110 5th Ave, 5th Fl) [$$$]

Deadlines

Summer Institute in Computational Social Science

Sunday, June 18 to Saturday, July 1, the Russell Sage Foundation will sponsor the first Summer Institute in Computational Social Science, to be held at Princeton University. Participation is restricted to Ph.D. students, postdoctoral researchers, and untenured faculty within 7 years of their Ph.D. Deadline for completed applications is Friday, February 24.

Tools & Resources

Google will open-source its Earth Enterprise on-premises software in March

VentureBeat, Jordan Novet

from January 30, 2017

Google today announced that in March it will open-source its Google Earth Enterprise software, which lets organizations deploy Google Maps and Google Earth in their on-premises data center infrastructure.

[1701.08954] CommAI: Evaluating the first steps towards a useful general AI

arXiv, Computer Science > Learning; Marco Baroni

from January 31, 2017

With machine learning successfully applied to new daunting problems almost every day, general AI starts looking like an attainable goal. However, most current research focuses instead on important but narrow applications, such as image classification or machine translation. We believe this to be largely due to the lack of objective ways to measure progress towards broad machine intelligence. In order to fill this gap, we propose here a set of concrete desiderata for general AI, together with a platform to test machines on how well they satisfy such desiderata, while keeping all further complexities to a minimum.

Careers

Postdocs

Connected Experiences Lab Postdoc

Cornell Tech; New York, NY

Sports.BradStenger.com

Data Science newsletter – February 2, 2017

Leave a Comment Cancel reply