Data Science newsletter – August 16, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for August 16, 2018


Data Science News

Gazing Back at the Surveillance Cameras That Watch Us

The New York Times, Lens blog, Jordan G. Teicher


[James] Bridle, who had been researching surveillance as part of an art residency, had set out to walk the perimeter of the city’s congestion charge zone — an area in central London where motorists must pay to drive — with the goal of documenting as many closed-circuit security cameras as he could find.

When he reached the Grosvenor, about halfway along the 12-mile route, he’d recorded 427. That number included not just the cameras that record the license plates of vehicles driving in the zone, but also private cameras, and other public cameras operated by the local authorities. On subsequent walks through the zone, he captured hundreds more.

Aspirational pursuit of mates in online dating markets

Science Advances; Elizabeth E. Bruch and M. E. J. Newman (h/t Brian Nosek)


Romantic courtship is often described as taking place in a dating market where men and women compete for mates, but the detailed structure and dynamics of dating markets have historically been difficult to quantify for lack of suitable data. In recent years, however, the advent and vigorous growth of the online dating industry has provided a rich new source of information on mate pursuit. We present an empirical analysis of heterosexual dating markets in four large U.S. cities using data from a popular, free online dating service. We show that competition for mates creates a pronounced hierarchy of desirability that correlates strongly with user demographics and is remarkably consistent across cities. We find that both men and women pursue partners who are on average about 25% more desirable than themselves by our measures and that they use different messaging strategies with partners of different desirability. We also find that the probability of receiving a response to an advance drops markedly with increasing difference in desirability between the pursuer and the pursued. Strategic behaviors can improve one’s chances of attracting a more desirable mate, although the effects are modest.

Google DeepMind AI system diagnoses eye diseases and shows its work

STAT, Casey Ross


In eye care, artificial intelligence systems have shown they can match the accuracy of doctors in diagnosing specific diseases. But a new system designed by Google DeepMind and British doctors goes a crucial step further: It can show users how it reached its conclusions.

A study published Monday in Nature Medicine reports that the DeepMind system can identify dozens of diseases and point out the portions of optical coherence tomography scans that it relies upon to make its diagnoses. That’s a crucial factor in validating the safety and efficacy of AI technologies being developed for use in diagnosing or recommending treatments for a broad range of diseases, from cancer to neurological and vision problems.

The paper states that the system made the right referral recommendation in more than 94 percent of cases based on a review of historic patient scans at Moorfields Eye Hospital in London and performed as good as, or better than, top eye specialists who examined the same scans. Experts said that level of accuracy is impressive on such an open-ended query. But the bigger breakthrough is the system’s solution to the so-called “black box” problem of artificial intelligence, which refers to the inability of such systems to explain their thinking.

Rethinking social networks

Carnegie Mellon University, Engineering, News


Recent high profile attempts to manipulate public perception and sentiment via social media have demonstrated that we may not know as much about the formulation and evolution of social networks as we think.

It was this gap in understanding that motivated Radu Marculescu, professor of electrical and computer engineering, to co-author a paper in Nature Scientific Reports outlining a new model for how social networks change and develop over time. The research, conducted in close collaboration with Mihai Udrescu and Alex Topirceanu of the Computer Science Department of the Politehnica University of Timişoara, Romania, proposes what the authors term the Weighted Betweenness Preferential Attachment (WBPA) model.

In modeling social networks, a node represents a single individual, and connections between nodes represent relationships between individuals. Prior models have focused on the amount of connections that an individual has, also called the node degree, as the driving force behind a node acquiring new connections.

In contrast, the core of the new WBPA model centers around the notion of “node betweenness.”

Are you liable for misconduct by scientific collaborators? What a recent court decision could mean for scientists

Retraction Watch, Richard Goldstein


Can you commit research misconduct if you fail to detect false data from another scientist?

The answer is yes and here’s how it can happen.

You work in a well-regarded laboratory that receives government funding. You are frequently a principal investigator (PI) and a lead author. The lab suffered from some disorganization so when you took over, you demanded quality work and hired a new lab administrator.

Things are generally good but life in the laboratory is demanding. The size of the lab makes it impossible for you to validate every piece of data. So, you often have to trust that a colleague’s work is reliable and truthful, including from collaborators at other facilities. Funding, as always, is a problem, which means you can’t buy enough equipment and data security software; tracking who did what is difficult. Some lab employees (inherited from your predecessor) have professional or ‘personnel’ issues and you suspect some will leave the laboratory. And of course, there is growing pressure to publish, attend conferences, make new findings, and to keep the funding stream going. There is never enough time.

All of that probably sounds familiar, but here’s where our story takes turn for the worse.

Vanderbilt Data Science Institute launched

Vanderbilt University, myVU


Vanderbilt has established the Data Science Institute to advance foundational research and data science skills across campus and to leverage the university’s collaborative culture.

The new trans-institutional institute, one of several recommendations outlined in a May 2018 report issued by the Data Science Visions Working Group, will facilitate and promote data-driven research across all Vanderbilt schools and colleges through interdisciplinary partnership.

“This is an exciting moment for Vanderbilt,” Provost and Vice Chancellor for Academic Affairs Susan R. Wente said. “This investment in data science will help spark dramatic advances in academic discovery. We have an opportunity to become a leader in an area that is shaping and reshaping our world daily, and the institute will provide resources and support to our faculty, students and researchers to further Vanderbilt’s impact in this cutting-edge field.”

The United States finally starts to talk about data privacy legislation

MarTech Today, Robin Kurzer


Despite the US Commerce Secretary’s condemnation of Europe’s GDPR in May, there are signs that federal and state governments are starting to take data privacy seriously.

Startup AI Chip Passes Road Test – X86 vet revamps ISA for machine learning

EE Times, Rick Merritt


A startup will sample before June a 13-W machine-learning accelerator for cars, robots, and drones said to handily beat Nvidia GPUs in recognizing images. Visteon is considering using the chip in future automotive systems based on test results on an FPGA version of the device.

AlphaICs designed an instruction set architecture (ISA) optimized for deep-learning, reinforcement-learning, and other machine-learning tasks. The startup aims to produce a family of chips with 16 to 256 cores, roughly spanning 2 W to 200 W.

The market is already getting crowded with AI accelerators from startups and established companies, but money is still flowing into the space because AI represents a historic shift in computing.

Governor Murphy Names Beth Simone Noveck as New Jersey’s First Chief Innovation Officer

State of New Jersey


Deepening his commitment to New Jersey’s innovation economy, Governor Phil Murphy today named New Jersey native Beth Simone Noveck as the State’s first Chief Innovation Officer (CIO). The move further advances the Governor’s promise to spur and expand innovation across the Garden State and within State government.

“To reclaim our innovation economy, we must have fresh, cutting-edge ideas that will not only bring New Jersey into the 21st century but also improve the lives of our nine million residents,” said Governor Murphy. “I am pleased to have Beth Noveck join our team as New Jersey’s first Chief Innovation Officer. Beth is an experienced, high-caliber professional who will make New Jersey a leader in government effectiveness.”

“Governor Murphy is a strong champion for using technology and innovation to seize the opportunities of the future, namely to spur economic growth, educate our children, increase health and wellness, and create new jobs,” said Beth Noveck. “It is an honor to serve in his Administration and help advance these goals for everyone in my home state.”

Universal Method to Sort Complex Information Found

Quanta Magazine, Kevin Hartnett


A team of computer scientists has come up with a radically new way of solving nearest neighbor problems. In a pair of papers, five computer scientists have elaborated the first general-purpose method of solving nearest neighbor questions for complex data.

“This is the first result that captures a rich collection of spaces using a single algorithmic technique,” said Piotr Indyk, a computer scientist at the Massachusetts Institute of Technology and influential figure in the development of nearest neighbor search.

Announcing Jeannette Wing as New PI of the Northeast Big Data Innovation Hub

Northeast Big Data Innovation Hub


We’re very pleased to announce today that Jeannette M. Wing will become the Principal Investigator of the Northeast Big Data Innovation Hub.

How a Billion-Dollar Autonomous Vehicle Startup Lost Its Way

Bloomberg BusinessWeek, Hyperdrive, Joshua Brustein and Mark Bergen


Quanergy Systems Inc. found itself in the center of a sudden frenzy over self-driving cars in 2014. It makes lidar, which bounces lasers off objects to help autonomous cars know what’s nearby. That September, the fledgling company announced a partnership with Mercedes-Benz that would put its devices on cars the automaker was using to test autonomous driving features. The deal established an enviable partnership with one of the world’s most prominent auto brands. In January 2015, the two companies showed off a Mercedes E350 sedan outfitted with Quanergy’s lidar devices at the Consumer Electronics Show in Las Vegas.

At the time, the lidar industry was dominated by Velodyne Lidar Inc., which provided the bulky, expensive sensors Google was using on its autonomous cars. Quanergy made the kind of promise Silicon Valley was built on: it’d shrink existing hardware through science, then sell it at a fraction of the price. Quanergy led the pack, as investors poured money into companies describing new techniques for lidar devices. It has raised $160 million to date at a peak valuation of more than $1.5 billion. Last fall, Quanergy began talking to banks about a potential IPO, setting it up to be one of the first public companies to emerge from the wave of firms making tech for autonomous vehicles.

This July, Daimler AG, the parent company of Mercedes, made another announcement. Daimler said it was running a test program for autonomous vehicles on urban streets in the Bay Area, which included a handful of partners, none of which was Quanergy. For lidar, the robo-taxis would use Velodyne. Daimler declined to comment on its relationship with Quanergy.

It was a troubling sign for the lidar industry’s first unicorn—and it wasn’t the only one.

Danny Hughes of the Georgia Institute of Technology and Harvey L. Neiman Health Policy Institute to Establish Health Economics and Analytics Lab

Georgia Tech, News Center


The Georgia Institute of Technology and the American College of Radiology’s Neiman Institute announced a new five-year, $3 million research partnership to establish the Health Economics and Analytics Lab (HEAL) within Georgia Tech’s Ivan Allen College of Liberal Arts. HEAL will focus on applying big data analytics and artificial intelligence to large-scale medical claims databases — with a focus on medical imaging — to better understand how evolving health care delivery and payment models affect patients and providers.

“The HEAL will provide needed research to inform the national medical imaging policy debate and develop new approaches for improving population health,” said Danny R. Hughes, a Georgia Tech professor of economics and executive director of the Neiman Institute, who will lead the lab. “Drawing on Georgia Tech’s unparalleled strength in interdisciplinary research, the HEAL is uniquely positioned to exploit the vast stores of medical data now available to ensure we move toward a sustainable health care system.”

2 CMU computer science professors — including founder of Project Olympus incubator — resign

Pittsburgh Post-Gazette, Courtney Linder


Lenore and Manuel Blum — both longtime professors of computer science at Carnegie Mellon University — have submitted their resignations.

In a Monday morning email blast to staffers in the School of Computer Science, Ms. Blum, founder of the university’s Project Olympus business incubator, made accusations about “professional harassment” and “sexist management” on the school’s Oakland campus over the past three years.

In the email obtained by the Pittsburgh Post-Gazette, she pointed specifically to changes made in recent years under a “new entrepreneurial management structure on campus.”

What a difference a year of data science makes

Harvard Gazette


After a successful first year, the Harvard Data Science Initiative (HDSI) will focus its second on five research themes designed to foster opportunities for collaboration both among Harvard’s Schools and beyond its walls.

Come fall, research activity for the initiative will coalesce around personalized health, evidence-based policy, networks and markets, data-driven scientific discovery, and methodology. HDSI will fundraise to these themes, aligning them to program offerings, research grants, and student and postdoctoral support.


Webinar: Reorganizing Federal Data Agencies

Association of Public Data Users


Online Friday, September 7, starting at 1:30 p.m. EDT. “This June the White House released a proposal that would shift the Bureau of Labor Statistics from the Department of Labor to the Department of Commerce, joining with the Bureau of Economic Analysis and the U.S. Census Bureau under a single economics and statistics agency. Join us for a facilitated conversation with Dr. Ken Poole as he talks with Dr. Nancy Potok, Chief Statistician of the United States at the Office of Management and Budget and Dr. Erica Groshen, former Commissioner of the Bureau of Labor Statistics.”

SVHealth Monthly Meetup: Success with Wearables – How To, What’s Next

Meetup, Silicon Valley Health 2.0


Sunnyvale, CA August 21, starting at 6 p.m., Plug and Play Tech Center. [rsvp required]

An Introduction to Public Data

Columbia University, The Brown Institute for Media Innovation


New York, NY September 6, starting at 9 a.m., Brown Institute for Media Innovation, Columbia University. [registration required]


The call for proposals for #StrataData in San Francisco 2019 is now open.

San Francisco, CA Conference is March 25-28, 2019. Deadline for proposal applications is September 18.

Call for Code – The issue: Natural disaster preparedness and relief. How will you answer the call?

“David Clark Cause is launching Call for Code alongside Founding Partner IBM. This multi-year global initiative is a rallying cry to developers to use their skills and mastery of the latest technologies, and to create new ones, to drive positive and long-lasting change across the world with their code. The inaugural Call for Code Challenge theme is Natural Disaster Preparedness and Relief.” Deadline for submissions is September 28.

The Microsoft AI Idea Challenge – Breakthrough Ideas Wanted!

“The Microsoft AI Idea Challenge is seeking breakthrough AI solutions from developers, data scientists, professionals and students, and preferably developed on the Microsoft AI platform and services.” Deadline for submissions is October 12.
Tools & Resources

TensorFlow 2.0 announced with a new focus on usability, preview coming later this year

AndroidPolice, Ryan Hager


TensorFlow 2.0 will support more (unnamed) platforms compared to the previous release while removing deprecated/duplicated APIs—apparently a source of confusion for some new developers as they learn to use it. This new release is built around TensorFlow’s existing Eager Execution environment, which should make for even easier use and debugging.

Now anyone can train Imagenet in 18 minutes, Jeremy Howard


A team of alum Andrew Shaw, DIU researcher Yaroslav Bulatov, and I have managed to train Imagenet to 93% accuracy in just 18 minutes, using 16 public AWS cloud instances, each with 8 NVIDIA V100 GPUs, running the fastai and PyTorch libraries. This is a new speed record for training Imagenet to this accuracy on publicly available infrastructure, and is 40% faster than Google’s DAWNBench record on their proprietary TPU Pod cluster. Our approach uses the same number of processing units as Google’s benchmark (128) and costs around $40 to run.

DIU and will be releasing software to allow anyone to easily train and monitor their own distributed models on AWS, using the best practices developed in this project.

Distill Update 2018

Distill, Editors


A little over a year ago, we formally launched Distill as an open-access scientific journal.

It’s been an exciting ride since then! To give some very concrete metrics, Distill has had over a million unique readers, and more than 2.9 million views. Distill papers have been cited 23 times on average. More importantly, we’ve published several new papers with a strong emphasis on clarity and reproducibility, which we think is helping to encourage a new style of scientific communication.

Despite this, there are a couple ways we think we’ve fallen short or could be doing better. To that end, we’ve been reflecting a lot on what we can improve.

Gutenberg, dammit

GitHub – aparrish


“I wanted all of plaintext Project Gutenberg in an easy-to-use format, so I made this”

Better Presentations Cheatsheet

John Schwabish, Policy Viz blog


Last week, I published a copy of my Data Visualization Cheatsheet, an introductory summary sheet I provide to people who take my classes and workshops. Today, I’m similarly publishing my Presentation Cheatsheet, a summary sheet I pass around at my Better Presentation Skills workshops.



Postdocs (2) – Zeynep Tufekci

University of North Carolina, School of Information and Library Science; Chapel Hill, NC

Leave a Comment

Your email address will not be published.