Data Science newsletter – May 22, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for May 22, 2018


Data Science News

We Asked the Guy Behind Is the L Train Fucked if the L Train Is Fucked

VICE, John Surico


Jonathan Vingiano’s simple website has told hundreds of thousands of New Yorkers over the last five years whether or not the L train is fucked. Now he offers his thoughts on how the MTA can do better in the future.

Microsoft Will Share Data, Tools to Speed Chinese AI Development

Bloomberg News


Microsoft Corp. will set up an open platform with four of China’s most prestigious universities to share its data and tools on artificial intelligence, quickening efforts to try and wrest clients away from local giants.

The company demonstrated its AI chip system, Project Brainwave, in Beijing on Monday. It showed off how “XiaoIce,” its Chinese digital assistant for chats, can emulate people’s voices, converse with humans over the phone and even compose poetry, echoing Google’s now-infamous demo to convince developers to adopt it as the chatbot of choice.

The displays encapsulate Microsoft’s efforts to use AI tools to win customers from more entrenched global rivals, including Inc. and Alphabet Inc. Among cloud vendors, AI services are regarded as effective at winning new business. And where Google’s services are blocked in China, Microsoft’s Brainwave and Azure cloud computing service is permitted thanks to its partnership with a local player. The Chinese market however remains dominated by the likes of Baidu Inc., Alibaba Group Holding Ltd. and Tencent Holdings Ltd.

Microsoft spells out its plans for Semantic Machines

VentureBeat, Khari Johnson


Microsoft announced Sunday it acquired conversational AI startup Semantic Machines, a company whose staff includes former Siri chief scientist Larry Gillick and researchers such as Percy Liang, who helped create Google Assistant.

Few details were provided in a blog post to announce the acquisition, but today Microsoft AI and Research Group chief technology officer David Ku spoke with VentureBeat to share how the acquisition will improve Cortana, help developers make interoperable voice apps, influence products across Microsoft, and plans to open the first conversational AI center for excellence at the University of California, Berkeley.

AI as the New Oil: Saudi Arabia’s $500 Billion Smart City



In October 2017 Hanson Robotics’ “Sophia” became the first robot to be granted citizenship when Saudi Arabia formally made her one of theirs at a conference in the nation’s capital Riya. Yesterday, Sophia joined a compatriot research team at the AI for Good Global Summit at the UN Headquarters in Geneva to discuss Saudi Vision 2030, in which the Gulf State charts a shift away from its dependence on oil revenue.

“This change will be powered by big data and artificial intelligence,” said the Kingdom’s Deputy Minister of Technology Industry and Digital Capabilities Dr. Ahmed Al Theneyan.

The jewel of the project is the smart city “NEOM”, an acronym that stands for “New Future” in Arabic. The Saudi government says it will pour US$500 billion into this mega-project, with construction expected to begin in 2020. NEOM will occupy 26,500 sq km (10,230 sq miles), 218 times larger than the city of San Francisco.

Extra Extra

In a combination of politics and technology, it turns out that a dating site for Trump supporters was uploading real pictures of people without their knowledge – some of whom were dead – to falsely increase the size of the dating pool, at least in largely anti-Trump New York.

Also in New York: the subway system is so bad it will cost $19 billion to fix the signals according Andy Byfeld, the MTA’s new head. This price tag has already kicked off another battle between Mayor De Blasio and Governor Cuomo that is sure to drag on for years.

People are pissed that grad students are reviewing some of the 3,000+ papers submitted to NIPS.

[1805.05238v1] Citation Count Analysis for Papers with Preprints

arXiv, Computer Science > Digital Libraries; Sergey Feldman, Kyle Lo, Waleed Ammar


We explore the degree to which papers prepublished on arXiv garner more citations, in an attempt to paint a sharper picture of fairness issues related to prepublishing. A paper’s citation count is estimated using a negative-binomial generalized linear model (GLM) while observing a binary variable which indicates whether the paper has been prepublished. We control for author influence (via the authors’ h-index at the time of paper writing), publication venue, and overall time that paper has been available on arXiv. Our analysis only includes papers that were eventually accepted for publication at top-tier CS conferences, and were posted on arXiv either before or after the acceptance notification. We observe that papers submitted to arXiv before acceptance have, on average, 65\% more citations in the following year compared to papers submitted after. We note that this finding is not causal, and discuss possible next steps.

Amazon urged not to sell facial recognition tool to police

Associated Press, Gene Johnson


The American Civil Liberties Union and other privacy advocates are asking Amazon to stop marketing a powerful facial recognition tool to police, saying law enforcement agencies could use the technology to “easily build a system to automate the identification and tracking of anyone.”

The tool, called Rekognition, is already being used by at least one agency — the Washington County Sheriff’s Office in Oregon — to check photographs of unidentified suspects against a database of mug shots from the county jail, which is a common use of such technology around the country.

Amazon’s Artificial Intelligence Helped TV Viewers Identify Royal Wedding Guests

Yahoo Finance, The Motley Fool, Beth McKenna


Prince Harry and American actress Meghan Markle tied the knot at Windsor Castle on Saturday during a ceremony that began at noon, London time, which was 7 a.m. EDT across the pond and 4 a.m. on the West Coast.

While many of us in the U.S. slept or started our days in other ways, millions of our fellow citizens and hundreds of millions around the world tuned in to the extensive TV coverage of the royal wedding. Those watching via British broadcaster Sky News’ livestream didn’t have to rack their brains to identify famous wedding guests, thanks to’s (NASDAQ: AMZN) artificial intelligence (AI).

No root for you, or how to stop worrying and love AWS China

The Register, Thomas Claburn


If you open an AWS account in China, you don’t get a root account; instead, one of Amazon’s Chinese operating partners, Sinnet or NWCD, has root access and creates an IAM admin user for you.

Nikki Bailey, senior devops engineer at Illumina, a company that builds gear to sequence genetic data, explained as much at the DevOps-focused Continuous Lifecycle London on Thursday.

She did so to illustrate the challenge of making a CICD pipeline work across the cloud environment.

UD announces new Data Science Institute and founding director

University of Delaware, UDaily


The University of Delaware is establishing a new institute to accelerate research in data science, and a pioneer in the interdisciplinary field, tapped from UD’s faculty ranks, will lead it.

Cathy Wu, the Unidel Edward G. Jefferson Chair in Engineering and Computer Science and an expert in bioinformatics, will serve as the founding director of the University of Delaware Data Science Institute. The appointment is effective April 1.

Biosecurity: Do synthetic biologists need a licence to operate?

PLOS Blogs Network, Kostas Vavitsas


A recent article in New York Times about DIY biology and biohacking sparked a vigorous discussion about biosecurity and regulation of synthetic biology.

The article starts with the rather sensationalist title, As D.I.Y. Gene Editing Gains Popularity, ‘Someone Is Going to Get Hurt’. In it Emily Baumgaertner explores DIY biology, with a particular take on security. Through personal stories and particual incidents, this story has a particularly negative tone, essentially portraying DIY biology as risky towards the practitioners and the society.

Pitt announces a new institute for modeling and simulation

Pittsburgh Post-Gazette, Courtney Linder


The University of Pittsburgh announced Monday morning that it will open a new institute to solve large-scale problems like the opioid epidemic through computer modeling and simulation.

The Modeling and Managing Complicated Systems Institute, part of the School of Computing and Information, will be a hub for academia, industry, foundations and government to collaborate on projects that could benefit from computer modeling or simulation.

“I guess I was one of those people who thought the world was too complicated to model until I got to DARPA. Then we started to develop AI technologies to build models of cancer, to build models of food insecurity,” Paul Cohen, founding dean of the School of Computing and Information, said in a video discussing the institute.

7 Startups, 6 Months, Zero Charge, and Up to $350,000 in Benefits

PR Newswire, NYU Tandon School of Engineering


The inaugural Catalyst NYC cohort includes:

  • enables brands to drive sales directly from social media comments
  • LandWright: creating a commercial real estate securities marketplace that brings a new form of equity financing to real estate owners and access to individuals priced out of the market
  • Omnirisk: assesses risk of commercial vehicles and their components in an era of rapidly evolving autonomous driving technologies
  • uses machine learning and a network of researchers to identify the veracity of scientific work and scientists
  • Skopos Labs: its AI software platform will turn unstructured data into accurate predictions of risk and opportunity for companies and financial markets
  • Trash TV: will enable predictive video editing in real time, with a next-generation stock video library and content platform
  • Ursa: brings context, efficiency, and actionability to meeting notes by recording notes and conversations side by side, syncing captured content, and letting users search and share key moments from meetings

  • Europe’s open-access drive escalates as university stand-offs spread

    Nature, Holly Else


    Bold efforts to push academic publishing towards an open-access model are gaining steam. Negotiators from libraries and university consortia across Europe are sharing tactics on how to broker new kinds of contracts that could see more articles appear outside paywalls. And inspired by the results of a stand-off in Germany, they increasingly declare that if they don’t like what publishers offer, they will refuse to pay for journal access at all. On 16 May, a Swedish consortium became the latest to say that it wouldn’t renew its contract, with publishing giant Elsevier.

    Under the new contracts, termed ‘read and publish’ deals, libraries still pay subscriptions for access to paywalled articles, but their researchers can also publish under open-access terms so that anyone can read their work for free.

    Advocates say such agreements could accelerate the progress of the open-access movement.

    Skills for a Lifetime – Nate Silver commencement speech, Kenyon College

    Kenyon College, Nate Silver


    There are lots of questions about how people are using data — we’ll get to that in a moment. But for better or worse, it’s no longer really acceptable to claim you don’t care about being data-driven. Even President Trump claims that he couches his decisions in data: “I call my own shots, largely based on an accumulation of data, and everyone knows it,” he tweeted last year.

    The flip-side to this is that the the “nerds” are no longer on the outside looking in. Instead, they’re probably running the company. Power has shifted toward people and companies with a lot of proficiency in data science.

    I obviously don’t think that’s entirely a bad thing. But it’s by no means entirely a good thing, either. You should still inherently harbor some suspicion of big, powerful institutions and their potentially self-serving and short-sighted motivations. Companies and governments that are capable of using data in powerful ways are also capable of abusing it.

    What worries me the most, especially at companies like Facebook and at other Silicon Valley behemoths, is the idea that using data science allows one to remove human judgment from the equation. For instance, in announcing a recent change to Facebook’s News Feed algorithm, Mark Zuckerberg claimed that Facebook was not “comfortable” trying to come up with a way to determine which news organizations were most trustworthy; rather, the “most objective” solution was to have readers vote on trustworthiness instead. Maybe this is a good idea and maybe it isn’t — but what bothered me was in the notion that Facebook could avoid responsibility for its algorithm by outsourcing the judgment to its readers.


    GeekWire Cloud Tech Summit



    Bellevue, WA June 27 at Meydenbauer Center (11100 NE 6th St). [$$$]


    NumFOCUS – Become a Member

    NumFOCUS is a 501(c)(3) nonprofit in the United States. Your tax-deductible donation supports NumFOCUS in our mission to promote sustainable high-level programming languages, open code development, and reproducible scientific research. Join us to support the open source tools you rely on every day.

    Causal Imitation in Robotics – RSS 2018 Workshop

    Pittsburgh, PA June 30. “This workshop will serve as a platform to discuss the impact and merit of algorithmic techniques in Imitation Learning and Causal Inference, and their applications in robotics.” Deadline for abstract submissions is June 3.

    DPhil in Social Data Science

    “The Oxford Internet Institute is pleased to announce the launch of a new and innovative DPhil in Social Data Science. The OII is partnering with multiple departments to jointly offer this new degree, including engineering science, statistics, sociology, computer science, and other departments across the University of Oxford. This will allow students to draw on Oxford’s wide range of disciplinary expertise in their classes and from their supervisors.” Deadline to apply for spots in Fall 2019 cohort have not been set.
    Moore-Sloan Data Science Environment News

    So was the tassel worth the hassle? Of course!

    Facebook, NYU Center for Data Science


    “There are perhaps 10, 20, or 30 thousand people graduating today with a CS degree,” said our Deputy Director Arthur Spirling. “But how many people are graduating with Data Science degrees? I’m not sure—but it’s crazy to think that a fifth of the US’s entire data science capacity is right here in this room.” [video, 5:42]

    Tools & Resources

    Princeton Dialogues of AI and Ethics: Launching case studies

    Princeton CITP, Freedom to Tinker blog, Bendert Zevenbergen


    “The aim of this project is to develop a set of intellectual reasoning tools to guide practitioners and policy makers, both current and future, in developing the ethical frameworks that will ultimately underpin their technical and legislative decisions. More than ever before, individual-level engineering choices are poised to impact the course of our societies and human values. And yet there have been limited opportunities for AI technology actors, academics, and policy makers to come together to discuss these outcomes and their broader social implications in a systematic fashion. This project aims to provide such opportunities for interdisciplinary discussion, as well as in-depth reflection.”

    [1805.08166] Learning to Optimize Tensor Programs

    arXiv, Computer Science > Learning; Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy


    We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are well-supported. The reliance on hardware-specific operator libraries limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search by effective model transfer across workloads. Experimental results show that our framework delivers performance competitive with state-of-the-art hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPU.


    Tenured and tenure track faculty positions

    Associate Professor in Statistics

    University of Florence; Florence, Italy
    Full-time positions outside academia

    NIH Chief Data Strategist and Director

    National Institutes of Health, Office of Data Science Strategy; Bethesda, MD

    Senior Design Researcher

    Wikimedia Foundation; San Francisco, CA

    Leave a Comment

    Your email address will not be published.