Data Science newsletter – December 4, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for December 4, 2017


Data Science News

candidate: Tweet of the Week



AI-ON, Artificial Intelligence Open Network

Yoshua Bengio, Hugo Larochelle and Max Welling


We are an open community dedicated to advancing Artificial Intelligence by assembling teams of volunteer researchers around the world to work on problems curated by leaders of the field.

Extra Extra

Men’s professional and college level sports tend to have stats available. Not so true for women’s sports, but here’s a site with women’s college b-ball stats. They plan to add WNBA data soon.

Andreessen Horowitz put out a podcast on Data Science in the Age of AI [audio, 9:42], smart enough to realize that AI is a subset of data science. And there’s the quality Gigaom podcast, Voices in AI, features University of Washington professor Pedro Domingos in episode #23 [audio, 53:50].

Atul Gawande, writing in The New Yorker, tells how patient outcomes data has begun to erode the reputations of leading U.S. medical centers.

Samir Khuller from University of Maryland interviews Dick Karp in this Fireside Chat for the Simons Foundation .

Sara Crowe of Polaris on data in the fight against human trafficking

ArchiTecht, Derrick Harris


In this episode of the ARCHITECHT Show, Sara Crowe — the data analysis program director for non-profit Polaris — discusses how that organization is using data analysis to help combat human trafficking. Crowe discusses some of the techniques Polaris uses to untangle what can be complex networks of abuse, and how it then works with law enforcement to act on what is discovered. She also explains how Polaris partners with the tech industry on both tools and training, the double-edged sword of the internet, and need for better methods of gathering and formatting web text.

Government Data Science News

The 2020 US Census has pushed past one relatively small hurdle: AT&T has dropped its lawsuit and come to an agreement with CDW-G about which company gets to provide mobile devices for the first semi-digital census. This does not, however, address concerns raised by former Census Director and current Columbia University professor, Kenneth Prewitt, who notes that there may not be enough time to test these new technologies before deploying them. Anyone working in software is familiar with the “ship fast, patch later” strategy…and why it is a very bad idea, especially when the product has a $15 billion price tag. One of the biggest concerns may be security (or lack thereof) according to Comptroller General Gene Dodaro. Wilbur Ross has asked Congress for an additional $4.5 billion to ensure that the 2020 Census data are secure and the results are reliable.

Last week I noted that the Census is like a sacred text for social scientists. I got some pushback from a reader who thought I meant that as a joke, which is fair because I’m often light-hearted. In that case, I meant to imply that because the Census is the only attempt to measure key demographic and geographic information about every individual in the US, it undergirds all sorts of sampling estimation techniques for other surveys, not to mention establishing funding for social welfare programs and informing political boundaries for electoral representation. The Census, then, is perhaps *the* key “text” (really, it’s more data than text) for social science practice, one which should be the subject of active debate, scrutiny, and oversight because it is such a fundamental sociotechnical structuring device for research, politics, and the economy.

Meanwhile, if you think the brain drain from academia to industry is a daunting uphill battle, the federal government faces an even greater challenge. Unlike academia, the government has no inertia to begin with — at least we have a bunch of talent in our PhD programs that we can try to support and retain. The federal government has to actively attract IT workers who are already on the job market and more likely to see the large offers from industry than a postdoc applying only for tenure track jobs. Somewhat similar to academia’s lower-than-industry starting salaries, the federal government hires workers into rigid pay bands that are dwarfed by industry salaries and usually lower than academia, too. Given how much data the government holds about individuals and corporations, we should all be concerned about the lack of IT talent in the public sector.

New York City’s CTO Miguel Gamino, is working to bring affordable, robust broadband to all New Yorkers by 2025. Gamino’s job requires him to figure out “hundred-year decisions in the context of the current demands” and in line with the changing tides of politics and funding regimes.

The National Institutes of Health is warning its grantees not to publish in questionable journals with aggressive solicitation tactics and non-standard editorial practices. Instead, they want to see funded research published in “venues that will reach the most relevant audience for their work while adhering to established good practices regarding peer review, editorial quality.” Furthermore, and quite excitingly!, they note that for authors chomping at the bit to get their work out, “we would encourage authors to issue a preprint.” Things do change in academia – preprints are being touted as a great strategy by many major funding institutions – it simply takes a long time.

There’s some concern that Canada, China or (maybe?) Russia could take over the lead from the US in AI research. It wouldn’t much bother me if Canada took the top spot because their norms about publication and sharing data are productive. Different norms adhere in Russia and China.

Don’t try picking a World Cup winner. Just sit back and enjoy the show

ESPN Soccer, Simon Kuper


… Given how silly I felt after the 7-1 defeat in Belo Horizonte, this time I’ll let other people bore you with their predictions. The bookmakers are now making France, Germany and Brazil joint favorites next summer at odds of about 5-1 each. In the coming months you’ll hear endless learned analyses about the supposed qualities of all 32 squads, but the fact is that World Cups are among the most unpredictable events in sports, for three basic reasons:

1. As any honest investment adviser will tell you, past performance is a poor guide to future performance. Every World Cup, stars go in with big reputations and come out a fortnight later looking like tired old men (the possible fate of Luis Suarez, who will be 31 when the tournament starts) while unknown youngsters steal the show (recall Thomas Muller in 2010).

China’s A.I. Advances Help Its Tech Industry, and State Security

The New York Times, Paul Mozur and Keith Bradsher


During President Trump’s visit to Beijing, he appeared on screen for a special address at a tech conference.

First he spoke in English. Then he switched to Mandarin Chinese.

Mr. Trump doesn’t speak Chinese. The video was a publicity stunt, designed to show off the voice capabilities of iFlyTek, a Chinese artificial intelligence company with both innovative technology and troubling ties to Chinese state security. IFlyTek has said its technology can monitor a car full of people or a crowded room, identify a targeted individual’s voice and record everything that person says.

“IFlyTek,” the image of Mr. Trump said in Chinese, “is really fantastic.”

As China tests the frontiers of artificial intelligence, iFlyTek serves as a compelling example of both the country’s sci-fi ambitions and the technology’s darker dystopian possibilities.

Company Data Science News

NVidia, Facebook AI and Google listed their accepted papers at the sold-out NIPS. Also, all NIPS 2017 papers listed at the conference website and counts for accepted papers by institution at (scroll down, top 3: CMU, Deepmind, MIT). recapped the conference. My colleagues Jennifer Hill and Foster Provost played on closing night. They are both rock star academics, in case that was not clear.

Oracle is opening a public charter school on its corporate campus in Redwood Shores, California for 550 students. Oracle employees will be available to mentor students in the $43 million facility that features a two-story fab lab. Oracle paid for the facility but will have no governance over hiring or curricula. This is the first public school to be housed on a corporate campus in the US.

Elsewhere in Silicon Valley construction projects, Microsoft is expanding its Mountain View campus to increase workspace by 35%. It does not appear that they will be including any high schoolers in their plans, which is totally fine. Let’s see what happens Oracle before jumping to copy cat.

And Microsoft Research has a new podcast [Microsoft Research Podcast] which is “first and foremost about the research” with some color coming from the researchers’ personalities and back stories.

Patreon has changed its fee structure. Right now, Patreon takes 5% of whatever creators make on the site plus some processing fees. As it stands, most creators leave 7-15% of their earnings with Patreon. Under the new plan, patrons will pay 2.9% of their pledge, plus a 35 cent per pledge fee. On the surface, this may seem like a boon to creators because patrons are paying the fees, but that may only be true for pledges above a certain threshold. For the common $1 and $2 pledges, a 35 cent per pledge fee is hefty. Creators are worried that would-be $1 patrons won’t pay anything at all if they think Patreon is taking 38% of their contribution. This is relevant to open source software contributors who rely on Patreon accounts to accept funding for their efforts.

Googlers Francois Chollet, Gabriel Pereyra, and Hugo Larochelle have started AI-ON, a website where senior AI researchers can list projects from their “backlog” open to anyone who wants to try to take a whack at them. Senior researchers get some free help, younger contributors get to work with senior people outside their organizations who they would have trouble reaching otherwise. Perhaps the most notable attribute of their about page was their attitude about fears of being scooped. They know that this type of open research platform “may raise concerns about having ideas or code taken from AI•ON by unscrupulous parties and published as their own. This is both inevitable and utterly unimportant.” One of their three open projects – few shot music generation – is here on GitHub.

IFlyTek is a Chinese voice recognition and translation company that allowed President Trump to address Chinese audiences in Chinese. The company’s technology is definitely impressive, but its ties to the Chinese government discomfit some Americans. As we know, access to data is key and according to The New York Times, “[w]here iFlyTek gets its data is not clear. But one of its owners is China Mobile, the state-controlled cellular network giant, which has more than 800 million subscribers. IFlyTek preloads its products on millions of China Mobile phones and runs the hotline service for China Mobile”. I wasn’t sure whether to put this in company data science news or government data science news because the key stakeholders are so entwined.

Have you ever been tempted to sell your plasma? Well, what about selling your data to pharmaceutical companies and researchers? There is, of course, an app for that run out of the US’s third tech hub in Northern Virginia by a company called Health Wizz. Because this is a high tech kind of exchange, it runs using blockchain and crypto currency.

Google released DeepVariant that takes genome sequencing data and “automatically identifies small insertion and deletion mutations and single-base-pair mutations”. It uses more advanced AI than existing competitors to better sort signal from noise in low frequency variation regions. The product came out of Google Brain and Verily.

Machine learning is becoming an increasingly important part of big-ticket industries law and investing. This was inevitable, of course, but why did it take so long? I’m setting high frequency trading aside, as that did emerge years ago.

Reuters is also using automation to help cover breaking news stories, building in safe guards to prevent malicious actors from gaming the system to spread misinformation.

Speaking of mainstream media and ‘fake news’, Duncan Watts and David M. Rothschild argue that the mainstream media, people who select into filter bubbles, and not so much tech companies are to blame for the impact of fake news, which has been exaggerated. Citing economists Hunt Alcott and Matthew Gentzkow who found that “the average US adult read and remembered on the order of one or perhaps several fake news articles during the election period, with higher exposure to pro-Trump articles than pro-Clinton articles.” They then estimated that “if one fake news article were about as persuasive as one TV campaign ad, the fake news in our database would have changed vote shares by an amount on the order of hundredths of a percentage point.” It’s intellectually invigorating to see this new research on fake news and filter bubbles. I hope the engagement continues.

Apple explained how they use differential privacy to draw inference from user’s data without exposing their identities.

Stanford-led artificial intelligence index tracks emerging field

Stanford University, Stanford News


A Stanford-led team has launched the first index to track the state of artificial intelligence and measure technological progress in the same way the GDP and the S&P 500 index take the pulse of the U.S. economy and stock market.

We Can Hack That: Georgia Tech Takes First Place in Hackathon Competition

Georgia Tech, College of Computing


It’s been a grueling year, but Georgia Tech students have risen to the top of the Major League Hacking (MLH) ranks and have won 1st Place in the recently completed 2017 MLH Hackathon Season.

MLH is an international student hackathon league. Participating in sanctioned events throughout North America, student teams from across campus collectively competed in most of the 190 events held between August 2016 and July 2017. In all, more than 3,000 university and high school students participated in the North American MLH events.

Gaming Machine Learning | December 2017 | Communications of the ACM

Communications of the ACM, Samuel Greengard


Over the last few years, the quest to build fully autonomous vehicles has shifted into high gear. Yet, despite huge advances in both the sensors and artificial intelligence (AI) required to operate these cars, one thing has so far proved elusive: developing algorithms that can accurately and consistently identify objects, movements, and road conditions. As Mathew Monfort, a postdoctoral associate and researcher at the Massachusetts Institute of Technology (MIT) puts it: “An autonomous vehicle must actually function in the real world. However, it’s extremely difficult and expensive to drive actual cars around to collect all the data necessary to make the technology completely reliable and safe.”

All of this is leading researchers down a different path: the use of game simulations and machine learning to build better algorithms and smarter vehicles. By compressing months or years of driving into minutes or even seconds, it is possible to learn how to better react to the unknown, the unexpected, and unforeseen, whether it is a stop sign obscured by graffiti, a worn or missing lane marking, or snow covering the road and obscuring everything.

“A human could analyze a situation and adapt quickly. But an autonomous vehicle that doesn’t detect something correctly could produce a result ranging from annoying to catastrophic,” explains Julian Togelius, associate professor of computer science and engineering at New York University (NYU).

Medical innovation seminar brings global perspective to annual health++ hackathon | Scope Blog

Stanford Medicine, Scope Blog


The second annual health++ hackathon brought together more than 220 engineers, designers, health care professionals, and business experts to Stanford this fall for a weekend of brainstorming and building solutions for unmet clinical needs in health care affordability.

Thirty project teams competed for $13,000 in prizes sponsored by Stanford departmental and industry partners, creating software, hardware, mechanical, and business model innovations ranging from mobile applications to ameliorate pediatric malnutrition in low-resource communities to artificial intelligence-based models to predict peripheral arterial disease from raw accelerometer data.

Building the hardware for the next generation of artificial intelligence | MIT News

MIT News, School of Engineering


On a recent Monday morning, Vivienne Sze, an associate professor of electrical engineering and computer science at MIT, spoke with enthusiasm about network architecture design. Her students nodded slowly, as if on the verge of comprehension. When the material clicked, the nods grew in speed and confidence. “Everything crystal clear?” she asked with a brief pause and a return nod before diving back in.

This new course, 6.S082/6.888 (Hardware Architecture for Deep Learning), is modest in size — capped at 25 for now — compared to the bursting lecture halls characteristic of other MIT classes focused on machine learning and artificial intelligence. But this course is a little different. With a long list of prerequisites and a heavy base of assumed knowledge, students are jumping into deep water quickly. They blaze through algorithmic design in a few weeks, cover the terrain of computer hardware design in a similar period, then get down to the real work: how to think about making these two fields work together.


Day For Night

Day For Night


Houston, TX December 15-17. “Day for Night is a current snapshot of popular music as well as a showcase for trailblazers who continue to cross over from the fringes to become influencers. Eschewing the idea of musical genres, Day for Night focuses instead on acts that specialize in an inventive and highly visual approach to performing.” [$$$]

NYU Center for Data Science News

Still don’t believe in climate change? Take a look at tree ring data

Medium, NYU Center for Data Science


NASA scientists use tree ring data to measure draughts and build statistical models that predict how climate change will affect our planet

Tools & Resources

Binder 2.0, a Tech Guide

Jupyter, Chris Holdgraf


“We are undergoing a dramatic increase in the complexity of techniques for analyzing data, doing scientific research, and sharing our work with others. In early 2016, the Binder project was announced, attempting to connect these three components.” … “Binder builds a Docker image from these dependencies, and provides a URL where any user in the world can instantly recreate this environment.”

Theoretical Machine Learning Lecture Series: Deep Learning and Cognition | Institute for Advanced Study

Institute for Advanced Study


eep learning, which is the reemergence of artificial neural networks, has recently succeeded as an approach towards artificial intelligence. In many fields, including computational linguistics, deep learning approaches have largely displaced earlier machine learning approaches, due to the superior performance they provide.

In this public lecture, Christopher Manning, Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and Computer Science, discusses some of the results in computer vision, speech, and language which support the preceding claims.

Research for Practice: Vigorous Public Debates in Academic Computer Science

Communications of the ACM


This installment of Research for Practice features a special curated selection from John Regehr, who takes us on a tour of great debates in academic computer science research. In case you thought flame wars were reserved for Usenet mailing lists and Twitter, think again: the academic literature is full of dramatic, spectacular, and vigorous debates spanning file systems, operating system kernel design, and formal verification.


GitHub – AI-ON


Learning a generative model for music data using a small amount of examples.

GBDX Notebooks and Amazon SageMaker for systematic mining of geospatial data

DigitalGlobe GBDX Team


DigitalGlobe’s 100-plus petabyte archive of high-resolution imagery is a rich source of information about our changing planet. But to fully explore and mine those riches, requires an efficient way to manage and analyze all the data. We set out to find a solution.

Our first step to unlock the power of the DigitalGlobe image library was to load our data on to Amazon Web Services (AWS), a compute-friendly environment that manages data efficiently and enables analysis at scale. The culmination of our efforts was the launch of our Geospatial Big Data platform called GBDX, a horizontally scalable compute environment for analyzing satellite imagery. But even with a great compute environment and a growing set of analytical methods and algorithms, truly harnessing our data takes a lot of work. This is where machine learning becomes critical—to analyze vast amounts of data and extract meaningful intelligence quickly and efficiently.



NYU Moore-Sloan Data Science Fellows

NYU Center for Data Science; New York, NY
Tenured and tenure track faculty positions

Assistant Professor, Interactive Media Studies

Miami University; Oxford, OH
Full-time, non-tenured academic positions

Data Engineer

Columbia University, Zuckerman Institute; New York, NY

Research Assistant (Creative AI, Sensilab)

Monash University; Caulfield East, Australia
Internships and other temporary positions

American Statistical Association, 2018 Internships List

American Statistical Association; United States

2018 Cambridge Internship Program

Amazon; Cambridge, England

Leave a Comment

Your email address will not be published.