Data Science newsletter – June 27, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for June 27, 2017


Data Science News

Georgia Tech Math Professor Appearing on American Ninja Warrior

YouTube, Georgia Tech


Professor Sal Barone will take time off from teaching math courses at Georgia Tech to test his strength on the NBC show American Ninja Warrior

Smartphones Open a New World for Medical Researchers

Wall Street Journal, Charles Wallace


So much data can be collected automatically and accurately via mobile phones—without participants or lab workers having to log it—that some scientists believe it will be easier to conduct and monitor many trials involving drugs or exercise in larger populations than have been examined up to now in conventional studies.

What’s more, doctors believe it will be possible to give participants feedback not only about their own health but also about the population at large much faster than is possible with conventional medical studies, which often appear in scientific journals years after they’re conducted. Participants could quickly see, for example, the beneficial effects of exercise or diet and adjust their behavior accordingly.

Google joins in backing Seattle tech marketplace Algorithmia

The Seattle Times, Rachel Lerman


Algorithmia’s online marketplace collects algorithms, or sets of operations performed by a computer, and sells them to developers and companies. The startup says putting algorithms in one easy-to-find place saves companies time because they don’t have to re-create something that has already been made.

One early, popular algorithm on the site was a colorizer — a program used to add color to black and white images.

Introducing Our New Labs Blueprint

DataKind, Rita Allen Foundation


Thanks to support from the Rita Allen Foundation, we are pleased to share our Labs Blueprint, a new resource documenting our learnings and approach from our first Labs project using data science to improve traffic safety in three U.S. cities.

Labs projects differ from other DataKind projects in that they are designed to address sector-wide challenges instead of a specific organization’s. As these projects look to move the needle on sector-wide issues, our hope is this document can help others learn from and hopefully be able to launch similar projects to drive change.

A machine-learning approach to venture capital

McKinsey & Company, McKinsey Quarterly


In this interview, Hone Capital managing partner Veronica Wu describes how her team uses a data-analytics model to make better investment decisions in early-stage start-ups.

Google set to face record EU antitrust fine as soon as Tuesday: sources



EU antitrust regulators are likely to impose a record fine on Alphabet (GOOGL.O) unit Google over its shopping service as soon as Tuesday, two people familiar with the matter said on Monday, concluding one of three cases against the company.

The European Commission’s case was triggered by scores of complaints from both U.S. and European rivals, leading to a seven-year-long investigation into the world’s most popular internet search engine.

Social Networks May One Day Diagnose Disease – But at a Cost

WIRED, Science, Sam Volchenboum


It’s now entirely conceivable that Facebook or Google—two of the biggest data platforms and predictive engines of our behavior—could tell someone they might have cancer before they even suspect it. Someone complaining about night sweats and weight loss on social media might not know these can be signs of lymphoma, or that their morning joint stiffness and propensity to sunburn could herald lupus. But it’s entirely feasible that bots trolling social network posts could pick up on these clues.

Sharing these insights and predictions could save lives and improve health, but there are good reasons why data platforms aren’t doing this today. The question is, then, do the risks outweigh the benefits?

Accenture: Healthcare AI poised for explosive growth, big cost savings

MobiHealthNews, Mike Miliard


The report forecasts a 40 percent compound annual growth rate between now and 2021, with acquisitions of AI startups proceeding at a feverish pace.The technology represents “a significant opportunity for industry players to manage their bottom line in a new payment landscape,” according to the report, which examined 10 different AI applications, ranked by their potential for cost savings.

  • Robot-assisted surgery – $40 billion
  • Virtual nursing assistants – $20 billion
  • Administrative workflow assistance – $18 billion
  • Fraud detection – $17 billion
  • Dosage error reduction – $16 billion
  • Connected machines – $14 billion
  • Clinical trial participant identifier – $13 billion
  • Preliminary diagnosis – $5 billion
  • Automated image diagnosis – $3 billion
  • Cybersecurity – $2 billion

  • Company Data Science News

    Google was handed a ferociously large fine: 2.4bn Euros ($2.7bn) by the European Commission after it found that the company promoted its own shopping results above the results of competitors. Google is considering an appeal.

    As if having the EU drop a legal anvil on Google weren’t enough, the Canadian Supreme Court ruled Google must remove certain listings from search results, not just in Canada, but everywhere. Students, please sign up for an ethics class so that you might avoid being on the receiving end of rulings and fines like this in your future positions.

    Matt Zeiler founded clarifai after getting his PhD at NYU with Yann LeCun and Rob Fergus (both of whom now split their time between Facebook AI and NYU). Zeiler is now building a “gore and violence” classifier to identify images of blood and violent acts and, hopefully, limit their virality.

    In partnership with Palantir, Airbus has launched an open data portal for air traffic related data. Of course, it’s in the cloud.

    Facebook has reached a major milestone: 2 billion users. Makes me wonder how they can possibly keep growing at rates anywhere this (lasers and drones to serve internet to those who don’t have it?) and what types of efforts they will devise to grow revenue if they can’t keep growing their user base this quickly.

    IBM Watson is going to be producing highlight reels from Wimbledon all by itself. Somehow I think this must be easier in tennis because the audience is typically so quiet that detecting an audio anomaly must be easier. Watson will also be scanning the crowds and using facial recognition to pick out royals and celebs in the audience.

    NVidia has partnered with Volkswagan (and many other self-driving car aspirants) to help VW get a self-driving car on the road. The two companies are also expanding their partnership to “optimize traffic flow in cities” and work on other deep learning projects. The company was also named the Smartest Company of 2017 by MIT Technology Review.

    Ofer Dekel, of Microsoft Labs, used a $35 Raspberry Pi and some image recognition software to ensure that his sprinkler system came on when a squirrel got too close to his birdfeeder. Call it round #874 in the quest to design a better mousetrap. (Elsewhere in human vs. squirrel battles: my sister had an extremely proud moment last week when she successfully shot a squirrel with a bb gun from the window over her kitchen sink. The squirrels have been eating the peaches off her tree.)

    Ford Motors has a new dedicated robotics and AI research team. It’s about time, Ford. Good luck recruiting – it’s a competitive field out there.

    How To Improve the Murder Clearance Rate in U.S. Cities

    CityLab, Anthony Williams


    U.S. law enforcement is the worst in the Western world at solving crimes. Only one-eighth of burglaries lead to an arrest. For rapes it’s only one-third, and for murder two-thirds. Murder clearance rates have generally held steady for thirty years—even as the murder rate has nosedived, albeit for reasons we don’t really know.

    Government Data Science News

    Scott Pruitt, the new head of the EPA thinks that what climate science needs is a good “red team, blue team” debate because “what the American people deserve, I think, is a true, legitimate, peer-reviewed, objective, transparent discussion about CO2.” Readers, you are all a bunch of scientists so you know that scientists already have peer-review and a variety of other platforms for transparent discussion – conferences, seminars, email chains, blogs, hackweeks, etc. You may also be scratching your head wondering how climate science could split into two opposing ‘teams.’ I despise it when complicated issues are reduced to “two sides of the story.”

    This is the CLIMATE. It is vastly complicated. What good could possibly come from turning it into a two-sided issue and recruiting scientists to uphold one of these viewpoints in a debate? Science is not settled in debates. Science is investigated in labs, in the field, in the data, the models, and the journals. Further terrifying me, it is unclear which *active* climate research scientists Pruitt can find to argue that climate change is not caused by burning fossil fuels or that climate change is not a problem for human civilization or whatever perspective he decides needs representation. The Atlantic found a few scientists who voted for Trump. One of whom, Princeton physicist William Happer, says he doesn’t believe, “the hysteria about climate change.” Judith Curry of Georgia Tech called the scientists who participated in the Science March “whiners.”

    The Pentagon has gotten an agreement from Congress to launch a $70m “Algorithmic Warfare Cross-Functional Team.” The new team will address autonomous weapon systems, but that is unlikely to be all that it does. The Brookings Institution reports the Pentagon is looking at ‘war algorithms’ which is a framework that can be used to assess how to proceed when existing international rules of engagement are insufficient for algorithmic wartime decision-making.

    The Metro in Seattle had to add new buses to get all of the Amazon summer interns to work on time.

    The Washington Post uses ModBot to moderate comments. Because of course Jeff Bezos’ newspaper would be using AI. Still, it’s probably better for democratic discourse to leave comments open as often as possible, which can only be accomplished efficiently and troll-free using bots.

    Speaking of Amazon, Brent Smith and Greg Linden have a paper out that talks about the recommender algorithm that now underpins so much of what we are all offered as we watch Netflix and YouTube and shop on Amazon and other .com retailers.

    Republican Congressman Greg Walden (OR) rode in a car with assistive braking, now wants to open US roadways to self-driving vehicles. Because his party is in power, US roads may open to autonomous cars soon.

    New York City’s Economic Development Council and the Mayor’s Office of Media and Entertainment has ponied up $6m and selected NYU Tandon School of Engineering to operate a VR/AR “hub” at the Brooklyn Navy Yard. The goal is to create 500 jobs over ten years and make New York a global leader in VR/AR development.

    The Alan Turing Institute in London is moving to deepen its partnership with the UK Ministry of Defence, Defence Science and Technology Lab, and Joint Forces Command. What is this set of partnerships going to work on? It’s still unclear to me, but here are the words they used: “The partnership is interested in developing data science methodologies and techniques, and in the direct application of data science. This is reflected in the initial areas of interest for the partnership that span creating intelligent data systems, securing cyber-space, enhancing data privacy and trust, and seeking a better understanding of the urban environment and its development.” Highlighting privacy and security is promising.

    Ron Jarmin, a career staffer at the Census Bureau, has been named acting director and will oversee the already highly controversial 2020 Census.

    In another embattled government agency, the Veterans Administration warned that the $543m contract to keep track of medical equipment is “careening off the rails.” The VA had been accidentally conducting colonoscopies and other procedures with equipment that had not been sterilized because they couldn’t keep track of where the clean equipment was. HIV and Hepatitis C infections ensued. The department’s wifi may not be able to support all the new location tagged equipment pinging back to base.

    Meanwhile, nobody works in the science division of the White House Office of Science and Technology Policy. The three remaining staffers left last week. Now the office sits empty.

    Our friendly neighbors to the north have named Yoshua Bengio an Officer of The Order of Canada, the country’s highest civilian honor. He was congratulated on Facebook and Twitter by Yann LeCun.

    The Amazon effect: Metro adds buses to handle new flock of summer interns

    The Seattle Times, David Gutman


    However you measure it, Amazon’s impact on Seattle is indisputable. Here’s a new metric: summer interns. Amazon has so many that King County Metro has adjusted bus service just for them.

    The Missing Link: Where Are Medium-Size Black Holes?, Charles Q. Choi


    For decades, while astronomers have detected black holes equal in mass either to a few suns or millions of suns, the missing-link black holes in between have eluded discovery. Now, a new study suggests such intermediate-mass black holes may not exist in the modern-day universe because of the rate at which black holes grow.

    < Scientists think stellar-mass black holes — up to a few times the sun's mass — form when giant stars die and collapse in on themselves. Over the years, astronomers have detected a number of stellar-mass black holes in the nearby universe, and in 2010, researchers detected the first such black hole outside the local cluster of nearby galaxies known as the Local Group.

    The Washington Post gets more than a million comments every month, so it’s using AI to tackle them

    Poynter, Benjamin Mullen


    The Washington Post gets about a million comments every month on its stories, said Greg Barber, director of digital news projects at the newspaper. So, like The New York Times, The Post has begun using artificial intelligence and machine learning to moderate them.

    ModBot, a new tool built by The Post’s engineering team, filters new comments by referring to a record of decisions made by human moderators. It uses technology that deconstructs comments into their component parts and scores them against The Post’s discussion policy.

    Retailers: Adopt Artificial Intelligence Now for Personalized and Relevant Experiences

    Adobe Retail Team


    Retail and e-commerce have always been central to the personalization and optimization conversation. From Amazon’s recommendations — which drive 30 percent of its revenue — to targeted email outreach and push alerts promoting complementary products, the most optimization-focused retailers have always pushed the experience envelope, fueling the desire for more relevance at all touch points.

    Delivering relevance on those touch points, though, is where some retailers start to lose their footing. “Taking that next step is a big leap,” says Kevin Lindsay, director of product marketing for Adobe Target. “It’s a leap of faith in terms of how much you can bite off. How much is actually doable today and what benefits can you get from incorporating AI into developing these tactics today?”

    These days, supplying personalized experiences at every touch point isn’t just something customers want, it’s what they expect. More than half of consumers want a “totally personalized experience,” and three in five are happy to have interests and behaviors shared if it leads to a more personalized journey with a retailer. However, 42 percent of retailers say they know too little to effectively engage key segments.

    Two Decades of Recommender Systems at

    IEEE Internet Computing, Brent Smith and Greg Linden

    from launched item-based collaborative filtering in 1998, enabling recommendations at a previously unseen scale for millions of customers and a catalog of millions of items. Since we wrote about the algorithm in IEEE Internet Computing in 2003,2 it has seen widespread use across the Web, including YouTube, Netflix, and many others. The algorithm’s success has been from its simplicity, scalability, and often surprising and useful recommendations, as well as desirable properties such as updating immediately based on new information about a customer and being able to explain why it recommended something in a way that’s easily understandable.

    What was described in our 2003 IEEE Internet Computing article has faced many challenges and seen much development over the years. Here, we describe some of the updates, improvements, and adaptations for item-based collaborative filtering, and offer our view on what the future holds for collaborative filtering, recommender systems, and personalization.



    Silicon Republic


    Dublin, Ireland Inspirefest is a unique international festival of technology, science, design and the arts, covering key trends from Infosec to Blockchain, AI to Robotics, Games to Professional Development. The main conference runs July 6th & 7th in the Bord Gáis Energy Theatre in Dublin’s Silicon Docks. [$$$]

    Big Data Day LA

    Subash DSouza


    Los Angeles, CA Big Data Day LA 2017 is on Saturday, August 5,
    at the University of Southern California. [free, registration required]

    viSFest – d3.unconf



    San Francisco, CA September 22-23. [registration required]


    Clickbait Challenge – Workshop this November at Google Hamburg

    The task is to develop a classifier that rates how click baiting a social media post is. For each social media post, the content of the post itself as well as the main content of the linked target web page are provided as JSON-Objects in our datasets. July 31 ends the software evaluation phase. Deadline for workshop paper submissions is August 31.
    Tools & Resources

    Variational Inference and Deep Learning: An Intuitive Introduction

    YouTube, The Nutty Netter (Alex Lamb)


    A lecture introducing Variational Inference and Deep Learning. Adapted from a lecture I gave for Aaron Courville’s Deep Learning course (IFT 6266).

    HPC in a day?

    Software Carpentry


    “The idea behind it can be paraphrased by ‘Help a carpentry learner to use a cluster of computers to speed up their day-to-day data lifting.’ Our efforts to brain storm a possible curriculum are currently fixed in this document. Feel free to dial over and provide comments.”

    [P] How HBO’s Silicon Valley built “Not Hotdog” with mobile TensorFlow & Keras


    “Hey, author here, just quickly wanted to mention that this subreddit was instrumental to the creation of the app… It was awesome to be able to see the latest research, chat with authors, and really get the pulse of what’s worth trying & what’s not. This comment in particular ended up being key to lowering the footprint & increasing the accuracy of the final network!”

    Good enough practices in scientific computing

    PLOS Computational Biology; Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, Tracy K. Teal


    Computers are now essential in all branches of science, but most researchers are never taught the equivalent of basic lab skills for research computing. As a result, data can get lost, analyses can take much longer than necessary, and researchers are limited in how effectively they can work with software and data. Computing workflows need to follow the same practices as lab projects and notebooks, with organized data, documented steps, and the project structured for reproducibility, but researchers new to computing often don’t know where to start. This paper presents a set of good computing practices that every researcher can adopt, regardless of their current level of computational skill. These practices, which encompass data management, programming, collaborating with colleagues, organizing projects, tracking work, and writing manuscripts, are drawn from a wide variety of published sources from our daily lives and from our work with volunteer organizations that have delivered workshops to over 11,000 people since 2010.


    Full-time positions outside academia

    Principal Data Scientist

    Honeywell; Golden Valley, MN

    Leave a Comment

    Your email address will not be published.