Data Science newsletter – November 23, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for November 23, 2017

GROUP CURATION: N/A

 
 
Data Science News



Four more universities join the Alan Turing Institute

University of Oxford, Mathematical Institute


from

The Alan Turing Institute is the national institute for data science, headquartered at the British Library. Five founding universities – Cambridge, Edinburgh, Oxford, UCL and Warwick – and the UK Engineering and Physical Sciences Research Council created The Institute in 2015. Now we are delighted to announce that four universities – Leeds, Manchester, Newcastle and Queen Mary University of London – are also set to join the Institute as university partners. The new universities will work with our growing network of partners in industry and government to advance the world-changing potential of data science.


Anonymized location-tracking data proves anything but: Apps squeal on you like crazy

The Register, Thomas Claburn


from

M. Keith Chen, associate professor of economics at UCLA’s Anderson School of Management, and Ryne Rohla, a doctoral student at Washington State University, accomplished this minor miracle of data science by assuming that the GPS coordinates transmitted by mobile phones between 1am and 4am over several weeks represent the location of device owners’ homes.

“If a person has a consistent early morning location over the three weeks before Thanksgiving, we use this location as a simple proxy for their ‘home,'” they explain in a draft research paper.

It’s far from foolproof, but good enough for government, or in this case academic, work.

Chen and Rohla used this information to determine that political advertising related to the divisive 2016 election had caused enough tension in families that people cut short their Thanksgiving visits last year with relatives holding opposing political views.


The Impact of Alexa and Google Home on Consumer Behavior

dashbot


from

We recently conducted a survey, using Survata, amongst owners of Amazon Alexa and Google Home devices for their end-user experiences.

Key takeaways from the survey include:

  • Consumers are happy with the interactions and highly recommend the devices
  • The devices are behavior changing — consumers use the devices frequently and often rely on them
  • Consumers are pleasantly surprised by the ease of use, accuracy, and functionality of the devices
  • There is an opportunity for generating awareness for 3rd party skills

  • Extra Extra

    Last week we were fooled into thinking that ICML in Stockholm had already sold out. It is not. And apparently I am way too gullible. My sincerest apologies. Let’s all meet in Sweden to discuss.



    This is the best unexpected internet treasure of the week: Death by Derivatives, a fascinating historical account of the invention and exhilarating anxiety of the financial derivative. The story starts with a botched suicide attempt in Chicago.

    Writer Sam Lansky tells of his “uncanny valley of online dating.” The term a “reference to the discomfiting effects of things that are eerily-close-to but not-quite human.” Hat tip to Longreads for the link to Lansky’s essay on Medium.


    Cheatsheet: Everything you need to know about Amazon Advertising

    Digiday, Shareen Pathak


    from

    Amazon Advertising has grown up this year, a function of the company’s growing investment in its $1 billion-plus offering, but also increased attention from agencies and brands. We break down how the offering stands.


    Measuring and monitoring collective attention during shocking events | EPJ Data Science | Full Text

    EPJ Data Science; Xingsheng He and Yu-Ru Lin


    from

    There has been growing interest in leveraging Web-based social and communication technologies for better crisis response. How might the Web platforms be used as an observatory to systematically understand the dynamics of the public’s attention during disaster events? And how could we monitor such attention in a cost-effective way? In this work, we propose an ‘attention shift network’ framework to systematically observe, measure, and analyze the dynamics of collective attention in response to real-world exogenous shocks such as disasters. Through tracing hashtags that appeared in Twitter users’ complete timeline around several violent terrorist attacks, we study the properties of network structures and reveal the temporal dynamics of the collective attention across multiple disasters. Further, to enable an efficient monitoring of the collective attention dynamics, we propose an effective stochastic sampling approach that accounts for the users’ hashtag adoption frequency, connectedness and diversity, as well as data variability. We conduct extensive experiments to show that the proposed sampling approach significantly outperforms several alternative methods in both retaining the network structures and preserving the information with a small set of sampling targets, suggesting the utility of the proposed method in various realistic settings.


    Stability in the SciPy ecosystem: a summary of the discussion

    Konrad Hinsen


    from

    The plea for stability in the SciPy ecosystem that I posted last week on this blog has generated a lot of feedback, both as comments and in a lengthy Twitter thread. For the benefit of people discovering it late, here is a summary of the main arguments and my reply to them.


    Mobile-Phone Case at U.S. Supreme Court to Test Privacy Protections

    Bloomberg Politics, Greg Stohr


    from

    A U.S. Supreme Court with a record of protecting digital privacy is taking up a case that may curb law enforcement officials’ power to track people using mobile-phone data.

    In arguments Nov. 29, the justices will consider requiring prosecutors to get a warrant before obtaining mobile-phone tower records that show a person’s location over the course of weeks or months.

    The case could have a far-reaching impact. Prosecutors seek phone-location information from telecommunications companies in tens of thousands of cases a year. Special Counsel Robert Mueller’s team used location data to build a case against George Papadopoulos, the former Trump campaign adviser who pleaded guilty to lying to federal investigators.

    Beyond location data, the case has implications for the growing number of personal and household devices that connect to the cloud — including virtual assistants, smart thermostats and fitness trackers.


    The Pudding Awards

    The Pudding


    from

    Although we love creating stories for the internet, we are endlessly inspired by what others in our field are producing. Our team keeps copious amounts of bookmarks when we see something that wows us. We decided to share notes this year and reflect on the past 11 months of visual storytelling. Here are our five favorite projects of the year.


    AMA, Patient Groups Join “All of Us” Precision Medicine Outreach

    HealthIT Analytics, Jennifer Bresnick


    from

    More than a dozen professional societies and patient advocacy groups have signed up to promote participation in the national Precision Medicine Initiative’s All of Us research program, announced the National Institutes of Health (NIH).

    The organizations, including the American Medical Association, American Academy of Family Physicians, National Baptist Convention, and League of United Latin American Citizens, will receive a combined $1 million to support outreach and community education aimed at recruiting a diverse group of patients to contribute their data to the effort.

    “We want to build long-term relationships with our participants based on transparency and trust. These organizations will help us in that effort,” said Eric Dishman, director of the All of Us Research Program at NIH.


    First ASA Bachelor’s Survey Provides Snapshot of 2016 Graduates

    Amstat News


    from

    The findings reveal information that can be broadly characterized as illuminating undergraduate studies, post-graduate studies, and post-graduate employment. To highlight a few features in each category, starting with undergraduate studies, many students wished they had taken more computer science/programming/coding courses and mathematics courses and advised current students to do so.

    Also, nearly one-third of respondents double-majored, with economics and mathematics being the most common majors. Such extensive double majoring is consistent with the advice of 2016 ASA President Jessica Utts, whose initiative promoting the theme “statistics plus” showed that students can combine statistics with almost any other interest.


    Stanford-Google digital-scribe pilot study to be launched

    Stanford Medicine, Scope Blog


    from

    Some medical practices, including the Stanford family medicine clinic on campus, use human scribes to enter information into EHRs, allowing physicians to concentrate on patients, rather than the computer. But what if a device could interpret each office visit and — using speech recognition and machine learning tools — automatically enter the information into an EHR system?

    [Steven] Lin and his Google collaborators are now launching a pilot study to investigate such as system, which they are calling a “digital-scribe.” A digital-scribe could save physician time, lessening the need to enter data. It could also improve the visit for patients, who would again have the full attention of their physician, Lin pointed out.


    Genomenon Creates Automated, Evidence-Based Cancer Gene Panel

    Xconomy, Sarah Schmid Stevenson


    from

    Genomenon, a University of Michigan spinout developing analytics and data visualization software for the genomics industry, last week announced it has created an evidence-based cancer gene panel using automated machine learning techniques.


    Is it time to say goodbye to ClinicalTrials.Gov?

    MedCity News, Zikria Syed


    from

    The statistics are daunting. More than 50 percent of clinical trial sites fail to meet their enrollment goals and up to 20 percent of the sites fail to enroll a single patient. It is no wonder that 80 percent of clinical trials fail to meet their enrollment timelines and approximately one-third of Phase III studies are terminated due to enrollment problems.

    However, this inability to recruit patients is not due to lack of effort or money. The industry spends over $5.9 billion on patient recruitment.

    “The problem with ClinicalTrials.gov is it is totally unbiased,” Facebook cofounder and philanthropist Sean Parker observed at an industry event in reference to the government registry of clinical trials by NIH. “It is just a directory. A lot of those clinical trials are somewhere between useless and harmful. So, it is very difficult to know anything about whether you should enroll in a trial.”


    candidate: Data Visualization of the Week

    Twitter, Mike Murphy


    from


    New tool can track resistant malaria at unprecedented speed and detail

    ScienceNordic, Rasmus Kragh Jakobsen


    from

    A team of scientists have now found a way to monitor resistance cheaper and quicker than ever before.

    “We’ve discovered a relatively simple and cheap method to monitor antimalarial resistance. It could be a big step forward in terms of avoiding a big catastrophe,” says Ph.D. student Sidsel Nag from the University of Copenhagen, Denmark, who is the lead author of the new study published in Nature Scientific Reports.


    UCLA engineers use deep learning to reconstruct holograms and improve optical microscopy

    UCLA Newsroom


    from

    In two new papers, UCLA researchers report that they have developed new uses for deep learning: reconstructing a hologram to form a microscopic image of an object and improving optical microscopy.

    Their new holographic imaging technique produces better images than current methods that use multiple holograms, and it’s easier to implement because it requires fewer measurements and performs computations faster.


    Company Data Science News

    Apple has faced criticism for being closed off, called “the NSA of AI” by Stanford’s Jerry Kaplan. A Buzzfeed report followed up on that year-old claim and found much the same, with some loosening around the edges. Their much ballyhooed blog, the portal through which in-house researchers were supposedly encourage to publish their work, has only had seven posts in four months. (Makes our weekly publication schedule at the newsletter look like a veritable information deluge, given that there are two of us and so very many working in machine learning and AI at Apple.) What’s peeving researchers even more, though, is that the blog doesn’t single out authors by name, listing only groups. This is in keeping with a design-centered culture than academic culture. One unnamed academic went so far as to call the blog “completely useless” though he was objecting to the lack of detail – “it amounts to bragging…it is impossible to actually learn anything from it”. Score one for staying the academy!



    Uber has gotten itself in the news yet again for unethical behavior. This time it is related to a 2016 hacking incident in which 50 million customers and 7 million drivers had their “names, email addresses and phone numbers” compromised. Chris Hoofnagle at the Berkeley Center for Law and Technology referred to Uber’s security practices as “amateur hour”. The data was not encrypted (doh!) and then the company tried to cover up the fact that it had been stolen (legally very bad). Hoofnagle notes, “The only way one can have direct liability under security breach notification statutes is to not give notice.” As it turns out, the new CEO Dara Khosrowshahi cannot so easily claim that he’s the new adult in the room cleaning up after a newly discovered mess left by other. Reports suggest he has known about the leak since September.



    Groq a stealthy machine learning hardware startup that had its origins somewhere in the tech behemoth known as Google has announced it will start manufacturing semiconductor chips designed specifically for artificial intelligence applications and data centers. Their minimalist website reports a maximalist goal: the processor will run 400 trillion operations per second. Whoa!In terms of energy consumption, that’s 8 trillion operations per watt. It’s wise to be concerned about the environmental impact of all this computation. This could pose a sincere challenge to NVidia’s dominance in the chip market. Still, it helps to have a 20 year headstart. Even Google/Groq will have its hands full trying to challenge the current computational titan.



    Google’s Android phones have been continuously collecting users’ locations even when location data services have been disabled, no apps are running, and no SIM card is in the device. The phones collect the information and store it in order to send it back to Google when they’re back on the network. After being contacted by Quartz and confirming the practice, the company now reports it will push an update curtailing this practice. This is another triumph for the free press. Still, your apps are collecting geolocation data on you pretty much all the time even if your handset maker isn’t. And in this case, Google can collect location data through the many apps it supports.

    Intel is getting ready to product launch new chips from recently acquired Nervana, which will compete against Intel’s existing product, a move straight from Clayton Christensen’s Innovators’ Dilemma playbook.


    I’m Testifying in Front of Congress in Washington DC about Data Breaches – What Should I Say?

    Troy Hunt


    from

    I’ve already drafted up both my written statement and oral testimony but since I have a few days before I need to submit anything, I really wanted to reach out to the community and share a very broad overview of what I have planned. Many of the people who read this blog have had first-hand experience with data breaches themselves either by having their personal info exposed, working for a company that’s been breached, sending me data they’ve seen circulating or in some cases even being, well, let’s call them “person 0” to have seen the breach. I’d love to know what you think is important for the folks in Washington to hear.

     
    Events



    Creativity and Collaboration: Revisiting Cybernetic Serendipity

    Sackler Colloquia of the National Academy of Sciences


    from

    Washington, DC Monday, March 12, 2018 at National Academy of Sciences. Organized by: Ben Shneiderman, Maneesh Agrawala, Donna Cox, Alyssa Goodman, Youngmoo Kim, and Roger Malina. [$$$]


    AI and Policy Event in DC,

    Princeton Center for Information Technology Policy


    from

    Washington, DC December 8 at National Press Club. Organized by Princeton Center for Information Technology Policy. [registration requested]

     
    Deadlines



    Submissions open for the PacificVis 2018 visual data storytelling contest

    “The contest is a part of PacificVis 2018, a unified visualization symposium, discussing all areas of visualization, including information, scientific, graph, security, and software visualization. Storytellers are invited to submit visual data driven stories that draw upon any of these areas.” Deadline for submissions is January 18, 2018.
     
    NYU Center for Data Science News



    PhD Candidate Profile: Sreyas Mohan

    Medium, NYU Center for Data Science


    from

    Who are our PhD students? Where do they come from, what are they studying now, and where do they hope to go in the future? Find out more about one of our PhD candidates, Sreyas Mohan!


    Meet NYU faculty, neurobiologist and entrepreneur, André A. Fenton

    NYU Entrepreneurship, Jennifer Curtis


    from

    Dr. André Fenton is a Professor of Neural Science at the NYU Center for Neural Science. … “Dr. Fenton is an internationally-recognized electrophysiologist trained in the extracellular recording of brain activity from freely-moving rats. He is also the founder & president of BioSignal Group, which advances the concept of developing tiny devices that wirelessly records brain data from rodents — by miniaturizing EEG machines to measure brain activity in people.”

     
    Tools & Resources



    RoboCupSimData — Bitbucket

    Oliver Obst


    from

    a large dataset from games of some of the top teams (from 2016 and 2017) in RoboCup Soccer Simulation League (2D), where teams of 11 robots (“agents”) compete against each other. Overall, we used 10 different teams to play each other, resulting in 45 unique pairings. For each pairing, we ran 25 matches (of 10 minutes), leading to 1125 matches or more than 180 hours of game play. The generated CSV files are 17GB of data (zipped), or 229GB (unzipped).


    Introducing The Topos Similarity Index and [x] Everywhere

    Medium, Topos


    from

    While we began by developing new concepts of distance and similarity in New York City, we have since scaled our platform to 15 additional metro regions. As part of this expansion, we have been studying and computing similarities between neighborhoods across the United States. Today, we are excited to share two ways for you to play with and explore this set of relationships.


    Skymind’s Deeplearning4j, the Eclipse Foundation, and scientific computing in the JVM

    JAXenter, Chris Nicholson


    from

    Why did Skymind join the Eclipse Foundation last month? Chris Nicholson, CEO of Skymind and creator of Deeplearning4j, explains why open sourcing its libraries was a step forward to show developers and enterprises that Deeplearning4j is mature, secure, and a safe bet for deep learning.


    After quite some discussion, scikit-learn finally merged a replacement for OneHotEncoder that can deal with strings and has a nicer interface

    Twitter, Andreas Mueller


    from

    “After quite some discussion, scikit-learn finally merged a replacement for OneHotEncoder that can deal with strings and has a nicer interface: https://github.com/scikit-learn/scikit-learn/pull/9151 … Say hello to the CategoricalEncoder http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.CategoricalEncoder.html#sklearn-preprocessing-categoricalencoder … (only in the master branch for now)”


    Fairness Measures – Detecting Algorithmic Discrimination

    Fairness Measures


    from

    “The increasing prevalence of automated decision-making process is increasing the risk associated to models that can potentially discriminate against disadvantaged groups. The Fairness Measures project contributes to the development of fairness-aware algorithms and systems by providing relevant datasets and software.”

    Leave a Comment

    Your email address will not be published.