Data Science newsletter – February 6, 2017

Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for February 6, 2017

GROUP CURATION: N/A

Data Science News

High resolution global gridded data for use in population studies

Nature, Scientific Data; Christopher T. Lloyd, Alessandro Sorichetta & Andrew J. Tatem

from January 31, 2017

Recent years have seen substantial growth in openly available satellite and other geospatial data layers, which represent a range of metrics relevant to global human population mapping at fine spatial scales. The specifications of such data differ widely and therefore the harmonisation of data layers is a prerequisite to constructing detailed and contemporary spatial datasets which accurately describe population distributions. Such datasets are vital to measure impacts of population growth, monitor change, and plan interventions. To this end the WorldPop Project has produced an open access archive of 3 and 30 arc-second resolution gridded data. Four tiled raster datasets form the basis of the archive: (i) Viewfinder Panoramas topography clipped to Global ADMinistrative area (GADM) coastlines; (ii) a matching ISO 3166 country identification grid; (iii) country area; (iv) and slope layer. Further layers include transport networks, landcover, nightlights, precipitation, travel time to major cities, and waterways. Datasets and production methodology are here described. The archive can be downloaded both from the WorldPop Dataverse Repository and the WorldPop Project website.

Using Machine Learning to predict parking difficulty

Google Research Blog, James Cook, Yechen Li and Ravi Kumar

from February 03, 2017

Much of driving is spent either stuck in traffic or looking for parking. With products like Google Maps and Waze, it is our long-standing goal to help people navigate the roads easily and efficiently. But until now, there wasn’t a tool to address the all-too-common parking woes.

Last week, we launched a new feature for Google Maps for Android across 25 US cities that offers predictions about parking difficulty close to your destination so you can plan accordingly.

Why Hollywood as We Know It Is Already Over | Vanity Fair

Vanity Fair, The Hive blog, Nick Bilton

from January 31, 2017

With theater attendance at a two-decade low and profits dwindling, the kind of disruption that hit music, publishing, and other industries is already reshaping the entertainment business. From A.I. Aaron Sorkin to C.G.I. actors to algorithmic editing, Nick Bilton investigates what lies ahead.

UM researchers awarded for using Big Data in medical studies

University of Michigan, Michigan Daily student newspaper

from January 29, 2017

Researchers at the University of Michigan will be using big data — large data sets that need to be computationally analyzed — to predict when individuals will be affected by diseases like depression and Hepatitis C. Big data will also be used to understand the applications of single-cell gene sequencing — examining genetic information from individual cells — through three projects that were recently funded.

The three projects, M-CHAMP, the Michigan Center for Single-Cell Genomic Data Analysis and the Intern Health Study, are receiving $3 million in funding from the Michigan Institute for Data Science as part of the Challenge Initiatives Program, which challenges data scientists and other research investigators to solve real-world problems in areas of transportation research, learning analytics, social sciences and health sciences. The program is part of the University’s plan to invest $100 million in Data Science Initiatives and infrastructure, which was announced in September 2015.

NYU Researchers Study Patients’ Genetic and Susceptibility Risk Factors in Hopes of Finding the Path to Cure Lymphedema

NYU News

from February 03, 2017

Each year, about 1.38 million women worldwide are diagnosed with breast cancer. Advances in diagnosis and treatment have facilitated a 90-percent, five-year survival rate, among those treated. However, with the increased rate and length of survival following breast cancer, patients face a lifetime risk of developing lymphedema, one of the most distressing and feared late onset breast cancer-related effects. … Researchers from New York University Rory Meyers College of Nursing (NYU Meyers), led by Dr. Mei R. Fu, PhD, RN, FAAN, conducted a study, “Precision assessment of heterogeneity of lymphedema phenotype, genotypes and risk prediction,” to address this phenomenon and prospectively examine phenotype of arm lymphedema by limb volume and lymphedema symptoms in relation to inflammatory genes in women treated for breast cancer.

Inaugural DataFest reflects a growing interest

Harvard Gazette

from February 03, 2017

The proof of Harvard’s growing interest in data science became even clearer the third week of January when the inaugural session of the Harvard DataFest conference reached capacity (at 166), with several dozen students, researchers, and staff waitlisted.

“We were able to create new course materials, and build awareness of existing Harvard resources, for developing skills in working with data,” said Mercè Crosas, chief data science and technology officer at the Institute for Quantitative Social Science (IQSS), who organized the workshop.

How data science is transforming cancer treatment scheduling

MedCity News, Mohan Giridharadas

from February 05, 2017

A mathematical approach to infusion center scheduling promises more patients served, reduced cost of service, optimized equipment, and facility utilization, and a whole lot less sitting in waiting rooms.

[1701.08884] Enabling large-scale viscoelastic calculations via neural network acceleration

arXiv, Physics > Geophysics; Phoebe R. DeVries, T. Ben Thompson, Brendan J. Meade

from January 31, 2017

One of the most significant challenges involved in efforts to understand the effects of repeated earthquake cycle activity are the computational costs of large-scale viscoelastic earthquake cycle models. Computationally intensive viscoelastic codes must be evaluated thousands of times and locations, and as a result, studies tend to adopt a few fixed rheological structures and model geometries, and examine the predicted time-dependent deformation over short (<10 yr) time periods at a given depth after a large earthquake. Training a deep neural network to learn a computationally efficient representation of viscoelastic solutions, at any time, location, and for a large range of rheological structures, allows these calculations to be done quickly and reliably, with high spatial and temporal resolution. We demonstrate that this machine learning approach accelerates viscoelastic calculations by more than 50,000%. This magnitude of acceleration will enable the modeling of geometrically complex faults over thousands of earthquake cycles across wider ranges of model parameters and at larger spatial and temporal scales than have been previously possible.

[1702.00748] QCD-Aware Recursive Neural Networks for Jet Physics

arXiv, High Energy Physics – Phenomenology; Gilles Louppe, Kyunghyun Cho, Cyril Becot, Kyle Cranmer

from February 02, 2017

Recent progress in applying machine learning for jet physics has been built upon an analogy between calorimeters and images. In this work, we present a novel class of recursive neural networks built instead upon an analogy between QCD and natural languages. In the analogy, four-momenta are like words and the clustering history of sequential recombination jet algorithms is like the parsing of a sentence. Our approach works directly with the four-momenta of a variable-length set of particles, and the jet-based tree structure varies on an event-by-event basis. Our experiments highlight the flexibility of our method for building task-specific jet embeddings and show that recursive architectures are significantly more accurate and data efficient than previous image-based networks. We extend the analogy from individual jets (sentences) to full events (paragraphs), and show for the first time an event-level classifier operating on all the stable particles produced in an LHC event.

Smart Underwater Imaging Aims to Save Endangered Marine Species

Intel, iQ

from January 25, 2017

Researchers at the University of California, San Diego built an artificially intelligent camera technology, powered by an Intel Edison module, that could lead to autonomous monitoring systems for tracking endangered species.

Albert Einstein and the origins of modern cosmology

Physics Today, Cormac O’Raifeartaigh

from February 03, 2017

In 1917 Einstein published a paper that applied general relativity to the universe, changing our view of the cosmos forever.

Data science is creating a tidal wave of opportunity for women to get into executive leadership

Recode, Daphne Kis

from February 03, 2017

We have reached the tipping point in big data. We can now access, manage and manipulate massive amounts of data with such ease that the real work has shifted to analysis and practical applications across industries. This new discipline, called data science, will not be exclusive to the male-dominated computer science profession, and a tidal wave of opportunity will arise for women.

The world of finance, of course, has benefited greatly from data and data science, but most of the action in big data has been reserved for computer scientists. Right now, there are fewer women graduating with computer science degrees than in the 1980s. Nevertheless, as we move from big data to data science, doors will open for more and more women.

Women now make up 40 percent of graduates with degrees in statistics — that’s a good indicator for women in an environment increasingly obsessed with data.

Events

Big Boulder 2017 Registration is open!

Boulder, CO Big Boulder 2017 will be June 1-2 and registration is now open! it’s hard to believe, but this is our 6th year and we’re excited to see everyone. [$$$$]

The 50 Years of the ACM Turing Award Celebration

San Francisco, CA Registrations are now open for ACM’s celebration of 50 years of the Turing Award. Registration is free of charge for ACM members. Space is limited. June 23 – 24. [registration required]

Deadlines

Kavli Summer Institute in Cognitive Neuroscience

The 2017 Kavli Summer Institute will be held in Santa Barbara, CA from June 26 through July 7, 2017. Deadline to apply is February 10.

UC-Berkeley-Haas School of Business Survey: Understanding researcher software needs + values

We are investigating researcher needs and values related to software and computer code. Your responses may help us and other organizations develop services that meet the needs of the research community.

Apply to attend rOpenSci unconf 2017!

Los Angeles, CA If you’d like to attend, it only takes a few minutes to nominate yourself. Submissions close Wednesday, February 22, 2017 at midnight Pacific Time.

Gertrude M. Cox Award Committee Seeks Nominees

The award recognizes a statistician in early to mid-career (fewer than 15 years after terminal degree) who has made significant contributions to one or more of the areas of applied statistics in which Gertrude Cox worked: survey methodology, experimental design, biostatistics, and statistical computing. Email nominations by February 28.

Association for Consumer Research Conference

San Diego, CA Conference is October 26-29. Deadline for submissions is Friday, March 10.

Connected Life – Conference Series

Oxford, England Connected Life 2017 is a student-run conference based at the University of Oxford dedicated to sparking exchange between disciplines and showcasing emerging Internet research. Bringing together participants from across the humanities, social sciences, physical sciences, and beyond, Connected Life seeks to foster collaborations within and beyond Oxford in pursuit of an enhanced understanding of the Internet and its multifaceted effects.

Connected Life will be held at the Faculty of Classics on Monday, June 19th, 2017. Deadline for paper submissions is April 1.

Lipari School on Computational Complex and Social Systems

Applications can be submitted up to May 31.

NYU Center for Data Science News

Data Science Environments partners publish reproducibility book

University of Washington, eScience Institute

from February 01, 2017

Researchers from the UW’s eScience Institute, New York University Center for Data Science and Berkeley Institute for Data Science (BIDS) have authored a new book titled The Practice of Reproducible Research. Representatives from the three universities, all Moore-Sloan Data Science Environments partners, joined on January 27, 2017, at a symposium hosted by BIDS. There, speakers discussed the book’s content, including case studies, lessons learned and the potential future of reproducible research practices.

The book (linked above) is available online, and will also be published in print in late 2017. It is described on the BIDS webpage as “a collection of 31 case studies in reproducible research practices written by scientists and engineers working in the data-intensive sciences. Each case study presents the specific approach that the author used to achieve reproducibility in a real-world research project, including a discussion of the overall project workflow, major challenges, and key tools and practices used to increase the reproducibility of the research.”

Tools & Resources

(More than) one million requests per second in Node.js ·

GitHub – uWebSockets

from February 04, 2017

If you previously read the post “WIP: Faster HTTP for Node.js” you might already know what I’ve been up to for the last couple of months. If not, then let me begin with an example.

The Practice of Reproducible Research – Case Studies and Lessons from the Data-Intensive Sciences

Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.)

from February 03, 2017

The book The Practice of Reproducible Research will be published in 2017.
It contains a collection of 31 case studies of reproducible research workflows, written by academic researchers in the data-intensive sciences.

YouTube-8M: A Large and Diverse Labeled Video Dataset for Video Understanding Research

YouTube

from September 26, 2016

“YouTube-8M is a large-scale labeled video dataset that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities. It also comes with precomputed state-of-the-art vision features from billions of frames, which fit on a single hard disk. This makes it possible to train video models from hundreds of thousands of video hours in less than a day on 1 GPU!”

Scraping for Craft Beers: A Dataset Creation Tutorial

Kaggle, No Free Hunch blog, Jean-Nicholas Hould

from January 31, 2017

“This post is separated in two sections: scraping and tidying the data. In the first part, we’ll plan and write the code to collect a dataset from a website. In the second part, we’ll apply the “tidy data” principles to this freshly scraped dataset. At the end of this post, we’ll have a clean dataset of craft beers.”

New Dataset: Five Years of Longitudinal Data from Scratch

Benjamin Mako Hill, copyrighteous blog

from February 03, 2017

Scratch is a block-based programming language created by the Lifelong Kindergarten Group (LLK) at the MIT Media Lab. Scratch gives kids the power to use programming to create their own interactive animations and computer games. Since 2007, the online community that allows Scratch programmers to share, remix, and socialize around their projects has drawn more than 16 million users who have shared nearly 20 million projects and more than 100 million comments. It is one of the most popular ways for kids to learn programming and among the larger online communities for kids in general.

Careers

Tenured and tenure track faculty positions

Lecturer in Data Science

Macquarie University; North Ryde, Australia

Postdocs

Postdoctoral Associates

Yale University, Digital Humanities Lab; New Haven, CT

Postdoctoral Fellowship in computational motor learning and rehabilitation

Tsukuba University; Tsukuba, Japan

Full-time, non-tenured academic positions

Associate Director for Actionable Science, Associate Director for Research (2)

University of Maryland, National Socio-Environmental Synthesis Center; Annapolis, MD

Sports.BradStenger.com

Data Science newsletter – February 6, 2017

Leave a Comment Cancel reply