Data Science newsletter – July 19, 2019

Newsletter features journalism, research papers, events, tools/software, and jobs for July 19, 2019


Data Science News

Funded Partnership Brings Dryad and Zenodo Closer

Dryad, Zenodo


Dryad is a leader in data curation and data publishing. For the last ten years, Dryad has focused primarily on research data, supporting a CC0 license and manually curating each incoming dataset. Zenodo, a general use repository hosted at CERN, has been paving the way in software citation and publishing. As long time players in the open science movement, we believe that we can advance open science and open-source projects further by working together. Instead of working individually to broaden each our scopes, building competitive features, and inefficiently using our limited resources, Dryad and Zenodo will be working together to support more seamless workflows that make the process easier for researchers.

To jumpstart this collaboration, we are proud to have been awarded an Alfred P. Sloan Foundation grant that will enable us to co-develop new solutions focused on supporting researcher and publisher workflows as well as best practices in data and software curation.

How Is Data Affecting Your Dating Life?

Dataconomy, Arwa Sutarwala


Technology has changed the way we communicate, the way we move, and the way we consume content. It’s also changing the way we meet people. Looking for a partner online is a more common occurrence than searching for one in person. According to a study by Online Dating Magazine, there are almost 8,000 dating sites out there, so the opportunity and potential to find love is limitless. Besides presenting potential partners and the opportunity for love, these sites have another thing in common — data. Have you ever thought about how dating apps use the data you give them?

How NLP Is Advancing Asset Management

Medium, SyncedReview


Unlike rule-based or statistical algorithms which depend on human-crafted rules or task-specific ad-hoc features, the deep learning approach trains a single end-to-end model, discovering the rules and features along the way. Such models are able to obtain high performance across a variety of NLP tasks. Recent research breakthroughs have also improved DL models’ understanding of context and made them better able to tackle large-scale problems.

The asset management industry is particularly interested in these capabilities. Leading firms like Citadel and AIG are leveraging signals from alternative data such as social network info, shopping history, shipping info, GPS and satellite data, etc. in a bid to increase active investment return — and are continuously exploring how NLP technology can improve efficiency and scalability in this practice. Automation of the ingestion and analysis of public filings and quick consumption and sentiment scoring of news and social media content are also being introduced by in-house teams and third-party providers in asset management.

Much Ado About Data: How America and China Stack Up

Macro Polo blog, Matt Sheehan


The reality is far more complex, because data is not a single-dimensional input into AI, something that China simply has “more” of. The relationship between data and AI prowess is analogous to the relationship between labor and the economy. China may have an abundance of workers, but the quality, structure, and mobility of that labor force is just as important to economic development.

Likewise, data is better understood as a key input with five different dimensions—quantity, depth, quality, diversity, and access—all of which affect what data can do for AI systems.

What follows is a framework for analyzing the comparative advantages of countries and companies across the five dimensions, with the aim of bringing more precision to comparisons of how America and China stack up. This is, however, just one framework, and I welcome critiques and suggestions on how to quantitatively measure each of these dimensions.

Connected fitness: how Peloton is influencing the future of exercise

The Verge, Natt Garun


… The Loechers just are two of the hundreds of thousands of people who’ve purchased a connected fitness equipment in recent years. The category is quickly growing, with a variety of devices offering at-home workout solutions where users stare at screens for guided instructions instead of an in-person fitness trainer. It was creepy at first, Brittany admits. “But everyone is constantly on their phone and society is moving that way anyway,” that ultimately the convenience outweighed her initial qualms.

“This is the future of fitness,” Nikolas says. “I can work out when I want, I don’t have to deal with driving to the gym, I can keep track of my fitness goals, and still have time to spend with my family.”

After spending over $10,000 on their connected fitness machines, now, the couple is preparing to literally invest in connected fitness. When news about Peloton’s IPO filing broke, Nikolas immediately jumped at the opportunity, vowing to do whatever it takes to add it to his portfolio. “We called our stock broker and said we are getting that stock!” Brittany says. “I believe in this company, the product, and this future.”

Skin sensors are the future of health care

Nature, Comment; Shuai Xu, Arun Jayaraman & John A. Rogers


Thin, soft electronic systems that stick onto skin are beginning to transform health care. Millions of early versions1 of sensors, computers and transmitters woven into flexible films, patches, bandages or tattoos are being deployed in dozens of trials in neurology applications alone2; and their numbers growing rapidly. Within a decade, many people will wear such sensors all the time. The data they collect will be fed into machine-learning algorithms to monitor vital signs, spot abnormalities and track treatments.

Medical problems will be revealed earlier. Doctors will monitor their patients’ recovery remotely while the patient is at home, and intervene if their condition deteriorates. Epidemic spikes will be flagged quickly, allowing authorities to mobilize resources, identify vulnerable populations and monitor the safety and efficacy of drugs issued. All of this will make health care more predictive, safe and efficient.

Where are we now? The first generation of biointegrated sensors can track biophysical signals, such as cardiac rhythms, breathing, temperature and motion3. More advanced systems are emerging that can track certain biomarkers (such as glucose) as well as actions such as swallowing and speech.

Supercomputer Innovations Open Science, Engineering Frontiers

Forbes, Oracle BrandVoice


“Machine learning is becoming a bigger deal for computational science,” said Jack Dongarra, a distinguished researcher at Tennessee’s Oak Ridge National Laboratory, home to the world’s fastest supercomputer. Machine learning models running on high-performance supercomputers give scientists an “approximation” of results they can later prove out with mathematical and statistical programming techniques in areas including weather forecasting and automobile crash testing. “The machine learning gives us that starting point,” said Dongarra, who leads curation of the Top 500 supercomputer list.

Amazon’s Most Ambitious Research Project Is a Convenience Store

Bloomberg Businessweek, Brad Stone and Matt Day


Jeff Bezos and his company have spent seven years and hundreds of millions of dollars getting rid of cashiers. Will it pay off?

Magic? Handheld translator helps you communicate using artificial intelligence

Dallas News, Jim Rossman


Every once in a while, I get to review a gadget that amazes me.

When you read the description, you think, “wow, this can’t possibly work,” and when it does work, you swear it must be magic.

I’ve been testing a device from Chinese electronics manufacturer Cheetah Mobile called the CM Translator, which uses artificial intelligence and your smartphone’s connection to the internet to translate English to Chinese, Japanese, Korean, Thai and Spanish.

How artificial intelligence can tackle climate change

National Geographic, Jackie Snow


Some of the biggest names in AI and machine learning—a discipline within the field—recently published a paper called “Tackling Climate Change with Machine Learning.” The paper, which was discussed at a workshop during a major AI conference in June, was a “call to arms” to bring researchers together, said David Rolnick, a University of Pennsylvania postdoctoral student and one of the authors.

“It’s surprising how many problems machine learning can meaningfully contribute to,” says Rolnick, who also helped organize the June workshop.

The paper offers up 13 areas where machine learning can be deployed, including energy production, CO2 removal, education, solar geoengineering, and finance. Within these fields, the possibilities include more energy-efficient buildings, creating new low-carbon materials, better monitoring of deforestation, and greener transportation. However, despite the potential, Rolnick points out that this is early days and AI can’t solve everything.

RPI Creates Artificial Intelligence Lab to Immerse Students in Foreign Language

The Daily Beast, Shira Feder


A student stands before a shopkeeper on a Chinese market street. The shopkeeper announces that he has many things to sell, from cake to postcards. The student, who is still learning how to master Mandarin, ponders the selection. Like many people learning a new language, he must consider grammar, pronunciation, and tenses before answering. When it comes to Mandarin—which the Foreign Service Institute of the U.S. Department of State considers one of the hardest languages to learn, requiring a minimum of 2,200 hours of study—that’s especially difficult.

This scene is not actually playing out in China, but at an American university where a classroom has been equipped with a learning game, enhanced by artificial intelligence, that promises immersion into a foreign culture with zero travel required.

The project is the result of a collaboration between IBM Research and Rensselaer Polytechnic Institute, a private university whose graduates include the inventors of the digital camera, ductile iron and the first commercial television. The six-week, four-credit class, called AI-Assisted Immersive Chinese, aims to create the kind of interactions a student could normally only get in a study-abroad program.

Africa’s science academy leads push for ethical data use

Nature, News, Lisa Nordling


The goal is to create the continent’s first cross-disciplinary guidelines for collecting, storing and sharing data and specimens.

Computer password inventor dies aged 93

BBC News


Computer pioneer Fernando Corbato, who first used passwords to protect user accounts, has died aged 93.

How the American Work Day Changed in 15 Years

FlowingData, Nathan Yau


The American Time Use Survey, an ongoing survey run by the Bureau of Labor Statistics and the Census Bureau, interviews thousands of people every year for a sense of how we spend our hours. It’s been running since 2003, and the data for 2018 recently went up. I had to ask: How has time use changed during the past 15 years?

How Big Data Provides A Pivotal Foundation For VPN Data Security

Smart Data Collective, Ryan Kh


The amount of information that is out in the world is unfathomable. Every website that we visit or transaction that we complete online has a risk of being collected and stored, waiting to be analyzed and used. The information may mostly be harmless, but naturally, most of us are not going to be keen to having our information taken and stored for future use.

One of the best ways that we can protect our information and keep our digital footprint private is through the use of VPN routers for your home network. Over 25% of all Internet users have used a VPN in the past month. That figure is rising every year. Here’s what you should know.


7th Workshop on the Challenges in the Management of Large Corpora

Corpus Linguistics conference


Cardiff, Wales July 22, starting at 9 a.m., Cardiff University, preceding the Corpus Linguistics 2019 conference.


AWS DeepRacer Scholarship Challenge

“AWS and Udacity are teaming up to teach machine learning and prepare students to test their skills by participating in the world’s first autonomous racing league—the AWS DeepRacer League. Students with the top lap times will earn full scholarships to the Machine Learning Engineer Nanodegree program.” The program begins August 1 and will run through October 31.
Tools & Resources

Best practices for analyzing large-scale health data from wearables and smartphone apps

npj Digital Medicine; Jennifer L. Hicks, Tim Althoff, Rok Sosic, Peter Kuhar, Bojan Bostjancic, Abby C. King, Jure Leskovec and Scott L. Delp


Smartphone apps and wearable devices for tracking physical activity and other health behaviors have become popular in recent years and provide a largely untapped source of data about health behaviors in the free-living environment. The data are large in scale, collected at low cost in the “wild”, and often recorded in an automatic fashion, providing a powerful complement to traditional surveillance studies and controlled trials. These data are helping to reveal, for example, new insights about environmental and social influences on physical activity. The observational nature of the datasets and collection via commercial devices and apps pose challenges, however, including the potential for measurement, population, and/or selection bias, as well as missing data. In this article, we review insights gleaned from these datasets and propose best practices for addressing the limitations of large-scale data from apps and wearables. Our goal is to enable researchers to effectively harness the data from smartphone apps and wearable devices to better understand what drives physical activity and other health behaviors.

20 Open Datasets for Natural Language Processing

Open Data Science, Elizabeth Wallace


Natural language processing is a significant part of machine learning use cases, but it requires a lot of data and some deftly handled training. In 25 Excellent Machine Learning Open Data Sets, we listed Amazon Reviews and Wikipedia Links for general NLP and the Standford Sentiment Treebank and Twitter US Airlines Reviews specifically for sentiment analysis, but here are 20 more great datasets for NLP use cases.

A Data-Driven Guide to Hiring Data Scientists

HackerRank Blog, Dana Frederick


So if you’re hiring data scientists, what should you know before you start your search? We gathered insights from our 2019 Developer Skills Report—based on feedback from over 70,000 developers and technical professionals—to better understand. In this guide, we’ll unpack data from our research to help you better find, attract, and evaluate data scientist candidates.


Full-time positions outside academia

Special Projects Lead

American Civil Liberties Union; New York, Ny
Internships and other temporary positions

Disaster Risk Prevention, Management, and Response Using Big Data

World Bank; Washington, DC

Postdoctoral Research Associate in Data Science Studies

University of Washington, eScience Institute; Seattle, WA

Leave a Comment

Your email address will not be published.