Data Science newsletter – February 21, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for February 21, 2018

GROUP CURATION: N/A

Data Science News

The New York Times, Sarah Vowell

from February 15, 2018

Every 10 years since 1948, Montanans vote on the 6-mill levy, a property tax to keep tuition low at the state universities, including M.S.U. It has always passed, but support keeps dwindling, to about 56 percent 10 years ago from around 64 percent in 1988. According to a Pew survey last year, 58 percent of conservatives believe universities have a negative impact on America, a crucial consideration in majority-Republican Montana.

This November, for the first time in seven decades, the 6-mill levy might not pass. In the meantime, while my neighbors in the party of Lincoln mull whether or not to undermine his noblest legacy in the state of Montana, I plan on stopping by the M.S.U. campus on its 125th anniversary, to witness that new Lincoln statue being unveiled.

This Computer Uses Light—Not Electricity—To Train AI Algorithms

WIRED, Business, Tom Simonite

from February 20, 2018

William Andregg ushers me into the cluttered workshop of his startup Fathom Computing and gently lifts the lid from a bulky black box. Inside, green light glows faintly from a collection of lenses, brackets, and cables that resemble an exploded telescope. It’s a prototype computer that processes data using light, not electricity, and it’s learning to recognize handwritten digits. In other experiments the device learned to generate sentences in text.

Right now, this embryonic optical computer is good, not great: on its best run it read 90 percent of scrawled numbers correctly. But Andregg, who cofounded Fathom late in 2014 with his brother Michael, sees it as a breakthrough. “We opened the champagne when it was only at about 30 percent,” he says with a laugh.

U of T study uses algorithms to predict how word meanings change over time

University of Toronto, UofT News

from February 20, 2018

Using the Historical Thesaurus of English as a dataset, which dates back nearly 1,000 years to the period of Old English, researchers at the University of California, Berkeley, Lehigh University and the University of Toronto have developed an algorithm to demonstrate how words evolve.

“If you look at the history of a word, meanings of that word tend to shift or extend over time,” says Yang Xu, an assistant professor in the department of computer science and University College’s cognitive science program. “The question is: Why is this happening? How is it happening? And whether there are computational algorithms we can leverage to make predictions about the historical development of word meanings.”

Global AI Talent Pool Report

jfgagne

from February 07, 2018

We are submitting our work amidst similar, though much broader, reports such as Tencent’s recent “2017 Global AI Talent White Paper,” which focused primarily on China in comparison to the United States. Tencent’s research found that currently “200,000 of the 300,000 active researchers and practitioners” are already employed in the industry, while some 100,000 are researching or studying in academia. Their number far exceeds the high-end of our measure at 22,000, primarily because it includes the entire technical teams and not just the specially-trained experts. Our report, however, focuses on finding out where the relatively small number of “AI experts” currently reside around the world.

We drew on two popular data sources for this line of inquiry. First, we used the results from several LinkedIn searches, which showed us the total number of profiles according to our own specialized parameters. Second, for an even more advanced subset, we captured the names of leading AI conference presenters who we consider to be influential experts in their respective fields of AI. Finally, we relied on other reports and anecdotes from the global community to put our numbers in greater context and see how the picture may develop in the near future.

Facebook’s next project: American inequality

POLITICO, Nancy Scola

from February 19, 2018

A Stanford economist is using the company’s vast store of personal data to study why so many in the U.S. are stuck in place economically.

Extra Extra

I’m excited to read OpenAI’s first deep dive into the potential for malicious use of AI. They wisely draw on lessons from the cybersecurity industry, employing an ethical imagination that asks what could go wrong in order to foresee and forestall bad outcomes. I’ll be in the Slack channel with my Ethics class students on this. Ping me if you want to join us.

Hanna Wallach of Microsoft has an extremely thoughtful, relevant piece about how to start as a computer science/machine learning expert and become a computational social scientist. It is not the same as being a machine learning expert and adding social data!

Check out this 92 page cybersecurity style guide. Strunk and White for hackers.

MIT Sloan Business School’s annual Sports Analytics conference recently took place in Boston, yielding an outstanding writeup by basketball analyst Ben Falk, and another excellent conference report from tennis analyst Stephanie Kovalchik.

Adding Program Evaluation to the Data Science Curriculum

Data Science Central, Howard Friedman

from February 20, 2018

The desired skills for a Data Scientist already include quite a long list. Knowing that we can’t add an infinite number of required skills to the Data Scientist Toolbox, what do you think about a basic course in Program Evaluation? Would some training in Program Evaluation be helpful to round out a Data Scientists training?

How The Pudding discovered swing states will suffer most from job automation

Storybench, N'dea Yancey-Bragg

from February 20, 2018

Will robots take your job one day? Probably. But how should that affect your choices in the voting booth? Ilia Blinderman, a data journalist at The Pudding, recently published an interactive explainer that tries to answer that very question. Blinderman combined census data with academic research to determine not only where automation will hit the hardest, but what it might mean for this country’s political future.

Storybench spoke with Blinderman about how he answered the question “Why the Republican Party wins if robots take your job?” and how simple tools can create great data visualizations.

Why AI on a chip is the start of the next IT explosion

Gigaom, Jon Collins

from February 20, 2018

It’s game on in the AI-on-a-chip race. Alongside Nvidia’s successes turning Graphics Processing Units into massively performant compute devices (culminating in last year’s release of the ‘Volta’ V100 GPU), we have ARM releasing its ‘Project Trillium’ machine learning processor on Valentine’s Day and Intel making noises around bringing the fruits of its Nervana acquisition to market, currently at sample stage. Microsoft with Catapult, Google with its TPU — if you haven’t got some silicon AI going on at the moment, you are missing out. So, what’s going on?

We have certainly come a long way since John von Neumann first introduced the register-based computer processing architecture. That, simple design (here’s a nice picture from Princeton), which fires data and instructions into an Arithmetic Logic Unit (ALU), still sits at the heart of what we call CPU cores today, though with their pre-fetching and other layers of smartness, these are to von Neumann’s original design a souped-up, drag racing version of the original design.

Apple’s Move to Bring Health Care Records to the iPhone Is Great News

WIRED, Science, Aneesh Chopra and Shafiq Rab

from February 19, 2018

In late January, Apple previewed an iOS feature that would allow consumers to access their electronic health records on their phones. Skeptics said the move was a decade too late given a similar (and failed) effort from Google. Optimists argued that Apple was capable of translating health data into something meaningful for consumers.

But the announcement portends great things for consumers and the app developers seeking to serve them, from our perspectives as the former US chief technology officer under President Obama, and as an early adopter of the Apple service as Rush University Medical Center’s chief information officer. That’s because Apple has committed to an open API for health care records—specifically, the Argonaut Project specification of the HL7 Fast Health Interoperability Resources—so your doctor or hospital can participate with little extra effort.

How Unreliable Data Leads to the Undercounting of Cybercrime

Pacific Standard, Josephine Wolff

from February 20, 2018

For years, people have viewed the most prominently cited cybercrime statistics with suspicion. For instance, the oft-repeated estimate that cybercrime costs $1 trillion globally per year led to a 2012 ProPublica investigation that found that number was based on some truly questionable arithmetic.

The data about cybercrime, and cybersecurity breaches more generally, is simply very sketchy. Some types of cybercrime, like ransomware, appear to be on the rise, while the costs of data breaches may be dropping, at least according to some estimates. But we often don’t know how frequently these incidents occur, or how much they cost. The challenges of measuring these statistics has only grown as the “cybercrime” label has grown to encompass pretty much any type of criminal activity involving computers, ranging from online extortion to revenge porn to denial-of-service attacks.

Part of the problem is that we’re almost always relying on companies’ self-reported estimates of how many online intrusions they’ve experienced and how much each one has cost them when we try to answer questions about the magnitude and damage of digital crimes. Depending on who is filling out these surveys at a company, respondents might try to lowball these estimates to seem more secure than they really are, or inflate them to drive greater investment in information security, or even more likely, not really know the answers in the first place.

Ocean-wide sensor array provides new look at global ocean current

Nature, News, Jeff Tollefson

from February 16, 2018

The North Atlantic Ocean is a major driver of the global currents that regulate Earth’s climate, mix the oceans and sequester carbon from the atmosphere — but researchers haven’t been able to get a good look at its inner workings until now. The first results from an array of sensors strung across this region reveal that things are much more complicated than scientists previously believed.

Researchers with the Overturning in the Subpolar North Atlantic Program (OSNAP) presented their findings this week at an ocean science meeting in Portland, Oregon. With nearly two years of data from late 2014 to 2016, the team found that the strength of the Atlantic Meridional Overturning Circulation — which pumps warm surface water north and returns colder water at depth — varies with the winds and the seasons, transporting an average of roughly 15.3 million cubic metres of water per second.

The measurements are similar in magnitude to those from another array called RAPID, which has been operating between Florida and the Canary Islands since 2004. But scientists say they were surprised by how much the currents measured by the OSNAP array varied over the course of two years.

Developer Attempts to Transcribe a Podcast with Microsoft’s Speech API. Hilarity Ensues.

ProgrammableWeb, David Berlind

from February 15, 2018

Over the last few years, the wave of machine learning and artificial intelligence APIs has been cresting as more and more businesses see the potential for differentiation and more and more API providers look to service that need. IBM clearly recognized the potential of these APIs when it acquired AlchemyAPI back in 2015. Alchemy specialized in machine-learning driven APIs like sentiment analysis and image/language processing. Those APIs are now a part of IBM’s Watson portfolio.

Now, a few years later, everyone is getting into the game. Why? For a fraction of the cost, these APIs — what we at ProgrammableWeb sometimes call the PhD APIs — can do the same things that, just a few years ago, required a team of highly paid PhDs to accomplish.

But as impressive as all this dirt cheap rocket science is, we can’t help but laugh hysterically when it falls down. And that’s exactly what happened when one developer decided to publish a tutorial on how to use Microsoft’s Bing Speech Recognition API to transcribe a podcast using the C# language.

Deadlines

Zurich Summer School for Women in Political Methodology

July 1-8. “The Zurich Summer School for Women in Political Methodology is a competitive fellowship program for early-career researchers who identify as women. Hosted by the Political Science Department at the University of Zurich.” Deadline to apply is March 16.

Uber AI Residency

The intensive one-year research training program is slated to begin this summer. Deadline to apply is March 18.

Tools & Resources

Researchers hope to use MRI for stroke treatment, recovery

University of Southern California, HSC News

from February 20, 2018

Stroke is the leading cause of disability in adults, affecting more than 15 million people worldwide each year, according to the World Health Organization. Researchers are increasingly using brain imaging to assess the damage caused by stroke and to predict how patients will recover. A USC-led team has now compiled, archived and shared one of the largest open-source datasets of brain scans from stroke patients via a study published Feb. 20 in Scientific Data, a Nature journal.

The dataset, known as Anatomical Tracings of Lesion After Stroke (ATLAS), is now available for download; researchers around the world already are using the scans to develop and test algorithms that can automatically process MRI images from stroke patients. Ultimately, scientists hope to identify biological markers that forecast which patients will respond to various rehabilitation therapies and personalize treatment plans accordingly.

Visualizing Deep Learning Models at Facebook

Georgia Institute of Technology, Machine Learning @ Georgia Tech

from February 16, 2018

While powerful deep learning models have significantly improved prediction accuracy, understanding these models remains a big challenge. Deep learning models are more difficult to interpret than most existing machine learning models, because of its nonlinear structures and huge number of parameters. Thus, in practice, people often use them as “black boxes”, which could be detrimental because when the models do not perform satisfactorily, users would not understand the causes or know how to fix them.

Visualization has recently become a popular means for interpreting such complex deep learning models. Data visualization and visual analytics help people make sense of data and discover insights by effectively transforming abstract data into meaningful visual representations and making it interactive. Deep learning models can be visualized by presenting intermediate data produced from models (e.g., activation, weights) or revealing relationships between datasets and results from models. With such visualization, users can better understand why and how the models work to produce results for their datasets. There are several visualization tools developed and available, including TensorBoard and Embedding Projector by Google’s Big Picture Group, Deep Visualization ToolBox, and so on.

An Outsider’s Tour of Reinforcement Learning – arg min blog

Ben Recht

from February 20, 2018

Make It Happen. Reinforcement Learning as prescriptive analytics.

Total Control. Reinforcement Learning as Optimal Control.

The Linearization Principle. If a machine learning algorithm does crazy things when restricted to linear models, it’s going to do crazy things on complex nonlinear models too.

The Linear Quadratic Regulator. A quick intro to LQR as why it is a great baseline for benchmarking Reinforcement Learning.

A Game of Chance to You to Him Is One of Real Skill. Laying out the rules of the RL Game and comparing to Iterative Learning Control.

The Policy of Truth. Policy Gradient is a Gradient Free Optimization Method.

JupyterLab is Ready for Users

Jupyter Blog

from February 20, 2018

“We are proud to announce the beta release series of JupyterLab, the next-generation web-based interface for Project Jupyter.”

Create your own AI-powered voice assistant with Matrix Voice

The Next Web, Tristan Greene

from February 20, 2018

The Matrix Voice development board is a Raspberry Pi add-on you can use to build your own voice assistant. I got my hands on a review unit to see if someone who is “code illiterate” could make something out of it.

Some people have an affinity for programming and development. I’m not one of them. I’ve written about DIY Alexa projects before, but I’ve never had the opportunity to make my own. On first glance it seemed incredibly complex. Yet, somehow I managed to make it through the project thanks in no small part to the excellent tutorials and documentation available at the Hackster.io website.

geogrid: Turn Geospatial Polygons into Regular or Hexagonal Grids

CRAN; Joseph Bailey, Ryan Hafen and Lars Simon Zehnder

from February 20, 2018

Turn irregular polygons (such as geographical regions) into regular or hexagonal grids. This package enables the generation of regular (square) and hexagonal grids through the package ‘sp’ and then assigns the content of the existing polygons to the new grid using the Hungarian algorithm, Kuhn (1955) (). This prevents the need for manual generation of hexagonal grids or regular grids that are supposed to reflect existing geography.

Careers

Full-time positions outside academia

Data Scientist

Fusion Media Group; New York, NY

Sports.BradStenger.com

Data Science newsletter – February 21, 2018

Leave a Comment Cancel reply