Data Science newsletter – August 21, 2019

Newsletter features journalism, research papers, events, tools/software, and jobs for August 21, 2019

GROUP CURATION: N/A

Data Science News

Training the next generation of ethical techies

Medium, Ethan Zuckerman

from August 14, 2019

My friend Christian Sandvig, who directs the Center for Ethics, Society, and Computing at the University of Michigan, started an interesting thread on Twitter yesterday. It began:

“I’m super suspicious of the “rush to postdocs” in academic #AI ethics/fairness. Where the heck are all of these people with real technical chops who are also deeply knowledgeable about ethics/fairness going to come from… since we don’t train people that way in the first place.”

Christian goes on to point out that it’s exceedingly rare for someone with PhD-level experience in machine learning to have a strong background in critical theory, intersectionality, gender studies and ethics. We’re likely to see a string of CS PhDs lost in humanities departments and well-meaning humanities scholars writing about tech issues they don’t fully understand.

I’m lucky to have students doing cutting-edge work on machine learning and ethics in my lab. But I’m also aware of just how unique individuals like Joy Buolamwini and Chelsea Barabas are. And realizing I mostly agree with Christian, I also think it’s worth asking how we start training people who can think rigorously and creatively about technology and ethics.

The Future is Here: College of Computing to Welcome First Students

Michigan Technological University, News

from August 20, 2019

As the only college of its kind in the state, Michigan Tech’s College of Computing meets the growing demand for computing skills in the workforce.

On July 1, 2019, Michigan Technological University launched the state’s first College of Computing to meet the technological, economic and social needs of the 21st century, and answer industry demand for talent in artificial intelligence (AI), software engineering, data science and cybersecurity.

With open data, scientists share their work

symmetry magazine, Meredith Fore

from August 20, 2019

“Open data often can be used to answer other kinds of questions that the people who collected the data either weren’t interested in asking, or they just never thought to ask,” says Kyle Cranmer, a professor at New York University. By making scientific data available, “you’re enabling a lot of new science by the community to go forward in a more efficient and powerful way.”

Cranmer is a member of ATLAS, one of the two general-purpose experiments that, among other things, co-discovered the Higgs boson at the Large Hadron Collider at CERN. He and other CERN researchers recently published a letter in Nature Physics titled “Open is not enough,” which shares lessons learned about providing open data in high-energy physics. The CERN Open Data Portal, which facilitates public access of datasets from CERN experiments, now contains more than two petabytes of information.

Hot Chips | A California Startup Has Built an AI Chip as Big as a Notebook. Why?

Synced

from August 20, 2019

California artificial intelligence startup Cerebras Systems yesterday introduced its Cerebras Wafer Scale Engine (WSE), the world’s largest-ever chip built for neural network processing. Cerebras Co-Founder and Chief Hardware Architect Sean Lie introduced the gigantic chip at one of the semiconductor industry’s leading conferences, Hot Chips 31: A Symposium on High Performance Chips, hosted at Stanford University.

The 16nm WSE is a 46,225 mm2 silicon chip — slightly larger than a 9.7-inch iPad — featuring 1.2 trillion transistors, 400,000 AI optimized cores, 18 Gigabytes of on-chip memory, 9 petabyte/s memory bandwidth, and 100 petabyte/s fabric bandwidth. It is 56.7 times larger than the largest Nvidia graphics processing unit, which accommodates 21.1 billion transistors on an 815 mm2 silicon base.

Flawed Algorithms Are Grading Millions of Students’ Essays

VICE, Motherboard, Todd Feathers

from August 20, 2019

Fooled by gibberish and highly susceptible to human bias, automated essay-scoring systems are being increasingly adopted, a Motherboard investigation has found

Sometimes You Don’t Need Deep Learning

Fortune, Eye on A.I., Jonathan Vanian

from August 20, 2019

Many executives expect “amazing results to the business,” [Ibrahim Gokcen] explained. The reality, however, is that the savings are often too modest to justify the high cost. For big savings, companies must have all of their industrial equipment “digitized” so that the machinery can be analyzed together, Gokcen said. The potential for costs savings therefore multiplies.

Schneider Electric is exploring how deep learning could help it create new predictive maintenance nirvana. But the technology isn’t ready.

Still, the excitement over deep learning is infectious, and Schneider Electric will continue testing the technology.

Waze Hijacked LA Neighborhoods. Can Traffic Apps Be Stopped?

Los Angeles Magazine, Jonathan Littman

from August 20, 2019

Traffic apps turned the city’s neighborhoods into ”shortcuts.” Now furious residents are attempting to take them back, street by street

Coming soon to Netflix: Movie trailers crafted by AI

CBS News, Sarah Min

from August 19, 2019

The next movie trailer you watch on Netflix could be made by artificial intelligence.

Gregory Peters, chief product officer at Netflix, said in a July earnings call that the Los Gatos, California, company is investing in technology that can index characters and scenes in a movie “so that our trailer creators can really focus their time and energy on the creative process.”

Privacy and Security are Converging in the Data Center

Data Center Knowledge, Maria Korolov |

from August 20, 2019

Europe’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) are just the most visible tips of a giant data privacy iceberg bearing down on data center users.

“There are roughly 120 countries now that are signed up to a data privacy regulation,” said Darren Mann, VP of global operations at Airbiquity, an automotive telematics company that operates data centers in Europe and pays a lot of attention to these laws.

Airbiquity doesn’t handle sensitive personal data such as voting preferences or medical histories. “We deal with things like vehicle registration numbers,” Mann said. “But the GDPR ruled that device identifiers can be classified as personal data. And, obviously, if we’re dealing with location-based services, we would be collecting location-based data.”

On me, and the Media Lab

Medium, Ethan Zuckerman

from August 20, 2019

A week ago last Friday, I spoke to Joi Ito about the release of documents that implicate Media Lab co-founder Marvin Minsky in Jeffrey Epstein’s horrific crimes. Joi told me that evening that the Media Lab’s ties to Epstein went much deeper, and included a business relationship between Joi and Epstein, investments in companies Joi’s VC fund was supporting, gifts and visits by Epstein to the Media Lab and by Joi to Epstein’s properties. As the scale of Joi’s involvement with Epstein became clear to me, I began to understand that I had to end my relationship with the MIT Media Lab. The following day, Saturday the 10th, I told Joi that I planned to move my work out of the MIT Media Lab by the end of this academic year, May 2020.

Teaching Schoolchildren the Fundamentals of Data Science

Columbia University, Data Science Institute

from August 16, 2019

Sela Rozov is only in fifth grade, but she ably presented a data-science project she worked on for a science fair to a group of teachers, students and data scientists who gathered at Columbia University for a workshop called Data Science in the Classroom. For her project, Sela collected and evaulated Facebook data on two advertisements, trying to understand which ad lured more viewers and why.

“It was really fun to work with data and I really liked making the visualizations for my project,” said Sela, who attends the F.E. Bellows Elementary School in Mamaroneck, N.Y.

The workshop, held Aug. 13 in the Smith Learning Theater at Teachers College, explored how teachers can instruct children like Sela in the fundamentals of data science.

Hundreds of extreme self-citing scientists revealed in new database

Nature, News Feature, Richard Van Noorden & Dalmeet Singh Chawla

from August 19, 2019

The world’s most-cited researchers, according to newly released data, are a curiously eclectic bunch. Nobel laureates and eminent polymaths rub shoulders with less familiar names, such as Sundarapandian Vaidyanathan from Chennai in India. What leaps out about Vaidyanathan and hundreds of other researchers is that many of the citations to their work come from their own papers, or from those of their co-authors.

Vaidyanathan, a computer scientist at the Vel Tech R&D Institute of Technology, a privately run institute, is an extreme example: he has received 94% of his citations from himself or his co-authors up to 2017, according to a study in PLoS Biology this month1. He is not alone. The data set, which lists around 100,000 researchers, shows that at least 250 scientists have amassed more than 50% of their citations from themselves or their co-authors, while the median self-citation rate is 12.7%.

The Planet Needs a New Internet

Gizmodo, Earther, Maddie Stone

from August 20, 2019

Huge changes will be needed because right now, the internet is unsustainable. On the one hand, rising sea levels threaten to swamp the cables and stations that transmit the web to our homes; rising temperatures could make it more costly to run the data centers handling ever-increasing web traffic; wildfires could burn it all down. On the other, all of those data centers, computers, smartphones, and other internet-connected devices take a prodigious amount of energy to build and to run, thus contributing to global warming and hastening our collective demise.

To save the internet and ourselves, we’ll need to harden and relocate the infrastructure we’ve built, find cleaner ways to power the web, and reimagine how we interact with the digital world. Ultimately, we need to recognize that our tremendous consumption of online content isn’t free of consequences—if we’re not paying, the planet is.

A writer shared her story about getting frightening genetic results online. The response was surprising

STAT; Rebecca Robbins, Damian Garde and Adam Feuerstein

from August 19, 2019

Over the past decade, more than 25 million people have ordered at-home DNA testing kits from companies like 23andMe and Ancestry.com. You spit in a tube, send it away, and get notified by email when your results are ready. Initially aimed at providing information about ancestry, some companies now test for certain genetic mutations that are strongly correlated with the risk of developing cancer, Alzheimer’s disease, or other serious conditions.

Dorothy Pomerantz is one of those 25 million people. Last year, almost on a lark, she bought a 23andMe test and sent her spit to Silicon Valley. But what she learned went far beyond the lighthearted stuff advertised on TV, like connecting with long-lost family members. She wrote about that experience for STAT’s First Opinion, and spoke with STAT about the reaction to it.

Stanford researchers find smart faucets could aid in water conservation

Stanford University, Stanford News

from August 20, 2019

An experiment with a water-saving “smart” faucet shows potential for reducing water use. The catch? Unbeknownst to study participants, the faucet’s smarts came from its human controller.

Events

Into the Dataverse! hackathon

National Security Innovation Network

from September 20, 2019

Ann Arbor, MI September 20-22 at University of Michigan. “Develop an AI-enabling user interface that can intuitively capture both structured and non-structured maintenance data, and associated maintainer actions, in an efficient and user-friendly manner to produce more accurate maintenance logs.” [free, registration required]

Privacy + Security Forum | Conference

Daniel Solove and Paul Schwartz

from October 14, 2019

Washington, DC October 14-16. [$$$$]

Tools & Resources

At long last (and very sorry for the delay): the public repo for Ethics and Data Science by @mikeloukides @dpatil and @hmason

Twitter, Mike Loukides

from August 12, 2019

https://t.co/7YaQfmOrmu

Configure Your R Project for binderhub • hole punch

Karthik Ram

from August 20, 2019

holepunch will read the contents of your R project on GitHub, create a DESCRIPTION file with all dependencies, write a Dockerfile, add a badge to your README, and build a Docker image. Once these 4 steps are complete, any reader can click the badge and within minutes, be dropped into a free, live, RStudio server. Here they can run your scripts and notebooks and see how everything works.

A Review Of Google’s Colab And CoCalc for Collaborative Data Science

Forbes, Gregory Ferenstein

from August 18, 2019

As part of my on-going series on learning data science and reviewing the latest tools, I ended up needing to work on data analysis with people in different countries. While big companies have their own internal tools for sharing code among teams, there were less available for students and freelancers. Fortunately, two such tools, Google Colab and CoCalc, are emerging to help data scientists collaborate online.

Front-end Developer Handbook 2019 – Learn the entire JavaScript, CSS and HTML development practice!

Cody Lindley

from April 21, 2019

This is a guide that everyone can use to learn about the practice of front-end development. It broadly outlines and discusses the practice of front-end engineering: how to learn it and what tools are used when practicing it in 2019.

It is specifically written with the intention of being a professional resource for potential and currently practicing front-end developers to equip themselves with learning materials and development tools. Secondarily, it can be used by managers, CTOs, instructors, and head hunters to gain insights into the practice of front-end development.

Deploying BERT in production

Towards Data Science, Omer Spillinger

from August 19, 2019

In this guide, I’ll use BERT to train a sentiment analysis classifier and Cortex to deploy it as a web API on AWS. The API will autoscale to handle production workloads, support rolling updates so that models can be updated without any downtime, stream logs to make debugging easy, and support inference on CPUs and GPUs.

Spriteworld: a flexible, configurable python-based reinforcement learning environment

GitHub – deepmind

from August 19, 2019

Spriteworld is a python-based RL environment that consists of a 2-dimensional arena with simple shapes that can be moved freely. This environment was developed for the COBRA agent introduced in the paper “COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration” (Watters et al., 2019). The motivation for the environment was to provide as much flexibility for procedurally generating multi-object scenes while retaining as simple an interface as possible.

Careers

Full-time positions outside academia

Research Librarian

Facebook; Menlo Park, CA

Sports.BradStenger.com

Data Science newsletter – August 21, 2019

Leave a Comment Cancel reply