Data Science newsletter – April 4, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for April 4, 2018

GROUP CURATION: N/A

Data Science News

Number of doctorates awarded by US institutions in 2016 close to all-time high

National Science Foundation

from April 03, 2018

U.S. institutions awarded 54,904 research doctorate degrees in 2016, only five fewer than the previous year’s record high, according to the Survey of Earned Doctorates (SED), a federally sponsored annual census of research degree recipients.

SED provides data for Doctorate Recipients from U.S. Universities, a report published by the National Center for Science and Engineering Statistics (NCSES) within the National Science Foundation (NSF) that supplies data and analysis of the American system of doctoral education, a vital U.S. economic interest.

Since the SED began collecting data in 1957, the number of research doctorates awarded in science and engineering (S&E) fields has exceeded the number of non-S&E doctorates, and the gap is widening. In 1957, S&E doctorates made up 65 percent of all doctorates awarded. In 2016, S&E doctorates made up 75 percent.

University Data Science News

The Social Science Research Council is coordinating a new academia-industry research partnership, the Social Data Initiative. Principal Investigators whose studies are approved will gain access to de-identified Facebook data, much more comprehensive than the data they typically have. The details of how the data, the power, and publication decisions will flow are spelled out by Harvard’s Gary King and Stanford Law’s Nathaniel Persily. Academics who want to get involved can either serve on the advisory committee or apply for grants, similar to the way they’d apply for any other kind of grant. Funding comes from private foundations including the Alfred P. Sloan Foundation, the William and Flora Hewlett Foundation, the Omidyar Network, and the Charles Koch Foundation. The initial goal of the Social Data Initiative is to investigate the way Facebook use may be changing democratic practice including the 2016 US elections.

The National Science Foundation just published its annual report on doctoral education in the US, based on exit surveys with newly minted doctorates. The number of doctorates granted in the US in 2016 — 54,904 — was nearly identical to the 2015 high of 54,909. Hot takes:

nine of the top ten granting institutions are flagship state universities (Texas, Wisconsin, Michigan, California (UCLA and Berkeley), Minnesota, Florida, Indiana, Ohio)

women earned 46% of all doctorates granted, continuing to move towards parity. There are still substantial within-field gender gaps in math and engineering,

after graduation, Science, Engineering, and Math (SEM) doctorate holders with “firm employment plans” went either to academic employment (32.6%), postdocs (38.5%), or industry (29%). Industry positions pay almost 1.5 times as much as assistant professor positions and a little more than double postdoc salaries in SEM fields,

about 39% of science, engineering, and math PhDs had no firm employment plans when they submitted their dissertations and took the survey.

University of Pennsylvania is now offering an ethics class in the Computer and Information Science Department led by Ani Nenkova and Michael Kearns. The big difference here is that the course plans to focus on technical skills, which is a huge benefit.

Alan Mislove and Christo Wilson, associate and assistant professors of computer science at Northeastern University are moving forward with their case against a federal computer crimes law that criminalizes breaches of websites’ terms of service, according to a federal district court judge. The computer scientists argue that they need to be able to submit false information to test for gender, racial, and other types of discrimination. Their argument is allowed to go forward. They are being represented by the American Civil Liberties Union (ACLU). Two other plaintiffs, Christian Sandvig from the University of Michigan, Karrie Karahalios from the University of Illinois, and First Look Media were not allowed to continue as plaintiffs as their parts of the case were dismissed.

Is publishing academic articles going to be replaced by sharing Jupyter Notebooks? There is a long-read in The Atlantic that positions Fernando Pérez, Brian Granger, and Min Ragan-Kelly’s Jupyter notebooks as the most evolved version of academic communication. In a love-letter to Jupyter notebooks, David Wallace points out that the number of “Jupyter (then called iPython) notebooks hosted on Github has climbed from 200,000 in 2015, to almost two million.”

Master’s and Doctoral Programs in Data Science and Analytics

Amstat News, Steve Pierson

from April 01, 2018

More and more universities are starting master’s and doctoral programs in data science and analytics—of which statistics is foundational—due to the increasing interest from students and employers. Amstat News reached out to those in the statistical community who are involved in such programs to find out more about them. Given their interdisciplinary nature, we identified programs involving faculty with expertise in different disciplines to jointly reply to our questions. We have profiled many universities in our April, June, and December 2017 issues and January 2018 issue; here are several more.

The Buck Stops – And Starts – Here For GPU Compute

The Next Platform, Timothy Prickett Morgan

from April 03, 2018

Ian Buck doesn’t just run the Tesla accelerated computing business at Nvidia, which is one of the company’s fastest-growing and most profitable products in its twenty five year history. The work that Buck and other researchers started at Stanford University in 2000 and then continued at Nvidia helped to transform a graphics card shader into a parallel compute engine that is helping to solve some of the world’s toughest simulation and machine learning problems.

The annual GPU Technology Conference was held by Nvidia last week, and we sat down and had a chat with Buck about a bunch of things relating to GPU accelerated systems, including what is driving the adoption of GPU computing and the things, such as the new NVSwitch that helps boost the performance of machine learning and perhaps HPC workloads.

Want to Regulate Facebook and User Data? Fix the Software

TIME, Jean Yang

from April 03, 2018

It is just about impossible to seriously regulate data use with these current practices in place. We need to build security and privacy controls into software tools. Researchers have been developing techniques for doing precisely this. There exist techniques that can, for instance, ensure that an app can read camera information but not send it across the network to anybody else. My research group, in collaboration with researchers at UC Santa Cruz, UC San Diego, Harvard and MIT, is working on a set of techniques that allow programmers to attach precise, complex rules about data use — like “only my friends near me can see my location between 9 a.m. Monday and 5 p.m. Friday” — directly to sensitive data values, allowing developers to write these kinds of policies in one place and auditors to check such policies by looking in a single location. (Full disclosure: Facebook has contributed funding to my research group, and we collaborate with two Facebook employees on a non-privacy related aspect of the work. I also worked on backend privacy at Facebook as an intern in 2012.) This is part of a broader context of researchers at places like Cornell, Stanford and MIT, where there are also groups actively working on information flow security techniques for preventing these kinds of leaks. Requiring a software company like Facebook to use such techniques would make it much easier to enforce higher-level regulation.

Mark Zuckerberg on Facebook’s hardest year, and what comes next

Vox, Ezra Klein

from April 02, 2018

I spoke with Zuckerberg on Friday about the state of his company, the implications of its global influence, and how he sees the problems ahead of him.

“I think we will dig through this hole, but it will take a few years,” Zuckerberg said. “I wish I could solve all these issues in three months or six months, but I just think the reality is that solving some of these questions is just going to take a longer period of time.”

But what happens then? What has this past year meant for Facebook’s future? In a 2017 manifesto, Zuckerberg argued that Facebook would help humanity takes its “next step” by becoming “the social infrastructure” for a truly global community.

The Limits of Big Data in Medical Research

Scientific American Blog Network, Observations, Jim Kozubek

from April 03, 2018

It could help large institutions reach new insights into disease—but also make it harder for small labs with original ideas to compete for grants

Aiming to fill skill gaps in AI, Microsoft makes training courses available to the public

Microsoft, The AI Blog, John Roach

from April 02, 2018

As a software engineer at Microsoft, Elena Voyloshnikova’s job is to make informed recommendations about how to improve the performance of software engineering tools.

But too often, she spends her days manually analyzing the data she needs to make those decisions. Lately, her team has been discussing the potential of building machine learning models to automate that task – creating more time to focus on the decision-making.

That’s why she was intrigued when she received an email announcing an upcoming AI training session for Microsoft employees.

“I asked my manager, ‘Can I go to this?’” she said. “I thought it looked like a good overview of things I would like to know.”

Berkeley Looks to Ban Contracts With Immigrant Database Builders

Bloomberg Law, BNA, Joyce E. Cutler

from April 02, 2018

Berkeley, Calif., is considering shutting out of city contracts companies that help federal immigration officials create databases and registries used to target immigrants and religious minorities.

The Sanctuary City Contracting and Investment Ordinance, to be taken up by the City Council April 3, wouldn’t allow contracts with any vendor that’s working with U.S. Immigration and Customs Enforcement to create a database that could be used to identify and round up immigrants. It would also prohibit city investments in such vendors.

Energy Hogs: Can World’s Huge Data Centers Be Made More Efficient?

Yale E360, Fred Pearce

from April 03, 2018

The gigantic data centers that power the internet consume vast amounts of electricity and emit 3 percent of global CO2 emissions. To change that, data companies need to turn to clean energy sources and dramatically improve energy efficiency.

Apple Hires Google’s A.I. Chief

The New York Times, Jack Nicas and Cade Metz

from April 03, 2018

Apple has hired Google’s chief of search and artificial intelligence, John Giannandrea, a major coup in its bid to catch up to the artificial intelligence technology of its rivals.

Apple said on Tuesday that Mr. Giannandrea will run Apple’s “machine learning and A.I. strategy,” and become one of 16 executives who report directly to Apple’s chief executive, Timothy D. Cook.

The hire is a victory for Apple.

To understand big data, convert it to sound

Network World, Patrick Nelson

from March 27, 2018

The sonification of big data will help people better understand and analyze big data, as well as detect anomalies in the data, say researchers at Virginia Tech.

Scientists Still Can’t Decide How to Define a Tree

The Atlantic, Rachel Ehrenberg

from April 03, 2018

We think we know what trees are, but even at the level of genetics, it’s difficult to find what separates them from other plants.

Antarctica’s Ice Is Becoming Unhinged

Earther, Brian Kahn

from April 02, 2018

The two most important words you need to know to understand the fate of our coastlines are “grounding line.” Those words describe where Antarctica’s voluminous ice shelves begin to float, holding back a wall of ice on land.

A study published on Monday in Nature Geoscience is among the first to create a detailed snapshot of how warming ocean waters are eating away at grounding lines around the continent. Over just five years, the continent lost 564 square miles of grounded ice, an area equivalent to roughly 25 Manhattans, 12 San Franciscos or four Philadelphias. This is not good news for any of those or other coastal cities.

Goldman Sachs has hired a top AI exec from Amazon

Business Insider, Alex Morrell

from April 03, 2018

Goldman Sachs has hired a senior employee from Amazon to run the bank’s artificial-intelligence efforts.

Charles Elkan has joined Goldman Sachs as a managing director leading the firm’s machine learning and AI strategies, according to an internal memo viewed by Business Insider.

Elkan comes from Amazon, where he was responsible for the Artificial Intelligence Laboratory at Amazon Web Services, according to the memo. He previously led the retailing giant’s Seattle-based central machine-learning team.

Events

CERDEC Technical Interchange

U.S. Army, Communications-Electronics Research, Development and Engineering Center

from May 02, 2018

Aberdeen Proving Ground, MD May 2-4. “CERDEC is the Army’s applied research and advanced technology development center for command, control, communications, computers, cyber, intelligence, surveillance and reconnaissance (C5ISR) technologies and systems.” [registration required]

Hear From Data Science Luminaries Nate Silver and Cathy O’Neil at Rev

KDnuggets

from May 30, 2018

San Francisco, CA May 30-31. “Rev is for data science leaders and practitioners, offering interactive sessions, stimulating conversations, and tutorials about how to run, manage, and accelerate data science as an organizational capability.” [$$$]

Deep Learning Summit

RE•WORK

from May 24, 2018

Boston, MA May 24-25, co-located with Deep Learning Health Summit. Organized by RE•WORK. [$$$]

11th Annual Political Networks Conference and Workshop

APSA Political Networks Section

from June 06, 2018

Arlington, VA June 6-9 at George Mason University. [$$$]

Moore-Sloan Data Science Environment News

5 Minutes with Laura Norén – Center for Data Science – Medium

Medium, NYU Center for Data Science

from April 03, 2018

“I wish more people knew how exciting and positive data science ethics is.” We catch up with ethical mastermind and postdoctoral fellow, Laura Norén!

Tools & Resources

How to Integrate your Databases with Apache Kafka and CDC

Confluent, Robin Moffat

from March 16, 2018

One of the most frequent questions and topics that I see come up on community resources such as StackOverflow, the Confluent Platform mailing list, and the Confluent Community Slack group, is getting data from a database into Apache Kafka®, and vice versa. Often it’s Oracle, SQL Server, DB2, etc—but regardless of the actual technology, the options for doing it are broadly the same. In this post we’ll look at each of those options and discuss the considerations around each. It may be obvious to readers, but it’s worth restating anyway: since this is Kafka—a streaming platform—we are talking about streaming integration of data, not just bulk static copies of the data.

nlp-datasets

GitHub – niderhoff

from March 06, 2018

“Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP). Most stuff here is just raw unstructured text data, if you are looking for annotated corpora or Treebanks refer to the sources at the bottom.”

Sports.BradStenger.com

Data Science newsletter – April 4, 2018

Leave a Comment Cancel reply