Data Science newsletter – February 17, 2020

Newsletter features journalism, research papers, events, tools/software, and jobs for February 17, 2020


Data Science News

Researchers have already tested YouTube’s algorithms for political bias

Ars Technica, Zain Humayun


In August 2018, President Donald Trump claimed that social media was “totally discriminating against Republican/Conservative voices.” Not much was new about this: for years, conservatives have accused tech companies of political bias. Just last July, Senator Ted Cruz (R-Texas) asked the FTC to investigate the content moderation policies of tech companies like Google. A day after Google’s vice president insisted that YouTube was apolitical, Cruz claimed that political bias on YouTube was “massive.”

But the data doesn’t back Cruz up—and it’s been available for a while. While the actual policies and procedures for moderating content are often opaque, it is possible to look at the outcomes of moderation and determine if there’s indication of bias there. And, last year, computer scientists decided to do exactly that.

Northwestern Mutual Data Science Institute Talks Data for DNC and 2020 Election

Milwaukee Courier, Mrinal Gokhale


Social media has become a powerful tool for the data science field over the years, and the Northwestern Mutual Data Science Institute is now merging politics and big data for a project dedicated to the DNC in Milwaukee.

Amber Wichowsky, an associate professor in the Department of Political Science at Marquette University, said there are millions of online mentions on the 2020 election per month, and that doesn’t include the impeachment trial.

“Since we’re here at the DNC Summit, one thing we’re excited about is, as we head towards the national convention here…how are people talking about Milwaukee online?” said Wichowsky.

Purdue University to open Scalable Open Laboratory for Cyber Experimentation



Purdue University’s CERIAS Center for Education and Research in Information Assurance and Security has announced the addition of a new laboratory facility that dramatically increases Purdue’s cyber-physical research, emulation, and analysis capabilities.

The new SOL4CE (Scalable Open Laboratory for Cyber Experimentation) is a collaborative initiative between CERIAS, the nation’s oldest and preeminent interdisciplinary cyber/cyber-physical security research institute, and the U.S. Department of Energy’s Sandia National Laboratories.

Chicago’s DPI aims to rival Boston tech hub at Kendall Square

Chicago Business, John Pletz


The biggest hurdle between Pritzker and his Kendall Square dreams may be the roughly 150 miles between Chicago and U of I’s main campus in Urbana-Champaign. The most successful technology incubators are much closer to their main university partners.

Pritzker, who was a venture capitalist before he took over as governor last year, is betting $230 million in tax money that DPI can bridge the gap. His goal: use U of I’s tech cred as a talent magnet and catalyst for Chicago’s tech community.

Combined with $270 million for other projects at more than a dozen public universities or campuses, from Carbondale to DeKalb, it’s one of the biggest economic development bets by Illinois in more than 30 years.

From models of galaxies to atoms, simple AI shortcuts speed up simulations by billions of times

Science, Matthew Hutson


Modeling immensely complex natural phenomena such as how subatomic particles interact or how atmospheric haze affects climate can take many hours on even the fastest supercomputers. Emulators, algorithms that quickly approximate these detailed simulations, offer a shortcut. Now, work posted online shows how artificial intelligence (AI) can easily produce accurate emulators that can accelerate simulations across all of science by billions of times.

“This is a big deal,” says Donald Lucas, who runs climate simulations at Lawrence Livermore National Laboratory and was not involved in the work. He says the new system automatically creates emulators that work better and faster than those his team designs and trains, usually by hand. The new emulators could be used to improve the models they mimic and help scientists make the best of their time at experimental facilities. If the work stands up to peer review, Lucas says, “It would change things in a big way.”

The Third AI Summer, Henry Kautz, AAAI 2020 Robert S. Engelmore Memorial Award Lecture

YouTube, Henry Kautz


Talk presented Henry Kautz, winner of the Robert S. Engelmore Memorial Lecture Award, at the 34th Annual Meeting of the Association for the Advancement of Artificial Intelligence (AAAI-2020) in New York, NY on February 10, 2020. Dr. Kautz received the award for for “outstanding research contributions in the area of knowledge representation, data analytics, and data mining of social media for public good.” [video, 49:48]

Your DNA is a valuable asset, so why give it to ancestry websites for free?

The Guardian, Opinion, Laura Spinney


DNA testing companies are starting to profit from selling our data on to big pharma. Perhaps they should be paying us

Unprecedented Facebook URLs Dataset now Available for Academic Research through Social Science One

Social Science One, Gary King and Nathaniel Persily


We are excited to announce that Social Science One and Facebook have completed, and are now making available to academic researchers, one of the largest social science datasets ever constructed. We processed approximately an exabyte (a quintillion bytes, or a billion gigabytes) of raw data from the platform. The dataset itself contains a total of more than 10 trillion numbers that summarize information about 38 million URLs shared more than 100 times publicly on Facebook (between 1/1/2017 and 7/31/2019). It also includes characteristics of the URLs (such as whether they were fact-checked or flagged by users as hate speech) and the aggregated data concerning the types of people who viewed, shared, liked, reacted to, shared without viewing, and otherwise interacted with these links. This dataset enables social scientists to study some of the most important questions of our time about the effects of social media on democracy and elections with information to which they have never before had access. The full codebook for the dataset is here.

New research center will focus on socially responsible artificial intelligence

Penn State University, Penn State News


Penn State will be well positioned to recognize and interpret the social implications of artificial intelligence (AI), thanks to a new, multi-unit research center launched this spring.

The Penn State Center for Socially Responsible Artificial Intelligence promotes the thoughtful development and application of AI and studies its impact on all areas of human endeavor. In addition to supporting research focused explicitly on AI for social good and mitigating threats from its misuse, through this center, Penn State will encourage that all AI research and development activities consider social and ethical implications as well as intended and possible unintended consequences.

Ohio Using Artificial Intelligence To Cut Through Red Tape

ideastream, Andy Chow


The state of Ohio is going high-tech to weed out what could be considered overly burdensome government rules. One state agency is using an artificial intelligence (AI) program to sift through hundreds of thousands of regulations.

The AI program will sort and analyze data collected from every page of Ohio’s laws and administrative code. Staff will then go through the data to determine redundancies and outdated rules. [audio, 0:56]

Why Bill Gates thinks gene editing and artificial intelligence could save the world

GeekWire, Alan Boyle


… The rapid rise of artificial intelligence gives Gates further cause for hope. He noted that that the computational power available for AI applications has been doubling every three and a half months on average, dramatically improving on the two-year doubling rate for chip density that’s described by Moore’s Law.

One project is using AI to look for links between maternal nutrition and infant birth weight. Other projects focus on measuring the balance of different types of microbes in the human gut, using high-throughput gene sequencing. The gut microbiome is thought to play a role in health issues ranging from digestive problems to autoimmune diseases to neurological conditions.

“This is an area that needed these sequencing tools and the high-scale data processing, including AI, to be able to find the patterns,” Gates said. “There’s just too much going on there if you had to do it, say, with paper and pencil to understand the 100 trillion organisms and the large amount of genetic material there. This is a fantastic application for the latest AI technology.”

How CRISPR technology is advancing

Harvard Gazette


In two papers published in Nature Biotechnology, researchers at Harvard University, the Broad Institute, and the Howard Hughes Medical Institute have invented new CRISPR tools that address both issues. The first paper describes newly designed cytosine base editors that reduce an elusive type of off-target editing by 10- to 100-fold, making new variants that are especially promising for treating human disease. The second describes a new generation of all-star CRISPR-Cas9 proteins the team evolved that are capable of targeting a much larger fraction of pathogenic mutations, including the one responsible for sickle cell anemia, which was prohibitively difficult to access with previous CRISPR methods.

“Since the era of human genome editing is in its fragile beginnings, it’s important that we do everything we can to minimize the risk of any adverse effects when we start to introduce these into people,” said David Liu, the lead author on the papers. “Minimizing this kind of elusive off-target editing is an important step toward achieving that goal.”

In Foreshadowing Cryptocurrency Regulations, U.S. Treasury Secretary Prioritizes Law Enforcement Concerns

Electronic Frontier Foundation, Rainey Reitman


U.S. Treasury Secretary Steven Mnuchin foreshadowed the Trump administration’s plans for greater surveillance of cryptocurrency users during his testimony before the Senate Finance Committee on Wednesday. He noted that cryptocurrency was a “crucial area” for the Treasury Department to examine, and said:

We are working with FinCEN and we will be rolling out new regulations to be very clear on greater transparency so that law enforcement can see where the money is going and that this isn’t used for money laundering.

Car ‘splatometer’ tests reveal huge decline in number of insects

The Guardian, Damian Carrington


Two scientific studies of the number of insects splattered by cars have revealed a huge decline in abundance at European sites in two decades.

The research adds to growing evidence of what some scientists have called an “insect apocalypse”, which is threatening a collapse in the natural world that sustains humans and all life on Earth. A third study shows plummeting numbers of aquatic insects in streams.

Disease modelers gaze into computers to see future of Covid-19

STAT, Sharon Begley


At least 550,000 cases. Maybe 4.4 million. Or something in between.

Like weather forecasters, researchers who use mathematical equations to project how bad a disease outbreak might become are used to uncertainties and incomplete data, and Covid-19, the disease caused by the new-to-humans coronavirus that began circulating in Wuhan, China, late last year, has those everywhere you look. That can make the mathematical models of outbreaks, with their wide range of forecasts, seem like guesswork gussied up with differential equations; the eightfold difference in projected Covid-19 cases in Wuhan, calculated by a team from the U.S. and Canada, isn’t unusual for the early weeks of an outbreak of a never-before-seen illness.

But infectious-disease models have been approximating reality better and better in recent years, thanks to a better understanding of everything from how germs behave to how much time people spend on buses.


Google I/O 2020

Twitter, Sundar Pichai


Mountain View May 12-14. [save the date]

Observable Community Meetup



San Francisco, CA February 25, starting at 5:30 p.m. “Come meet other Observable users and hang out with the team. We’ll have talks from community members along with show & tell so you can demo what you’ve been working on!” [rsvp required]

The R Conference | New York

Lander Analytics


New York, NY May 7-9. [$$$]

BIOMEDevice Boston

Informa Markets


Boston, MA May 6-7. “Every spring, 2,200+ industry professionals and 335+ suppliers convene in Boston — home to the nation’s highest number of medical device companies — for a two-day event focused on moving medtech design projects to the next stage of development.” [$$$] Silicon Valley: Submitting Papers to Grace Hopper Celebration (GHC 20)) Community network


San Francisco, CA February 27, starting at 6:30 p.m., Udemy (600 Harrison St). Presenter: Karen Catlin. [registration required]


2020 IEEE/RSJ International Conference on Intelligent Robots and Systems

Las Vegas, CA October 25-29. Deadline for paper submissions is March 1.

JuliaCon 2020

Lisbon, Portugal July 27-31. “We are interested in all topics that have to do with Julia.” Deadline for speaker and workshop proposals is March 7.
Tools & Resources Academic Program for Professors and Students: Quick Start with Driverless AI and Paperspace

R-bloggers, Gregory Kanevsky


“If you are a professor teaching or a student enrolled in machine learning program or non-technical program with a machine learning hands-on lab becoming a member of the Academic Program will get you free access to non-commercial use of software license for education and research purposes. Since November 2018 (my employer) made its ground-breaking automated machine learning (AutoML) platform Driverless AI available to academia for free.”

Acceleration without pain

Machine Learning Research Blog, Francis Bach


Acceleration is a key concept in numerical analysis and can be carried through in two main ways. The first way is to modify some steps of the algorithm (such as Nesterov acceleration for gradient descent, or Chebyshev / Jacobi acceleration for linear recursions). This requires a good knowledge of the inner structure of the underlying algorithm. A second way is to totally ignore the specifics of the algorithm, and see the acceleration problem as trying to find good “combinations” of the observed iterates that converge faster.

In this blog post, I thus consider a sequence of iterates (xk)k>=0 in Rd obtained from an iterative algorithm xk+1=T(xk), which will typically be an optimization algorithm. The main question I will address is: Can we do better than outputting the last iterate?

Introducing FAIR Island

UC3 :: California Digital Library


Building on the Island Digital Ecosystem Avatars (IDEA) Consortium (see Davies et al. 2016), FAIR Island leverages collaboration between the University of California Gump Station, located on Moorea in French Polynesia, and Tetiaroa Society, which operates a newly established field station located on the atoll of Tetiaroa, a short distance from Moorea.

Tetiaroa is in a unique position to demonstrate how we can advance open science by creating optimal FAIR data policies governing all research conducted at the field station. By implementing mandatory registration requirements including extensive use of controlled vocabularies, personal identifiers (PIDs), and other identifiers, DMPs in this “FAIR data utopia” will be utilized as key documents for tracking provenance, attribution, compliance, deposit, and publication of all research data collected on the island.


Tenured and tenure track faculty positions

Faculty Positions in Computer Science

University of Rochester, Department of Computer Science; Rochester, NY
Internships and other temporary positions

Data Science Intern (Summer)

AARP; Washington, DC
Full-time, non-tenured academic positions

Data Library Engineer

University of California-San Francisco, Data Science CoLab; San Francisco, CA


University of Colorado, Department of Information Science; Boulder, CO

Leave a Comment

Your email address will not be published.