Data Science newsletter – July 23, 2021

Newsletter features journalism, research papers and tools/software for July 23, 2021


Mysterious DNA sequences, known as ‘Borgs,’ recovered from California mud

Science, Elizabeth Pennisi


In the TV series Star Trek, the Borg are cybernetic aliens that assimilate humans and other creatures as a means of achieving perfection. So when Jill Banfield, a geomicrobiologist at the University of California, Berkeley, sifted through DNA in the mud of her backyard and discovered a strange linear chromosome that included genes from a variety of microbes, her Trekkie son proposed naming it after the sci-fi aliens. The new type of genetic material was a mystery. Maybe it was part of a viral genome. Maybe it was a strange bacterium. Or maybe it was just an independent piece of DNA existing outside of cells. Whatever it is, it’s “pretty exciting,” says W. Ford Doolittle, an evolutionary biologist at Dalhousie University who was not involved with the work.

Testing one machine learning method’s limits

Chemical & Engineering News, Sam Lemonick


New research argues that there are ways to make accurate deep learning predictions on molecular systems with data sets of only thousands of example compounds (Nat. Mach. Intell. 2021, DOI: 10.1038/s42256-021-00368-1).

That’s a relief for study author Michael Skinnider, a medical student at the University of British Columbia who earned his PhD under coauthor Leonard J. Foster. The two wanted to use deep learning to predict the molecular structures of illicit designer drugs based on mass spectrometry data, but Skinnider says only about 1,700 designer drugs structures are known, well below the supposed limit of deep learning accuracy. Machine-learning algorithms develop the ability to make accurate predictions—about things like molecular structures or chemical properties—after exposure to relevant data, a process known as training.

So the researchers and their colleagues set out to determine just how many data are needed to properly train deep learning algorithms. They also wanted to find out if there were ways to modify the data, the algorithms, or the training procedures to improve accuracy when only limited data are available

A new metric for designing safer streets

University of Pennsylvania, Penn Today


A new study published in Accident Analysis & Prevention shows how biometric data can be used to find potentially challenging and dangerous areas of urban infrastructure before a crash occurs. Lead author Megan Ryerson led a team of researchers in the Stuart Weitzman School of Design and the School of Engineering and Applied Science in collecting and analyzing eye-tracking data from cyclists navigating Philadelphia’s streets. The team found that individual-based metrics can provide a more proactive approach for designing safer roadways for bicyclists and pedestrians.

Next Year’s Freshman: How Corequisites Might Help Address COVID-19 Learning Loss

Rand Blog, Lindsay Daugherty and Trey Miller


It’s no longer breaking news that millions of students across the United States are likely to suffer COVID-19 school closure–induced learning loss in key subjects like math and English language arts. Many K–12 school districts are already planning how to address the problem: summer programs, tutoring, and parental guidance will likely prove key in the months ahead. But what will colleges do to address learning loss among the many new high school graduates who were affected by school closures?

The good news for these students is that colleges have been hard at work improving the way they provide academic support. In the not-so-distant past, colleges deemed the majority of students “not college ready” and required them to enroll in one or more semesters of noncredit developmental education courses. Studies found that many students were dropping out of these courses before ever making it to entry-level English and math courses.

Now colleges across the country have a new approach: corequisite remediation. Students immediately enter that college-level English or math course and receive some additional, aligned academic support during the same semester.

Julia Computing raises $24 mln in funding round led by Dorilton Ventures | Reuters



Julia Computing raised $24 million in a funding round led by venture capital firm Dorilton Ventures and said Bob Muglia, former chief of software provider Snowflake Inc (SNOW.N), would join the computing solutions company’s board.

Column: Amazon wants to use radar so Alexa can watch as you sleep

Los Angeles Times, David Lazarus


At first glance, it’s one of those things that appears relatively benign: Amazon received federal approval the other day to develop a device for tracking your sleep patterns.

When you look closer, though, questions arise.

Will the device’s radar sensors become an even more intrusive threat to our privacy than the microphones and cameras that the likes of Amazon, Apple and Google already have in millions of homes?

UNR, national laboratory to partner on research initiatives with a focus on climate change

The Nevada Independent, Daniel Rothberg


The project with the national laboratory will study a section of the Sierra Nevada that overlaps with the Truckee River watershed, a primary source of drinking water for the Reno area. The project, with funding from the U.S. Forest Service, as part of an initiative to improve restoration within the central Sierra Nevada, which includes the forests that surround Lake Tahoe.

In particular, the research will look at how management, including forest thinning, might affect the water supply. The researchers will then send their data to economists, who will look at the issue from an economic perspective. The study will rely on a model that was developed by Mark Wigmosta, a chief scientist for watershed hydrology at the national laboratory.

Microsoft Research to open Amsterdam lab focused on molecular simulation, led by noted physicist

GeekWire, Todd Bishop


Microsoft Research will open a new lab in Amsterdam led by Max Welling, a physicist who specializes in molecular simulation, looking to further unlock the potential of machine learning in areas such as climate change and healthcare.

Training Computers to Transfer Music from One Style to Another

University of California-San Diego, UC San Diego News Center


“People are more familiar with machine learning that can automatically convert an image in one style to another, like when you use filters on Instagram to change an image’s style,” said UC San Diego computer music professor Shlomo Dubnov. “Past attempts to convert compositions from one musical style to another came up short because they failed to distinguish between style and content.”

To fix that problem, Dubnov and co-author Conan Lu developed ChordGAN – a conditional generative adversarial network (GAN) architecture that uses chroma sampling, which only records a 12-tones note distribution note distribution profile to separate style (musical texture) from content (i.e., tonal or chord changes).

Charging Infrastructure Analysis Leverages NREL Data Science Expertise

CleanTechnica, U.S. Department of Energy


“NREL’s infrastructure analysis, which focuses on light-duty vehicles, suggests that up to 1.2 million chargers — beyond those located at single-family homes — may be necessary to support California’s ZEV goals,” said Eric Wood, an NREL data science research engineer. “This represents a dramatic increase from the 70,000 public and shared private chargers operating in the state today.”

Models estimate that California will need more than 700,000 shared private and public chargers in 2030 to support 5 million ZEVs, as shown in green, and more than 1.2 million chargers to support 8 million ZEVs, as shown in blue. (These values do not include chargers at single-family residences.) Chart courtesy of CEC

“California’s ZEV goals call for a dramatic shift in transportation infrastructure, necessitating an unprecedented investment in residential, destination, and fast-charging infrastructure,” Wood added. “In addition to identifying the number of chargers needed, NREL’s analysis is identifying efficient charging station locations as well as ways to mitigate the impact of charging loads on the electric grid—by tapping into renewable energy and employing smart-charge technologies, for instance.”

How Managing Building Energy Demand Can Aid the Clean Energy Transition

Lawrence Berkeley Labratory, News Center


Since buildings consume 75% of electricity in the U.S., they offer great potential for saving energy and reducing the demands on our rapidly changing electric grid. But how much, where, and through which strategies could better management of building energy use actually impact the electricity system?

A comprehensive new study led by researchers from the Department of Energy’s Lawrence Berkeley National Laboratory (Berkeley Lab) answers these questions, quantifying what can be done to make buildings more energy efficient and flexible in granular detail by both time (including time of day and year) and space (looking at regions across the U.S.). The research team, which also included scientists from the National Renewable Energy Laboratory (NREL), found that maximizing the deployment of building demand management technologies could avoid the need for up to one-third of coal- or gas-fired power generation and would mean that at least half of all such power plants that are expected to be brought online between now and 2050 would not need to be built.

Our New Framework for Artificial Intelligence (video)

U.S. Government Accountability Office


Artificial intelligence (AI) is a transformative technology with enormous potential for good. It is not only ubiquitous in all aspects of life, it is beginning to permeate the public sector and beyond, including health care, agriculture, law enforcement, manufacturing, transportation, and defense. Much of the power of AI comes from its ability to rapidly detect patterns in data that humans simply cannot comprehend. Bringing accountability to responsible use of this technology is therefore paramount to address inherent complexities, risks, and societal consequences.

GAO developed our AI Accountability Framework to address accountability challenges in artificial intelligence and machine learning by laying out key practices, questions, and audit procedures. To develop the framework, we held a forum on AI oversight with experts from government, academia, industry, international partners, and nonprofits.

RSNA to Launch Imaging AI Certificate Program in Fall 2021

Radiological Society of North America


Radiologists are at the forefront of adopting artificial intelligence (AI) into clinical practice. However, no matter where you are on the AI learning curve — novice or expert — the world of AI is ever-changing. This can create a struggle to keep updated on the ways that AI can improve workflow processes and quality in radiology.

RSNA is launching its Imaging AI Certificate program to deliver a pathway for radiologists, including those who don’t consider themselves technologically savvy, to understand and learn how to apply AI to their practices.

The RSNA Imaging AI Certificate program offers a convenient and structured online curriculum designed to help radiologists understand how to integrate AI into their practice, especially to assist with diagnostic radiology and workflow efficiency. Customized to meet radiologists wherever they are in their AI knowledge, the RSNA Imaging AI Certificate will help radiologists learn, practice and continually refresh their AI skills

Event Horizon Telescope captures ‘beautiful’ images of second black hole’s jet

Science, Daniel Clery


The astronomy team that 2 years ago captured the first close-up of a giant black hole, lurking at the center of the galaxy Messier 87 (M87), has now zoomed in on a second, somewhat smaller giant in the nearby active galaxy Centaurus A. The Event Horizon Telescope’s (EHT’s) latest image should help resolve questions about how such galactic centers funnel huge amounts of matter into powerful beams and fire them thousands of light-years into space. Together the images also support theorists’ belief that all black holes operate the same way, despite huge variations in their masses.

“This is really nice,” astronomer Philip Best of the University of Edinburgh says of the new EHT image. “The angular resolution is astonishing compared to previous images of these jets.”

The EHT merges dozens of widely dispersed radio dishes, from Hawaii to France and from Greenland to the South Pole, into a huge virtual telescope. By pointing a large number of dishes at a celestial object at the same time and carefully time stamping the data from each one with an atomic clock, researchers can later reassemble it with massive computing clusters—a process that takes years—to produce an image with a resolution as sharp as that of a single Earth-size dish. One challenge is getting observing time on 11 different observatories simultaneously, so the EHT only operates for a few weeks each year; poor weather and technical glitches often further narrow that window.

DNA pulled from thin air identifies nearby animals

Science, Erik Stokstad


DNA is everywhere, even in the air. That’s no surprise to anyone who suffers allergies from pollen or cat dander. But two research groups have now independently shown the atmosphere can contain detectable amounts of DNA from many kinds of animals. Their preprints, posted on bioRxiv last week, suggest sampling air may enable a faster, cheaper way to survey creatures in ecosystems.

The work has impressed other scientists. “The ability to detect so many species in air samples using DNA is a huge leap,” says Matthew Barnes, an ecologist at Texas Tech University. “It represents an exciting potential addition to the toolbox.”

“The surprising part is that you’re able to get birds and mammals—wow,” says Julie Lockwood, a molecular ecologist at Rutgers University, New Brunswick. The new studies suggest “there’s more than just spores; there’s cells and hair and all kinds of interesting things that float through the air.”


Folks, I am *so excited* for #PaCSS2021! (Aug 9-13)

Twitter, Sarah Shugars


Featuring keynotes by @j_a_tucker
and @cerenbudak
+ panels on some of the most exciting, innovative, and important CSS work from so many great scholars!

2021 Stanford AIMI Symposium + BOLD-AIR Summit

Stanford University, Center for Artificial Intelligence in Medicine and Imaging


Online August 3-4. “The Stanford Center for Artificial Intelligence in Medicine and Imaging presents the AIMI Symposium & BOLD-AIR Summit – free and open to all” [registration required]


The big one is back: PyData Global 2021 is coming Oct 28-30. We can’t wait to see and chat with world’s most interesting innovators once again.

“Do you have ideas to share? Submit a proposal for your presentation.” Deadline for submissions is August 15.



The eScience Institute’s Data Science for Social Good program is now accepting applications for student fellows and project leads for the 2021 summer session. Fellows will work with academic researchers, data scientists and public stakeholder groups on data-intensive research projects that will leverage data science approaches to address societal challenges in areas such as public policy, environmental impacts and more. Student applications due 2/15 – learn more and apply here. DSSG is also soliciting project proposals from academic researchers, public agencies, nonprofit entities and industry who are looking for an opportunity to work closely with data science professionals and students on focused, collaborative projects to make better use of their data. Proposal submissions are due 2/22.


Tools & Resources

NVIDIA Announces Hybrid Cloud Program for AI Deployment

RTInsights, Elizabeth Wallace


The unified approach taken by NVIDIA ensures that companies have support for AI initiatives through an end-to-end hardware and software suite.

Companies that are unable to use public cloud computing for AI initiatives will soon have a new alternative from NVIDIA. The AI Launchpad offers customers an on-demand AI solution designed to level the playing field for companies with specific needs.

Graceful AI – How to make trained systems evolve gracefully.

Amazon, Science blog, Stefano Soatto


Why is it necessary to reprocess old data? Can we design and train new learning-based models in a manner that is compatible with previous ones, so that it is not necessary to reprocess the entire gallery?

These questions generally pertain to the need to train machine-learning-based systems, not in isolation, but in reference to other models. Specifically, we want the new models to be compatible with classifiers or clustering algorithms designed for the old models, and we want them to not introduce new mistakes.

Using Neural Networks for Your Recommender System

NVIDIA Developer Blog, Benedikt Schifferer


Deep learning (DL) is the state-of-the-art solution for many machine learning problems, such as computer vision or natural language problems and it outperforms alternative methods. Recent trends include applying DL techniques to recommendation engines. Many large companies—such as AirBnB, Facebook, Google, Home Depot, LinkedIn, and Pinterest—share their experience in using DL for recommender systems.

Recently, NVIDIA and the RAPIDS.AI team won three competitions with DL: the ACM RecSys2021 Challenge, SIGIR eCom Data Challenge, and ACM WSDM2021 Challenge.

The field of recommender systems is complex. In this post, I focus on the neural network architecture and its components, such as embedding and fully connected layers, recurrent neural network cells (LSTM or GRU), and transformer blocks. I discuss popular network architectures, such as Google’s Wide & Deep and Facebook’s Deep Learning Recommender Model (DLRM).

Leave a Comment

Your email address will not be published.