Data Science newsletter – June 2, 2021

Newsletter features journalism, research papers and tools/software for June 2, 2021

 

A collaborative approach that engages the community to develop digital biomarkers

Nature Portfolio Health Community, Solveig Sieberts


from

… Inspired by the way brains work, deep learning models with neural networks methods were overwhelmingly more accurate than signal processing-based approaches when we had massive amounts of data (in this case ~40,000 sets of sensor reads). But they performed similarly to signal processing methods when the data size was an order of magnitude smaller. Of course, data sets of sufficient size are incredibly rare and prohibitive to collect for most groups. Democratizing data access can facilitate better biomarker development.

Access to data is not the only barrier to the development of quality digital biomarkers. In an attempt to publish or perish, researchers’ own self-interests can lead them to use subtle tactics such as selective choice of data and metrics to improve the perception of their model accuracy relative to competing methods. Unfortunately, the peer-review process is often not sufficient to detect these subtle manipulations. This also makes it difficult to truly understand the relative performance of competing methods in a truly unbiased fashion.


/1. Yesterday at the ACS Data Users Conference, the Census Bureau described its plans to replace the American Community Survey (ACS) microdata with “fully synthetic” data over the next three years.

Twitter, Steve Ruggles


from

/2. Details of the methodology have not been disclosed, but the idea is to develop models describing the interrelationships of all the variables in the ACS, and then construct a simulated population consistent with those models.

/3. Such modeled data captures relationships between variables only if they have been intentionally included in the model. Accordingly, synthetic data are poorly suited to studying unanticipated relationships, which impedes new discovery.


How Johns Hopkins Medicine became an information powerhouse during the COVID-19 pandemic

FierceHealthcare, Heather Landi


from

“I have to credit our content strategy team. In early January (2020) they began researching the coronavirus. My senior content person came to me and said ‘I think we should write an article on coronavirus’ and so we had an article up on the site in late January,” Aaron Watkins, senior director of internet strategy at Johns Hopkins Medicine told Fierce Healthcare.

Watkins and his team are responsible for marketing, social media, search engine positioning and broad digital transformation strategies at the organization.

The teams at Johns Hopkins had to shift quickly to create a content model that was more nimble and could provide access to accurate coronavirus information in a rapidly changing environment. For example, the organization leveraged its medical expertise and content to develop an online COVID-19 self-checker.


University joins the Center for Accelerated Real Time Analytics

University of Miami, News @ the U


from

As Mitsu Ogihara, the director of education at the University of Miami Institute for Data Science and Computing (IDSC) notes, such smart data must be collected, analyzed, and conveyed instantly because “two days—or even two seconds—later is too late.”

Now, with a prestigious grant from the National Science Foundation (NSF), the University has become the newest academic partner of the NSF-sponsored Center for Accelerated Real Time Analytics (CARTA), which helps industry and government members drive innovations by extracting and analyzing real-time information from massive and moving data sets, such as video, voice, social networking, and the “internet of things.”


Researchers create cellular blueprint of healthy lungs

Yale University, YaleNews


from

In a new study, [Naftali] Kaminski’s team and a multinational team of investigators and the Human Cell Atlas project, an international initiative to describe all cells in the human body, examined the diverse characteristics of these relatively understudied cells.

In analysis of healthy lung tissues in 15,000 endothelial cells obtained from 73 individuals, they established a reference map of endothelial cells in the lung which will help researchers identify specific abnormalities that occur in a host of lung diseases, including COVID-19.


CDS Founder Yann LeCun, Deepmind Fellow Aishwarya Kamath and CIMS Post-Doctoral Fellow Nicolas Carion propose MDETR

Medium, NYU Center for Data Science


from

If someone handed you an art history book and asked you to show them the famous painting of a woman in a brack dress who is not quite smiling, you could probably identify Da Vinci’s Mona Lisa from the pictures within. Humans learn this ability to match images and objects to a written or oral description over many years. AI has proven capable of this as well but not quite as capable as humans. This is especially true when the AI is asked to identify an object outside of its current vocabulary or when it only has a textual description of the object to go off of.

A team of scientists at NYU and Facebook, including CDS founder Yann LeCun, CDS Deepmind Fellow Aishwarya Kamath and CIMS Post-Doctoral Fellow Nicolas Carion came together to try to create a solution to this problem. What they came up with is MDETR or Modulated Detection for End-to-End Multi-Modal Understanding which they outlined in a paper available on arXiv.


How financial data can bridge the investment gap to scale soil health

Environmental Defense Fund, Growing Returns blog, Vincent Gauthier and Camille Morse Nicholson


from

Conservation and agricultural organizations have advanced the business case for in-field conservation practices in recent years, but there remains a need for scalable data-gathering mechanisms that easily integrate into farmers’ and lenders’ most trusted sources of financial benchmarking information.

Leveraging the insights from a recently published guide from Environmental Defense Fund and partners, there are clear opportunities to address this data gap at regionally specific scales, ultimately reducing farmer uncertainty in the profitability of conservation practices and helping farmer advisers support wise conservation investments.


Publishers grapple with an invisible foe as huge organised fraud hits scientific journals

Chemistry World, Katrina Kramer


from

‘As with many hidden criminal syndicates, you don’t always know what’s happening,’ says Retraction Watch’s Ivan Oransky about paper mills. They are the biggest organised fraud perpetrated on scientific journals ever, eroding scientists’ trust in the publishing system – and in each other.

While plagiarism and fraud isn’t new – individual researchers have been caught photoshopping electron microscopy images or inventing elemental analysis data – paper mills serve up professional fakery for their customers on an industrial scale. Buyers can apparently purchase a paper, or authorship of one, on any topic based on phony results to submit to a journal. This makes them not only harder to detect and crack down on, but also exponentially increases the damage they could do.

The extent of their operations became apparent in early 2020. Two independent groups of image detectives came across a number of manuscripts, all from different authors at different institutions working on different biomedical topics, that seemed to share strange inconsistencies – as if they had all used the same stock images. The set now contains almost 600 manuscripts. Another set of 125 was discovered only a few months later. And there could be 10 times as many professionally manipulated papers that have not yet been – and might never be – found, estimates science integrity consultant Elisabeth Bik.


Ocean data can now be harnessed using sustainable floats

World Economic Forum, Formative Content, Walé Azeez


from

Seatrec, a five-year-old ‘blue tech’ start-up based in Vista, California is one company rising to that challenge, with power generating technology that it hopes will bring a level of sustainability to oceanographic exploration not possible until recently.

Dead batteries on the seabed

‘Profiling floats’ – unmanned, robotic data gathering devices that monitor the sea’s physical, chemical and geological characteristics – is an established approach. The Argo international research project, for example, has more than 3,000 active floats dotted around the world.

Laden with sensors, these battery-fuelled floats dive to predetermined depths to carry out their programmed tasks, re-emerge on completion and transmit their data harvest to the research stations via satellite.


A new tool tracks health disparities in the U.S. — and highlights major data gaps

STAT, Katie Palmer


from

Over and over, the pandemic has reinforced the reality of racial disparities in the U.S. health system. But that story remains difficult to see in the data, which is still inconsistently collected and reported across the country.

On Wednesday, a coalition of researchers and advocates launched a tool they hope will fill some of those gaps: the Health Equity Tracker, a portal that collects, analyzes, and makes visible data on some of the inequities entrenched in U.S. medicine.

“For far too long it’s been ‘no data, no problem,’” said Nelson Dunlap, chief of staff at the Satcher Health Leadership Institute at Morehouse School of Medicine, which developed the tool with funding and resources from Google.org, Gilead Sciences, Annie E. Casey Foundation, and CDC Foundation. By making data that do exist on racial health disparities accessible, the tracker aims to empower local advocates to drive change in their communities — and inspire action to fill in holes in data that are themselves reinforced by structural racism. In the tracker’s display, 38% of federally-collected Covid-19 cases report unknown race and ethnicity.


Integrating Data to Find Links Between Environment and Health

Eos; Zhong Liu, Daniel Tong, Jennifer Wei, and David Meyer


from

To help reveal the pandemic’s impacts on economies and people’s daily lives at local, regional, and global scales around the world, scientists, policymakers, and the public have looked to information from many different sources, including various public health and population data sets as well as in situ and satellite observations of environmental conditions. Satellite data typically cover wider areas than in situ observations; in situ observations are better at identifying small-scale phenomena, and they provide ground truth for satellite observations. Thus, integrating satellite environmental data with data from multiple on-the-ground sources can provide a more holistic look at the causes and spread of disease outbreaks, as well as the effects these outbreaks have on the environment and society.

Assembling all this information into forms that allow researchers to uncover links between health and environmental factors and that decisionmakers can use to enact effective policies is far from simple. Nonetheless, this sort of geohealth data integration is crucial in facilitating research and the search for solutions during public health crises like disease outbreaks, which are often influenced by environmental conditions. This point has been demonstrated not only during the ongoing COVID-19 pandemic but also by studies of other disease outbreaks in the past.

Here we discuss the need for effective integration of data from different disciplines (e.g., health, geosciences, economy, population) and sources (satellite versus on the ground), barriers to such integration, and tools and opportunities to overcome these barriers.


Neuroscience and AI’s Future – How The Thousand Brains Theory of Intelligence may unlock machine intelligence.

Psychology Today, Cami Rosso


from

What we’re doing at Numenta is starting with deep learning networks, and how far we can go by modifying them. Instead of throwing everything away, how do we start moving from where we are?

We have a whole theory on sparsity, and we’ve now been able to speed up existing deep neural networks by anywhere from five to over a hundred times, depending on the network architecture. They become more robust, and they’re less prone to adversarial attacks.

We’re now adding dendrite theory that I wrote about in the book. We’re showing that we can solve another big problem in deep learning, which is continuous learning. Deep learning networks can’t be incrementally trained; you have to start over and train the whole thing again.


UW launches innovative Center for Health Disparities Research

University of Wisconsin, School of Medicine and Public Health


from

A new center at the UW School of Medicine and Public Health seeks to examine how a person’s environment and social conditions impact their health down to the molecular level.

The new research effort, called the UW Center for Health Disparities Research (CHDR), launched in April. As its name implies, center investigators will focus on how physical environment and social conditions intersect to influence an individual’s health. Researchers will aim to identify new therapies, precision medicine approaches and other interventions, according to Amy Kind, MD, PhD, center director, and professor of medicine at the school.


A New Tool May Make Geological Microscopy Data More Accessible

Eos, Richard J. Sima


from

It all started with a problem many geoscientists faced in 2020.

Alex Steiner, a doctoral student at Michigan State University, had research to do working on thin sections—slivers of geological materials that are usually analyzed under a microscope. But he and the two undergraduate students on the project were not allowed to access the lab or the geological samples they were working on. Because, well, pandemic.

It was out of this necessity that Steiner helped develop a new tool that could automatically take pictures of entire thin sections and stitch them into digital panoramic microscope images that could be analyzed anywhere.

The technical report on the device, named PiAutoStage, was recently published


New Degree in Computer Vision Launches as Artificial Intelligence Expands

PR Newswire, University of Central Florida


from

A new master’s degree in computer vision will launch this fall at the University of Central Florida, the first public university in the country to offer a degree in this rapidly expanding field.


Events



Save the Date! Rev, the Marquee MLOps Event, Returns This Fall

Domino Data Lab


from

Chicago November 9-11. “Rev 3 is dedicated to providing attendees continuous learning that improves MLOps so they can better accomplish their missions.” [save the date]


SciPy 2021 | Attend

SciPy


from

Online July 12-18. [$$$]


Virtual event: How artificial intelligence is transforming health care

STAT


from

Online July 13-14. “Technologies and procedures that would have been science fiction are now the reality of medicine. Our 2021 STAT Breakthrough Science Summit will take you inside these innovations.” [$$$]


Deadlines



The TensorFlow Microcontroller Challenge

Deadline for entries is June 19.

Help governments classify AI systems! A thread… Right now, AI systems are broadly illegible to policymakers, which is why AI policy is confusing. At the @OECD , we’re trying to make them legible via a framework people can use to classify AI systems.

“First, you could help us test out the framework by classifying an existing AI system (e.g, AlphaGo Zero, C-CORE, CASTER) using our survey (or classifying your own system using it).”

Analyze your movement with the latest video-analysis technology

Want to be part of one of the largest human movement studies to date?

It’s as simple as:
1. Recording a video with 5 sit-to-stand movements
2. Filling out a questionnaire


New Librarianship Symposia Invitation

“Ten years ago MIT Press published The Atlas of New Librarianship. We are taking the opportunity of its 10th anniversary to explore some of the key issues in librarianship that have evolved and emerged since 2011 in a series of online symposia in October and November 2021.” Deadline for abstract submissions is June 30.

The Call for Proposals for the @MWBigDataHub Community Development and Engagement program is open!

Deadline for proposals is July 5.

SPONSORED CONTENT

Assets  




The eScience Institute’s Data Science for Social Good program is now accepting applications for student fellows and project leads for the 2021 summer session. Fellows will work with academic researchers, data scientists and public stakeholder groups on data-intensive research projects that will leverage data science approaches to address societal challenges in areas such as public policy, environmental impacts and more. Student applications due 2/15 – learn more and apply here. DSSG is also soliciting project proposals from academic researchers, public agencies, nonprofit entities and industry who are looking for an opportunity to work closely with data science professionals and students on focused, collaborative projects to make better use of their data. Proposal submissions are due 2/22.

 


Tools & Resources



AI for Query Understanding

Medium, Daniel Tunkelang


from

In the past decade, the incredible progress in word embeddings and deep learning has fueled an interest in neural information retrieval. An increasing number of folks believe that it’s time to retire the traditional inverted indexes (aka posting lists) that search engines use for retrieval and ranking.

In its place, they advocate a model where search engines use neural networks to represent documents and queries as vectors, and then use nearest neighbor search — or more sophisticated ranking models — to retrieve and rank results.

This revolutionary approach is tempting, but — in my view — misdirected. As I argue below, the right place to focus AI efforts is query understanding.


“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI

Google Research, SIGCHI; Nithya Sambasivan Shivani Kapania Hannah Highfill Diana Akrong Praveen Kumar Paritosh Lora Mois Aroyo


from

AI models are increasingly applied in high-stakes domains like health and conservation. Data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations. Paradoxically, data is the most under-valued and de-glamorised aspect of AI. In this paper, we report on data practices in high-stakes AI, from interviews with 53 AI practitioners in India, East and West African countries, and USA. We define, identify, and present empirical evidence on Data Cascades—compounding events causing negative, downstream effects from data issues—triggered by conventional AI/ML practices that undervalue data quality. Data cascades are pervasive (92% prevalence), invisible, delayed, but often avoidable. We discuss HCI opportunities in designing and incentivizing data excellence as a first-class citizen of AI, resulting in safer and more robust systems for all.


New post: @pacoid and I detail key challenges in monitoring #MachineLearning models, and we outlined several key components of a model monitoring platform.

Twitter, Ben Lorica


from

This is a very active area with many startups rolling out new offerings

2/ Models interacting with the real world and failing to make sense of it can have serious consequences. Unfortunately, models begin to degrade the moment they get deployed. We begin by describing the many catalysts that can result in model degradation


Careers


Postdocs

Postdoctoral Fellow in Quantitative Social Science



Harvard University, Laboratory for Innovation Science; Allston, MA

Postdoctoral Fellow position in the broad area of Healthcare Analytics.



Harvard University, Mossavar-Rahmani Center for Business and Government; Cambridge, MA
Internships and other temporary positions

Data Fellow



City of Philadelphia; Philadelphia, PA
Full-time positions outside academia

Data Scientist, Ethical AI Practice



Salesforce; San Francisco, CA

Leave a Comment

Your email address will not be published.