Data Science newsletter – July 24, 2019

Newsletter features journalism, research papers, events, tools/software, and jobs for July 24, 2019

GROUP CURATION: N/A

 
 
Data Science News



The Human Brain Project Hasn’t Lived Up to Its Promise

The Atlantic, Ed Yong


from

Ten years ago, a neuroscientist said that within a decade he could simulate a human brain. Spoiler: It didn’t happen.


Anonymising personal data ‘not enough to protect privacy’, shows new study

Imperial College London, News


from

With the first large fines for breaching EU General Data Protection Regulation (GDPR) regulations upon us, and the UK government about to review GDPR guidelines, researchers have shown how even anonymised datasets can be traced back to individuals using machine learning, a type of artificial intelligence.

The researchers say their paper, published in Nature Communications, demonstrates that allowing data to be used – to train AI algorithms, for example – while preserving people’s privacy, requires much more than simply adding noise, sampling datasets, and other de-identification techniques.


Using big data to solve the world’s biggest challenges

Michigan State University, MSUToday


from

About 2.5 quintillion bytes of data are created every day around the world. Now a first-of-its-kind research group at Michigan State University is on a mission to extract meaningful information from immense data sets to create solutions for some of society’s biggest challenges.

The Department of Computational Mathematics, Science and Engineering’s collaborative environment brings together biologists, engineers, astronomers and mathematicians as well as undergrads and grad students. These researchers create models that help tackle complex problems like researching dying stars so we can understand the origins of matter, mapping Earth’s interior to better predict earthquakes and developing novel ideas in network theory to explain complex diseases.


Can NASA Get Its Satellite Data into the Real World?

Eos, Gabriel Popkin


from

Less than two months ago, the data spigot for NASA’s Ice, Cloud, and Land Elevation Satellite–2 (ICESat-2) finally turned on. It was a long-awaited moment for scientists hoping that the state-of-the-art satellite, which sends out laser pulses and detects returning photons, would provide exquisitely precise measurements of the elevation of Earth’s ice sheets, forests, and land.

Earth scientists won’t be the only ones keeping a close eye on the satellite’s performance. A group of NASA program managers will be scrutinizing not the data themselves, but whether they make their way beyond the scientific literature into practical, public-facing applications such as weather and climate forecasts, disaster prediction and response efforts, and environmental conservation plans.


NVIDIA Launches U.K. Technology Center to Advance AI Research

insideHPC


from

NVIDIA just launched a new technology center in the UK designed to support groundbreaking research in AI and data science — and foster engagement across the country’s higher education and research community.

EPCC, Hartree Centre, and the University of Reading are the first to join the NVIDIA AI Technology Center, which provides a collaborative community for world-class talent driving AI adoption and excellence across the UK.


AI protein-folding algorithms solve structures faster than ever

Nature, News, Matthew Hutson


from

The race to crack one of biology’s grandest challenges — predicting the 3D structures of proteins from their amino-acid sequences — is intensifying, thanks to new artificial-intelligence (AI) approaches.

At the end of last year, Google’s AI firm DeepMind debuted an algorithm called AlphaFold, which combined two techniques that were emerging in the field and beat established contenders in a competition on protein-structure prediction by a surprising margin. And in April this year, a US researcher revealed an algorithm that uses a totally different approach. He claims his AI is up to one million times faster at predicting structures than DeepMind’s, although probably not as accurate in all situations.

More broadly, biologists are wondering how else deep learning — the AI technique used by both approaches — might be applied to the prediction of protein arrangements, which ultimately dictate a protein’s function. These approaches are cheaper and faster than existing lab techniques such as X-ray crystallography, and the knowledge could help researchers to better understand diseases and design drugs.


Protein wrangler, serial entrepreneur, and community builder: Inside David Baker’s brain

Chemical & Engineering News, Laura Howes


from

Earlier this year, David Baker was sitting at his computer at the Institute for Protein Design, in Seattle, worrying about a situation familiar to anyone who’s ever been a graduate student: it was his turn to present at the weekly group meeting, but he didn’t have any results to show.

“I really hadn’t made much progress. And I was totally stressed,” Baker recalls. Knuckling down over the next few days, he coaxed some results from his computational project and met his deadline, enabling some productive conversations about what direction to head next.

The odd thing about this situation is that Baker is not a grad student, nor even a postdoc. He’s a group leader and head of the University of Washington’s Institute for Protein Design, with a staff of just over 130.


New joint major in economics and computer science available this fall

Yale University, YaleNews


from

A new joint major in computer science and economics will offer Yale undergraduates hands-on research opportunities and prepare them to leave their mark on the world’s digital economy.

Beginning in the fall of 2019, students can pursue the Computer Science and Economics (CSEC) interdepartmental major and explore the practical and theoretical connections between the two disciplines while developing a skillset highly prized by research universities; tech behemoths like Google, Facebook, and Amazon; and any number of other industries.


New curriculum will focus on philosophy of artificial intelligence

Arizona State University, ASU Now


from

Artificial intelligence algorithms have become pervasive in daily life, but should they? And what are the drawbacks and advantages of using machine learning?

Several Arizona State University faculty members have won a grant from the National Endowment for the Humanities to create a new curriculum that will challenge students to think about these complex issues while they’re learning how to create the technology.

The grant is funding a yearlong process for the School of Arts, Media and Engineering to create the new program, which will be a concentration within the existing Bachelor of Arts in digital culture. The school is housed in both the Herberger Institute for Design and the Arts and the Ira A. Fulton Schools of Engineering.


Globalization Isn’t Dying, It’s Just Evolving

Bloomberg Graphics, Shawn Donnan and Lauren Leatherby


from

Globalization is a force both more powerful and ancient than Trump. Too often we think of it—of economic integration and the exchange of ideas, people and goods that comes with it—as a recent phenomenon.

The reality is it has been with us since the dawn of time. Religions like Christianity and Islam are products of globalization. They also have arguably done more to both shape and promote globalization than U.S. multinationals or China’s new corporate giants.

Globalization also isn’t a static force. We associate globalization today with the shipping container, the 1950s invention that increased the efficiency and lowered the cost of the global trade in goods. Or with the outsourcing of jobs in advanced economies and the rebirth of great trading economies like China’s.

But we are entering a new era in which data is the new shipping container and there are far more disruptive forces at work in the world economy than Trump’s tariffs. New manufacturing techniques such as 3D printing and the automation of factories are reducing the economic incentives to offshore production. The smartphones we carry with us are not just products of globalization but accelerants for it. For good or bad, we are more exposed to a global culture of ideas than we have ever been. And we are only becoming more global as a result.


Estimating the success of re-identifications in incomplete datasets using generative models

Nature Communications; Luc Rocher, Julien M. Hendrickx & Yves-Alexandre de Montjoye


from

While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model. [full text]


2019 Google Scholar Metrics Released, CVPR Cracks the Top Ten

Medium, SyncedReview


from

Estimates peg the total number of academic papers and other scholarly literature indexed on the Google Scholar at almost 400 million, making it the world’s largest such database. To compile the index Google surveys hundreds of journals and websites that meet its inclusion guidelines, along with leading conferences in Engineering & Computer Science.

In a blog post last Friday Google released its 2019 Scholar Metrics, designed to “provide an easy way for authors to quickly gauge the visibility and influence of recent articles in scholarly publications …to help authors as they consider where to publish their new research.” One of the top AI conferences — IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ranked in the top 10 for the first time, up from 20th in 2018. The world’s most prominent scientific journals, Nature and Science, ranked first and third respectively.


Student Startup Brings New Customer Insights to Brick and Mortar Stores

Georgia Institute of Technology, College of Computing


from

The online advantage for retailers is obvious. Website analytics and transactional data allow e-commerce retailers to know exactly how their customers shop.

To help level the playing field, a startup company created by Georgia Tech students is piloting a new technology with several businesses in Ponce City Market that provides much of the same customer insight information to brick and mortar retailers.

The company, Countable Technologies, has devised a secure way of passively capturing a smartphone’s unique identifier, without apps or a network connection, to give retail businesses a slew of new data points on foot traffic to their stores, while maintaining customer anonymity and privacy.

“Privacy is one of our utmost concerns and our sensors that are placed in stores only passively talk to phones within range,” said Arvin Poddar, co-founder of Countable Technologies and a computer science student at Georgia Tech.


Facebook pairs its Map With AI service OpenStreetMap project

ZDNet, Between the Lines blog, Larry Dignan


from

Facebook’s effort, aside from also benefiting from mapping data, is designed to work with the open source community to map millions of miles that haven’t been mapped.


How credit unions could help people make the most of personal data

MIT Sloan School of Management, Ideas Made to Matter, Dylan Walsh


from

In May of 2018, the EU adopted the General Data Protection Regulation, referred to by The New York Times as “the world’s toughest rules to protect people’s online data.” Among its many safeguards, the GDPR gave individuals ownership of their personal data and thereby restricted its collection and use by businesses.

“That’s a good first start,” said Alex Pentland, a co-creator of the MIT Media Lab who played a foundational role in the development of the GDPR. “But ownership isn’t enough. Simply having the rights to your data doesn’t allow you to do much with it.” In response to this shortcoming, Pentland and his team have proposed the establishment of data cooperatives.

The idea is conceptually straightforward: Individuals would pool their personal data in a single institution — just as they pool money in banks — and that institution would both protect the data and put it to use. Pentland and his team suggest credit unions as one type of organization that could fill this role.

 
Events



The inaugural New York Sleep Fest w/ leading scientists, scholars & authorities on the topic are coming to #NewYork for a free event open to the public

Twitter. Dr. Rebecca Robbins


from

 
Deadlines



The Data Hub Challenge offers startups access to experienced companies and unique data sets to develop and implement promising data-driven solutions.

“Join forces with startups and other corporates to unleash your data potential.” Deadline for applications is July 31.
 
Tools & Resources



Ollivier-Ricci Curvature-Based Method to Community Detection in Complex Networks

Nature, Scientific Reports; Jayson Sia, Edmond Jonckheere & Paul Bogdan


from

Identification of community structures in complex network is of crucial importance for understanding the system’s function, organization, robustness and security. Here, we present a novel Ollivier-Ricci curvature (ORC) inspired approach to community identification in complex networks. We demonstrate that the intrinsic geometric underpinning of the ORC offers a natural approach to discover inherent community structures within a network based on interaction among entities. We develop an ORC-based community identification algorithm based on the idea of sequential removal of negatively curved edges symptomatic of high interactions (e.g., traffic, attraction). To illustrate and compare the performance with other community identification methods, we examine the ORC-based algorithm with stochastic block model artificial networks and real-world examples ranging from social to drug-drug interaction networks. The ORC-based algorithm is able to identify communities with either better or comparable performance accuracy and to discover finer hierarchical structures of the network. This opens new geometric avenues for analysis of complex networks dynamics. [full text]


pandas: The two cultures

Marc Garcia, datapythonista blog


from

In 2001, [Leo] Breiman published the paper Statistical Modeling: The Two Cultures. In it, Breiman identified that there were two somehow conflicting cultures in the discipline of statistical modeling. One that was focusing on modeling (and trying to understand) the stochastic process generating some random data. While the other followed an algorithmic approach focused on obtaining results (minimizing the error between the model results and the data), and considered the stochastic process a black box. Today we would probably call them statistics and machine learning … But this post is not about machine learning, but about pandas. And about the two cultures in the pandas community, that I personally don’t think are often well identified, causing frustration to some users, and making more complex taking decisions regarding the API of the project.


New release of the COAR Controlled Vocabulary, “Resource Type”

Confederation of Open Access Repositories


from

COAR is pleased to announce the release of the Resource Type Vocabulary, Version 2. This vocabulary, which is now available in 15 languages, provides standardized terms for different types of content contained in a repository. Controlled vocabularies ensure that “everyone is using the same word to mean the same thing” and are key to achieving the COAR vision of a global knowledge commons, based on an interoperable, international network of open repositories. The Resource Type Vocabulary supports discovery of content by allowing readers to confidently search and browse across systems according to the “type” of content they are looking for.

 
Careers


Full-time positions outside academia

Director of Data Platform



Pluralsight; South Jordan, UT

C++ Software Engineer



Geopipe; New York, NY
Postdocs

NYUAD Postdoctoral Associate – CITIES Research Center



New York University, NYU – Global: Abu Dhabi: Social Science; Abu Dhabi, United Arab Emirates

NYUAD Post-Doctoral Associate in Computer Science



New York University, NYU – Global: Abu Dhabi: Science; Abu Dhabi, United Arab Emirates

Leave a Comment

Your email address will not be published.