Data Science newsletter – November 12, 2020

Newsletter features journalism, research papers and tools/software for November 12, 2020

GROUP CURATION: N/A

 

Mounting clusters in youth sports, pandemic fatigue complicate fight against coronavirus in Mass.

The Boston Globe, Kay Lazar


from

Growing clusters of coronavirus infections in youth sports, coupled with pandemic fatigue, are complicating the battle to control new cases across Massachusetts, health department directors say.

They say the hurdles, coming as the number of cases soar across the state, are making their job of tracking down close contacts of infected residents exponentially more complex than earlier in the pandemic. That’s reflected in recent state data indicating that the source of infections in about half of new COVID-19 cases is a mystery — a troubling trend because without knowing how and where people are getting infected, it’s difficult to prevent further infections.


Free Fall

The Harvard Crimson student newspaper, Sophia S. Liang and Matteo N. Wong


from

While [Harvard President Lawrence] Bacow’s infection was well-documented and promptly treated, many Harvard custodians say they faced barriers to getting tested throughout the spring. As a result, the scope of the outbreak among them remains unclear — one custodian estimates she knows about 60 custodians across the University who have experienced COVID-like symptoms. Another knows of at least eight custodians who tested positive for COVID-19 at the Law School alone, one of whom has been in the hospital since April. A spokesperson for the custodians’ union, 32BJ SEIU, says that members have reported 13 confirmed COVID-19 cases to the union, along with several other presumed cases.


I’m kind of surprised I haven’t seen an article in @FiveThirtyEight or somewhere similar using the Total Survey Error framework as a tool for categorizing potential sources of survey error and tests for those hypotheses as election results get finalized.

Twitter, Kevin Collins


from

When survey researchers hear about “Shy Trump” voters, we hear it as measurement error, and there’s good evidence that it’s vanishingly small. But I think the broader public might also be including non-response error as part of how they understand that term

As we unpack the sources of survey error, it’s worth keeping our eye on some patterns. For instance, this comparison of survey averages to projected results by @gelliottmorris
shows a correlation between 2016 and 2020, but also an intercept shift


AI has cracked a key mathematical puzzle for understanding our world

MIT Technology "Review, Karen Hao


from

Researchers at Caltech have introduced a new deep-learning technique for solving PDEs that is dramatically more accurate than deep-learning methods developed previously. It’s also much more generalizable, capable of solving entire families of PDEs—such as the Navier-Stokes equation for any type of fluid—without needing retraining. Finally, it is 1,000 times faster than traditional mathematical formulas, which would ease our reliance on supercomputers and increase our computational capacity to model even bigger problems.


UX issues around voting

Statistical Modeling, Causal Inference, and Social Science blog, Bob Carpenter


from

How well do elections measure individual voter intent?

What is the probability that a voter who tries to vote has their intended votes across the ballot registered? Spoiler alert. It’s not 100%.

We also want to know if the probability of having your vote recorded depends on the vote. Or on the voter. To put it in traditional statistical terms, if we think of the actual vote count as an estimate of voter intent, what is the error and bias of the estimator?


The carbon footprint of artificial intelligence is growing

Anthropocene, Sarah DeWeerdt


from

The computing power used in deep learning grew 300,000-fold between 2012 and 2018, and if that pace of growth continues it’s not hard to see how artificial intelligence could have a major climate impact.

But this isn’t inevitable, say researchers at the University of Copenhagen in Denmark. As it happens, the answer to a problem with algorithms could be another algorithm: the researchers created a free, open-source program to assess and predict the carbon footprint of deep learning models.

Their program, called Carbontracker, is basically an add-on to deep learning models. It uses the Python programming language, like most such models, to make it easier to integrate. The researchers wrote their program so that it would not require a lot of computing power itself and would not interfere with the deep learning algorithms.


Google’s AIY kits offer do-it-yourself artificial intelligence

EDN, Brian Dipert


from

There’s plenty of low-cost hardware out there feasible for implementing deep learning training and (especially, along with being your likely implementation focus) inference, as well as plenty of open source (translation: free) and low-priced software, some tied to specific silicon and other more generic. Tying the two (hardware and software) together in a glitch-free and otherwise robust manner is the trick; select unwisely and you’ll waste an inordinate amount of time and effort wading through arcane settings and incomplete (and worse: incorrect) documentation, trying to figure out why puzzle pieces that should fit together perfectly aren’t.

That’s where Google’s AIY (which stands for “Artificial-Intelligence-Yourself,” a play on DIY, i.e., “Do-It-Yourself) Project Kits come in. They’re targeted at hobbyists and professionals alike: in Google’s own words, “With our maker kits, build intelligent systems that see, speak, and understand. Then start tinkering. Take things apart, make things better. See what problems you can solve.” While the hardware and software included in each kit may not match what you end up using in your own designs after you get up the initial learning curve, the fundamentals you’ll “grok” in using them will stay with you and continue to apply.


What Pfizer’s landmark COVID vaccine results mean for the pandemic

Nature, News, Ewen Callaway


from

Scientists welcome the first compelling evidence that a vaccine can prevent COVID-19. But questions remain about how much protection it offers, to whom and for how long.


Covid & Cities: Reasons for optimism

City Observatory blog, Joe Cortright


from

There’s a reason why most of the stories we hear about people leaving the city involve mid-career or older professionals. It’s the same reason why young people, just starting their careers, gravitate toward cities: cities are great places to develop skills, build networks and establish a professional reputation. Once you’ve established all those things, usually after a couple of decades of hard work, then maybe you can think about down-shifting and working remotely from Boulder or Bend. But if you’re just starting out, small towns and rural areas offer few ways to find the challenging opportunities, hone professional skills, network with peers and mentors, and build a compelling resume.

The importance of “being there” is critically important when it comes to promotions and layoffs. When looking to promote, organizations prize workers who show zeal, enthusiasm and commitment. If nothing else, it is hard to overlook those who are in the office every day. And the reverse is true: out-of-sight—or off-site—is out of mind, making you more vulnerable to downsizing.


University of Idaho Researchers Help Introduce First-of-its-Kind Arctic Animal Database

Big Country News


from

Two University of Idaho College of Natural Resources faculty, a student, and a former U of I researcher are among the authors of an Arctic animal research paper that introduces a first-of-its-kind database and was published this week in Science, the premier scientific journal in the United States.

In “Ecological Insights from Three Decades of Animal Movement Tracking Across a Changing Arctic,” U of I student Jyoti Jennewein, professors Jan Eitel and Lee Vierling, as well as Arjan Meddens, a former postdoctoral fellow at U of I, collaborated with a team of 160 scientists worldwide to present the animal movement database.


Survey of COVID-19 research provides fresh overview

Karolinska Institute, KI News


from

In the wake of the rapid spread of COVID-19, research on the disease has escalated dramatically. Over 60,000 COVID-19-related articles have been indexed to date in the medical database PubMed. This body of research is too large to be assessed by traditional methods, such as systematic and scoping reviews, which makes it difficult to gain a comprehensive overview of the science.

“Despite COVID-19 being a novel disease, several systematic reviews have already been published,” says Andreas Älgå, medical doctor and researcher at the Department of Clinical Science and Education, Södersjukhuset at Karolinska Institutet. “However, such reviews are extremely time- and resource-consuming, generally lag far behind the latest published evidence, and only focus on a specific aspect of the pandemic.”


CAS opens data vault to MIT scientists

Chemical & Engineering News, Sam Lemonick


from

Scientists will soon get their hands on a vault of chemical data that has for years been kept from independent machine-learning researchers. But for now, database organization CAS will only allow one group, based at the Massachusetts Institute of Technology (MIT), to use these data in order to train machine learning–based synthesis planners. Agreements between CAS and academic researchers have been rare, but a CAS official says it hopes this one will be the first of many.

CAS is a division of the American Chemical Society, which publishes C&EN. CAS maintains databases of chemical information, including molecular structures and properties, as well as reaction procedures and conditions. That’s exactly the kind of data computational chemists need in order to develop machine learning tools that can carry out retrosynthesis, the process of predicting the synthetic steps needed to make a target molecule like a drug. But CAS’s standard terms of use specifically forbid machine learning algorithm training.

Connor W. Coley, the MIT computational chemist who will lead the CAS collaboration, says the agreement will give his group access to a curated dataset of several million reactions.


Is there a middle ground in communicating uncertainty in election forecasts?

Statistical Modeling, Causal Inference, and Social Science blog, Jessica Hullman


from

Beyond razing forecasting to the ground, over the last few days there’s been renewed discussion online about how election forecast communication again failed the public. I’m not convinced there are easy answers here, but it’s worth considering some of the possible avenues forward. Let’s put aside any possibility of not doing forecasts, and assume the forecasts were as good as they possibly could be this year (which is somewhat of a tautology anyway). Communication-wise, how did forecasters do and how much better could they have done?

We can start by considering how forecast communication changed relative to 2016. The biggest differences in uncertainty communication that I noticed looking at FiveThirtyEight and Economist forecast displays were:

1) More use of frequency-based presentations for probability, including reporting the odds as frequencies, and using frequency visualizations (FiveThirtyEight’s grid of maps as header and ball-swarm plot of EC outcomes).

2) De-emphasis on probability of win by FiveThirtyEight (through little changes like moving it down the page, and making the text smaller)

3) FiveThirtyEight’s introduction of Fivey Fox, who in multiple of his messages reminded the reader of unquantifiable uncertainty and specifically the potential for crazy (very low probability) things to happen.


What’s the Role of Developer Experience in Programming Languages Research?

SIGPLAN Blog, Jean Yang


from

In 2018, I was two years into a tenure-track junior faculty position at Carnegie Mellon University when I saw what was happening in the world of software. 2018 was when Cambridge Analytica happened; soon afterward, GDPR became law. Most academics who start companies in industry do so because they are commercializing a specific project; I left to start Akita Software because I saw that the timing was right to build the kinds of tools I wanted to build—those that could help improve the quality and security of software. For the last two years, I’ve been working to build developer tools at the API level for modern web apps.

Because the technical problem was not set in stone at the time I started Akita, I spent the first year of working on the company doing product and user research work. My academic career trained me to solve problems in a principled way. Through the process of getting Akita up and running, I learned how to select problems in a principled way. This was important because for the first time, the success of my efforts was being defined by how well I solved problems that people cared about enough to pay real-world money for. Here are my learnings.


Announcing the Objectron Dataset

Google AI Blog, Adel Ahmadyan and Liangkai Zhang


from

The state of the art in machine learning (ML) has achieved exceptional accuracy on many computer vision tasks solely by training models on photos. Building upon these successes and advancing 3D object understanding has great potential to power a wider range of applications, such as augmented reality, robotics, autonomy, and image retrieval. For example, earlier this year we released MediaPipe Objectron, a set of real-time 3D object detection models designed for mobile devices, which were trained on a fully annotated, real-world 3D dataset, that can predict objects’ 3D bounding boxes.

Yet, understanding objects in 3D remains a challenging task due to the lack of large real-world datasets compared to 2D tasks (e.g., ImageNet, COCO, and Open Images). To empower the research community for continued advancement in 3D object understanding, there is a strong need for the release of object-centric video datasets, which capture more of the 3D structure of an object, while matching the data format used for many vision tasks (i.e., video or camera streams), to aid in the training and benchmarking of machine learning models.


Deadlines



Statistics and Public Policy Callsfor Editor Applications and Nominations

Deadline for nominations is December 11.

Tools & Resources



A Theory of Universal Learning

arXiv, Computer Science > Machine Learning; Olivier Bousquet, Steve Hanneke, Shay Moran, Ramon van Handel, Amir Yehudayoff


from

How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its “learning curve”, that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, does not explain the behavior of learning curves: the distribution-free PAC model of learning can only bound the upper envelope of the learning curves over all possible data distributions. This does not match the practice of machine learning, where the data source is typically fixed in any given scenario, while the learner may choose the number of training examples on the basis of factors such as computational resources and desired accuracy.
In this paper, we study an alternative learning model that better captures such practical aspects of machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PAC model. More precisely, we consider the problem of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution. The main result of this paper is a remarkable trichotomy: there are only three possible rates of universal learning. More precisely, we show that the learning curves of any given concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover, each of these cases is completely characterized by appropriate combinatorial parameters, and we exhibit optimal learning algorithms that achieve the best possible rate in each case.
For concreteness, we consider in this paper only the realizable case, though analogous results are expected to extend to more general learning scenarios.


big news: we are starting a non-profit! It is called 2i2c, which stands for “The International Interactive Computing Collaboration”.

Twitter, Chris Holdraf


from

“2i2c has a few core goals:”
– “Manage interactive computing infrastructure for research and education”
– “Develop and improve tools in interactive computing for these use-cases”
– “Support open source tools and communities that underlie this infrastructure”


First release of the Array API Standard

Consortium for Python Data API Standards


from

“The main goal of this standard: make it easier to switch from one array library to another one, or to support multiple array libraries as compute backends in downstream packages. We’d also like to emphasize that if some functionality is not present in the API standard, that does not mean it’s unimportant, or that we’re asking existing array libraries to deprecate it. Instead it simply means that that functionality at present isn’t supported – likely due to it not being present in all or most current array libraries, or not being used widely enough to have been included so far. The use cases section of the standard may provide more insight into important goals.”


Careers


Tenured and tenure track faculty positions

Endowed Chair of Data Science



Baylor University, Computer Science and Informatics Department; Waco, TX
Internships and other temporary positions

Faculty Fellows



New York University, Center for Data Science; New York, NY

Leave a Comment

Your email address will not be published.