Data Science newsletter – March 27, 2017

Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for March 27, 2017

GROUP CURATION: N/A

 
 
Data Science News



Intel Forms New AI Group Reporting Directly To CEO Brian Krzanich

Forbes, Patrick Moorhead


from

Intel announced it is aligning all its AI efforts under Naveen Rao, former Nervana Systems CEO, in a pan-Intel organization called the Artificial Intelligence Products Group (AIPG). Rao and the AIPG group will report directly into Intel CEO Brian Krzanich.

AIPG will “own” the execution for certain deliverables and the overall AI strategy and roadmap and will also “align” resources needed from other groups to have an AI roadmap spanning Intel Xeon, Intel Nervana, and Intel Xeon Phi and software stacks. AIPG will also include an applied AI research lab that will be looking out three to five years for additional building blocks, like new algorithms, architectures.


Howard University campus to open at Google’s headquarters

San Francisco Chronicle, Wendy Lee


from

Silicon Valley’s leading tech companies have long been stumped over how to diversify. Despite efforts to recruit and retain black and Latino engineers, the numbers haven’t budged much, if at all.

Google is trying a new tack. The company announced Thursday it is creating a college campus at its Mountain View headquarters that’s geared toward students at historically black colleges and universities.


5 UMass cyber security expert worries hackers, not CIA, pose biggest privacy threat

MassLive.com, Diane Lederman


from

Emery Berger noted that the CIA is prohibited from spying on American citizens, and said he believes the agency is not targeting Americans.

But, he said, there is a risk agents could pick up surveillance on Americans when it is intercepting conversations with foreign citizens.


Data Can Help Local Governments Fight Corruption, Study Says

Government Technology, Zack Quaintance


from

Local governments can and should begin to use data more often in order to root out and eliminate corruption, a new study has found.

The study, Taking a Byte Out of Corruption, emphasized that while major advancements in data analytics have given law enforcement agencies and policymakers new tools that could potentially fight public corruption, a lot of work remains before systems are established to ensure such data is used fairly and effectively. In the report, the authors speak often of this being a starting point for municipalities that want to bridge this gap.

The study, the result of a yearlong effort, was conducted by the Center for the Advancement of Public Integrity (CAPI) Data Analytics Working Group at Columbia Law School in New York City, and support for it was provided by the Laura and John Arnold Foundation.


Assessing real-time Zika risk in the United States

bioRxiv; Spencer J. Fox et al.


from

Background: Confirmed local transmission of Zika Virus (ZIKV) in Texas and Florida have heightened the need for early and accurate indicators of self-sustaining transmission in high risk areas across the southern United States. Given ZIKVs low reporting rates and the geographic variability in suitable conditions, a cluster of reported cases may reflect diverse scenarios, ranging from independent introductions to a self-sustaining local epidemic. Methods: We present a quantitative framework for real-time ZIKV risk assessment that captures uncertainty in case reporting, importations, and vector-human transmission dynamics. Results: We assessed county-level risk throughout Texas, as of summer 2016, and found that importation risk was concentrated in large metropolitan regions, while sustained ZIKV transmission risk is concentrated in the southeastern counties including the Houston metropolitan region and the Texas-Mexico border (where the sole autochthonous cases have occurred in 2016). We found that counties most likely to detect cases are not necessarily the most likely to experience epidemics, and used our framework to identify triggers to signal the start of an epidemic based on a policymakers propensity for risk. Conclusions: This framework can inform the strategic timing and spatial allocation of public health resources to combat ZIKV throughout the US, and highlights the need to develop methods to obtain reliable estimates of key epidemiological parameters.


[1703.07317] On the predictability of infectious disease outbreaks

arXiv, Physics > Physics and Society; Samuel V. Scarpino, Giovanni Petri


from

Infectious disease outbreaks recapitulate biology: they emerge from the multi-level interaction of hosts, pathogens, and their shared environment. As a result, predicting when, where, and how far diseases will spread requires a complex systems approach to modeling. Recent studies have demonstrated that predicting different components of outbreaks–e.g., the expected number of cases, pace and tempo of cases needing treatment, importation probability etc.–is feasible. Therefore, advancing both the science and practice of disease forecasting now requires testing for the presence of fundamental limits to outbreak prediction. To investigate the question of outbreak prediction, we study the information theoretic limits to forecasting across a broad set of infectious diseases using permutation entropy as a model independent measure of predictability. Studying the predictability of a diverse collection of historical outbreaks–including, gonorrhea, influenza, Zika, measles, polio, whooping cough, and mumps–we identify a fundamental entropy barrier for time series forecasting. However, we find that for most diseases this barrier to prediction is often well beyond the time scale of single outbreaks, implying prediction is likely to succeed. We also find that the forecast horizon varies by disease and demonstrate that both shifting model structures and social network heterogeneity are the most likely mechanisms for the observed differences in predictability across contagions. Our results highlight the importance of moving beyond time series forecasting, by embracing dynamic modeling approaches to prediction and suggest challenges for performing model selection across long disease time series. We further anticipate that our findings will contribute to the rapidly growing field of epidemiological forecasting and may relate more broadly to the predictability of complex adaptive systems.


A.I. Versus M.D.

The New Yorker, Siddhartha Mukherjee


from

… I walked with Lignelli-Dipple to her office. I was there to learn about learning: How do doctors learn to diagnose? And could machines learn to do it, too?


The Rise of FinTech and Compliance

NYU School of Law, Tom C.W. Lin


from

An important transformation is happening in the financial industry. The rise of new technology and compliance has dramatically altered many of the key functions and functionaries of modern finance. Artificial intelligence, algorithmic programs, and supercomputers, instead of human actors, now constitute the core of many financial operations. At the same time, compliance officers have become just as critical to financial institutions as traders, bankers, and analysts. Finance as we knew it has changed and continues to change.

My recent article, Compliance, Technology, and Modern Finance, offers a detailed commentary on these unfolding changes—the crosscutting developments in compliance, technology, and modern finance. It examines the concurrent and intersecting ascents of new financial technology and compliance as well as the potential perils linked with their ascents. It also highlights the larger implications of the changing financial landscape due to the growing roles of new technology and compliance. In particular, it focuses on the challenges of financial cybersecurity, the integration of technology and compliance, and the role of humans in modern finance.


Company Data Science News

Andrew Ng left Baidu, sending the stock plummeting 2.7 percent. Talk about knowing how much you are worth (that’s 1.5 bn, if you’re doing the math). Ng avoided explaining why in his medium post. Maybe he’ll be spending more time at Stanford, where he’s still on faculty?

Y Combinator is trying out an all-AI cohort. The AI startups will get extra cloud computing credits and machine learning office hours.

Intel has launched an Artificial Intelligence Products Group that will report directly to their CEO, Brian Krzanich. The most exciting part of the new group is the AI research lab which will be doing semi-blue sky style exploratory research into new algorithms, hardware, and architectures.

Researchers at OpenAI revamped an algorithm that rivals reinforcement learning by yielding comparable results without all the time and difficulty.


Geoff Hinton will be the Chief Scientist of the new Vector Institute for Artificial Intelligence in Toronto. Vector is funded by C$170m from the Canadian and Ontario governments in addition to an unnamed amount from 30 businesses including Google and the Royal Bank of Canada. Hinton explained that his “main interest is in trying to find radically different kinds of neural nets”.

Capital One hired Kim Rees as Head of Data Visualization.

Cornell has a group, IC3, dedicated to studying cryptocurrencies, based in NYC, that is currently looking for a postdoc. Note also that the hedge fund Two Sigma shares Roosevelt Island space with Cornell’s New York City campus.



Google’s AI Explosion in One Chart

MIT Technology Review, Antonio Regalado


from

These are some the most elite academic journals in the world. And last year, one tech company, Alphabet’s Google, published papers in all of them.

The unprecedented run of scientific results by the Mountain View search giant touched on everything from ophthalmology to computer games to neuroscience and climate models. For Google, 2016 was an annus mirabilis during which its researchers cracked the top journals and set records for sheer volume.


SDI & Duke Partner on Deep Learning Research for Airport Screening

American Security Today, Tammy Waitt


from

Smiths Detection Inc. (SDI), is partnering with the Duke University Edmund T. Pratt Jr. School of Engineering, Department of Electrical and Computer Engineering, in a “deep learning” digital solution project to advance airport checkpoint x-ray system screening capabilities.

The U.S. Transportation Security Administration (TSA) has entered into a contract with Duke University for this deep learning initiative to refine and apply state-of-the-art machine learning techniques in the security space.


Are investors too optimistic about Amazon?

The Economist


from

Since the start of 2015 Amazon’s share price has risen by 173%, seven times the growth of the preceding two years. Operating profits have expanded, too, but at $4.2bn remain relatively small—which is how shareholders like it. Amazon has always emphasised the value of long-term growth (presumably with some bigger profits down the line), and investors have come to accept this. In February, when Amazon reported higher profits but lower revenue than expected, its share price temporarily dipped. Shareholders worried it might not be set to grow as quickly as they had hoped.

Morgan Stanley, a bank, expects Amazon’s sales to rise by a compound average of 16% each year from 2016 through to 2025: that is higher than its estimates for Google or Facebook.


Rattlesnakes on a Jetstream

Science Node, Tristan Fitzpatrick


from

Jetstream, the National Science Foundation’s (NSF) first cloud-based resource for science, technology, engineering, and mathematics (STEM) disciplines, has easy-to-use high-performance computing (HPC) capabilities that promote scientific research by allowing large amounts of data to be processed.

“We obtain massive amounts of data by sequencing thousands of genes,” says Marlis Douglas. “These data are obtained from a single rattlesnake, but we actually need to sequence 500 snakes for a realistic estimate of their diversity. Those data simply cannot be analyzed by just using desktop computers.”


On the Reuse of Scientific Data

Data Science Journal; Irene V. Pasquetto , Bernadette M. Randles, Christine L. Borgman


from

Over the last decade or so, a growing number of governments and funding agencies have promoted the sharing of scientific data as a means to make research products more widely available for research, education, business, and other purposes (European Commission High Level Expert Group on Scientific Data 2010; National Institutes of Health 2016; National Science Foundation 2011; Organisation for Economic Co-operation and Development 2007). Similar policies promote open access to data from observational research networks, governments, and other publicly funded agencies. Many private foundations also encourage or require the release of data from research they fund. Whereas policies for sharing, releasing, and making data open have the longest histories in the sciences and medicine, such policies have spread to the social sciences and humanities. Concurrently, they have spread from Europe and the U.S. to all continents.

The specifics of data sharing policies vary widely by research domain, country, and agency, but have many goals in common.

 
Events



Shake IT Montreal

LINKBYNET


from

Montreal, Quebec, Canada The only event dedicated to technological acceleration and hyperconvergence, April 19. [$$$]


Data Science Festival Mainstage Day

David Loughlan


from

London, England Due to the popularity of Data Science Festival events, we are now allocating event tickets via a random ballot. Registering here enters you into the ticket ballot for the Data Science Festival Mainstage day on Saturday April 29. [free]


Machine Learning in Finance Workshop 2017

Columbia University, Data Science Institute and Bloomberg


from

New York, NY Friday April 21 at Columbia University. [$$]


InSpire 2017: Developing the Health Informatics Workforce of the Future

American Medical Informatics Association


from

La Jolla, CA June 6-9, organized by American Medical Informatics Association [$$$]

 
Deadlines



Election Research Preacceptance Competition

November 8, 2016. What really happened? To enter and be eligible for prizes, you must register a design before the ANES releases the data. The organizers strongly advise that you complete your registrations by Monday, March 27.

NSA Best Scientific Cybersecurity Paper Competition

In order to encourage the development of the scientific foundations of cybersecurity, the National Security Agency (NSA) established The Annual Best Scientific Cybersecurity Paper Competition. NSA invites nominations of papers that show an outstanding contribution to cybersecurity science. Deadline for submissions is March 31.

PET Award

Nominations for PET Symposium’s 2017 Caspar Bowden Award for Outstanding Research in Privacy Enhancing Technologies will be accepted through April 5.

Women in Computer Vision Workshop

Honolulu, HI July 26, in conjunction with CVPR. Deadline for abstracts submissions is April 21.

Rewarding Disobedience — MIT Media Lab



The Media Lab Disobedience Award seeks to highlight effective, responsible, ethical disobedience across disciplines (scientific research, civil rights, freedom of speech, human rights, and the freedom to innovate, for example). Deadline for nominations is May 1.

DH@Guelph Summer Workshops 2017 (May 8-11, 2017) – Digital Scholarship Ontario

Guelph, Ontario, Canada The University of Guelph is hosting a series of 4-day workshops on topics related to digital humanities research and teaching from May 8-11, 2017. Deadline for registration is May 1.

Beyond ILSVRC workshop 2017

The workshop will mark the last of the ImageNet Challenge competitions, and focus on unanswered questions and directions for the future. The workshop will 1) present current results on the challenge competitions including new tester challenges, 2) review the state of the art in recognition as viewed through the lens of the object detection in images and videos, and classification competitions in the challenge.
 
NYU Center for Data Science News



Using AI To Generate Images

NYU Center for Data Science


from

As CDS’ Kyle Cranmer recently explained in an article for Nature, a gap exists between the theory and practical engineering of neural networks. Currently, neural networks produce sound results but how they do so is still a “black box.” Neural networks initially involved an input and a prediction. But now, generative networks produce images of dogs, cats, or galaxies that look real, suggesting their understanding of generating real world representations has improved, thereby easing some of the apprehension scientists have had about their ‘black box’ nature.

The photorealistic images that generative models create are immensely beneficial for scientists and researchers who need to perform image reconstruction, or to fill in deformities in images through simulation. The applications of this technology, as Cranmer remarks, are “pretty endless.”


A Perspective on Natural Language Understanding Capability: An Interview with Sam Bowman

Mary Ann Liebert Publishing, Big Data journal


from

Prof. Vasant Dhar, (Editor-in-Chief, Big Data): Sam, can you broadly classify the approaches we have taken in the last 30–40 years to understand natural language? What have been our experiences? What has worked? What has not worked?

Prof. Bowman: I can only give a very high-level overview of the earlier part of this history, but there are three broad families of approaches to language understanding in natural language processing (NLP).

From the 1960s through right around 1990, the predominant mode of research was rule and template based and very strictly symbolic. This kind of work on language understanding involved quite a lot of knowledge engineering, of trying to build systems around expert-designed representations for the semantics of various kinds of situations and the meanings of words.

The CYC project seems like a good example of the kind of work that was at the center of a lot of this type of thinking.

 
Tools & Resources



Evolution Strategies as a Scalable Alternative to Reinforcement Learning

OpenAI; Andrej Karpathy, Tim Salimans, Jonathan Ho, Peter Chen, Ilya Sutskever, John Schulman, Greg Brockman & Szymon Sidor


from

We’ve discovered that evolution strategies (ES), an optimization technique that’s been known for decades, rivals the performance of standard reinforcement learning (RL) techniques on modern RL benchmarks (e.g. Atari/MuJoCo), while overcoming many of RL’s inconveniences.

In particular, ES is simpler to implement (there is no need for backpropagation), it is easier to scale in a distributed setting, it does not suffer in settings with sparse rewards, and has fewer hyperparameters. This outcome is surprising because ES resembles simple hill-climbing in a high-dimensional space based only on finite differences along a few random directions at each step.


Hackathons: 6 Alternatives Outcomes

The Huffington Post, Liz Gerber


from

“Practitioners and researchers increasingly agree that people generate less code and digital artifacts than through traditional programming strategies. This so-called productivity loss has led many researchers to conclude that hackathons are not productive. I question this conclusion because it is based on efficient code generation as the primary effectiveness outcome and on studies that do not examine how or why organizations use hackathons.”

“Through participation and interviewers with organizers, participants, and non-participants, I suggest alternative outcomes: (1) supporting talent recruitment and selection (2) supporting technical capacity and expertise, (3) supporting expansion of social networks, (4) providing exposure to rapid development process, (5) impressing clients or funders, and (6) providing income for the organization.”


Modules vs. microservices

O'Reilly Radar, Sander Mak


from

Much has been said about moving from monoliths to microservices. Besides rolling off the tongue nicely, it also seems like a no-brainer to chop up a monolith into microservices. But is this approach really the best choice for your organization? It’s true that there are many drawbacks to maintaining a messy monolithic application. But there is a compelling alternative which is often overlooked: modular application development. In this article, we’ll explore what this alternative entails and show how it relates to building microservices.


Automatically identifying wild animals in camera trap images with deep learning

Jeff Clune et al.


from

Having accurate, detailed, and up-to-date information about wildlife location and behavior across broad geographic areas would revolutionize our ability to study, conserve, and manage species and ecosystems. Currently such data are mostly gathered manually at great expense, and thus are sparsely and infrequently collected. Here we investigate the ability to automatically, accurately, and inexpensively collect such data, which could transform many fields of biology, ecology, and zoology into “big data” sciences. Motion sensor cameras called “camera traps” enable pictures of wildlife to be collected inexpensively, unobtrusively, and at high-volume. However, identifying the animals, animal attributes, and behaviors in these pictures remains an expensive, time-consuming, manual task often performed by researchers, hired technicians, or crowdsourced teams of human volunteers. In this paper, we demonstrate that such data can be automatically extracted by deep neural networks (aka deep learning), which is a cutting-edge type of artificial intelligence. In particular, we use the existing human-labeled images from the Snapshot Serengeti dataset to train deep convolutional neural networks for identifying 48 species in 3.2 million images taken from Tanzania’s Serengeti National Park. In this paper we train neural networks that automatically identify animals with over 92% accuracy, and we expect that number to improve rapidly in years to come. More importantly, we can choose to have our system classify only the images it is highly confident about, allowing valuable human time to be focused only on challenging images. In this case, our system can automate animal identification for 98.2% of the data while still performing at the same 96.6% accuracy level of crowdsourced teams of human volunteers, saving approximately ~8.3 years (at 40 hours per week) of human labeling effort (i.e. over 17,000 hours) on a 3.2-million-image dataset. Those efficiency gains immediately highlight the importance of using deep neural networks to automate data extraction from camera trap images. The improvements in accuracy we expect in years to come suggest that this technology could enable the inexpensive, unobtrusive, high-volume and perhaps even realtime collection of information about vast numbers of animals in the wild.

 
Careers


Full-time positions outside academia

Deep Learning at Amazon Web Services



Amazon; Seoul, Korea and Beijing/Shanghai, China

Newsperson/ Visual Producer



Associated Press; New York, NY

Supervisory Librarian



Library of Congress; Washington, DC

Data Engineer



Carolina Hurricanes; Raleigh, NC
Full-time, non-tenured academic positions

Research Scientist, Cancer Genetics



University of Birmingham; Birmingham, England
Postdocs

Postdoc in Data Science for Comparative Effectiveness Research



Harvard T.H. Chan School of Public Health; Boston, MA

Leave a Comment

Your email address will not be published.