Data Science newsletter – March 7, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for March 7, 2018

GROUP CURATION: N/A

 
 
Data Science News



Canadian Researchers Get Their Third Multi-Petaflop Supercomputer

TOP500 Supercomputer Sites, Michael Feldman


from

Niagara, Canada’s newest multi-petaflop supercomputer tasked for academic use, is now available to researchers across the country.

The system, which is installed at a facility run by the University of Toronto and is the third in a trio of new multi-petaflop supercomputers that have been deployed in Canada over the past year. According to the announcement, Niagara will deliver over three petaflops of performance, putting it on par with its two academic brethren: Cedar a 3.7-petaflop supercomputer installed at Simon Fraser University, and Graham, a 2.6-petaflop system deployed at University of Waterloo.


Teaching Computers to Guide Science: New Machine Learning Method Sees the Forests and the Trees

Lawrence Berkeley Lab


from

While it may be the era of supercomputers and “big data,” without smart methods to mine all that data, it’s only so much digital detritus. Now researchers at the Department of Energy’s Lawrence Berkeley National Laboratory (Berkeley Lab) and UC Berkeley have come up with a novel machine learning method that enables scientists to derive insights from systems of previously intractable complexity in record time.

In a paper published recently in the Proceedings of the National Academy of Sciences, the researchers describe a technique called “iterative Random Forests,” which they say could have a transformative effect on any area of science or engineering with complex systems, including biology, precision medicine, materials science, environmental science, and manufacturing, to name a few.

“Take a human cell, for example. There are 10170 possible molecular interactions in a single cell. That creates considerable computing challenges in searching for relationships,” said Ben Brown, head of Berkeley Lab’s Molecular Ecosystems Biology Department. “Our method enables the identification of interactions of high order at the same computational cost as main effects – even when those interactions are local with weak marginal effects.”


Microsoft targets university students with new Azure for Students plan

ZDNet, Mary Jo Foley


from

Microsoft is adding a second student-focused plan for Azure to its line-up, which includes free access to a number of the company’s AI services.


SAS is on the brink of…something.

Thomas W. Dinsmore, ML/DL blog


from

SAS has a firm hold on its Global 2000 enterprise base, which isn’t going away anytime soon.

I wouldn’t be so sanguine. In fact, the most consistent message one hears from G2000 CDOs is “we want to reduce our SAS footprint.”


Strengthening AI R&D Among China’s 2018 Innovation Goals

Medium, Synced


from

In his address to the 13th National People’s Congress (NPC) in Beijing on March 5, Chinese Premier Keqiang Li delivered this year’s official government work report, identifying “reinforcing the new generation of artificial intelligence research and development” among the state’s innovation focuses for 2018.

Premier Li provided a three-part overview of government work, including a review of the past five years, policy direction for this year’s socioeconomic development, and suggestions for upcoming government work in 2018. The report’s keywords include “artificial intelligence” along with “agricultural supply-side structural reform,” “shared economy,” “digital healthcare,” and “fintech.”


ML 2.0: Machine learning for many

MIT News, MIT Laboratory for Information and Decision Systems


from

Today, when an enterprise wants to use machine learning to solve a problem, they have to call in the cavalry. Even a simple problem requires multiple data scientists, machine learning experts, and domain experts to come together to agree on priorities and exchange data and information.

This process is often inefficient, and it takes months to get results. It also only solves the problem immediate at hand. The next time something comes up, the enterprise has to do the same thing all over again.

One group of MIT researchers wondered, “What if we tried another strategy? What if we created automation tools that enable the subject matter experts to use ML, in order to solve these problems themselves?”

For the past five years, Kalyan Veeramachaneni, a principal research scientist at MIT’s Laboratory for Information and Decision Systems, along with Max Kanter and Ben Schreck who began working with Veeramachaneni as MIT students and later co-founded machine learning startup Feature Labs, has been designing a rigorous paradigm for applied machine learning.


Lottery hacking, winning millions

FlowingData, Nathan Yau


from

I always love a good lottery hacking story. Jason Fagone for The Huffington Post chronicles the winnings of Gerald and Marge Selbee, a retired couple from a small town in Michigan. It is a story of probabilities, expected values, and arduously buying lot of tickets to maximize profits.


Fewer international grad students are seeking computer science degrees in the U.S. for the first time in years

GeekWire, Monica Nickelsburg


from

The number of international students enrolling in American universities is declining for the first time in years, amid volatile shifts in U.S. immigration policy.

That’s according to the latest data from the federal government’s National Science Board. The number of international graduate students enrolled in U.S. science and engineering programs dropped 6 percent between 2016 and 2017 and 5 percent in non-science and engineering fields. The decline was driven by fewer international grad students seeking computer science and engineering degrees. International student enrollment had been increasing since 2012 until last year.

The total number of foreign-born students enrolled in undergraduate programs in the U.S. also fell 2 percent over the same time period. NSB notes that among undergrads, the decline isn’t happening in computer science or math majors, which actually increased. Engineering, social sciences, and non-STEM fields drove the overall drop for undergrads.


Predicting the next deadly outbreak – CNN

CNN, Victoria Brown and Tom Page


from

They came from different backgrounds, among them journalists, health professionals, writers and film producers. [Rainier] Mallol was a computer engineering graduate and hit it off with Malaysian public health specialist Dr Dhesi Raja on their first day on the program.

The two talked about dengue and the inability to forecast outbreaks, and conceived an algorithm-based approach to monitor the disease. Mallol created a basic artificial intelligence software in a week, feeding it Dengue statistics and asking it to predict three months ahead.

“We waited about 10 weeks then compared the results,” he says. “We obtained about 81% accuracy.”

It was the first steps towards co-founding their company, Artificial Intelligence in Medical Epidemics (AIME), which today claims an improved average accuracy of 86.4%. Their company’s Dengue Outbreak Prediction platform now supplies the Malaysian government and regional governments in Brazil and the Philippines with insights to manage and curb outbreaks.


Love is All You Need

Albert-László Barabási


from

In the past weeks, I have received several requests to address the merits of the Anna D. Broido and Aaron Clauset (BC) preprint [1] and their fruitless search for scale-free networks in nature. The preprint’s central claim is deceptively simple: It starts from the textbook definition of a scale-free network as a network with a power law in the degree distribution [2]. It then proceeds to fit a power law to 927 networks, finding that only 4% are scale-free. The author’s conclusion that ‘scale-free networks are rare,’ is turned into the title of the preprint, helping it to get maximal attention. It worked—Quanta magazine accepted its conclusions without reservations. AfterThe Atlantic carried the article, the un-refereed preprint received a degree of media exposure that the original discovery of scale-free networks never enjoyed.

While I saw the conceptual problems with the manuscript, I was convinced that the paper must be technically proficient. Yet, once I did dig into it, it was a real ride. If you have the patience to get to the end of this commentary, you will see where it fails at the conceptual level. But, we will learn that it also fails, repeatedly, at the technical level.


Government Data Science News

The FBI was secretly (but not secretly enough) in cahoots with Best Buy’s Geek Squad. agents were reviewing the contents of computers brought in for repair by customers, looking for illegal content like child pornography. The Bureau would not comment on whether they had similar arrangements with other customer-facing companies.



The extremely important 2020 US Census is nearly upon us. Statistician and data visualizer, Nathan Yau put together a visual explainer about why the Census is important. Content is suitable for use with people as young as 5 or 6. That’s hard to do.



The State Department was allotted $120 m to fight foreign meddling in US elections in 2016. They have spent $0. None of their 23 analysts speak Russian, but there’s a hiring freeze, so they can’t hire anyone who does or anyone who has specific expertise in the relevant type of analysis. Sigh.



Canada has switched on the third peta-flop supercomputer open to academic researchers at the University of Toronto.



China continues to make announcements about its commitment to data science including fintech, artificial intelligence, and digital healthcare.



The National Science Board reports that the US is losing its attraction for top international graduate students. A new report finds that “international graduate students enrolled in U.S. science and engineering programs dropped 6 percent between 2016 and 2017 and 5 percent in non-science and engineering fields.”



Ajit Pai and the FCC have struck down net neutrality protections. States like Washington are attempting to institute net neutrality protections at the state level, setting the stage for a legal showdown.



The SEC is trying to protect bitcoin investors and traders from their own complicity in the bitcoin hype machine by reminding them that most exchanges are not registered with the SEC and therefore offer none of the protections associated with SEC registered exchanges. They have taken a page from the Buzzfeed playbook and offered an article: “13 questions to see if your bitcoin exchange is run by weasels.”
[I may have changed the title slightly.]

Representative Robin Kelly, a Democrat from Illinois is fighting to restore budgets for IT research arguing that, “this administration’s science, immigration and education policies are all working together to reduce the U.S. lead in AI technologies.” In closely related news, “The National Science Foundation … was already unable to fund $174 million in promising artificial intelligence research last year due to budget constraints.”


Google Is Helping the Pentagon Build AI for Drones

Gizmodo, Kate Conger and Dell Cameron


from

Google has partnered with the United States Department of Defense to help the agency develop artificial intelligence for analyzing drone footage, a move that set off a firestorm among employees of the technology giant when they learned of Google’s involvement.

Google’s pilot project with the Defense Department’s Project Maven, an effort to identify objects in drone footage, has not been previously reported, but it was discussed widely within the company last week when information about the project was shared on an internal mailing list, according to sources who asked not to be named because they were not authorized to speak publicly about the project.


Accelerating clinical research through mobile technology

PLOS Blogs Network, Carolyn Graybeal


from

Researchers face a number of challenges when conducting a clinical study.1 Investigators spend considerable time and money recruiting and screening viable participants. If recruitment takes too long, important studies can get scrapped before they are even started. Once a study is underway, participants must sacrifice their own time to make clinic visits, which, for long-term studies, can reduce participant retention. Incorporating internet and mobile technologies into a study’s design can relieve some of these burdens. Research efforts like the University of California San Francisco’s Health eHeart Study capitalize on the ubiquity and convenience of mobile technology to improve data collection and make it easier for people to participate.

The Health eHeart Study is a long-term, internet-based study exploring the causes of cardiovascular disease, the leading cause of death in the United States, affecting individuals across all ages and backgrounds.2 Its prevalence makes it all the more important for researchers to be able to cast a wide net for study participants. By using online surveys, smartphone apps, and at home tests, Health eHeart makes it easier to engage participants and collect data.

“Making it easy for people to participate, making it so they don’t have to come to a clinic, is important for getting large numbers and obtaining diverse populations,” said Dr. Jeffrey Olgin, Professor of Medicine at UCSF and one of the lead investigators of the study. “For example, it can be hard for people in rural communities to participate [if they need to drive long distances to a lab] or to attract busy people.”


Privacy at the Margins| Refractive Surveillance: Monitoring Customers to Manage Workers

International Journal of Communication; Karen Levy and Solon Barocas


from

Collecting information about one group can facilitate control over an entirely different group—a phenomenon we term refractive surveillance. We explore this dynamic in the context of retail stores by investigating how retailers’ collection of data about customers facilitates new forms of managerial control over workers. We identify four mechanisms through which refractive surveillance might occur in retail work, involving dynamic labor scheduling, new forms of evaluation, externalization of worker knowledge, and replacement through customer self-service. Our research suggests that the effects of surveillance cannot be fully understood without considering how populations might be managed on the basis of data collected about others. [full-text pdf download]


The state of the field in pre-college computer science education: Highly recommended Google report

Mark Guzdial, Computing Education Research Blog


from

Google has just released a report: Pre-College Computer Science Education: A Survey of the Field (available here). The report is authored by Paulo Blikstein of Stanford. The report is innovative, developed with an unusual method. It’s terrific, and I highly recommend it.

Paulo started out with a pretty detailed survey document about the state of the literature in computer science education. He covered from the 1967 launch of Logo to modern day. Then he interviewed 14 researchers in the field (I was one). These were detailed interviews, where the interviewees got to review the transcript afterwards. Paulo integrated ideas and quotes from the interviews into the document. Here comes the really cool part: he put the whole thing on a Google doc and let everyone comment on it.

When I got the call to review the document, I just skimmed it. It looked pretty good to me. But then the debates started, and the fights broke out. That Google doc had some of the longest threads of comments I’ve ever seen. After a few weeks, Paulo closed the comments, and then integrated the threads into the document. So now, it’s not just a serious survey paper, brought up to date with interviews. It’s also a record of significant debate between over a dozen researchers, where the tensions and open questions were surfaced.

This is the document to read to figure out what should come next in computing education research. I will recommend it to all of my students.

 
Events



The Gupta Family Hackathon for Health Communication

Gupta Family Foundation, University of Michigan


from

Ann Arbor, MI “Application is now open for participation in marathon event March 23-25, focused on innovation for sharing information in times of crisis & beyond.” Organized by the University of Michigan Institute for Healthcare Policy & Innovation.


Machines + Media

NYC Media Lab


from

New York, NY Tuesday, May 15 at Bloomberg (731 Lexington Avenue, 7th Floor). “NYC Media Lab’s Machines + Media conference, generously sponsored and hosted by Bloomberg for a second year, will focus on new applications of data science and technology in media and journalism.” [application required]


Conference Programme – Conference on Data Justice

Cardiff University Data Justice Lab


from

Cardiff, Wales May 21-22 at Cardiff University. “This conference will examine the intricate relationship between datafication and social justice by highlighting the politics and impacts of data-driven processes and exploring different responses.” [$$]

 
Deadlines



Call for Contributions: Data Carpentry Ecology and Software Carpentry Curriculum Advisory Committees

“Due to overwhelming enthusiasm from the Maintainer community, we are now recruiting for Curriculum Advisors for the Data Carpentry Ecology lessons and the Software Carpentry full lesson stack. Applications are open to all Carpentry community members. We strongly encourage applications from community members with current classroom teaching experience, university or college faculty and staff, and Maintainers for these lessons.” Deadline for applications is March 16.

OpenAI Scholars

OpenAI is “providing 6-10 stipends and mentorship to individuals from underrepresented groups to study deep learning full-time for 3 months and open-source a project.” Deadline to apply is March 31.

Request for Information: Soliciting Input for the National Institutes of Health Strategic Plan for Data Science

“Data science is an integral component of modern biomedical research. It is the interdisciplinary field of inquiry in which quantitative and analytical approaches, processes, and systems are developed and used to extract knowledge and insights from increasingly large and/or complex sets of data. Data science has increased in importance for biomedical research over the past decade and NIH expects that trend to continue. In order to capitalize on the opportunities presented by advances in data science, and overcome key challenges, the NIH is developing a Strategic Plan for Data Science.” Deadline for responses is April 2.

Sports/Media/Tech Startup Bootcamp – Open Call

“Four teams will be selected to participate in this bootcamp. Kicking off in June of 2018, NYC Media Lab will execute an 8-week Lean Launchpad program to encourage and support customer discovery and market validation.” Deadline to apply is April 13.

Geopolitical Forecasting [GF] Challenge

“Solvers will be invited to the GF Challenge Platform to compete. Each solver or team will be assigned a unique application programming interface (API) token and login account for the platform. A steady stream of questions (roughly 25 per month) will be released for solvers to produce probabilistic forecasts against (specific requirements will be described in the rules). Questions may be binary (yes/no), multiple choice, or ordered outcome (binned quantity/date). In addition to the questions, solvers will receive access to a continuously updated stream of forecast judgments produced by a crowd of human forecasters. Solvers are encouraged to create solutions that use this stream of human judgments alongside other publicly available data streams and information to create their forecasting solutions.” This IARPA and HeroX challenge closes on September 7.
 
Moore-Sloan Data Science Environment News



Understanding Particle Physics: Data Science Jets into the Future

Medium, NYU Center for Data Science


from

In new research from Joan Bruna, Kyunghyun Cho, both CDS Faculty and Assistant Professors of Computer Science and Data Science, Kyle Cranmer, CDS Affiliated Faculty and Associate Professor of Physics, Isaac Henrion, and Johann Brehmer, jets are represented as graphs and studied with a Message Passing Neural Network (MPNN) rather than a recursive neural network. The researchers applied their MPNN to a binary classification problem, attempting to distinguish between two classes of jets: QCD jets (arising from a known mixture of quarks and gluons) and W jets (arising from bosons decaying into two quarks).

Compared with the performance of recursive neural networks in achieving this classification, MPNNs did significantly better.

 
Tools & Resources



The Building Blocks of Interpretability

Distill, Chris Olah et al.


from

“Interpretability techniques are normally studied in isolation.
We explore the powerful interfaces that arise when you combine them — 
and the rich structure of this combinatorial space.”


Leveraging GANs to combat adversarial examples

Approximately Correct blog, Alex Dimakis


from

“How can we defend against such attacks? A deluge of recent papers have proposed methods for defenses and counter-attacks. The problem is surprisingly complex and many natural defenses can be easily defeated by creating new adversarial images, designed to break the defenses.”

“The tremendous attention by the research community is well justified. Beyond security concerns, adversarial examples illustrate that our modern complex models can be making correct predictions for completely wrong reasons.”


NEON Data page

National Ecological Observatory Network


from

Showing 180 of 180 data products at 79 of 79 sites, Dec 2010 – Mar 2018


Introducing Coördinator: A new open source project made at Spotify to inject some whimsy into data visualizations

Spotify Labs, Aliza Aufrichtig


from

“Coördinator is an open source browser interface to help you turn an SVG into XY coordinates. That means you can now take any SVG file, turn it into dots, and use those dots in a data visualization.”


PLOS Criteria for Recommended Data Repositories

PLOS Blogs Network


from

Since 2015, the PLOS journals have maintained a list of repositories that we have determined to be suitable for authors depositing datasets that accompany PLOS articles. Selection of an appropriate repository allows researchers to maximize the visibility of their data while ensuring the data and related meta-data meet field-specific standards within their research community. We therefore encourage authors to deposit data to recommended field-specific, multi-disciplinary or institutional repositories when possible and put data in Supporting Information files only when a suitable repository is not available.

 
Careers


Full-time, non-tenured academic positions

Scholarly Publication & Communication Specialist



University of Washington, UW Libraries; Seattle, WA

Co-ordinator for Synthetic Biology Centre (Fixed Term)



University of Cambridge, Department of Plant Sciences; Cambridge, England

Django developer with an interest in Health Care



University of Oxford, Evidence-Based Medicine DataLab; Oxford, England
Postdocs

Postdoctoral Fellow



New York University Social Media and Political Participation (SMaPP) Lab; New York, NY

Leave a Comment

Your email address will not be published.