Data Science newsletter – July 6, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for July 6, 2017

GROUP CURATION: N/A

 
 
Data Science News



Duke Launches Project Baseline Study Enrollment in North Carolina

Duke School of Medicine


from

Duke will enroll approximately 1,000 participants at two locations in North Carolina — in Kannapolis at the North Carolina Research Campus, as well as in Durham at Duke University Medical Center. Community advisory boards in both Kannapolis and Durham will provide crucial guidance to investigators to maintain the participant-centric mission of the study and ensure its success.

Anyone who is at least 18 years old, a resident of the United States, able to speak and read English or Spanish, and not severely allergic to nickel or metal jewelry may register online at projectbaseline.com to enroll in the Project Baseline registry and be considered for the study. Importantly, participants will be partners in all aspects of the study, having a voice in the direction of the study as members of committees with direct access to study leadership.


What’s in a name? Big Data reveals distinctive patterns in higher education systems

EurekAlert! Science News, University of Chicago Medical Center


from

Using lists of names collected from publicly available websites, two University of Chicago researchers have revealed distinctive patterns in higher education systems, ranging from ethnic representation and gender imbalance in the sciences, to the presence of academic couples, and even the illegal hiring of relatives in Italian universities.

“This study was an exercise in exploiting bare-bones techniques,” said author Stefano Allesina, PhD, professor of ecology & evolution and a member of the Computation Institute at the University of Chicago. “We wanted to analyze the simplest form of data you could imagine: lists of names. That’s all we had. We wondered what kinds of information we could extract from such a meager source of data. We also asked: how could we use this to explore real-world problems?”


Compliance Culture or Culture Change? The role of funders in improving data management and sharing practice amongst researchers

Research Ideas & Outcomes, Cameron Neylon


from

There is a wide and growing interest in promoting Research Data Management (RDM) and Research Data Sharing (RDS) from many stakeholders in the research enterprise. Funders are under pressure from activists, from government, and from the wider public agenda towards greater transparency and access to encourage, require, and deliver improved data practices from the researchers they fund.

Funders are responding to this, and to their own interest in improved practice, by developing and implementing policies on RDM and RDS. In this review we examine the state of funder policies, the process of implementation and available guidance to identify the challenges and opportunities for funders in developing policy and delivering on the aspirations for improved community practice, greater transparency and engagement, and enhanced impact.

The review is divided into three parts. The first two components are based on desk research: a survey of existing policy statements drawing in part on existing surveys and a brief review of available guidance on policy development for funders. The third part addresses the experience of policy implementation through interviews with funders, policy developers, and infrastructure providers.


Practical parallelism

MIT News


from

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory have developed a new system that not only makes parallel programs run much more efficiently but also makes them easier to code.


Earn a $1.5 Million Prize at ‘Kaggle!’ (American Applicants Only, Please.)

Undark magazine, Jeremy Hsu


from

Earlier this year, Google undercut its Silicon Valley rivals by acquiring what it described as “the world’s largest community of data scientists and machine learning enthusiasts” in the online platform Kaggle. The acquisition gave Google more direct access to Kaggle’s one million members, who compete and earn prize money for developing artificial intelligence solutions to all manner of data analysis problems — from improving the algorithms used by the online real estate giant Zillow, to helping a satellite company use data to “track the human footprint in the rainforest.”

Response to the Google merger within the tightly-knit Kaggle community was somewhat mixed, but the platform’s international membership, which includes researchers in at least 194 countries, is now in revolt over Kaggle’s recent decision to host a U.S. government competition — one worth $1.5 million in prize money — that is legally off limits to foreigners.

The current controversy involves the Passenger Screening Algorithm Challenge sponsored by the U.S. Department of Homeland Security. Federal legislation known as the America Competes Act, enacted in 2007 and reauthorized under Barack Obama in 2011, bars the government, among other things, from awarding federal prize money like what DHS is offering to anyone who is not an American citizen or permanent resident.


AI is changing how we do science. Get a glimpse

Science, Latest News


from

Particle physicists began fiddling with artificial intelligence (AI) in the late 1980s, just as the term “neural network” captured the public’s imagination. Their field lends itself to AI and machine-learning algorithms because nearly every experiment centers on finding subtle spatial patterns in the countless, similar readouts of complex particle detectors—just the sort of thing at which AI excels. “It took us several years to convince people that this is not just some magic, hocus-pocus, black box stuff,” says Boaz Klima, of Fermi National Accelerator Laboratory (Fermilab) in Batavia, Illinois, one of the first physicists to embrace the techniques. Now, AI techniques number among physicists’ standard tools.

Particle physicists strive to understand the inner workings of the universe by smashing subatomic particles together with enormous energies to blast out exotic new bits of matter. In 2012, for example, teams working with the world’s largest proton collider, the Large Hadron Collider (LHC) in Switzerland, discovered the long-predicted Higgs boson, the fleeting particle that is the linchpin to physicists’ explanation of how all other fundamental particles get their mass.


“Advances in Deep Neural Networks,” at ACM Turing 50 Celebration

reddit.com/r/machinelearning, YouTube, Association of Computing Machinery


from

It will be interesting to see how this discussion will be judged by the future, say, 3Deep neural networks can be trained with relatively modest amounts of information and then successfully be applied to large quantities of unstructured data. Deep learning techniques have been applied with great success to areas such as speech recognition, image recognition, natural language processing, drug discovery and toxicology, customer relationship management, recommendation systems, and biomedical informatics. The capabilities of deep neural networks, in some domains, have proven to rival those of human beings. Panelists will explore how deep neural networks are changing our world and our jobs. They will also discuss how things may further change going forward.


Is fact-checking ‘fake news’ a waste of time?

Futurity, Boston University


from

A new study suggests that fact-checking has little influence on what online news media covers, and fact-checks of false news stories spreading online—”fake news”—may use up resources newsrooms could better use covering substantive stories.

“Faculty and students have been agonizing recently about the emergence of fake news—false information packaged to deceive the public into thinking it was produced by professionals with respect for truth,” notes Thomas Fiedler, dean of the Boston University College of Communication, in his spring 2017 COMtalk column.


Artificial Stupidity: Learning To Trust Artificial Intelligence (Sometimes)

Breaking Defense, Sydney J. Freedberg Jr.


from

As conflict on earth, in space, and in cyberspace becomes increasingly fast-paced and complex, the Pentagon’s Third Offset initiative is counting on artificial intelligence to help commanders, combatants, and analysts chart a course through chaos — what we’ve dubbed the War Algorithm (click here for the full series). But if the software itself is too complex, too opaque, or too unpredictable for its users to understand, they’ll just turn it off and do things manually. At least, they’ll try: What worked for Luke Skywalker against the first Death Star probably won’t work in real life. Humans can’t respond to cyberattacks in microseconds or coordinate defense against a massive missile strike in real time. With Russia and China both investing in AI systems, deactivating our own AI may amount to unilateral disarmament.


Digital identification systems: responsible data challenges and opportunities

The Engine Room, Zara Rahman


from

Last week, I had the pleasure of being part of the Ethics and Risks working group at the 2017 ID2020 Platform for Change Summit. This event focused on issues faced by people living without recognised identification, and brought together multiple stakeholders, public and private, to discuss these challenges.

Through the working group, we discussed challenges, opportunities and aims for addressing the the ethical challenges that come from digital identification systems. I found it to be a fascinating challenge to apply many of the general lessons learned from our Responsible Data work to a very specific problem. Below are just some of the key points that stood out for me, as a non-expert in identification systems.

For context, our group was one of a few different working groups, some of which were focused on specific issues mentioned below – like security or legal regulations. To avoid duplication, we focused on issues that fell explicitly within “ethics and risks” rather than issues that other groups might be addressing.


Why Google’s newest AI team is setting up in Canada

Recode, Tess Townsend


from

Political realities also make Canada a particularly attractive place for Google to expand its AI efforts.

The Canadian government has demonstrated a willingness to invest in artificial intelligence, committing about $100 million ($125 million in Canadian currency) in its 2017 budget to develop the AI industry in the country.

This is in contrast to the U.S., where President Donald Trump’s 2018 budget request includes drastic cuts to medical and scientific research, including an 11 percent or $776 million cut to the National Science Foundation.


DeepMind expands to Canada with new research office in Edmonton, Alberta

Google DeepMind


from

DeepMind has always been a unique hybrid of startup culture and academia, and we’ve been lucky to collaborate with many of the best researchers from around the world. Today we’re thrilled to announce our next phase: the opening of DeepMind’s first ever international AI research office in Edmonton, Canada, in close collaboration with the University of Alberta (UAlberta).

It was a big decision for us to open our first non-UK research lab, and the fact we’re doing so in Edmonton is a sign of the deep admiration and respect we have for the Canadian research community. In fact, we’ve had particularly strong links with the UAlberta for many years: nearly a dozen of its outstanding graduates have joined us at DeepMind, and we’ve sponsored the machine learning lab to provide additional funding for PhDs over the past few years.


Making Waves

IST Austria


from

Computer scientists use wave packet theory to develop realistic, detailed water wave simulations in real time. Their results will be presented at this year’s SIGGRAPH conference.


Can Michael Bloomberg save American cities with data at the core?

Diginomica, Jerry Bowles


from

Michael Bloomberg is giving $200 million to provide new tools that allow city mayors to innovate, solve problems, and work together to move the needle on the issues he believes matter most to citizens and America’s future.


New Berkman Klein Center study examines global internet censorship

Harvard Law Today


from

A sharp increase in web encryption and a worldwide shift away from standalone websites in favor of social media and online publishing platforms has altered the practice of state-level internet censorship and in some cases led to broader crackdowns, a new study by the Berkman Klein Center for Internet & Society at Harvard University finds.

“The Shifting Landscape of Global Internet Censorship”, released today, documents the practice of internet censorship around the world through empirical testing in 45 countries of the availability of 2,046 of the world’s most-trafficked and influential websites, plus additional country-specific websites. The study finds evidence of filtering in 26 countries across four broad content themes: political, social, topics related to conflict and security, and internet tools (a term that includes censorship circumvention tools as well as social media platforms). The majority of countries that censor content do so across all four themes, although the depth of the filtering varies.


Wolfram Alpha Is Making It Extremely Easy for Students to Cheat

WIRED, Backchannel, Pippa Biddle


from

Denise Garcia knows that her students sometimes cheat, but the situation she unearthed in February seemed different. A math teacher in West Hartford, Connecticut, Garcia had accidentally included an advanced equation in a problem set for her AP Calculus class. Yet somehow a handful of students in the 15-person class solved it correctly. Those students had also shown their work, defeating the traditional litmus test for sussing out cheating in STEM classrooms.

Garcia was perplexed, until she remembered a conversation from a few years earlier. Some former students had told her about an online tool called Wolfram|Alpha that could complete complicated calculations in seconds. It provided both the answers and the steps for reaching them, making it virtually undetectable when copied as homework.


As new and lethal opioids flood U.S. streets, crime labs race to ID them

STAT, Max Blau


from

As newer and stronger opioids flood states from Arizona to New York to Ohio, crime labs like this one find themselves racing to identify unfamiliar drugs in hopes of saving lives. They need to know what’s on the streets in order to build legal cases against the dealers. But identifying the drugs is also vital for public health: It lets first responders know how much of the overdose antidote naloxone to carry. And it helps them understand how lethal the drug residue might be — a crucial bit of information in an era when police officers have overdosed simply from incidental exposure at a crime scene.

“What happens when fentanyl changes on a weekly or monthly basis? You need 2017 data in a 2017 crisis response,” said Dr. Daniel Ciccarone, professor of family and community medicine at the University of California, San Francisco.

 
Events



Zillow Prize Ignites Seattle

Zillow Data Science and Engineering Team


from

Seattle, WA Wednesday, July 12, starting at 6 p.m., Zillow HQ. [free, registration required]

 
Deadlines



OpenCon2017 Application Form

Berlin, Germany November 11-13. OpenCon is the conference and community for students and early career academic professionals interested in advancing Open Access, Open Education and Open Data. Deadline to apply is August 1.

IEEE VIS 2017 – Student Volunteers

Phoenix, AZ Conference is October 1-6. Deadline for student volunteer applications is August 1.

Here’s your chance to help reduce human-wildlife conflict!

WWF is looking for technology developers to help us create early warning systems to detect approaching wild animals. This is so we can reduce human-wildlife conflict. More information on this Human-Wildlife Conflict Technology Challenge can be found on the WILDLABS webpages. Proposals should be submitted no later than September 12.
 
Tools & Resources



[D] Running data science in an Agile (Scrum, Kanban, etc) environment. Experience + best practices, tips, etc : MachineLearning

reddit.com/r/machinelearning


from

There was a thread yesterday about how to run data science in Agile environments, be they Scrum, Kanban, or whatever flavor you desire.

There’s a lot of really good information in that thread, but since the entire thread was slightly OT for the OP, I figured a dedicated post might pull in more experience (and might share it to a wider audience in general).


Building an Operating System for AI

Algorithmia


from

The operating system on your laptop is running tens or hundreds of processes concurrently. It gives each process just the right amount of resources that it needs (RAM, CPU, IO). It isolates them in their own virtual address space, locks them down to a set of predefined permissions, allows them to inter-communicate, and allow you, the user, to safely monitor and control them. The operating system abstracts away the hardware layer (writing to a flash drive is the same as writing to a hard drive) and it doesn’t care what programming language or technology stack you used to write those apps – it just runs them, smoothly and consistently.

As machine learning penetrates the enterprise, companies will soon find themselves productionizing more and more models and at a faster clip. Deployment efficiency, resource scaling, monitoring and auditing will start to become harder and more expensive to sustain over time. Data scientists from different corners of the company will each have their own set of preferred technology stacks (R, Python, Julia, Tensorflow, Caffe, deeplearning4j, H2O, etc.) and data center strategies will shift from one cloud to hybrid. Running, scaling, and monitoring heterogeneous models in a cloud-agnostic way is a responsibility analogous to an operating system – that’s what we want to talk about.


Exactly-once Support in Apache Kafka

Medium, Jay Kreps


from

On Thursday we released a new version of Apache Kafka that dramatically strengthens the semantic guarantees it provides.

This release came at the tail end of several years of thinking through how to do reliable stream processing in a way that is fast, practical, and correct. The implementation effort itself was on the order of about a year, including an extended period in which about a hundred pages of detailed design documents were discussed and critiqued in the Kafka community, extensive performance tests were performed, and thousands of lines of distributed torture tests specifically targeting this functionality were added.


Rust’s 2017 roadmap, six months in

The Rust Programming Language Blog, Nicholas Matsakis


from

The most direct way to make Rust easier to learn is to improve the way that we teach it. To that end, we’ve been hard at work on a brand new edition of the official “Rust” book (roadmap issue), and we now have a complete draft available online. This new edition puts ownership front and center and it also has expanded coverage of a number of other areas in Rust, such as error handling, testing, matching, modules, and more. Even better, you can pre-order a printed version through No Starch Press.


[1706.09451] (Machine) Learning to Do More with Less

arXiv, High Energy Physics – Phenomenology; Timothy Cohen, Marat Freytsis, Bryan Ostdiek


from

Determining the best method for training a machine learning algorithm is critical to maximizing its ability to classify data. In this paper, we compare the standard “fully supervised” approach (that relies on knowledge of event-by-event truth-level labels) with a recent proposal that instead utilizes class ratios as the only discriminating information provided during training. This so-called “weakly supervised” technique has access to less information than the fully supervised method and yet is still able to yield impressive discriminating power. In addition, weak supervision seems particularly well suited to particle physics since quantum mechanics is incompatible with the notion of mapping an individual event onto any single Feynman diagram. We examine the technique in detail — both analytically and numerically — with a focus on the robustness to issues of mischaracterizing the training samples. Weakly supervised networks turn out to be remarkably insensitive to systematic mismodeling. Furthermore, we demonstrate that the event level outputs for weakly versus fully supervised networks are probing different kinematics, even though the numerical quality metrics are essentially identical. This implies that it should be possible to improve the overall classification ability by combining the output from the two types of networks. For concreteness, we apply this technology to a signature of beyond the Standard Model physics to demonstrate that all these impressive features continue to hold in a scenario of relevance to the LHC.

 
Careers


Postdocs

Post-Doc Participatory and Interactive Forest and Land Monitoring



Wageningen University & Research; Wageningen, The Netherlands

Leave a Comment

Your email address will not be published.