Data Science newsletter – July 11, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for July 11, 2017

GROUP CURATION: N/A

 
 
Data Science News



DNA evidence is rewriting domestication origin stories

Science News, Tina Hesman Saey


from

One lab full of rats looks pretty much the same as another. But visiting a lab in Siberia, geneticist Alex Cagan can distinguish rats bred to be tame from those bred to be aggressive as soon as he opens the lab door.

“It’s a completely different response immediately,” he says. All of the tame rats “come to the front of the cage very inquisitively.” The aggressive rats scurry to the backs of their cages to hide. Exactly how 70 generations of breeding have ingrained friendly or hostile behaviors in the rats’ DNA is a mystery that domestication researchers are trying to solve. The rats, along with mink and silver foxes, are part of a long-running study at the Institute of Cytology and Genetics in Novosibirsk, Russia. The aim is to replay domestication to determine the genetic underpinnings that set domesticated animals apart from their wild ancestors.


Under the Hood of a Self-Driving Taxi

Voyage, Oliver Cameron


from

A self-driving car traditionally follows the paradigm of Sense, Plan, Act. The car senses the environment around it, utilizing sensors like LIDAR, radar and cameras. The car plans the path from point A to point B, using sensor information and other contextual information. The car then acts, executing the path that was planned by controlling its steering and speed.

To give a car the ability to Sense, Plan & Act (SPA) requires a complex system of hardware and software, all of which works together in (hopeful) harmony to form a self-driving car. You might be familiar with some of the surface level components in isolation (things like cameras, LIDAR etc.), but equally important is all the plumbing necessary to make them all sing together. It’s that plumbing, or the core, that we’ll cover in this post.


PAIR: the People + AI Research Initiative

Google, The Keyword blog, Martin Wattenberg and Fernanda Viegas


from

We’re announcing the People + AI Research initiative (PAIR) which brings together researchers across Google to study and redesign the ways people interact with AI systems. The goal of PAIR is to focus on the “human side” of AI: the relationship between users and technology, the new applications it enables, and how to make it broadly inclusive. The goal isn’t just to publish research; we’re also releasing open source tools for researchers and other experts to use.


University Data Science News

Brown University has received $19m over four years from DARPA to build a “fully implantable wireless brain interface system able to record and stimulate neural activity.” The proposal is awe-inspiring, readers. Their vision could lead to “a ‘cortical intranet’ of tens of thousands of wireless micro-devices — each about the size of a grain of table salt — that can be safely implanted onto or into the cerebral cortex, the outer layer of the brain. The implants, dubbed ‘neurograins,’ will operate independently, interfacing with the brain at the level of a single neuron. The activity of the devices will be coordinated wirelessly by a central communications hub in the form of a thin electronic patch worn on the skin or implanted beneath it.” When our brains join the internet of things, we need to ask what we can do to make sure they aren’t hackable. Of course, the recent election and referendum cycles suggest that human brains are already highly susceptible to fake news, confirmation bias, and the cornucopia of other observation bias covered in social psychology 101.

MIT researchers found a positive correlation between the education level of residents and the likelihood that a neighborhood will improve. They used image recognition, maps, and demographic data overlays to find out that highly educated residents are the motor for gentrification. They did not, however, indicate causality because this would get into tricky terrain. Do highly educated people put more effort into evaluating their neighborhood choices, thereby picking neighborhoods that were already going to gentrify anyways? Do highly educated residents have fewer ties to home communities having spent so many years in school, thereby making residential choices based on economic value and convenience rather than family and community inertia? Do highly educated residents see a faster uptick in their incomes than other residents – move in as a poor grad student, stick around and pump money into the community as a professor or professional worker? This long litany of questions is an example of what AI can and cannot do. Even coupling image recognition with demographic data – at least the way it was done here – was not sufficient to direct the causal arrow with any validity. I’m a huge proponent of involving statisticians and social scientists in research like this due to their disciplines’ longstanding toolkit for addressing human time series data and conducting qualitative research.


Wired is now covering the question: to pre-print or not to pre-print? using the field of biology as the main character.

By looking at 40,000+ papers and 2,300+ patents, researchers at MIT were able to measure the impact of proximity on co-authorship in an academic community. The main finding, “researchers located in the same workspace are more than three times as likely to collaborate compared to those who are 400 meters apart” suggests challenges for data science, a methodological field that will benefit most from interdisciplinary research.



Awni Hannun (Stanford) is co-lead author on a paper describing algorithms that
diagnose heart arrhythmias with very high accuracy
, better than trained cardiologists.



The Ethics and Governance of Artificial Intelligence Fund, established last January with money from Reid Hoffman, Pierre Omidyar, and the Knight Foundation, has committed $7.6 million to be split unevenly between MIT’s Media Lab, Harvard’s Berkman Klein Center, and seven smaller research efforts including AI Now, FAT ML, and Data & Society.

Seth Shipman and his colleagues in biology at Harvard figured out how to use CRISPR to put GIFs in DNA. Yes, GIFs like internet junk GIFs. Because…wait, why? Because the GIFs are going to be used as recording devices that will reveal to scientists how they developed into the types of cells they ended up becoming at a rate of one frame per day for five days.

Anind Dey is moving from Carnegie Mellon University to become the dean of the iSchool at UW-Seattle. Congrats, Professor Dey and UW.

Maria Klawe President of Harvey Mudd, is nailing it when it comes to getting more women students to enroll in and complete CS degrees (55%). However, Klawe is not telling those students to lean in to VC-backed firms. She is concerned they will run into “bro culture.” Without an HR department she notes, “if something goes wrong, it’s a matter of luck whether you have management that cares.”


NYC Media Lab White Paper: How AI is Changing Media Economics

NYC Media Lab


from

The media companies of today are fast adapting to a machine-driven economy, but are they adapting fast enough? What will keep them competitive in the future? In this white paper, NYC Media Lab takes a detailed look at how AI is impacting the economics, revenue and organizational structure of the media business. The paper includes economic insights and interviews with leading data science researchers and executives from NYC Media Lab’s recent Machines + Media conference.


UCSF Cancer Researcher Leads Team to Win First Ever AI Genomics Hackathon

UC San Francisco, UCSF News Center


from

A UC San Francisco cancer researcher has led a team of data scientists and engineers to win a first-of-its-kind Artificial Intelligence (AI) Genomics Hackathon competition.

Jyotika ‘Jo’ Varshney, DVM, PhD, a UCSF postdoctoral fellow, and her three teammates were named the winners at the event, which challenged participants to analyze a real patient’s genomic data using AI and other computational methods in order to advance the understanding of a rare genetic disease called neurofibromatosis type 2, or NF2.


What’s Challenging in Big Data Now: Integration and Privacy

datanami, Alex Woodie


from

It’s been said many times before, but it’s worth stating again: big data presents large opportunities for improving business and society, but it also involves sizable computing challenges, as well as moral challenges. A panel of renowned professors in the field expounded on the obstacles blocking big data’s path forward during a recent meeting of the Association of Computing Machinery (ACM). Privacy and integration issues led the way.


NIH-funded team uses smartphone data in global study of physical activity

National Institutes of Health


from

Using a larger dataset than for any previous human movement study, National Institutes of Health-funded researchers at Stanford University in Palo Alto, California, have tracked physical activity by population for more than 100 countries. Their research follows on a recent estimate that more than 5 million people die each year from causes associated with inactivity.

The large-scale study of daily step data from anonymous smartphone users dials in on how countries, genders, and community types fare in terms of physical activity and what results may mean for intervention efforts around physical activity and obesity. The study was published July 10, 2017, in the advance online edition of Nature.

“Big data is not just about big numbers, but also the patterns that can explain important health trends,” said Grace Peng, Ph.D., director of the National Institute of Biomedical Imaging and Bioengineering (NIBIB) program in Computational Modeling, Simulation and Analysis.


Why do some neighborhoods improve? Density of highly educated residents, rather than income or housing costs, predicts revitalization

MIT News Office


from

Four years ago, researchers at MIT’s Media Lab developed a computer vision system that can analyze street-level photos taken in urban neighborhoods in order to gauge how safe the neighborhoods would appear to human observers.

Now, in an attempt to identify factors that predict urban change, the MIT team and colleagues at Harvard University have used the system to quantify the physical improvement or deterioration of neighborhoods in five American cities.

In work reported today in the Proceedings of the National Academy of Sciences, the system compared 1.6 million pairs of photos taken seven years apart. The researchers used the results of those comparisons to test several hypotheses popular in the social sciences about the causes of urban revitalization.


Smart Cities and Image Recognition

Medium, Joe Hanson


from

Advances in artificial intelligence mean applications increasingly can take on image recognition capabilities that allow them to identify objects, detect the age of human faces and screen out adult content. The Department of Homeland Security has worked for several years to implement a biometric monitoring system to verify travelers in U.S. airports, and they recently found success with a Customs and Border Protection pilot.

The system uses facial recognition software to compare photos of passengers against a database, allowing DHS officials to identify travelers who have overstayed visas or are wanted in criminal investigations.

These developments underscore the need for the government to remain abreast of ways to manage complex technology and maintain standards of living.


Air Pollution Still Kills Thousands In U.S. Every Year

NPR, Shots blog, Rob Stein


from

“We are now providing bullet-proof evidence that we are breathing harmful air,” says Francesca Dominici, a professor of biostatistics at the Harvard T.H. Chan School of Public Health, who led the study. “Our air is contaminated.”

Dominici and her colleagues set out to do the most comprehensive study to date assessing the toll that air pollution takes on American lives.


Kilogram to be redefined as scientists ‘revolutionise’ measurement of mass

The Independent (UK), Sarah Kaplan


from

John Pratt of US’s National Institute of Standards and Technology seeking to revise how weight is recorded as metal impurities render current approach too variable and imprecise


Biology’s Roiling Debate Over Publishing Research Early

WIRED, Science, Megan Molteni


from

Five years ago, Daniel MacArthur set out to build a massive library of human gene sequences—one of the biggest ever. The 60,706 raw sequences, collected from colleagues all over the globe, took up a petabyte of memory. It was the kind of flashy, blockbuster project that would secure MacArthur a coveted spot in one of science’s top three journals, launching his new lab at the Broad Institute into the scientific spotlight. But before all that happened, he did something that counted as an act of radicalism in the world of biology: He put it on the internet.

Posting scientific papers online before peer review—in so-called preprint archives—isn’t a new idea. Physicists have been publishing their work this way, free to the public, for decades. But for biologists, preprints are uncharted territory. And that territory is rapidly expanding as academia and its big-time funders shift toward a culture of openness. As preprints become more popular, they’re throwing the field into a state of uncertainty.


Data Supercharges Billion-Dollar Boats in the America’s Cup, the World’s Fastest Sailing Race

WIRED, Gear, Tim Newcomb


from

As billion-dollar catamarans skim across the ocean vying to best each other in the world’s fastest sailing race, the win may not always go to the best sailors. Sometimes, a victory in the America’s Cup goes to the team with the best data.

“This is as much a design race as a sailing race” says Mauricio Munoz, an engineer for the British Land Rover BAR team. Munoz’s team is one of many in the 2017 America’s Cup who used data collected during pre-Cup races to improve the designs of their boats. “Design without data, well I’m not sure what it is,” Munoz says.

During these head-to-head races—where teams from around the globe sail as fast as 30 knots—even a half-knot gain in speed can be all that’s needed to secure a victory. And since a tweak to a boat’s design or to the crew’s routine is often enough to earn it that extra brio in the water, the more data about the boat’s performance that can be gathered, the better.


DARPA Wants Brain Implants That Record From 1 Million Neurons

IEEE Spectrum, Eliza Strickland


from

The agency announced the six research groups that have been awarded grants under the NESD program. In a press release, DARPA says that even the 1-million-neuron goal is just a starting point. “A million neurons represents a miniscule percentage of the 86 billion neurons in the human brain. Its deeper complexities are going to remain a mystery for some time to come,” says Phillip Alvelda, who launched the program in January. “But if we’re successful in delivering rich sensory signals directly to the brain, NESD will lay a broad foundation for new neurological therapies.”


What’s coming in Data.gov’s next revamp

FCW, Adam Mazmanian


from

The federal government data repository Data.gov is due for a revamp, according to contracting documents released as part of a sole-source extension granted to contractor REI.

The six-month extension is required, according to the General Services Administration, to complete the migration to new infrastructure and to continue work on new features. While the value of the extension is very small by federal contracting standards, just over a half-million dollars, the documentation included to justify the move gives a clear picture of the next steps for the nascent data site.

 
Events



Scale By the Bay 2017: Schedule

Data by the Bay


from

San Francisco, CA November 15-18. Scale By the Bay is the 5th year of the flagship By the Bay conference. [$$$$]

 
Deadlines



2017 IEEE International Conference on Big Data – Call for Papers

Boston, MA Conference is December 11-14. Deadline for submissions is August 7.
 
Tools & Resources



Technical Debt in Machine Learning

Medium, Towards Data Science, Go to the profile of Maksym Zavershynskyi Maksym Zavershynskyi


from

Many of us frown upon the technical debt but generally, it is not a bad thing. Technical debt is an instrument which is justified when we need to meet some release deadlines or unblock a colleague. The problem with the technical debt though is the same as with the financial debt — when the time comes to pay the debt we give back more than we took at the beginning. That is because the technical debt has a compound effect.

Experienced teams know when to back up seeing a piling debt, but technical debt in machine learning piles extremely fast. You can create months worth of debt in a matter of one working day and even the most experienced teams can miss a moment when the debt is so huge that it sets them back for half a year, which is often enough to kill a fast-pacing project.


Supporting Advanced Visualization Sharing at data.world

data.world


from

In data.world, you can share Vega and Vega-Lite visualizations a new convenient and useful ways depending on what you’re trying to do:

Upload .vg.json or .vl.json files to get full previews

If you write a Vega file, use the extension .vg.json or .vl.json for Vega-Lite. Upload the file to a data.world dataset and you’ll see the visualization previewed inline as a file card (in the screenshot below you can see a visualization rendered next to the data that backs it!)


Learn Word2Vec by implementing it in tensorflowTowards Data Science – Medium

Medium, Towards Data Science, aneesh joshi


from

I feel that the best way to understand an algorithm is to implement it. So, in this article I will be teaching you Word Embeddings by implementing it in Tensor Flow.

The idea behind this article is to avoid all the introductions and the usual chatter associated with word embeddings/word2vec and jump straight into the meat of things. So, much of the king-man-woman-queen examples will be skipped.

 
Careers


Full-time positions outside academia

Research Scientist



Wikimedia Foundation; San Francisco, CA

Education Director



AI4ALL; Palo Alto, CA
Postdocs

Postdoctoral Research Associate in Artificial Intelligence Law & Policy



University of Washington, Tech Policy Lab; Seattle, WA
Full-time, non-tenured academic positions

Data Services Librarian



University of Colorado; Boulder, CO

Leave a Comment

Your email address will not be published.