Data Science newsletter – January 25, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for January 25, 2017

GROUP CURATION: N/A

 
 
Data Science News



Data Visualization of the Week

Twitter, David Robinson


from


One Nation, Divided By A Common Language

Fast Company, Virginia Heffernan


from

In an illuminating study published in September in The Proceedings of the International Conference on Social Informatics, several linguists from Microsoft, Columbia University, and the Georgia Institute of Technology examined millions of North American tweets and tracked the spread of slang words like jawn, hella, and ctfuu (look ’em up). Using methods from epidemiology, they found that such coinages tore through populations like infectious diseases, with single and then multiple contacts occurring fast and furious in the close, unsanitary quarters of Twitter.


Machine Vision Helps Spot New Drug Treatments

MIT Technology Review, Tom Simonite


from

Web users searching for photos and cops looking for suspects in video already benefit from software that understands the content of images. Chris Gibson says it can also make it easier to find treatments for diseases not targeted by existing drugs.

“By combining robotics and machine vision, we can work at large scale on hundreds of diseases simultaneously, using a small number of people,” says Gibson, who is CEO and cofounder of the 40-person startup Recursion Pharmaceuticals.


The Rise of the Data Engineer

Medium, Maxime Beauchemin


from

I joined Facebook in 2011 as a business intelligence engineer and by the time I left in 2013, I was a data engineer. I was not promoted or assigned a new role, we simply came to realize that the work we were doing was transcending classic business intelligence and that the role we had created for ourselves was a new discipline.

As my team was at forefront of this transformation, we were developing new skills, new ways of doing things, new tools, and more often than not turning my back to traditional methods. We were pioneers. We were data engineers!


Alexa and Google Assistant have a problem: People aren’t sticking with voice apps they try

Recode, Jason Del Rey


from

Amazon Echo and Google Home were the breakaway hits of the holiday shopping season. But both devices — and the voice technologies that power them — have some major hurdles to overcome if they want to keep both consumers and software developers engaged.

That’s one of the big takeaways from a new report that an industry startup, VoiceLabs, released on Monday.


Berkeley launches RISELab, enabling computers to make intelligent real-time decisions

University of California-Berkeley, Berkeley Engineering


from

UC Berkeley today launched the RISELab, the successor of AMPLab, and the latest in its series of five-year intensive research labs in computer science, with the goal of improving how machines make intelligent decisions based on real-time input.

The new Berkeley lab focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). The lab kicks off this January with support from founding sponsors Amazon Web Services, Ant Financial, Capital One, Ericsson, GE Digital, Google, Huawei, Intel, IBM, Microsoft, and VMWare.


The Robot Rampage

Bloomberg Gadfly, Chris Bryant and Elaine He


from

It took 50 years for the world to install the first million industrial robots. The next million will take only eight, according to Macquarie. Importantly, much of the recent growth happened outside the U.S., in particular in China, which has an aging population and where wages have risen.


And Suddenly Everything will be Different – Biology’s iPhone Moment

Tincture, Jamie Heyward


from

I believe we’re about to have our iPhone moment in personalized health and medicine. What is different today is that unlike earlier approaches, which were driven largely by genomics and generally provided limited insight into our future or “fate,” this new technology will be driven by the digitization of “state” biology. Today we can measure protein and RNA expression, immune signatures, annotations to DNA, and our biome at scale. We can now begin to combine these tools with real world patient outcomes, experience, environment, care and advanced machine learning.


South Africa’s North-West University Becomes Software and Data Carpentry’s first African Partner

Software Carpentry


from

At the end of 2016 the NWU entered into a gold partnership with Software and Data Carpentry. The partnership marks the beginning of a new phase of capacity development around computing and data at the university, it is the culmination of months of hard work, exciting workshops, and interesting conversations with colleagues from all over the world.


UCL students learn state-of-the-art AI in DeepMind partnership

University College London, UCL News


from

DeepMind is known internationally as a leader in an area of computer science called machine learning. Now senior DeepMind staff are joining forces with UCL’s Department of Computer Science to share their knowledge by delivering a state-of-the-art Master’s level training module called ‘Advanced Topics in Machine Learning’.

This new module will provide a key component of UCL’s Machine Learning Master’s programmes and will cover some of the most sophisticated topics in artificial intelligence. The first of these lectures will take place in January 2017.


Big Data Is Helping Us See Environmental Problems in a Whole New Light

SingularityHub, Peter Rejcek


from

In the last few years, conservationists and others have turned to big data to get the big picture on environmental degradation, helping to wrangle answers to some of the globe’s most pressing problems. Big data, in this case, comes in various forms, from satellite images to global trade databases to social media postings.


[1701.04906] Quantifying the distribution of editorial power and manuscript decision bias at the mega-journal PLOS ONE

arXiv, Computer Science > Digital Libraries; Alexander M. Petersen


from

We analyzed the longitudinal activity of nearly 7,000 editors at the mega-journal PLOS ONE over the 10-year period 2006-2015. Using the article-editor associations, we develop editor-specific measures of power, activity, article acceptance time, citation impact, and editorial renumeration (an analogue to self-citation). We observe remarkably high levels of power inequality among the PLOS ONE editors, with the top-10 editors responsible for 3,366 articles — corresponding to 2.4% of the 141,986 articles we analyzed. Such high inequality levels suggest the presence of unintended incentives, which may reinforce unethical behavior in the form of decision-level biases at the editorial level. Our results indicate that editors may become apathetic in judging the quality of articles and susceptible to modes of power-driven misconduct. We used the longitudinal dimension of editor activity to develop two panel regression models which test and verify the presence of editor-level bias. In the first model we analyzed the citation impact of articles, and in the second model we modeled the decision time between an article being submitted and ultimately accepted by the editor. We focused on two variables that represent social factors that capture potential conflicts-of-interest: (i) we accounted for the social ties between editors and authors by developing a measure of repeat authorship among an editor’s article set, and (ii) we accounted for the rate of citations directed towards the editor’s own publications in the reference list of each article he/she oversaw. Our results indicate that these two factors play a significant role in the editorial decision process. Moreover, these two effects appear to increase with editor age, which is consistent with behavioral studies concerning the evolution of misbehavior and response to temptation in power-driven environments.


D-Wave upgrade: How scientists are using the world’s most controversial quantum computer

Nature News & Comment, Elizabeth Gibney


from

Scepticism surrounds the ultimate potential of D-wave machines, but researchers are already finding uses for them.


U-M projects use Big Data to predict diseases, advance research in single-cell gene sequencing

University of Michigan News


from

Researchers at the University of Michigan will use Big Data and mobile technology to learn how to predict when individuals will get diseases including depression and hepatitis C, and to unlock the potential of single-cell gene sequencing under three recently funded projects.

The Michigan Institute for Data Science awarded the three interdisciplinary projects a combined $3 million under the second round of its Challenge Initiative program. The program is part of U-M’s $100 million Data Science Initiative, which was announced in September 2015.


Cancer scientists are having trouble replicating groundbreaking research

Vox, Julia Belluz


from

As researchers reproduce more experiments, they’re learning that they can’t always get clear answers about the reliability of the original results. Replication, it seems, is a whole lot murkier than we thought.

Take the latest findings from the large-scale Reproducibility Project: Cancer Biology. Here, researchers focused on reproducing experiments from the highest-impact papers about cancer biology published from 2010 to 2012. They shared their results in five papers in the journal ELife last week — and not one of their replications definitively confirmed the original results.


URMC Awarded up to $9M to Study Infectious Threats

University of Rochester Medical Center, Newsroom


from

The University of Rochester Medical Center will receive up to $9 million from the Centers for Disease Control and Prevention to conduct infectious disease surveillance and research over the next five years. The award renews the University’s role as a member of the CDC’s Emerging Infections Program, a national network that keeps hawk-like watch on the activity of several infectious threats and conducts studies that guide policy related to prevention and treatment.

 
Events



Tartan Data Science Cup: Episode III — Capital Rogue One



Pittsburgh, PA Episode III of the Tartan Data Science Cup at Carnegie Mellon University, sponsored by Capital One, is on February 4. Events leading up to the Cup begin on January 31.

Techlore Discussion Session: Weapons of Math Destruction



Oakland, CA Speaker: Cathy O’Neil, February 7 at 7 p.m., Kapor Center for Social Impact [free]

DataKind San Francisco Meetup: Advancing your mission with data



San Francisco, CA Tuesday, February 7, at 6 p.m., Google on Spear Street SF [free]

DATA VISUALIZATION + VIRTUAL REALITY: WHAT’S NEXT FOR THE SPORTS FAN EXPERIENCE?



Brooklyn, NY February 10-12, NYU MAGNET (2 MetroTech, 8th Floor) [free, application required]

Hacking the Academy: Citizen



Seattle, WA February 21 at 4 p.m., Research Commons, Green A [free]
 
Deadlines



Open Science for Synthesis: Gulf Research Program

This 3-week intensive training, convening in July 2017 at NCEAS in Santa Barbara, CA, will revolve around scientific computing and scientific software for reproducible science. Deadline to apply is Friday, February 17.
 
NYU Center for Data Science News



the MaD Seminar

NYU Center for Data Science, Courant Institute


from

New York, NY February 2 at 2 p.m., Andrew Gelman from Columbia University hosted by the NYU Center for Data Science (60 Fifth Ave)


We have @arthur_spirling & Michael Gill on the Data Science Demystified podcast! Wikileaks & the cutting edge!

Data Science Demystified podcast


from

[audio, 57:27]
 
Tools & Resources



Intro to Bootstrapping

Katherine Wood, Inattentional Coffee blog


from

The idea behind the bootstrap is simple. We know that if we repeatedly sample groups from the population, our measurement of that population will get increasingly accurate, becoming perfect when we’ve sampled every member of the population (and thus obviating the need for statistics at all). However, this world, like worlds without friction in physics, don’t resemble operating conditions. In the real world, you typically get one sample from your population. If only we could easily resample from the population a few more times.

Bootstrapping gets you the next best thing. We don’t resample from the population. Instead, we continuously resample our own data, with replacement, and generate a so-called bootstrapped distribution. We can then use this distribution to quantify uncertainty on all kinds of measures.


Why use SVM?

yhat, Greg


from

Support Vector Machine has become an extremely popular algorithm. In this post I try to give a simple explanation for how it works and give a few examples using the the Python Scikits libraries. All code is available on Github. I’ll have another post on the details of using Scikits and Sklearn.

 
Careers


Tenured and tenure track faculty positions

Assistant Professor Position Center for Network Science



Central European University; Budapest, Hungary
Postdocs

Postdoctoral Research Fellow – Sport Business Intelligence



Victoria University; Melbourne, Australia

Leave a Comment

Your email address will not be published.