Data Science newsletter – January 2, 2018

Data Science Newsletter features journalism, research papers, events, tools/software, and jobs for January 2, 2018

GROUP CURATION: N/A

 
 
Data Science News



The physician of the future: a precisionist who sees patients as they are

STAT, Grace E. Terrell


from

… As an internal medicine physician in practice for more than 30 years, I am very aware of a crucial fact in medicine that this process ignores: patients are not prototypes, and no one is average. Some patients experience muscle pain from taking a cholesterol-lowering statin. We see this sometimes. Some patients don’t respond to sertraline in treating their major depression. We see this sometimes. Some patients develop the Stevens-Johnson reaction when they take a sulfa drug. I saw this one time. A patient of mine received a sulfa antibiotic, the evidence-based medicine treatment of choice for his condition. Soon afterward, his skin began to blister and peel; he ended up in a burn unit for six months and nearly died.

Even as we develop increasingly effective therapies for cancer, heart disease, and other modern plagues, we still struggle to develop a health care system designed around individual needs even as it addresses the population as a whole. What we are calling precision medicine or personalized medicine is essentially the antidote: a model of care built upon differentiation at the individual level that will ultimately be the disruptive force that accelerates change in health care by driving waste from the system as it improves outcomes.


University Data Science News

Jake VanderPlas is now working a split position, half at UW-Seattle, half at Google where he will be on the Colaboratory team, “Google’s drive-backed Jupyter notebook.”



NYU joined UC-Berkeley, UW-Seattle and 83 other schools in The American Talent Initiative to recruit more students from low and moderate income families. It’s good to see this effort to target economically disadvantaged students at a time when intergenerational economic growth has stagnated (and even started to slide backwards). Higher education is supposed to be part of what levels the playing field, but if we don’t recruit and include low and moderate income students, higher education will more likely serve to reinforce class divides.



Antarpreet Jutla of West Virginia University used satellite data and machine learning to predict a cholera outbreak in Yemen four weeks before it exploded.



Quantum computing is still my favorite head scratcher. A write-up about a team at Lawrence Berkeley National Lab has a quantum ‘computer’ that sometimes functions (which sounds exactly right) notes that over the next three years and $3m the research team “will design algorithms and even new math.” I understand designing new algorithms, but the new math part? Sounds like the brainiac version of fake news?? But it also sounds like the proverbial geeks in the garage, “Nobody really knows how to do this right now…As soon as Siddiqi gets his 8 qubit up and running, we are going to do some experiments on that.” Someone’s having a lot of fun with science.



Yuanjun Gao and Matthew Connelly from Columbia’s stats and history departments with Rahul Mazumder from MIT and Jack Goetz from University of Michigan are inventing new methods that apply data science techniques to government archives to supplement the historical record of 20th Century geo-politics.



Sports stats is a great tool for outreach to younger students. Hamline University is pairing up with MinneAnalytics to sponsor analytics competitions at the high school level in Minnesota while the Statistics in Sports section of the American Statistical Association will be featuring a sports stats paper competition just for undergrads this year. Really hope to see some stats work on sports outside the major leagues. Tennis, anyone?



University of Arkansas professor Matt Covington is collecting data to measure the impact of melting water on future melt rates in Greenland and ocean temperatures everywhere-ish. Climbing around in ice caves — which could lead to hypothermia and broken bones — is what science looks like. For some scientists. Think of Covington when you use data someone else collected. At least throw them a citation for all the risks and discomfort they endured to get the data you’re processing in a climate-controlled office. [Covington notes: “150 million people [are] living within 3 feet of high tide around the planet” which brings me back to my point about population level generalized anxiety]



Also in climate science: “By 2060–2080, most regions within 30° latitude of the equator may experience between 25 and 150 days per year that exceed the historical once-per-year maximum air temperature, and 25–250 days per year that exceed historical once-per-year maximum wet bulb temperature.” In other words, there will be many more heat waves. Heat waves are the deadliest natural ‘disaster’. And they are projected to hit areas in India and West Africa that are under developed with respect to cooling technology. This new climate change paper by Ethan Coffel, Radley Horton, and Alex de Sherbinin is alarming. The only silver lining is that if the globe could keep a lid on carbon emissions, we could largely avoid these extreme heat events. Also in the news last week: Trump’s administration was won over by Sarah Palin’s compelling “Drill, baby, drill” policy advice.



Emily Putnam-Hornstein, of the University of Southern California and Rhema Vaithianathan at Auckland University of Technology worked with Pittsburgh officials to predict which children reported to child welfare services need intervention most. The professors are working in a powerfully charged situation. Very few families actually want the state to intervene in their lives, the state has too few resources to attend to all risky cases, and without intervention some children will be neglected, abused, or worse. You won’t regret reading this story about the power of automation to aid human decision makers. It’s also a story about the problems associated with training data that likely reflects human bias.



The University of Illinois at Urbana-Champaign is using data science to predict crop yields in the corn belt. PI Kaiyu Guan describes the “allure” of his new model: “The original CLM model underestimates above-ground biomass but overestimates the harvest index of maize, leading to apparent right-yield simulation with the wrong mechanism. Our new model corrected this deficiency.” Yes, I love the diversity of topics I get to cover in this newsletter.

Gary Marcus looks back over the last five years of developments in AI (and is, of course, drawing on earlier advances) to offer “ten concerns for deep learning.” I would love to see a response from Yann LeCun.



Finally, I’m not one to push anyone out of academia, but if you are wondering what life would be like on the outside, Robyn Rap’s post about leaving academia to work as a data scientist in industry is based on 11 interviews with people who have done so. She notes the value of having a job that does not require one to move every 3-5 years to places not of one’s choosing and the freedom of having a job that does not own one’s mental space 24/7. Industry data scientists apparently stop thinking about work when they leave at the end of the day.


Study: Artificial intelligence hindered by talent shortage

Silicon Valley Business Journal, Jennifer Elias


from

“This year, as businesses strategized how to integrate AI into their operations, they were hampered by a shortage of experts with requisite knowledge of the technology,” Ernst & Young chief analytics officer Chris Mazzei said in a press release. “This serves to demonstrate that successful AI integration is not just about the technology, it’s about the people.”


How Law and Computer Science Can Work Together to Improve the Information Society

Communications of the ACM, Chris Marsden


from

In this column, I explore the various means by which lawyers can be helped by computer scientists to stop the (inevitable) collateral damage to innovation when the unstoppable force of legislation hits the irresistible innovation of the Internet.1 I will explore some current controversies (fake news, Net neutrality, platform regulation) from an international perspective. The conclusion is familiar: lawyers and computer scientists need each other to prevent a disastrous retrenchment toward splintered national-regional intranets. To avoid that, we need to be intellectually pragmatic in pursuing what may be a mutually disagreeable aim: minimal legislative reform to achieve co-regulation using the most independent expert advice. The alternatives are to allow libertarian advocates to so enrage politicians that severe overregulation results.


Attitudes and norms affecting scientists’ data reuse

PLOS One; Renata Gonçalves Curty, Kevin Crowston , Alison Specht, Bruce W. Grant, Elizabeth D. Dalton


from

The value of sharing scientific research data is widely appreciated, but factors that hinder or prompt the reuse of data remain poorly understood. Using the Theory of Reasoned Action, we test the relationship between the beliefs and attitudes of scientists towards data reuse, and their self-reported data reuse behaviour. To do so, we used existing responses to selected questions from a worldwide survey of scientists developed and administered by the DataONE Usability and Assessment Working Group (thus practicing data reuse ourselves). Results show that the perceived efficacy and efficiency of data reuse are strong predictors of reuse behaviour, and that the perceived importance of data reuse corresponds to greater reuse. Expressed lack of trust in existing data and perceived norms against data reuse were not found to be major impediments for reuse contrary to our expectations. We found that reported use of models and remotely-sensed data was associated with greater reuse. The results suggest that data reuse would be encouraged and normalized by demonstration of its value. We offer some theoretical and practical suggestions that could help to legitimize investment and policies in favor of data sharing.


What can machine learning do? Workforce implications

Science, Policy Forum, Erik Brynjolfsson and Tom Mitchell


from

Digital computers have transformed work in almost every sector of the economy over the past several decades (1). We are now at the beginning of an even larger and more rapid transformation due to recent advances in machine learning (ML), which is capable of accelerating the pace of automation itself. However, although it is clear that ML is a “general purpose technology,” like the steam engine and electricity, which spawns a plethora of additional innovations and capabilities (2), there is no widely shared agreement on the tasks where ML systems excel, and thus little agreement on the specific expected impacts on the workforce and on the economy more broadly. We discuss what we see to be key implications for the workforce, drawing on our rubric of what the current generation of ML systems can and cannot do [see the supplementary materials (SM)]. Although parts of many jobs may be “suitable for ML” (SML), other tasks within these same jobs do not fit the criteria for ML well; hence, effects on employment are more complex than the simple replacement and substitution story emphasized by some. Although economic effects of ML are relatively limited today, and we are not facing the imminent “end of work” as is sometimes proclaimed, the implications for the economy and the workforce going forward are profound.


Is “Big Data” racist? Why policing by data isn’t necessarily objective

Ars Technica, Andrew Ferguson


from

Algorithmic technologies that aid law enforcement in targeting crime must compete with a host of very human questions. What data goes into the computer model? After all, the inputs determine the outputs. How much data must go into the model? The choice of sample size can alter the outcome. How do you account for cultural differences? Sometimes algorithms try to smooth out the anomalies in the data—­anomalies that can correspond with minority populations. How do you address the complexity in the data or the “noise” that results from imperfect results?

The choices made to create an algorithm can radically impact the model’s usefulness or reliability. To examine the problem of algorithmic design, imagine that police in Cincinnati, Ohio, have a problem with the Bloods gang—­a national criminal gang, originating out of Los Angeles, that signifies membership by wearing the color red.


Extra Extra

@WhatTheHerp is a twitter bot to which people submit photos of snakes and frogs for Fitch, another robot, to identify. Fitch needs a little help with his training data, so go be a citizen scientist and tag snakes you can identify.



Back in 1968 Douglas Englebart gave what has since been dubbed ‘the mother of all demos’ showing fellow engineers a punch-card-free computer with a mouse and ended by alluding to what would become the internet. Such a great read about computer science history.



Trust will be crucial for successful digital technology in the near future, according to Bhaskar Chakravorti from Tufts.

Julia Angwin, Alondra Nelson, and Virginia Eubanks are going to be speaking at Data & Society on January 17th at 4 pm. I know this event is already listed in our events section, but these women are so damn smart and tough and steely when it comes to speaking truth to (automated) power. It’s worth a double mention even though it’s already oversubscribed. Come listen to the livestream with me at NYU CDS. 🙂


How NASA’s Search for ET Relies on Advanced AI

Scientific American, Larry Greenemeier


from

Jet Propulsion Laboratory’s artificial intelligence chief describes the “ultimate” test for AI in space exploration


Berkeley Scientists Are Building a Quantum Computer

University of California-Berkeley, California Magazine, Kali Persall


from

Currently, working prototypes of quantum computers do exist, albeit probably fewer than a dozen. And scientists at LBL are building their own. In September, the Department of Energy awarded them a grant of $3 million per year to construct a quantum computer and the software needed to operate it.

The hardware team, co-led by [Irfan] Siddiqi and his colleague at the lab Jonathan Carter, will build the computer and study its properties and applications over five years. The software team of a dozen researchers, led by de Jong, will design algorithms and even new math over three years.


How Do You Vote? 50 Million Google Images Give a Clue

The New York Times, Steve Lohr


from

What vehicle is most strongly associated with Republican voting districts? Extended-cab pickup trucks. For Democratic districts? Sedans.

Those conclusions may not be particularly surprising. After all, market researchers and political analysts have studied such things for decades.

But what is surprising is how researchers working on an ambitious project based at Stanford University reached those conclusions: by analyzing 50 million images and location data from Google Street View, the street-scene feature of the online giant’s mapping service.

For the first time, helped by recent advances in artificial intelligence, researchers are able to analyze large quantities of images, pulling out data that can be sorted and mined to predict things like income, political leanings and buying habits. In the Stanford study, computers collected details about cars in the millions of images it processed, including makes and models.


Behind-the-scenes of Amazon bid, state sought personal touch

The Boston Globe, Tim Logan


from

Early on, Governor Charlie Baker hoped to schedule a phone call with Amazon chief executive Jeff Bezos to tout the state directly to the boss. He didn’t get one. He did, though, talk with Jay Carney, White House press secretary under President Obama and now senior vice president for corporate affairs at Amazon. And there were other talks. The state’s economic development secretary, Jay Ash, and his aides spoke and e-mailed several times with Amazon’s head of worldwide economic development. She responded within 10 minutes to Ash’s first message to Amazon’s HQ2 e-mail address and personally confirmed receipt of the Massachusetts bid on deadline day.


Greater Data Science at Baccalaureate Institutions

ASA, Journal of Computational and Graphical Statistics; Amelia McNamara, Nicholas J. Horton and Benjamin S. Baumer


from

Donoho’s paper is a spirited call to action for statisticians, who he points out are losing ground in the field of data science by refusing to accept that data science is its own domain. (Or, at least, a domain that is becoming distinctly defined.) He calls on writings by John Tukey, Bill Cleveland, and Leo Breiman, among others, to remind us that statisticians have been dealing with data science for years, and encourages acceptance of the direction of the field while also ensuring that statistics is tightly integrated.

As faculty at baccalaureate institutions (where the growth of undergraduate statistics programs has been dramatic (American Statistical Association 2015 ——— (2015), “A peek into the largest, fastest-growing undergraduate statistics departments,” http://magazine.amstat.org/blog/2015/02/01/undergraduatedepts_feb2015. [Google Scholar])), we are keen to ensure statistics has a place in data science and data science education. In his paper, Donoho is primarily focused on graduate education. At our undergraduate institutions, we are considering many of the same questions. [full text]

 
Events



Code Freeze 2018: Microservice Architectures

University of Minnesota


from

Minneapolis, MN Thursday, January 11. “The University of Minnesota’s Software Engineering Center and DevJam bring you the 13th annual Code Freeze symposium.” This year, our theme is Microservice Architectures. [$$$]


Experimental Animation & Video with Code – P0510-18

Anderson Ranch Arts Center


from

Snowmass Village, CO July 2-6 with instructor Casey Reas at Anderson Ranch Arts Center.


LADS 2018: The 2018 Liberal Arts Data Science Workshop

New College of Florida


from

Sarasota, FL January 12-13 at New College of Florida. “This two-day workshop brings together statisticians and computer scientists to discuss how we are introducing data science at liberal arts colleges now and how we should integrate it in the future. The workshop is open to anyone interested in the topic.” [free, registration required]

 
Deadlines



NVIDIA AI City Challenge

“The NVIDIA AI City Challenge Workshop at CVPR 2018 will specifically focus on ITS problems such as: Estimating traffic flow and volume; Leveraging unsupervised approaches to detect anomalies such as lane violation, illegal U-turns, etc.; Multi-camera tracking, and object re-identification in urban environments.” Deadline for proposal submissions is January 10.


CTSP | Data for Good Competition — Call for Proposals

The Center for Technology, Society & Policy (CTSP) seeks proposals for a Data for Good Competition. The competition will be hosted and promoted by CTSP in coordination with the UC Berkeley School of Information IMSA, and made possible through funds provided by Facebook. Deadline for proposals is January 28.

The Statistics in Sports section of Amstat News is running another undergrad research contest for #jsm2018 in Vancouver.

Deadline for poster abstract submissions is February 1.

Data Visualization on Mobile Devices workshop

Montreal, Canada April 21, part of CHI 2018. Deadline for workshop paper submissions is February 2.

D4R | Data for Refugees Turkey

The project will run from December 2017 to December 2018. A ‘D4R Committee’, composed of members of the refugee community, ministries, universities and institutions supporting the initiative, will evaluate the proposals. The selected teams will be granted access to the entire dataset under a carefully drafted user agreement. The best projects in each category will be awarded publicly.” [registration required]
 
Tools & Resources



Making Data Visual by Miriah Meyer, Danyel Fisher

O"Reilly Media


from

“You have a mound of data front of you and a suite of computation tools at your disposal. Which parts of the data actually matter? Where is the insight hiding? If you’re a data scientist trying to navigate the murky space between data and insight, this practical book shows you how to make sense of your data through high-level questions, well-defined data analysis tasks, and visualizations to clarify understanding and gain insights along the way.”


Get Up To Speed Fast As A Junior Data Scientist

Data Science Weekly


from

You’ve got the job and have started as a junior data scientist. To get your bearings as quickly as possible you need to ask three main questions.

  • 1. What are the most important Key Performance Indicators in this domain?
  • 2. What are the most relevant classic case studies in this domain?
  • 3. Who are the industry thought leaders (internal & external) I should be learning from?

  • Reverse Curriculum Generation for Reinforcement Learning Agents

    The Berkeley Artificial Intelligence Research Blog, Carlos Florensa


    from

    “Reinforcement Learning (RL) is a powerful technique capable of solving complex tasks such as locomotion, Atari games, racing games, and robotic manipulation tasks, all through training an agent to optimize behaviors over a reward function. There are many tasks, however, for which it is hard to design a reward function that is both easy to train and that yields the desired behavior once optimized. Suppose we want a robotic arm to learn how to place a ring onto a peg. The most natural reward function would be for an agent to receive a reward of 1 at the desired end configuration and 0 everywhere else. However, the required motion for this task–to align the ring at the top of the peg and then slide it to the bottom–is impractical to learn under such a binary reward, because the usual random exploration of our initial policy is unlikely to ever reach the goal, as seen in Video 1a. Alternatively, one can try to shape the reward function to potentially alleviate this problem, but finding a good shaping requires considerable expertise and experimentation. For example, directly minimizing the distance between the center of the ring and the bottom of the peg leads to an unsuccessful policy that smashes the ring against the peg, as in Video 1b. We propose a method to learn efficiently without modifying the reward function, by automatically generating a curriculum over start positions.”


    How to save everything you post to social media

    Popular Science, David Nield


    from

    “If you get the urge to revisit that cute photo you posted some time last year, you’ll have to scroll through your timeline for what feels like hours to track it back down. Instead, when you share a post on social media, also save it to your phone for safe-keeping. This will not only save your social media hits for posterity, but also make them easier to find if you ever need to rediscover them.”

    Leave a Comment

    Your email address will not be published.