Data Science newsletter – August 30, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for August 30, 2017

GROUP CURATION: N/A

 
 
Data Science News



Company Data Science News

Harvard Business Review reported on a McKinsey survey of 3,073 executives found that only “20% of our survey respondents using one or more AI technologies at scale or in a core part of their business.” The upshot is that we are at the very beginning of the growth curve when it comes to applied data science in industry. This is a critically important time for academic leaders to shape the future of how applications are deployed. Let’s not waste this opportunity to institute some ethical guidelines.



Google’s Project Loon – part of the x unit – now has 10 balloons flying 20 km above Kenya, providing internet service from *above* the clouds. Just when you thought the cloud was the top, now Google has gone above and beyond the clouds into the stratosphere. The balloons are guided into place via algorithms that are fed windspeed sensor data. They are powered by the sun; lithium ion batteries keep them going overnight. This may be the future of internet. Like the past, the balloons have been leased to existing telecoms due to the need to get permission to use radio frequencies.

Google and Apple are not the kings of all things. Reportedly, they are both scaling back on their self-driving car projects.

Y Combinator is trying to appeal to AI startups by offering training in image recognition and autonomy, “desirable” data sets, and computing resources for intensive applications. This is such a predictable move it hardly seems like news. I included it because the journalist described exciting unnamed datasets as “desirable” and AI as the “ability to forge gold from iron.” Data is so hot right now!



Yves Saint Laurent also thinks AI is hot, hot enough to feature Stanford grad Alexandre Robicquet in a full page ad and a video. He looks good. I hope I’m allowed to say that.

Neurable is a little gaming startup that makes games played by brain wave. Players wear headsets that project VR environments and detect their brain waves. All the actions are guided by thought. This type of technology is also being used to try to give amputees more natural control over prosthetic limbs. This is *definitely* a technology to watch.



Athelas is a biotech startup using image recognition and deep neural nets to deliver fingerprick complete blood count (CBC) blood work. Results are available in two minutes and the machinery is small enough and easy enough to use at home. Athelas isn’t offering the range of lab work that Theranos offered (yet), but it is validating (r=0.96) its results against venous blood draws tested at the gold standard LabCorp. The company has received $3.7m in funding from Sequoia. I’m only surprised that they aren’t investing at a higher level.



Ernst and Young, one of the largest accounting firms in the US, is using calendar and email data from its employees to “identify patterns around who is engaging with whom, which parts of the organization are under stress, and which individuals are most active in reaching across company boundaries.” There is a huge expansion in the use of data available to employers – email, chat client transcripts, calendars, call logs – to manage their human resources. In the study linked above and an email data pilot study I ran as part of my dissertation, only metadata was used (in my case, recipient names and email addresses were stripped before I saw the data, too). From off-the-record conversations I’ve had, it is clear that many companies are trying out some data-driven management, but are concerned about employee privacy and other rights. If you find stories about human resource management, please send them my way so we can work through what constitutes beneficent use of human resource data together.

Movidius, an image processing firm active in the drone space that was purchased by Intel, is shipping new Myriad X hardware for deep learning. The chips have a “microarchitecture in [the] hardware to accelerate deep-learning inference,” moving that computing out of software into the hardware. Their big advantage over many other chipmakers? They actually ship.

Walt Disney Company wrote a case study about how they are using machine learning to improve their coverage of soccer and basketball by “teaching automatic cameras to be more human-like.”

Microsoft and Amazon have decided to let their virtual assistants, Alexa and Cortana to enter into a relationship. This is a big deal – almost like an arranged marriage of offspring – for these two tech behemoths.

Zillow explains how they concoct their famous Zestimate, the estimate of how much a piece of property will sell for. Andrew Martin takes a close look at the problem of time series modeling and feature engineering.

YouTube has managed to capture us in cycles of video-followed-by-video chaining when we only intended to watch one. How did they figure us out?! They sort of explain here, but it still boggles my mind how many hours humans can collectively spend watching cats try to fit themselves into boxes.

Facebook is opening new office space in Kendall Square near MIT.



Netflix is not headed by Todd Yellin as I erroneously reported last week. Thank you to Claus Herther for the correction. Netflix’s CEO is Reed Hastings who co-founded the firm with Marc Randolph. Yellin is Netflix’s VP of Product Innovation.


New Major Addresses Fast-Growing Field of Data Science

Bethel University


from

Deborah Thomas, associate professor of math and computer science, explains that preparations for this major have been in process for a while. “We were already offering the courses, getting students exposed to the field that is very hot right now,” she says. Since data science can be applied to any field, this major will allow graduates to enter the job market with ease. “The amount of data being collected is constantly increasing and will never go away,” says Thomas. “Most often, we have acquired a ton of data with no easy way of extracting hidden patterns from it. Data science allows us to do that easily. The number of opportunities for our students will only increase.”


CMU receives $1M from Chicago software company to fund machine learning projects for the social good

TribLIVE, Aaron Aupperlee


from

Carnegie Mellon University’s School of Computer Science received $1 million to pay for machine learning projects.

There’s a catch, however.

The students and faculty must use the money for good.

Uptake, a Chicago-based predictive analytics software company, gave the money to CMU to establish the Machine Learning for Social Good fund.


Beyond Deep Learning: a Case Study in Sports Analytics

DataScience.com, Hoang Le


from

When it comes to televised sporting events, the ability of the cameras to follow the action — both smoothly and accurately — can make or break the viewing experience. That’s why, last year, alongside California Institute of Technology Assistant Professor Yisong Yue, I and a team of collaborators from Disney Research and the University of British Columbia published a paper on automating broadcast cameras using machine learning.

The project was designed to improve the television broadcast experience by teaching automatic cameras to be more human like. This involved creating machine learning algorithms that trained automated cameras to strike a delicate balance between capturing all movements accurately and maintaining a smooth viewing experience. The Walt Disney Company plans to use this research to improve its soccer and basketball coverage.

Before I elaborate on the strategies we used to achieve this objective, which involved combining a deep neural network with a model-based approach, I want to take a step back and discuss, at a higher level, machine learning, deep learning, and imitation learning.


New app could use smartphone selfies to screen for pancreatic cancer

University of Washington, UW News


from

University of Washington researchers are developing an app that could allow people to easily screen for pancreatic cancer and other diseases — by snapping a smartphone selfie.

BiliScreen uses a smartphone camera, computer vision algorithms and machine learning tools to detect increased bilirubin levels in a person’s sclera, or the white part of the eye. The app is described in a paper to be presented Sept. 13 at Ubicomp 2017, the Association for Computing Machinery’s International Joint Conference on Pervasive and Ubiquitous Computing.


Government Data Science News

Regina Bateson quit her job as a professor of political science at MIT to move back to California and run for office. Bateson4Congress!!!

NSF has lowered its budget request for 2018 from 2016 levels by 29% for fellowships and scholarships and 16% for postdoctoral programs. The Interdisciplinary Research in Behavioral and Social Sciences Program was cut altogether, so was the International Research Program. Astronomy and Astrophysics, however, has an increase in its budgetary request and Math is remaining level.

NSF has announced $17.7m for 12 projects in the 12 Transdisciplinary Research in Principles of Data Science (TRIPODS) program. In this particular program, transdisciplinary work focuses on only three disciplines: statistics, mathematics and theoretical computer science. Ethnographer David Ribes presented work over the weekend at 4S that suggested the mathematicians had to fight their way into that group, making me wonder if there was a third discipline that had to step aside to make space for math? Or if they were going to have a “tri” disciplinary program consisting only of statistics and computer science? I wish that at least one discipline with a more robust, present sense of ethical accountability would have been included in this transdisciplinary effort.

NOAA and NASA have developed a new forecasting model that was effective in predicting Harvey and is likely to replace less accurate models going forward.

The 2020 Census is still predicted to be inaccurate, a problem we have flagged in previous newsletters that hasn’t gone away. The Census still has no leader. Brookings Institution has a new report on how bad it could be noting, “Its state-by-state counts determine how the 435 members of the House of Representatives are allocated among the states, and its counts by “Census block” (roughly a neighborhood) shape how members of state legislatures and many city councils are allocated in those jurisdictions.”

Norway is taking a national step towards open access by making “all publicly funded Norwegian research articles openly available by 2024…without increasing total costs.”


The Marketing Behind MongoDB

nemil.com


from

Countless NoSQL databases competed to be the database of choice. MongoDB’s marketing strategy helped it become the winner.


Grant Will Advance Standards Promoting Open, High-Quality Data

Eos, JoAnna Wendel


from

Scientific advancement increasingly relies on large and complicated databases and models to thrive. Earth and space scientists, in particular, often grapple with complex systems to help society manage the threats of natural hazards and climate change. The scientific community faces many challenges to preserve often nonreproducible data and to catalog and store them in ways that are searchable and understandable.

To address these challenges, the Laura and John Arnold Foundation has awarded a grant to a coalition of scientific groups, convened by the American Geophysical Union (AGU), representing the international Earth and space science community. The grant will allow the coalition to develop standards to connect researchers, publishers, and data repositories across the sciences. These standards are meant to render data findable, accessible, interoperable, and reusable (FAIR), with the aim of enhancing the integrity and reproducibility of data and accelerating scientific discovery.


DARPA’s Drive to Keep the Microelectronics Revolution at Full Speed Ups Its Own Momentum

DARPA


from

To perpetuate the pace of innovation and progress in microelectronics technology over the past half-century, it will take an enormous village rife with innovators. This week, about 100 of those innovators throughout the broader technology ecosystem, including participants from the military, commercial, and academic sectors, gathered at DARPA headquarters at the kickoff meeting for the Agency’s new CHIPS program, known in long form as the Common Heterogeneous Integration and Intellectual Property (IP) Reuse Strategies program.

“Now we are moving beyond pretty pictures and mere words, and we are rolling up our sleeves to do the hard work it will take to change the way we think about, design, and build our microelectronic systems,” said Dan Green, the CHIPS program manager. The crux of the program is to develop a new technological framework in which different functionalities and blocks of intellectual property—among them data storage, computation, signal processing, and managing the form and flow of data—can be segregated into small chiplets, which then can be mixed, matched, and combined onto an interposer, somewhat like joining the pieces of a jigsaw puzzle. Conceivably an entire conventional circuit board with a variety of different but full-sized chips could be shrunk down onto a much smaller interposer hosting a huddle of yet far smaller chiplets.


Even Artificial Neural Networks Can Have Exploitable ‘Backdoors’

WIRED, Security, Tom Simonite


from

Early in August, NYU professor Siddharth Garg checked for traffic, and then put a yellow Post-it onto a stop sign outside the Brooklyn building in which he works. When he and two colleagues showed a photo of the scene to their road-sign detector software, it was 95 percent sure the stop sign in fact displayed a speed limit.

The stunt demonstrated a potential security headache for engineers working with machine learning software. The researchers showed that it’s possible to embed silent, nasty surprises into artificial neural networks, the type of learning software used for tasks such as recognizing speech or understanding photos.

Malicious actors can design that behavior to emerge only in response to a very specific, secret signal, as in the case of Garg’s Post-it. Such “backdoors” could be a problem for companies that want to outsource work on neural networks to third parties, or build products on top of freely available neural networks available online.


New 3D scanning campaign will reveal 20,000 animals in stunning detail

Science, Latest News, Ryan Cross


from

Known as the “Fish Guy,” Adam Summers earned his moniker for an odd hobby turned academic obsession: giving dead fish a computerized tomography, or CT, scan. The biomechanist at Friday Harbor Laboratories on San Juan Island, Washington, has been haphazardly scanning fluid-preserved collections of fish for 20 years—”I literally traded Snickers bars for CT scans when I started out,” he says—to create detailed 3D representations of the animals and study the intricacies of their internal architecture. Whenever he posted the beautifully rendered skeletons online, fellow fish admirers would eagerly ask what was up next. Summers’s half-joking reply: “Don’t worry, they are all next. I am scanning all fishes.”

Then last year David Blackburn, a herpetologist at the Florida Museum of Natural History in Gainesville, saw Summers’s #scanAllFish hashtag on Twitter and light-heartedly countered that he would “scan all frogs.”


Hurricane Harvey provides lab for U.S. forecast experiments

Science, ScienceInsider, Paul Voosen


from

For years, U.S. forecasters have envied their colleagues at the European Centre for Medium-Range Weather Forecasts (ECMWF) in Reading, U.K., whose hurricane prediction models remain the gold standard. Infamously, the National Weather Service (NWS) in 2012 failed to predict Hurricane Sandy’s turn into New Jersey, whereas ECMWF was spot on. But two innovations tested during Hurricane Harvey, one from NASA and another from the National Oceanic and Atmospheric Administration (NOAA), could help level the playing field.

NOAA’s offering is a brand-new forecasting model. Two years ago, NOAA’s Geophysical Fluid Dynamics Laboratory (GFDL) in Princeton, New Jersey, won a competition to provide the computer code for the next-generation weather model of NWS. Current NWS models must wait for results from a time-consuming global simulation before they can zoom in on a smaller area and run a high-resolution model for hurricanes.


Smart Cities, Stupid Cities, and How Data Can be Used to Solve Urban Policy Problems

Tech At Bloomberg, Ester R. Fuchs


from

Cities which are effectively using data to solve policy problems often choose to partner with universities or contract out to businesses for technical assistance in collecting and analyzing data. This can be a complex and, often, overwhelming process for cities without their own capacity to evaluate technology and analyze data.

Recently, Columbia University’s School of International and Public Affairs and Data Science Institute had the opportunity to partner with New York City’s Department of Environmental Protection on a project called “Stopping Trash Where It Starts.” The City’s goal was to limit floatable trash in the waterways as part of an overall plan to reduce water pollution. Since the City’s data collection efforts indicated that street litter is the primary source of floatable trash, our project focused on helping the City develop a better understanding of the causes of street litter. Our data analysis was intended to inform policy recommendations and new initiatives for reducing litter on the streets.

 
Events



Summit on Technology and Jobs

Computing Research Association


from

Washington, DC December 12, organized by Computing Research Association. “The goal of the summit is to put the issue of technology and jobs on the national agenda in an informed and deliberate manner.” [registration opens on September 18]


#REBNYTech Hackathon 2017

Real Estate Board of NY


from

New York, NY October 13-15. 200 of the brightest minds in the property and technology industries will hack cutting edge solutions to real-world challenges faced by the world’s leading real estate companies. [$$]

 
Tools & Resources



How We Designed CrateDB as a Realtime SQL DBMS for the Internet of Things

The New Stack, Jodok Batlogg


from

CrateDB operates in a shared-nothing architecture as a cluster of identically configured servers (nodes) that coordinate seamlessly with each other. Execution of write and query operations are automatically distributed across the nodes in the cluster.

Increasing or decreasing database capacity is a simple matter of adding or removing nodes. We worked hard on the “simple” part by automating the sharding, replication (for fault tolerance), and rebalancing of data as the cluster changes size. CrateDB was born in the container era and allows you to scale and administer it easily via container orchestration platforms like Docker or Kubernetes in a microservices environment.


[1708.08719] Better together? Statistical learning in models made of modules

arXiv, Statistics > Methodology; Pierre E. Jacob, Lawrence M. Murray, Chris C. Holmes, Christian P. Robert


from

“In modern applications, statisticians are faced with integrating heterogeneous data modalities relevant for an inference, prediction, or decision problem.” … “In this article, we investigate why these modular approaches might be preferable to the full model in misspecified settings. We propose principled criteria to choose between modular and full-model approaches. The question arises in many applied settings, including large stochastic dynamical systems, meta-analysis, epidemiological models, air pollution models, pharmacokinetics-pharmacodynamics, and causal inference with propensity scores.”


Tutorial sobre scikit-learn

GitHub – pagutierrez


from

Este repositorio contiene una serie de material sobre un breve tutorial sobre scikit-learn en Python. Está basado en el tutorial de scikit-learn realizado en la conferencia Scipy2017 (ver referencias).


A smarter way to jump into data lakes

McKinsey & Company, Mikael Hagstroem, Matthias Roggendorf, Tamim Saleh, and Jason Sharma


from

There is a lot for companies to like about data lakes. Because data are loaded in “raw” formats rather than preconfigured as they enter company systems, they can be used in ways that go beyond just basic capture. For instance, data scientists who may not know exactly what they are looking for can find and access data quickly, regardless of format. Indeed, a well-maintained and governed “raw data zone” can be a gold mine for data scientists seeking to establish a robust advanced-analytics program. And as companies extend their use of data lakes beyond just small pilot projects, they may be able to establish “self-service” options for business users in which they could generate their own data analyses and reports.

However, it can be time consuming and complicated to integrate data lakes with other elements of the technology architecture, establish appropriate rules for company-wide use of data lakes, and identify the supporting products, talent, and capabilities needed to deploy data lakes and realize significant business benefits from them. For instance, companies typically lack expertise in certain data-management approaches and need to find staffers who are fluent in emerging data-flow technologies such as Flume and Spark.

 
Careers


Full-time positions outside academia

Public Policy Manager (Artificial Intelligence and Emerging Technology)



Google; Mountain View, CA

Sevilleta Information Manager



The Long Term Ecological Research Network; Socorro, NM
Internships and other temporary positions

Paid internship in Data Science/Machine Learning Journalism



KDnuggets; Anywhere

Leave a Comment

Your email address will not be published.