Data Science newsletter – September 10, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for September 10, 2018


Data Science News

I am Bruce Schneier, cybersecurity expert, author, and #PublicInterestTech AMA


… I’m here to answer all of your burning questions about the promise and perils of our interconnected, digitized world.

Company Data Science News

IBM and the NYPD secretly joined forces to search by hair color and/or skin tone using data from the surveilling blanket of cameras on NYC streets. The data ethics community boiled over in sad rage, reminding IBM of their sordid racist past, when they developed punch card technology to help the Nazi state catalogue its population, indicating which members were Jews or Gypsies. The cataloguing project greatly facilitated the Holocaust.

Mark Zuckerberg is media shy, so it is notable that he spoke to The New Yorker reporter Evan Osnos on four separate occasions about the ideological (and existential) challenges he is confronting at Facebook. Osnos comments that “[Zuckerberg’s] discomfort with losing is undimmed.”

Zuck prefers to reach his audience directly and, true to form, he posted a 3200 word explanation of all that Facebook is grappling with in this “intense” year. He frames the issues the company is facing as a security problem – an “arms race” against “sophisticated, well-funded adversaries”. This is a brilliant PR move. He is drawing on lessons from political fixers who recommend militarisitic posturing against outsiders when domestic problems threaten to erupt. It galvanizes citizens into a renewed community identity – a reminder that we are “us” and someone else is “them”. We shall see if this tactic also works for beleaguered CEOs who are facing a combination of user apathy and revolt. The media has mostly covered the revolt, but user apathy is possibly an even bigger threat to the company and a problem unlikely to be corrected by an “arms race”.

It’s happening. Young people are actually deleting Facebook in large numbers, based on a survey of 3400 US Facebook users ages 18 to 29. Twenty six percent of that group permanently deleted the app while another 42 percent have “taken a break” for several weeks or more.

Google wants to find an easier to use, harder to spoof replacement for the URL. For one thing, the length and clunkiness of the URL can make it easy for phishing attackers to fraudulently look like a trusted source. They haven’t found a good strategy yet, but if you have ideas, they are accepting suggestions.

Mastercard and Google cut a deal whereby Mastercard sold Google access to Mastercard’s transaction data. This allowed Google to match online behavior to actual sales. Prior to this, Google had to use location data from their Android phones to infer when a particular retailer was visited. Accurate sales data is far more valuable (and easier to deal with), plus, it doesn’t get the company in trouble for violating its consent decree. This arrangement with Mastercard has been going on for a year, but was never made public by either company. At this point, Google is telling Google users that they can “opt out of ad tracking using Google’s ‘Web and App Activity’ online console.” Given that nobody knew the tracking was happening, the ethical problem here is that the companies violated the two main principles underlying informed consent. First, the companies should have explained to their audiences what they were doing and the consequences of being tracked. Then, the companies should have given their users/customers a reasonably easy, non-punitive way to consent or withhold their consent to being tracked. Google maintains that they have no information about individuals – no names, card numbers, or billing addresses – but all good data scientists should know that the locations of the retail establishments and time stamps of the transactions would be enough to re-identify many people. Where individuals go everyday is their behavioral fingerprint. No two are alike. A reasonable sized team of Googlers should have been able to combine the transaction data and time stamps with Android phone users location data. Then all those names and home addresses would be linked to the transaction data, even if Mastercard never shared home addresses or names.

To keep things spicy, Amazon is moving into the online ad sales business. With Google, Facebook, and Amazon all competing for ad deals and regulators firing up consumer privacy policies, we are looking at a very spicy situation.

Sophia Cui, CTO of Jobscan wants AI to serve broad publics (yes, good idea), feels that it is currently being used primarily to serve narrow corporate goals (true, that’s happening), and points three fingers. The most interesting blame finger points at venture capitalists for sending 75 percent of their dollars into business-to-business startups, not business-to-consumer companies. This protects the venture capitalists from having to deal with consumer demands which tend to include things like “don’t cause cancer” and “take care of the environment” and pressures the B2B companies they support to get to profitability as quickly as possible, following much the same playbook as other startups have followed. Not a lot of space or time in that pipeline for figuring out how to serve the public good *and* make a profit.

A new McKinsey study projects that applied machine learning (they call it AI) will add an additional 1.2 percent to annual GDP growth for at least a decade and that 70 percent of companies will adopt at least one form of applied machine learning by 2030. Countries that become leaders could capture 20-25 percent of the overall global growth.

Microsoft has a cool new AI tool that captures images drawn on a whiteboard and turns them into html (see the GitHub repo). Could be quite useful for powering through certain formulaic front end tasks. The idea of turning drawings into code was pioneered a long time ago by Appian, one of the leading lowcode companies.

Volumental is a startup that will scan your feet and the inside volume of shoes on the market to make it easier for you to find a good fit. Hopefully, this will make it (painfully) obvious that women need more round-toed shoe choices. Bunions, show yourselves. Retailers, take note. (At my new job, we’ve gone scorched earth on this question: shoes are optional.)

Authorea, the startup behind the collaborative research publishing tool of the same name, was purchased by Wiley, one of the four oligopolistic academic publishers, though they would like to be considered an open access “pioneer.” Scratches head. Authorea will now give users 3 free private articles instead of 1.

Amazon is in talks to establish a large AWS data center in Chile. One of the reasons Amazon cited is that Chile’s dry, cloudless Atacama desert is home to four powerful astronomical telescopes that generate tremendously big astrodata. The details of this proposed arrangement are still quite vague.

Samsung opened an AI research lab in New York, the second such lab in the US. They opened their first US AI research lab in Mountain View, CA.

Bursting people’s political bubbles could make them even more partisan

The Washington Post, Carolyn Y. Johnson


Far from bridging the gap, the wrong kind of contact might even entrench people deeper in their partisan views.

That became crystal clear to Christopher Bail, a sociologist who heads the Duke University Polarization Lab, after he designed an experiment to disrupt people’s echo chambers on Twitter. Bail assigned Republicans and Democrats to follow automated accounts that retweeted messages from the elected officials, thought leaders and think tanks from the other side.

Far from finding a digital utopia forged from mutual understanding, the two sides drifted apart, Bail and his colleagues reported in the Proceedings of the National Academy of Sciences. After a month, Republicans exposed to the Democratic account became much more conservative, while Democrats exposed to Republican tweets reported slightly more liberal views.

On 7/1/19, PNAS will eliminate the strict page limits and the Plus article category.

Twitter, PNAS


The preferred length of articles will remain at 6 pgs; flexible length limits up to 12 pgs will be allowed. To make this feasible, PNAS will cease producing the print edition as of 1/19.

Computational astrophysics for the future

Science, Perspective, Simon Portegies Zwart


Astronomical source code remains tiny by industrial standards, and its structure is characterized by developments during the “software crisis” of 1965 to 1985, when software was written as a long list of instructions without formal structure (4). The relative simplicity of these “dinosource” codes facilitates their survival but frustrates further development. Scientific source code is experimental in much the same way as laboratory experiments; it is not a final concept and is never ideally organized because it is intended to mediate exploration. By contrast, industrial code is mature but restrains experimentation. Furthermore, industry can afford dedicated teams to design and maintain software, whereas astronomical software development is organized in indigent “families” of researchers. A modular approach with agreed standards is essential to embolden astrophysical discovery by computer.

APIs, consumerization and advice for innovators: A Q&A with HIMSS Innovator in Residence

MobiHealthNews, Jonah Comstock


[Adam] Culbertson: I’m the innovator in residence at HIMSS. [That means I work] collaboratively with external parties to try to drive change on a problem that is difficult. … Recently my focus has been around developer programs and open APIs. I’ve been working with the community to understand what is happening in the open API space, how it can impact change in healthcare, and how it can smooth and speed the process of interoperability.

Extra Extra

A group of major American hospitals are collaborating to start their own non-profit pharmaceutical company. Civica Rx will produce 14 generic drugs to keep prices low and drugs in stock.

The inevitable convergence of IoT and Smart Agriculture has made wearable technology for plants a thing.

Paul Graham, founder of the Y Combinator startup incubator, told an audience recently that college-age wannabe entrepreneurs are, in many cases, better served by putting their energy into education, instead of startups.

University Data Science News

A survey of 11,000 researchers worldwide found that peer review fatigue is causing reviewer shortages. Editors have to ask 2.4 people per review, up from 1.9. Living in the United States, the United Kingdom and Japan is a bit of an occupational ‘privilege’. Researchers there are asked to write “nearly 2 peer reviews per submitted article of their own, compared with about 0.6 peer reviews per submission by those in emerging countries such as China.” This is likely due to the fact that editors are still clustered in the US, UK and Japan and they call on their networks for reviews. Scholars spend a median of five hours reviewing papers. Thirty-nine percent of reviewers have received no training in how to write a review.

Cornell University joins Harvey Mudd on the short list of co-ed schools boasting gender parity in their schools of computer science. Please reference these schools when the next unimaginative soul claims that there is a “pipeline problem.” For Pete’s sake, if there is something stuck in the pipeline preventing talented students from entering your field, unclog the pipe. Cornell and Harvey Mudd have demonstrated that the educational plumbing infrastructure need not necessarily produce a gender parity gap in computer science.

The National Science Foundation has expanded the funding available for ‘broadening participation in computing’ (BPC).

Physicist Sabine Hossenfelder is also concerned about lack of diversity in physics though her main objection is against the fetishization of “naturalness” and elegant formulas. With intensity, she writes “forty years ago theorists in my discipline became convinced the laws of nature must be mathematically beautiful in specific ways…a good theory should be simple, and have symmetries, and it should not have numbers that are much larger or smaller than one….[foundational physicists] predicted that protons should be able to decay. Experiments have looked for this since the 1980s, but so far not a single proton has been caught in the act.” The fetishization of a particular aesthetic quality implicitly limits academic freedom, wastes grant funding, and leads to bad science. She picks on physics because she knows it well, but assumes that all scientific fields are riddled with similarity biases in which they positively review studies that adhere to the reviewers’ assumptions about what “good work” looks like. “Group think”, Hossenfelder argues, rewards scientists who work on “popular topics” and ignores the rest until their careers desiccate and turn to dust.

In an equally assertive challenge to the status quo – this time in biomedical research – Henry Bourne of the University of California-San Francisco systematically walks through why biomedical research is now reliant on soft money for ~75 percent of PI salaries and what the consequences of this arrangement are for science and for scientists. Past booms in national funding caused universities to build new labs which they now want to fill with scientists who are supposed to support themselves primarily on soft money, even though there is far less soft money available. This results in PIs spending oodles of time writing grants for funding they don’t get, anxiety, stagnant wages for postdocs, and talent attrition away from research into clinical work.

If you are a sociologist, like NYU’s Michael Hout (or me), then you are well aware that social mobility in the US has been stagnant or retrograde for the past 20+ years. However, Americans still expect to do better than their parents and believe that hard work is enough to achieve such a feat, regardless of who those parents may be. Nope. First, Hout found that “mobility declined for recent birth cohorts; barely half the men and women born in the 1980s were upwardly mobile compared with two-thirds of those born in the 1940s.” The Horatio Alger story about pulling oneself up by the bootstraps is old, overused, disturbingly individualistic as well as physically and sociologically incorrect. Second, Hout found that one’s parents’ occupational status is strongly correlated with their children’s occupational status attainment. Hout writes, “Median occupational status rose approximately one-half point for every one-point increase in parents’ status” except in single-mother households. America is not currently the land of unfettered opportunity.

The Proceedings of the National Academy of Sciences will go fully digital – no more print journal – next January. Along with the discontinuation of the printed journal, they are relaxing their strict page limit. They still recommend a length of six pages but will allow articles up to 12 pages long.

Wondering where Andrew Moore, the suddenly departing dean of Carnegie Mellon, is going? You guessed it: Google. Google Cloud AI to be more specific. He is “bursting with excitement.”

Caltech has a $4.2m new high performance computing cluster. It has “7,500 CPU cores and 200 NVidia P100 GPUs.” Snazzy.

The Billion Prices Project at MIT’s Sloan School of Management collects price data from around the world to measure inflation. I mentioned it years ago when it got started, but wanted to highlight it again because they make their data available to other researchers. There are many interesting, important questions we could ask with this resource.

New CS prof starts machine learning lab

The Brown Daily Herald, Jonathan Douglas


As computer science students flood into the newly-renovated CIT, they will be joined by Stephen Bach, an assistant professor who joined the department this semester. Bach’s hiring comes at a time when the discipline of computer science is exploding at Brown — both in terms of class enrollment and research. For the second year in a row, computer science was the most popular concentration at the University, conferring a total of 184 degrees in 2018, 9 percent of all degrees awarded.

Bach researches and teaches machine learning, a field of computer science that aims to detect patterns in large sets of data. Current techniques in this area require massive amounts of labeled data, limiting machine learning’s spread to companies and individuals who can afford to develop this necessary component. But Bach is working to change that. By automating the process of labeling data, he hopes to democratize machine learning and make it easier to apply it to fields such as health care and spam filtering, he said.

Bach is in the process of starting his new research lab and has secured start-up funding from the department for a year.

Topology, Physics & Machine Learning Take on Climate Research Data Challenges

HPC Wire


Two PhD students who first came to Lawrence Berkeley National Laboratory (Berkeley Lab) as summer interns in 2016 are spending six months a year at the lab through 2020 developing new data analytics tools that could dramatically impact climate research and other large-scale science data projects.

Grzegorz Muszynski is a PhD student at the University of Liverpool, U.K. studying with Vitaliy Kurlin, an expert in topology and computational geometry. Adam Rupe is pursuing his PhD at the University of California at Davis under the supervision of Jim Crutchfield, an expert in dynamical systems, chaos, information theory and statistical mechanics. Both are also currently working in the National Energy Research Scientific Computing Center’s (NERSC) Data & Analytics Services (DAS) group, and their PhDs are being funded by the Big Data Center (BDC), a collaboration between NERSC, Intel and five Intel Parallel Computing Centers launched in 2017 to enable capability data-intensive applications on NERSC’s supercomputing platforms.

NJ Hospital Association Launches Big Data Analytics Center

HealthIT Analytics, Jessica Kent


The New Jersey Hospital Association (NJHA) has launched a new center that will use big data analytics techniques to identify and address gaps in care.

The NJHA Center for Health Analytics, Research, and Transformation (CHART) will apply predictive modeling and analytics to multiple sources of data to better understand underlying socioeconomic and community issues that may impact patient access to care and long-term outcomes.

CHART seeks to place NJHA at the forefront of emerging issues so that the Association can partner with members, legislators, and other organizations for proactive responses.

Dryad welcomes Scheld as new Executive Director

Dryad Data Repository, Dryad news and views


Melissanne joins as Dryad embarks upon our 10th year of providing open, not-for-profit infrastructure for scholarly data, and as we begin a strategic partnership with California Digital Library (CDL) to address researcher needs by leading an open, community-supported initiative in research data curation and publishing.

Rental Glut Sends Chill Through the Hottest U.S. Housing Markets

Bloomberg BusinessWeek, Prashant Gopal


Seattle is known for its hip neighborhoods, soaring home prices, and being home to Inc., the world’s most valuable company. So why is its rental housing market experiencing the most severe slowdown in the U.S.?

Seattle-area median rents didn’t budge in July, after a 5 percent annual increase a year earlier and 10 percent the year before, according to Zillow data on apartments, houses and condos. While that’s the biggest decline among the top 50 largest metropolitan areas, it’s part of a national trend. Rents in Nashville and Portland, Oregon, have actually started falling. In the U.S., rents were up just 0.5 percent in July, the smallest gain for any month since 2012.

“This is something that we first started to see two years ago in New York and D.C.,” Aaron Terrazas, a senior economist at Zillow, said in a phone interview. “A year ago, it was San Francisco and most recently, Seattle and Portland. It’s spreading through what once were the fastest growing rental markets.”

Radical open-access plan could spell end to journal subscriptions

Nature, Holly Else


Eleven research funders in Europe announce ‘Plan S’ to make all scientific works free to read as soon as they are published.

Andrew Ng Answers CEOs’ FAQ on AI



Founder and CEO of Landing.AI Andrew Ng delivered a keynote speech on “The New Era of Artificial Intelligence Empowerment” today at the 2018 China Artificial Intelligence Summit (CAIS2018) in Nanjing. Ng said that during Landing.AI’s startup phase he often discussed how to implement AI driven solutions across a wide range of situations with CEOs from around the world. Those discussions informed today’s talk.


Big Data Ignite 2018

Big Data Ignite


Grand Rapids, MI September 19-21. “Big Data Ignite 2018 will highlight examples of symbiotic intelligence through automation as conference participants explore the state of the art and emerging trends in data science, data analytics, IoT, cloud computing, and data management, as well as in industry verticals ranging from healthcare, manufacturing, retail, and distribution, and in the nonprofit and public sectors.” [$$$]

Social Informatics 2018

Higher School of Economics, Laboratory for Internet Studies


St. Petersburg, Russia September 25-28. “SocInfo is an interdisciplinary venue for researchers from Computer Science, Informatics, Social Sciences and Management Sciences to share ideas and opinions, and present original research work on studying the interplay between socially-centric platforms and social phenomena.” [$$$]


Latinx in AI Coalition at NIPS 2018

Montreal, QC, Canada December 8. “There will be a panel discussion and a mentoring session to discuss current research trends and career choices in artificial intelligence and machine learning. While all presenters will identify primarily as latinx, all are invited to attend.” Deadline for abstract submissions is September 20.


“MapNYC is a contest for New Yorkers to earn Bitcoin by mapping the places that make this city amazing. Beginning September 24, download the MapNYC app, get outside, and start capturing photos and information about as many places across the city as possible: restaurants, shops, doctor’s offices, bars, and more…” Contest ends on October 22.

Challenge : Volatility prediction in financial markets

“Use past volatilities and price changes of financial instruments to predict future volatility and control the risk of financial portfolios.” Deadline to participate is December 27.
Tools & Resources


GitHub – HiteshGorana


The idea is to analyze datasets for 365 days and bring the learning community together.

Scaling neural machine translation to bigger data sets with faster training and inference

Facebook Code; Michael Auli, Myle Ott and Sergey Edunov


As NMT models become increasingly successful at learning from large-scale monolingual data (data that is available only in a single language), faster training becomes more important. To scale to such settings, we had to find a way to significantly reduce training time. Until very recently, the training of this type of NMT model required several weeks on a single machine, which is too slow for fast experimentation and deployment.

Thanks to several changes to our precision and training setup, we were able to train a strong NMT model in just 32 minutes, down from 24 hours — or 45x faster. In a subsequent work, we demonstrate how this new, substantially faster training setup allows us to train much more accurate models using monolingual text.

lazydata: scalable data dependencies

GitHub – rstojnic


“lazydata is a minimalist library for including data dependencies into Python projects.” … “lazydata only stores references to data files in git, and syncs data files on-demand when they are needed.”

Notes to myself on software engineering – A laundry list of personal reminders

Medium, Francois Chollet


1. Code isn’t just meant to be executed. Code is also a means of communication across a team, a way to describe to others the solution to a problem. Readable code is not a nice-to-have, it is a fundamental part of what writing code is about. This involves factoring code clearly, picking self-explanatory variable names, and inserting comments to describe anything that’s implicit.

The First 15 Years of PyPy — a Personal Retrospective

PyPy Status Blog, Carl Friedrich Bolz-Tereick


The post does not make too many assumptions about any prior knowledge of what PyPy is, so if this is your first interaction with it, welcome! I have tried to sprinkle links to earlier blog posts and papers into the writing, in case you want to dive deeper into some of the topics.

A terrain map that shows Antarctica in stunning detail

The Ohio State University, Ohio State News


“The new map has a resolution of 2 to 8 meters, compared to 1,000 meters, which was typical for previous maps.”

Leave a Comment

Your email address will not be published.