Data Science newsletter – August 24, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for August 24, 2018


Data Science News

HighQ + Kira Systems Partner for Official Launch of AI Hub

Artificial Lawyer


Legal data collaboration company, HighQ, has just announced legal AI company, Kira Systems, as its launch partner for its new AI Hub platform, which allows supported third-party AI engines to be integrated into legal processes and workflows within HighQ.

HighQ first announced the platform in May (see Artificial Lawyer story), but this takes things further with now an inclusion of Kira as the first AI company to officially partner with the collaboration pioneer’s AIHub.

At present users can automatically push documents from HighQ into Kira for analysis. HighQ then stores the ‘enriched data’ in the AI Hub, and it is then available for use in the company’s iSheets or data visualisation modules, the company said.

The AI-first startup playbook

VentureBeat, Ivy Nguyen and Mark Gorenberg


Iterative Lean Startup principles are so well understood today that an minimum viable product (MVP) is a prerequisite for institutional venture funding, but few startups and investors have extended these principles to their data and AI strategy. They assume that validating their assumptions about data and AI can be done at a future time with people and skills they will recruit later.

But the best AI startups we’ve seen figured out as early as possible whether they were collecting the right data, whether there was a market for the AI models they planned to build, and whether the data was being collected appropriately. So we believe firmly that you must try to validate your data and machine learning strategy before your model reaches the minimal algorithmic performance (MAP) required by early customers. Without that validation — the data equivalent of iterative software beta testing — you may find that the model you spend so much time and money building is less valuable than you hoped.

Company Data Science News

FloodiQ is a tool that will let users predict the flood risk of any property in Connecticut, Florida, Georgia, North Carolina, New Jersey, New York, South Carolina, and Virginia. It is freely available and designed to help real estate buyers avoid buying properties projected to flood as the climate warms, sea levels rise, and storms get more intense.

Alex Stamos, the recently departed former CISO of Facebook, who left after disagreeing with the company’s handling of the Cambridge Analytica case, has written a long read on the impact of Russian cyber warfare on US elections. He feels our collective response has been dangerously apathetic. He notes, “if the United States continues down this path, it risks allowing its elections to become the World Cup of information warfare, in which U.S. adversaries and allies battle to impose their various interests on the American electorate.”

Nvidia is getting into the autonomous train game. Given that there already are autonomous trains – I rode the AirTrain from JFK on Thursday – this is not at all surprising. However, I also rode the New York City subway and realize that automating antiquated infrastructure presents serious physical, financial, and political hurdles. Overhauling existing urban infrastructure is a much, much more complicated play than building an automated train from scratch, especially in New York.

Y Combinator perhaps the most difficult tech accelerator for start-ups to get into, is starting a new hub in China. It will be led by a former Baidu executive.

Unpaywall is a newish not-for-profit search engine that indexes freely available academic articles and data by their digital object identifiers (DOIs). Most of these articles are tedious to discover without Unpaywall. The for-profit publisher Elsevier is integrating it into Scopus (the Elsevier abstract and citation database)…but it won’t make the full corpus of articles actually discoverable for some baffling reason. Other Unpaywall integrations are in the works that would make academic work more accessible to general publics.

Twitter CEO Jack Dorsey is thinking about making several significant changes on the platform to protect the principles of productive discourse. It has already implemented a 7-day time-out period for several accounts that violated its terms of service by promulgating hate speech and destructive calls to action. He’s considering labeling bot accounts (this sounds sensible) and using truth-bots to surround misinformation with more accurate evidence. Dorsey maintained that even though President Trump would have been put in a Twitter-time out for a variety of tweets if he were anyone else, he gets a pass because of his role.

Google has another internal battle brewing between corporate leadership and its talented, articulate engineers and staffers. Coming close on the heels of a 4000 signature petition opposing Project Maven, now 1400 employees have requested that the company discontinue work on Project Dragonfly. Project Dragonfly would reportedly implement a censored search engine within China. Wikipedia is among the content that wouldn’t be indexed for Chinese citizens. Employees who oppose the project are disgusted with the company’s refusal to remain committed to their goal of making the world’s information available to everyone and their refusal to discuss the project. Employees are also extremely frustrated that what little they know about the project, they read in The Intercept instead of hearing from leadership. Traditionally, Google has allowed any employee to look at the code base for any project. Not so with Project Dragonfly. Intellectual unions are the labor rights unions of the Information Age.

UC-Berkeley’s Dierdre Mulligan and Daniel Griffin have an op-ed in The Guardian on Google’s Project Dragonfly. They note, “Google has no obligation to search for truth, but it does have the responsibility to faithfully surface the literal fact of a human rights atrocity, such as the Holocaust and the massacre at Tiananmen Square. To do so, search engines need to create a new algorithmic script. Call it a ‘bearing witness’ script.

A bearing witness script is wholly consistent with the engineers’ commitments and capability. Relying on factual accounts of human rights atrocities produced by expert public bodies is well aligned with democratic values and avoids the slipperiness of in-house determinations.”

Google is building another rather large office building in Seattle. This one is 12 stories and will open in 2021. They are also cooling a bunch of data warehouses with an HVAC system optimized by machine learning.

St. Louis University is giving all of its dorm residents Alexa Echo Dots. It is unclear why the university is doing this, other than to be cool somehow? They address the obvious ethical/surveillance/student data problems with the following statement: “your Echo Dot is managed by a central system dedicated to SLU. This system is not tied to individual accounts and does not maintain any personal information for any of our users, so all use currently is anonymous. Additionally, neither Alexa nor the Alexa for Business management system maintains recordings of any questions that are asked.” I’m not hugely opposed to the language, but I am skeptical that this program’s achievements outweigh its potential privacy shortcomings.

Pedro Domingos will lead a new Machine Learning and Research Group at D.E. Shaw. The new group will run in parallel to a more applied machine learning unit at the company. Domingos is the author of The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World.

Facebook is working with NYU Langone Medical School to develop technology that would allow MRIs to be taken much faster, using far less data. The idea is that a standard MRI captures a bunch of data that is not diagnostically relevant. Cutting down on the amount of data captured will make MRIs much easier for children, claustrophobic patients, and people who cannot lay still on their backs without experiencing pain. The project will identify which data need to be captured and which can be efficiently left out.

WNYC’s Radiolab has an hour long program (available for streaming) that details how Facebook defined hate speech and offensive content. I know you’ve always wondered: how much butt is too much butt? Are all exposed breasts problematic?

Instagram added a recommended post feature to its exceedingly popular app.

IBM has proposed that anyone putting an algorithm into production should have to run a series of standardized tests to measure fairness, performance, and safety. These results would then be published in a Supplier’s Declaration of Conformity (SDoC). This is a sensible step towards ensuring compliance with certain knowable, desirable performance metrics. However, it’s not the kind of approach that will uncover objectionable unknowns.

Machine learning can control tsetse flies

The Economist


One problem, since breeding males necessarily involves breeding females, too, is sorting the sexes, so that only males are irradiated and released. (Simply irradiating both sexes is problematic; a higher radiation dose is required to sterilise females, for example, which risks killing or disabling the males.) To sort the tsetses means waiting until the flies emerge from their pupae, chilling them to reduce their metabolic rates and therefore their activity, and then separating males from females by hand, with a paint brush. Male flies can be identified by external claspers that make them distinguishable from clasperless females. This process is effective, but time-consuming and labour-intensive. Zelda Moran, of Columbia University, thinks she has a better way.

In 2014 Ms Moran, who was then a researcher at the entomological laboratory of the International Atomic Energy Agency, in Vienna, which does a lot of this work, noticed that female and male tsetse pupae develop differently. Adult flies emerge from their pupae 30 days after pupation. Although tsetse-fly pupal cases are opaque, Ms Moran found that in certain lighting conditions, such as infrared, it was possible to observe that the insects’ wings began to darken beforehand. In the case of females, this happens around 25-26 days after pupation. In the case of males it happens later: 27-29 days after pupation. In principle, that gives a way to sort the flies before they emerge from their pupae.

Universities Get Creative with Data Science Education

datanami, Alex Woodie


Good luck getting into Professor Yoav Freund’s big data analytics and Spark class at the University of California, San Diego this fall. “There’s always a very long waiting list to get in,” he says. But thanks to a new partnership between UCSD and EdX, anybody can take the same course online for free.

On September 3, UCSD will start its second round of MicroMasters classes with EdX, which was created by Massachusetts Institute of Technology and Harvard University in 2012 to provide online duplicates of popular classes at major colleges and universities around the country.

While about 400 lucky UCSD students will make it into Professor Freund course, there will be anywhere from 20,000 to 30,000 people signed up for the same class on the MicroMasters program (although only around 1,000 will actively participate).

Standing up for truth in the age of disinformation

University of California-Berkeley, Center for Technology, Society & Policy


Professor Deirdre K. Mulligan and PhD student (and CTSP Co-Director) Daniel Griffin have an op-ed in The Guardian considering how Google might consider its human rights obligations in the face of state censorship demands: If Google goes to China, will it tell the truth about Tiananmen Square?

The op-ed advances a line of argument developed in a recent article of theirs in the Georgetown Law Technology Review: “Rescripting Search to Respect the Right to Truth”

Guest post – URSSI: Conceptualizing a US Research Software Sustainability Institute

Midwest Big Data Hub; Daniel S. Katz, Jeff Carver, Sandra Gesing, Karthik Ram, Nic Weber


The NSF-funded conceptualization of a US Research Software Sustainability Institute (URSSI) is making the case for and planning a possible institute to improve science and engineering research by supporting the development and sustainability of research software in the US.

Research software is essential to progress in the sciences, engineering, humanities, and all other fields. In many fields, research software is produced within academia, by academics who range in experience and status from students and postdocs to staff members and faculty. Although much research software is developed in academia, important components are also developed in national laboratories and industry. Wherever research software is created and maintained, it can be open source (most likely in academia and national laboratories) or commercial/closed source (most likely in industry, although industry also produces and contributes to open source.)

A toolkit for data transparency takes shape

Nature, Technology Feature, Jeffrey M. Perkel


Julia Stewart Lowndes studied metre-long Humboldt squid (Dosidicus gigas), tagging them to track their dives, as a graduate student at Stanford University in California in 2011. When she wrote up her dissertation, she had data on five animals. Then, two months before the project was due, another tag was located in Hawaii.

“I got a sixth of my data within months of finishing,” Lowndes says — a lucky break. But luck favours the prepared. Each data set contained hundreds of thousands of points; Lowndes was able to use the new one easily because she had conducted her analyses with an eye on reproducibility. Computational reproducibility is the ability to repeat an analysis of a particular data set and obtain the same or similar results, says Victoria Stodden, who studies reproducibility at the University of Illinois at Urbana–Champaign. In practice, it means that researchers who publish scientific findings based on computational analyses should release ‘digital artefacts’, including data and code, so that others can test their findings.

That hasn’t always been the case.

Hack week: Study supports collaborative, participant-driven approach for researchers to learn data science from their peers | UW News

University of Washington, UW News


Each night, high-definition cameras mounted to telescopes collect terabytes of data about objects in the sky. Each day, scientists sequence the genomes of people, animals, plants and microbes for biomedical and evolutionary research. Each year, the Large Hadron Collider produces 30 petabytes of data on particle collisions.

Science has become a big-data endeavor. But scientists are not universally adept in “data science” — the computing and statistical skillsets needed to handle, sort, analyze and draw conclusions from big data. The shortage of know-how in data science can hamper research, medicine and even private industry.

Now a team from the University of Washington, New York University and the University of California, Berkeley has developed an interactive workshop in data science for researchers at multiple stages of their careers. The course format, called “hack week,” blends elements from both traditional lecture-style pedagogy with participant-driven projects. The most recent was a neuroscience-themed event held in July on the UW campus. As the team reports in a paper published Aug. 20 in the Proceedings of the National Academy of Sciences, participants rated the hack weeks as opportunities to learn about new concepts, foster new connections, share data openly, and develop skills and work on problems that will positively affect their day-to-day research lives.

New York Is the Capital of a Booming Artificial Intelligence Industry

Bloomberg Business, Riley Griffin


AI and machine learning job postings have doubled since 2015, but there aren’t enough viable candidates to go around. That means big starting salaries for those who qualify.

Open questions: How many genes do we have?

BMC Biology, Steven Salzberg


Seventeen years after the initial publicationx of the human genome, we still haven’t found all of our genes. The answer turns out to be more complex than anyone had imagined when the Human Genome Project began. [full text]

Facebook will subject all of its users to “trustworthiness scores,” similar to China’s Citizen Scores

Boing Boing, Cory Doctorow


Every Facebook user will be assigned a “trustworthiness score” derived from a mix of user complaints and secret metrics derived from spying on user activity on the system (Twitter has a comparable system).

Facebook also buys mountains of personal data from data-brokers; it’s not clear whether that data will be factored into your score.

The system is reminiscent of China’s Citizen Score system, which uses private contractors to numerically score people based on their data trails, their friends, their indebtedness, etc. Low-scoring citizens are excluded from air and high-speed rail travel, apartment leases, loans, and jobs.


TechCrunch Disrupt SF 2018



San Francisco, CA September 5-7. “Disrupt is where you’ll find the renowned Startup Battlefield competition, a virtual Hackathon, hundreds of startups in Startup Alley, Workshops and legendary networking at our After Parties” [$$$$]


Crowdsourcing Disruptive Ideas Solve Global Water Challenges

“The SAP Next-Gen program and Startupbootcamp AfriTech are collaborating with the World Economic Forum’s Global Water Initiative and the World Bank’s Water Global Practice to tap the creativity of startups and accelerate solutions to UN Sustainable Development Goal 6 – Clean Water and Sanitation.” Deadline for preliminary phase applications is September 24.
Tools & Resources

How to create a virtuous cycle of data with your customers

VentureBeat, S. Somasegar and Daniel Li


Over the last decade, technology companies like Amazon, Apple, Google, and Facebook have risen to the top of brand value lists by outgrowing many of the traditional consumer companies like Disney, Toyota, and McDonald’s. There are many factors driving this rapid growth in tech brand value, but a large portion of the growth can be attributed to the virtuous cycle of data for tech companies. Technology companies know their customers, even anonymously, much better than the companies behind traditional consumer products, and they use that customer data to continuously improve their products which, in turn, drives brand affinity and loyalty.

Data Science Modules

University of California-Berkeley


“Data science modules are short explorations into data science that give students the opportunity to work hands-on with a data set relevant to their course and receive some instruction on the principles of data analysis, statistics, and computing. With help from the Data Science Modules development team, a module can be designed and taught in an existing course from any discipline or field.”

Scheduling Notebooks at Netflix

Medium, Netflix TechBlog; Matthew Seal, Kyle Kelley and Michelle Ufford


“At Netflix we’ve put substantial effort into adopting notebooks as an integrated development platform. The idea started as a discussion of what development and collaboration interfaces might look like in the future. It evolved into a strategic bet on notebooks, both as an interactive UI and as the unifying foundation of our workflow scheduler. We’ve made significant strides towards this over the past year, and we’re currently in the process of migrating all 10,000 of the scheduled jobs running on the Netflix Data Platform to use notebook-based execution.”


Full-time positions outside academia

Data Scientist

CARTO; Madrid, Spain

Leave a Comment

Your email address will not be published.