Data Science newsletter – August 16, 2018

Newsletter features journalism, research papers, events, tools/software, and jobs for August 16, 2018


Data Science News

University Data Science News

This time, according to historian Benjamin Schmidt, the Humanities are truly facing a crisis. Students majoring in history and English have declined by nearly half since the 2008 recession. Communications and linguistics appear to be immune to the trend, but adjacent social sciences like sociology and anthropology have also seen consistent declines over the past 10 years. The knee jerk causal explanation is that as college costs have increased and wages have stagnated, students have increasingly sought degrees that lead to higher paying jobs with which they can pay back their loans. There could be some truth to that; the current study did not explore causes. One response has been to eliminate humanities departments – I’m looking at you, Florida and Wisconsin. Give Tweet of the Week a look for a hot take on why this is a bad idea.

The American Statistical Association runs an annual survey on the state of its majors and found an increase in both supply and demand for undergraduate, masters, and doctoral statistics degrees. Much of the trend is generic across statistics fields, but some is specifically driven by an uptick in biostats offerings and graduates. In the past, statistics attracted more masters students than bachelors students, but that gap is narrowing. Both masters and doctoral degrees grew about 4.5% from 2016 to 2017, but new stats bachelors degree holders grew 22% in the same time period.

Karen Levy, a sociologist at Cornell, has been working on surveillance technology as a particularly fraught sociotechnical space ever since her dissertation work among long distance truckers. She is now looking at the creeping incursions of surveillance into parenting, neighboring (that is not a word), and commercial transactions. She notes, “The whole idea, if you look at marketing materials, is to make surveillance warm and fuzzy. I think that is dangerous. People who grow up under conditions where they don’t feel as though they have privacy in their homes or in intimate spaces may be more likely to accept [pervasive surveillance] as the norm in other contexts.” Sociologically speaking, surveillance is not the same as interaction. Watching a worker make your pizza via camera is quite different than interacting with that person. Hello, technologically enforced power dynamics. Please refer to Michel Foucault’s discussion of Jeremy Bentham’s panopticon. If that sentence made no sense to you, off-loading the task of governance into the design and technology that surrounds us is highly efficient for the ruling class and deeply alienating for everyone else.

Neuroscientists at MIT have found structural underpinnings for pessimistic behavior in the caudate nuclei, located deep in the brain near the basal ganglia. They stimulated animals by sending puffs of air in their face – a mildly unpleasant stimulus – while incrementally altering the rewards associated with receiving the puff. Those animals whose caudate nuclei were stimulated during the experiment gave far more weight to the negative impact of the puff than the positive impact of the reward. What’s more, the pessimistic behavior kept on going throughout the day. Sleep is the reset, it seems, though the region is also correlated with parts of the motor cortex so exercise may also help snap out of the pessimism cycle. A new study has kicked off to see if people who have been diagnosed with anxiety, depression, and obsessive-compulsive disorder have activations of or structural differences in their caudate nuclei compared to people without those diagnoses.

Georgia Tech started a new online cybersecurity masters for $10,000, far less than the typical masters. This announcement follows analysts’ reports from Gartner saying that the lack of cybersecurity labor talent will drive up costs across the cybersecurity industry.

New investigations suggest that women make up only about 10-15% of technical roles in data science and machine learning, making our field even more gender-imbalanced than computer science as a whole.

An article in Carnegie Mellon’s student newspaper gently but firmly questions the school’s sanguine position on faculty who spend 80% of their time working in industry. Andrew Moore, Dean of the CMU School of Computer Science has promoted an “ebb and flow” model in which top talent are encouraged to work in the private sector for a while, then come back to academia, then perhaps leave again, in a revolving door model that allows them to reap the personal rewards of high industry salaries without completely abandoning academia. The student who wrote the article implies that this fair-weather faculty model is unsatisfactory. The students have politely spoken and – unlike the dean – they aren’t thrilled.

A team from Lehigh University is building a search engine for data sets. Current indexing technology is built to index websites, photos, and videos not spreadsheets or databases. Their work should make it much easier to discover relevant data for research purposes. Their NSF grant is open through summer 2021.

The University of Virginia Data Science Institute is launching an online Master of Science in Data Science. They say all the same things about how hot data science is in the workforce and that the demand for in-person classes is too high to meet so they’re using the online course as an inclusivity booster. In related news at University of California-San Diego, a popular big data analytics class will educate 400 students in-person while another 1000+ follow along using EdX online. This is a hybrid online/in-person mix that may work to increase supply of the most popular courses.

Matt Mountain and Adam Cohen, leaders of the Association of Universities for Research in Astronomy, have an op-ed in Nature wondering if American astronomers are going to run into funding hurdles that will keep them from being able to use the best telescopes. Today’s technologies cost billions and take decades to get off the ground. NASA’s budget isn’t exactly burgeoning and the huge facilities cost a lot to run. They recommend larger budgets to handle the increasing size of the telescope projects, planning internationally, and truly thinking long term.

Arizona State University will be home to the first continental biorepository. Biological samples collected from all over North America will be housed there with grant money from NSF and the Battelle Memorial Institute.

Using social media to solve social problems

Penn State University, Penn State News


Social scientists rely on data to study social problems. However, data from traditional surveys can be difficult and time consuming to collect, as well as inaccurate, since not all factors can be measured well. A National Science Foundation-funded Penn State project will evaluate the accuracy of using Twitter data to represent populations across different demographic groups.

According to principal investigator Guangqing Chi, associate professor of rural sociology and demography and public health sciences in the Department of Agricultural Economics, Sociology, and Education, and a Social Science Research Institute co-funded faculty member, Twitter data is generated by a large number of people in real time, is rapidly growing and easily accessible, and is drawing interest from many research disciplines.

“Twitter data has great potential for understanding population dynamics, however, the use of the data has been resisted by social scientists, largely because we know little about the users’ demographic characteristics,” said Chi.

Extra Extra

Women’s pockets are way too small, a problem that has now been visualized by the creative data visualizers over at The Pudding.

A team of 200 scientists have published 94% of the sprawling wheat genome.

There’s an intriguing (semi-paywalled) article at The Economist on how the internet has improved dating, making it easier for the marginalized to find each other.

Descartes Labs unleashes machine learning on space data

Quartz, Tim Fernholz


Satellite imagery across the visual spectrum is cascading down from the heavens. The challenge is to figure out how to process it, learn from it—and monetize it.

Advances in distributed processing and machine learning have made it possible for researchers to manipulate and analyze data on, well, a planetary scale. Descartes Labs, a company spun out of the Los Alamos National Laboratory in New Mexico, is one of the leaders in the field. Now, its tool for leveraging Earth-observation data on a large scale is now available.

Highlights from 2017 Degree Release: Bachelor’s Numbers Close in on Master’s

Amstat News


The ratio of the number of master’s degrees to that of bachelor’s closed to 1.2 in 2017, the closest it’s been since 1987 when it was also 1.2. The ratio grew to around 2.5 in the mid 2000s.
Figure 1 Statistics and biostatistics degrees at the bachelor’s, master’s, and doctoral levels in the United States. The dotted lines of matching colors are the number of degrees for that degree level earned by women. Data source: NCES IPEDS.

According to the latest preliminary data release from the National Center for Education Statistics, bachelor’s degrees, from 2016 to 2017, grew 22% to 3,398 (36 of which are for biostatistics) and master’s degrees increased 4% to 4,059 (693 for biostatistics). Doctoral degrees increased by 5% to 620 (201 for biostatistics), as seen in Figure 1, with the dotted lines showing the associated number of degrees earned by women.

Accompanying this growth is an increase in the number of universities granting bachelor’s degrees in statistics (from 126 to 132), master’s degrees in statistics (137 to 143), doctoral degrees in biostatistics (34 to 39), and doctoral degrees in statistics (69 from a previous high of 67 in 2014), as seen in Figures 3 and 4.

Experts Weigh in on the Future of AI and Evolutionary Algorithms



Evolutionary computation is a form of artificial intelligence (AI) that can bring creative solutions to many different commercial problems, for example, trading, digital marketing, healthcare and cyber agriculture.

In the pursuit of advancing AI, some of the foremost experts in both the academic and commercial worlds of AI are coming together to share their knowledge and research on this topic. We sat down to understand more about where they see AI heading and the significant role evolutionary computation is destined to play as the new deep learning.

AI and GPUs Could Lead to Autonomous Trains

NVIDIA Blog, Scott Martin


Boosted by machine learning, image recognition and NVIDIA GPUs, trains are on track to lead the way in autonomous transportation.

How technology turns consumers into spies

Cornell University, Cornell Chronicle


Digital tools increasingly compel us to spend time and energy monitoring other people, from our own children or ailing parents to the workers preparing our pizza, a new study says.

The paper, “The Surveillant Consumer,” explores how people are turned into spies by the ubiquity of digital cameras or other surveillance technology, combined with marketing messages that suggest they’re bad parents or ill-informed consumers if they don’t use them.

“There’s a lot of research about the consumer as the target of surveillance, but less focus on how the consumer becomes the surveillor herself, and how these products encourage the consumer to do those things,” said co-author Karen Levy, assistant professor of information science. “Consumers are sold a sense of anxiety about the world, some of which is based in fact, some of which is less so. You want to know where your child is, but it can become a slippery slope.”

Government Data Science News

The Trump Administration is pushing an immigration agenda that would prevent H4 visa holders from getting work permits in the US. H4 visa holders are typically spouses (sometimes children) of H1-B visa holders. This move is opposed widely in the tech industry. It would hit Indian immigrants hardest; they hold 80% of H4 visas right now.

Stacey Dixon has taken over leadership of IARPA and is frustrated that we are not effectively competing with China’s research budget for artificial intelligence.

Argonne National Labs announced it will put the first US exascale supercomputer online in 2021. It will be called Aurora 21. It will enable computationally intensive simulations with massive amounts of data.

You know the U.S. Defense Department Project Maven that had employees at several companies signing petitions refusing to work on war making technology? It got a huge 580% boost in the most recent DoD budget. The budget included huge expenditures for things like three littoral combat ships (ships for near-shore operations, price tag: $1.56 billion) when the Navy only requested one.

The Invisible Institute has released an immense data set on the disciplinary histories of members of The Chicago Police Department. The data set spans 50-years, with complete records from 2000-2016 and includes 240,000 allegations of misconduct for 22,000 officers.

Y Combinator to Set Up China Arm With Ex-Baidu Exec as CEO

Bloomberg, Deals, Selina Wang


Former Baidu Inc. executive Qi Lu has been named head of Y Combinator China, marking the American startup incubator’s first full-fledged international effort.

Y Combinator, which has seeded companies including Airbnb Inc., Stripe Inc., Reddit and Dropbox Inc., will start its program in China as soon as next summer. In the U.S., the accelerator selects two batches of companies a year for financing, advice and connections in exchange for a small percentage of equity. Lu will lead the Chinese program, which will be called YC China and adopt a similar approach though there may be tweaks to fit the domestic market, said Sam Altman, Y Combinator’s president.

Tinder co-founders and 8 others sue dating app’s owners

CNN Tech, Laurie Segall and Chris Isidore


Co-founders of Tinder and eight other former and current executives of the popular dating app are suing the service’s current owners, alleging that they manipulated the valuation of the company to deny them of billions of dollars they were owed.

Georgia Tech Creates Cybersecurity Master’s Degree Online for Less Than $10,000

Georgia Tech, News Center


The Georgia Institute of Technology has announced a new online cybersecurity master’s degree that will be offered for less than $10,000 and delivered in collaboration with edX. The Online Master of Science in Cybersecurity (OMS Cybersecurity) is designed to address a severe global workforce shortage in the field. According to the 2017 Global Information Security Workforce Study, the shortage is expected to reach 1.8 million people by 2022.

Georgia Tech is the only nationally ranked Top 10 university to offer such a program at a tuition rate intended to increase higher education accessibility and affordability. The degree has existed on campus since 2002 and costs $20,000 for in-state students and $40,000 for those out-of-state. Applications for spring 2019 are open now until October 1, 2018.

Is Technology Killing English as a Lingua-Franca?

Lilt, Kyle Paice


The proliferation of Artificial Intelligence and Machine Learning tools have professionals everywhere nervy about the future utility of their skill set. And when you take a peek at some of the exciting advances in these fields, it’s hard not to ask yourself “will I be replaced by a robot?”

There are plenty of other blogs you can read if you’re interested in the coming regime of automaton overlords, but here, we’re more interested in the future of language. Specifically, are new language technologies phasing out the likelihood of people learning languages beyond their mother tongue? And will advances in these technologies cut out the need to learn English as a “lingua-franca”, or go-between language between two individuals of differing native languages?

In short. no.


[Women in Infra] Infrastructure for Machine Learning

Meetup, Women in Infrastructure


San Francisco, CA August 23, starting at 6:30 p.m., Tabletop Tap House (175 4th St). [rsvp required]

Tools & Resources

Phrase-Based & Neural Unsupervised Machine Translation

arXiv, Computer Science > Computation and Language; Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, Marc'Aurelio Ranzato


Machine translation systems achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large amounts of parallel sentences, which hinders their applicability to the majority of language pairs. This work investigates how to learn to translate when having access to only large monolingual corpora in each language. We propose two model variants, a neural and a phrase-based model. Both versions leverage a careful initialization of the parameters, the denoising effect of language models and automatic generation of parallel data by iterative back-translation. These models are significantly better than methods from the literature, while being simpler and having fewer hyper-parameters. On the widely used WMT’14 English-French and WMT’16 German-English benchmarks, our models respectively obtain 28.1 and 25.2 BLEU points without using a single parallel sentence, outperforming the state of the art by more than 11 BLEU points. On low-resource languages like English-Urdu and English-Romanian, our methods achieve even better results than semi-supervised and supervised approaches leveraging the paucity of available bitexts. Our code for NMT and PBSMT is publicly available.

Indexing Text for Both Effective Search and Accurate Analysis

Qualtrics Engineering, Stacks & Qs blog, David Norton


At Qualtrics, Text iQ is the tool that allows our users to find insights from their free response questions. Powering Text iQ are various microservices that analyze the text and build models for everything from sentiment analysis to identifying key topics. In addition, Text iQ provides a framework for users to explore their text data through search. As the name suggests, we want all aspects of Text iQ, including search, to be intuitive and feel smart. All of this requires a storage solution that allows for effective text indexing as well as accurate and complete data aggregation and retrieval.

In this article we examine CrateDB, the technology we use for storage in Text iQ, as well as the actual text processing pipeline that we use to give us the indexing capabilities that we need.


Full-time, non-tenured academic positions

Research Scientist

Indiana University, Network Science Institute; Bloomington, IN

Leave a Comment

Your email address will not be published.