Data Science newsletter – July 25, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for July 25, 2017

GROUP CURATION: N/A

Data Science News

Doing a Startup Involving Cryptography? Get Out of the United States

IEEE Spectrum, Tekla S. Perry

from July 18, 2017

“There’s no better place than Singapore to do a deep tech startup, particularly anything involving cryptography.” So says Brijesh Pande, founder and managing partner of the Tembusu ICT Fund, a Singapore-based software-focused venture capital fund. Admittedly, he has a vested interest in enticing entrepreneurs to come to the island nation, but he—and two founders of companies in his portfolio, Lawrence Hughes from Sixscape and Ramond Looi from Vi Dimensions—make a solid argument.

Here in Singapore, Pande says, “We have no requirement for a security back door. The fact that the NSA [National Security Agency] requires U.S. companies to provide a back door makes technology developed in the U.S. less trusted around the world.”

GOLDMAN SACHS: There’s a fortune to be made analyzing sports stats

Business Insider, Joe Ciolli

from July 13, 2017

“The natural marriage of science and sport is only strengthening,” Goldman analyst Christopher Wolf wrote in a client note. “The insatiable thirst for a competitive edge, coupled with new technologies and advanced computing power (e.g., IBM Watson) is driving the evolution of sports from the analog to the digital era.”

The 2017 Top Programming Languages

IEEE Spectrum, Stephen Cass

from July 18, 2017

Interview: Daphne Koller, Chief Computing Officer, Calico; Adjunct Professor of Computer Science, Stanford University

insideBIGDATA

from July 22, 2017

The Association of Computing Machinery (ACM) just concluded a celebration of 50 years of the ACM A.M. Turing Award (commonly known as the “Nobel Prize of computing”) with a two-day conference in San Francisco. The conference brought together some of the brightest minds in computing to explore how computing has evolved and where the field is headed. Big data was the focus of a number of panels and discussions at the conference. The following is a discussion with Daphne Koller, Chief Computing Officer, Calico Labs; Adjunct Professor of Computer Science, Stanford University; ACM-Infosys 2007 Foundation Award.

University Data Science News

UC Irvine rescinded admission to 499 incoming freshmen, explaining that they either had senioritis (bad senior year grades) or failed to submit their final transcripts. Students are appealing in droves, many claiming to have sent the reportedly missing transcripts. The school admits it is facing an overenrollment crunch and used “a harder line on the terms and conditions this year” to address it. This is what happens when the pressure to get a degree increases but the funding for expanding the physical plant and hiring tenure-track professors stagnates or decreases.

Another sign of creeping credentialization? Science ran a story this week that could have been titled: “Your PhD is no good here”. A PhD is apparently not enough qualification to be science writer or to work in a tech transfer office. But there is good news! You can keep the university system afloat and spend more years of your life infantilized in the classroom by getting another degree.

David Sontag and colleagues published a breakthrough paper in Nature that proposes a two-step human-in-the-loop noisy-Or Bayesian gate model for emergency room diagnostics. The procedure should work on other medical data, too, but it was built and tested on ER data. Deriving robust, valid insight from medical records has been a hard problem. Sontag and colleagues now recommend a “a two-step process…for the construction of the knowledge graph, in which a clinician reviews and rejects some of the edges suggested by the model. Using the results of the clinical evaluation, we can infer that if a filtering step were added to the pipeline, to achieve perfect precision with a corresponding recall of 60%, physicians would have to discard fewer than 2 out of 10 suggested edges. If this step were added to the pipeline, the resulting graph would have perfect precision and recall that would far exceed that of the Google health knowledge graph, making it an attractive candidate for real life applications.” This is big progress.

Keep in mind that it is coming well after hospitals started digitizing their records a decade (or more, or less, ago). The time it has taken to get from the idea that hospitals could and should digitize health records to now, when we’re seeing robust demonstrations of good data science – including human doctors – is a smart case study in what applied data science will look like. (See this new pre-print on imputing electronic health record data.) It will take time. There is no one-size-fits most data ‘answer me this’ button that can simply be installed on a data set. It will take human experts like computer scientists, statisticians, domain experts, and (increasingly, as more people adopt the job title) data scientists.

Fei Fei Li published ImageNet in 2009 on a poster in a Miami Beach conference venue. (Note to my home discipline: Why don’t we have conferences in desirable destinations? Are there no unionized hotels near gorgeous locales?) Quartz’s David Gershgorn explains how that poster became the key dataset in image recognition and Li rocketed into simultaneous positions at Stanford and Google.

DataCamp, one of a growing number of for-profit data science training outfits, will compete with universities for students. DataCamp just raised $4m from investors and plans to offer 80 short courses. Very different model than a masters degree – short, affordable, and uncertified – and will be interesting to watch.

Ohio State University has a new Translational Data Analytics Institute that is inviting input from industry to develop curriculum.

University of British Columbia is also paying attention to industry demand, offering data science degrees that target unfilled roles in business.

Nasdaq released four new data sets into its Nasdaq Analytics hub which it is already selling to buy-side analysts. One of them scores data from SEC filings, two others do something indiscernible with “AI and algorithmic multi-factor ensemble voting”, and a fourth agglomerates the trades of online retail investors. The first one makes sense. I withhold judgment on the last three.

Here’s a jupyter notebook tracking problems, metrics, and datasets “from the artificial intelligence and machine learning literature” to “see how things are progressing” in a range of AI/ML subfields. Researchers can also report new results here.

Grab your popcorn and get ready for a statistical blockbuster. The grumblings about p-hacking have erupted into a full-blown public confrontation about the role and proper level of p-values for gold-standard science.

Big Surprise for Top AI Brainiacs: NVIDIA CEO Gives World’s Top AI Researchers First NVIDIA Tesla V100s

The Official NVIDIA Blog

from July 22, 2017

NVIDIA CEO Jensen Huang chose to light up a meetup of elite deep learning researchers at CVPR to unveil the NVIDIA Tesla V100, our latest GPU, based on our Volta architecture, by presenting it to 15 participants in our NVIDIA AI Labs program.

Movidius launches a $79 deep-learning USB stick

TechCrunch, Lucas Matney

from July 20, 2017

Movidius and Intel have put deep-learning on a stick with a tiny $79 USB device that makes bringing AI to hardware a snap.

In April of last year, Movidius showed off the first iteration of this device, which they then called the Fathom Neural Compute Stick. The company wasn’t able to get the product out as quickly as they had hoped because they were a little busy getting acquired by Intel.

The Next HoloLens Will Include A Deep Learning Accelerator

Tom's Hardware, Lucian Armasu

from July 24, 2017

What we do know so far about the second-gen HPU is that it will incorporate an accelerator for deep neural networks (DNNs). The deep learning accelerator is designed to work offline and use the HoloLens’ battery, which means it should be quite efficient, while still providing significant benefits to Microsoft’s machine learning code.

Here’s the three-pronged approach we’re using in our own research to tackle the reproducibility issue

The Conversation, Ben Marwick and Zenobia Jacobs

from July 19, 2017

Once we’ve identified that reproducibility is a big problem, the question becomes: How do we tackle it? Part of the answer has to do with changing incentives for researchers. But there are plenty of things we in the research community can do right now in the course of our scientific work.

It might come as a surprise that archaeologists are at the forefront of finding ways to improve the situation. Our recent paper in Nature demonstrates a concrete three-pronged approach to improving the reproducibility of scientific findings.

Professor teaches statistics and data science aboard USS George H.W. Bush

Yale University, YaleNews

from July 24, 2017

Just as everyone gets settled in, Jay Emerson strides forward, directing
attention to the middle of the classroom.
He is the director of graduate studies in the Department of Statistics and Data
Science at Yale University, and he’s teaching a workshop on the aircraft carrier USS George H.W. Bush (CVN 77).

Using open discussions and “hands-on” practical exercises, the “Life in the Sea
of Statistics and Data Science” workshop hopes to address real-life scenarios and
give sailors the opportunity to learn something new.

Amazon showing interest in health care, supply chain ‘nervous’

CNBC, Christina Farr

from July 24, 2017

Amazon is sending all kinds of signals that it’s interested in the health-care industry.

CNBC reported in May that the company was on the hunt for a general manager to lead a new pharmacy unit. Since then, it has brought on a slew of health experts to bolster its cloud offering, Amazon Web Services, and rallied the industry to build applications for its Alexa voice technology. Amazon has also been selling medical supplies online for some time.

That interest is making some players in the health-care industry nervous.

Events

Machine Learning for Music Discovery workshop

ICML2017

from August 11, 2017

Sydney, Australia August 11. Part of ICML 2017. [$$$]

Deadlines

AMIA 2018 Informatics Summit

San Francisco, CA March 12-15, 2018. Deadline for submissions is September 21.

Submissions are open for Population Association of America, Annual Meeting 2018

Denver, CO Conference is April 26-28, 2018. Deadline for submissions is September 29.

Calls for Contributions – CompleNet ’18

Boston, MA Conference is March 5-8, 2018. Deadline for abstract/paper submission is October 6.

Journalism, Misinformation and Fact Checking CFPa – TheWebConf 2018

Lyon, France The Web Conference is April 23-27, 2018. Deadline for submissions is January 5, 2018.

Tools & Resources

Proximal Policy Optimization

OpenAI

from July 20, 2017

We’re releasing a new class of reinforcement learning algorithms, Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance.

Some advice for journalists writing about artificial intelligence

Julian Togelius

from July 23, 2017

I’d like to offer some advice on how to write better and more truthfully when you write articles about artificial intelligence. The reason I’m writing this is that there are a whole lot of very bad articles on AI (news articles and public interest articles) being published in newspapers and magazines. Some of them are utter nonsense, bordering on misinformation, some of them capture the gist of what goes on but are riddled with misunderstandings. No, I will not provide examples, but anyone working in AI and following the news can provide plenty. There are of course also many good articles about AI, but the good/bad ratio could certainly be improved.

[1707.06742] Machine Teaching: A New Paradigm for Building Machine Learning Systems

arXiv, Computer Science > Learning; Patrice Y. Simard, Saleema Amershi, David M. Chickering, Alicia Edelman Pelton, Soroush Ghorashi, Christopher Meek, Gonzalo Ramos, Jina Suh, Johan Verwey, Mo Wang, John Wernsing

from July 21, 2017

The current processes for building machine learning systems require practitioners with deep knowledge of machine learning. This significantly limits the number of machine learning systems that can be created and has led to a mismatch between the demand for machine learning systems and the ability for organizations to build them. We believe that in order to meet this growing demand for machine learning systems we must significantly increase the number of individuals that can teach machines. We postulate that we can achieve this goal by making the process of teaching machines easy, fast and above all, universally accessible.
While machine learning focuses on creating new algorithms and improving the accuracy of learners, the machine teaching discipline focuses on the efficacy of the teachers. Machine teaching as a discipline is a paradigm shift that follows and extends principles of software engineering and programming languages. We put a strong emphasis on the teacher and the teacher’s interaction with data, as well as crucial components such as techniques and design principles of interaction and visualization.

Vega makes visualizing BIG data easy

MapD, Gene

from July 22, 2017

“MapD Vega is based on the open-source Vega specification developed by Jeffrey Heer and his group at the University of Washington. We’ve adapted the original specification to the MapD platform so you can use the power of SQL to investigate your data and quickly render it as a custom visualization.”

TriviaQA

Mandar Joshi

from July 18, 2017

“TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions.”

Careers

Internships and other temporary positions

VRAF Internship

Rhizome; New York, NY

Full-time positions outside academia

Applications Bioinformatician

Nanopore; New York, NY

Data Engineer

Los Angeles Dodgers; Los Angeles, CA

Tenured and tenure track faculty positions

Assistant Professor – Cognitive Sciences

University of California-Irvine; Irvine, CA

Sports.BradStenger.com

Data Science newsletter – July 25, 2017

Leave a Comment Cancel reply