Data Science newsletter – January 27, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for January 27, 2017

GROUP CURATION: N/A

 
 
Data Science News



Tweet of the Week

Twitter, Waldo Jaquith


from


Nashville startup PredictGov joins Duke Law Tech Lab, eyes capital raises

Venture Nashville


from

VANDERBILT Ph.D. candidate John Nay has won admittance for his PredictGov lawtech startup in the inaugural class of the Duke Law Tech Lab at Duke University.

Nay told Venture Nashville he now contemplates a near-term $500K Angel raise and a subsequent Seed raise of perhaps $1.5MM.


MethodSpace website relaunches with new big data hub

SAGE Connection – Insight, Michael Todd


from

A year in the making, the revamped version of the MethodSpace website, complete with a new hub focused at the emerging world of computational social science, is now live.

We launched MethodSpace as a dynamic online community for social and behavioral research methods that enables scholars and students to share experiences and solve problems on a global scale in 2009. It was a natural move for us, since research methods is in many ways the framework on which SAGE– and by extension scholarly knowledge — is built. Throughout our history, we have committed to publishing and innovating research methods. Our first methods book was published in 1970, and since then we have committed to publishing research methods through journal articles, textbooks and now a suite of born-digital products to support scholars as they develop their fields.


Help Google Develop Tools for Raspberry Pi

Raspberry Pi


from

Google is going to arrive in style in 2017. The tech titan has exciting plans for the maker community.

It intends to make a range of smart tools available this year. Google’s range of AI and machine learning technology could enable makers to build even more powerful projects.

To make this happen, Google needs help from the maker community. Raspberry Pi fans are the best makers around, and it’s their ideas that will give the tech company direction.


Data Science for Social Good collaboration nets WSDOT funding

University of Washington, eScience Institute


from

A collaboration between two eScience Institute Data Science for Social Good (DSSG) project leads has resulted in a financial award which allows them to continue their work. Mark Hallenbeck, director of the Washington State Transportation Center, and Anat Caspi, director of the Taskar Center for Accessible Technology, both eScience fellows, answered questions submitted by Robin Brooks, communications associate, by email for this piece.


UW spinout KenSci raises $8.5M for machine learning platform that predicts which patients will get sick

GeekWire, Clare McGrane


from

Healthcare is an expensive industry. Every year, the United States spends about $9,000 per capita on healthcare, higher than in many other industrial countries. And despite that extra spending, U.S. health outcomes aren’t any better than countries that spend less — in some instances, they’re worse.

One way to lower costs and increase quality in the healthcare system would be to predict which patients will get sick, and while predicting the future seems impossible, that’s exactly what data science and machine learning startup KenSci is hoping to do. Today the company announced that it has raised $8.5 million in a Series A investment round, led by Ignition Partners, to kick their program into high gear.


Cornell Data Science Launches Student-Led Training Course in Statistical Methods, Programming Languages

The Cornell Daily Sun, Jeanette Si


from

The Cornell Data Science project team will launch an unofficial student-led training course this semester — taught and developed entirely by Cornell students — to help students gain hands-on experience in the increasingly valuable skills of statistical methods and programming languages.

“We saw an opportunity to promote data science in the undergraduate community, and we saw it as our responsibility as the data science project team to help make it happen,” said Chase Thomas ’19, CDS’s operations lead.

The course will be open to students of all majors who are interested in data science, according to Thomas. With no prerequisites and only one recommended course — Computer Science 1110: Introduction to Computing Using Python — the course is designed to be accessible to students with no prior programming experience.


Amazon opens Pittsburgh office to tap local talent

Pittsburgh Business Times, Tim Schooley


from

Amazon introduced its new Pittsburgh office to the public Tuesday, demonstrating strong local ties with Carnegie Mellon University and the local tech community right from the start.

The ecommerce giant, with a market capitalization more than $150 billion more than Walmart, held a reception at its new 15,000-square-foot office at the SouthSide Works now formally serving as its Pittsburgh office, incorporating the operations of two local companies it bought here in 2015, Safaba Translation Systems LLC and Shoefitr, at one address.


Announcing the 2017 Facebook PhD Fellows

Facebook Research


from

Facebook is proud to announce the 2017 Facebook Fellowship winners and finalists. “This year we received over 800 applications from promising PhD students,” said Rebekkah Hogan, Fellowship Program Manager. “The 2017 Fellows represent some of the most talented young researchers in computer science and engineering disciplines from universities across the globe.”


RISELab Takes Flight at UC Berkeley

datanami, Alex Woodie


from

UC Berkeley yesterday officially launched the RISELab, the successor to the highly successful AMPLab that produced superstar open source technologies like Apache Spark and Apache Mesos. There’s no telling what, if any, RISELab projects will strike a chord like Spark and Mesos did. But one thing is for certain: while the analytical focus will remain, the timeframe for decision making is changing dramatically.

“Like much of the Big Data movement, the AMPLab focused mostly on offline data analysis problems, where minutes and hours could be devoted to extracting value from data,” writes Brett Israel of UC Berekeley’s media relations team in a UC Berkeley story about the new lab. “At the RISELab, however, researchers are looking to [make] decisions in milliseconds.”

The RISELab’s mantra, “Real-time Intelligence with Secure Execution,” (which also forms the basis for the RISE acronym) begins to tell the story of where this project is heading.


London university admits to monitoring student emails under pressure Government anti-terror programme

The Independent (UK)


from

A top London university has admitted to spying on its staff and students as part of government efforts to prevent radicalisation on campus.

A notice on the King’s College London (KCL) email login page warns members that emails can be “monitored and recorded” under the Government’s controversial anti-terror strategy Prevent.


Startup uses Microsoft technology to translate text messages in real-time for world leaders at Davos

GeekWire, Dan Richman


from

The roughly 3,000 leaders from science, economics, politics and business who attended the World Economic Forum at Davos, Switzerland, this month may have understood each other a little better, thanks to technology from Microsoft and Layer, a San Francisco-based startup in which Microsoft has invested.


If You Want Life Insurance, Think Twice Before Getting A Genetic Test

Fast Company, Katie Jennings


from

Since 2008, with the passing of the Genetic Information Nondiscrimination Act (GINA), the federal government has barred health insurance companies from denying coverage to those with a gene mutation. But the law does not apply to life insurance companies, long-term care, or disability insurance. These companies can ask about health, family history of disease, or genetic information, and reject those that are deemed too risky.


The Atlas of Urban Expansion Shows How Cities Grow

Planetizen, Todd Litman


from

As of 2010, the world contained 4,231 cities with 100,000 or more people. The Monitoring Global Urban Expansion Program gathers and analyzes data on a sample of representative 200 cities. The Atlas of Urban Expansion presents the program’s preliminary results.

This project by New York University, UN-Habitat the Lincoln Institute of Land Policy, and numerous collaborators is a comprehensive, ongoing research program to monitor quantitative and qualitative aspects of global urban expansion. The project used medium-resolution Landsat satellite imagery and census data to analyze how these cities grew between 1990 and 2014. Housing development and affordability surveys investigated how land use planning practices and development regulations affect urban fringe development patterns, home ownership patterns and housing affordability in these cities, based on data supplied by city-based researchers. The program has now completed its data collection phase and started evaluation and interpretation.


Artificial intelligence used to identify skin cancer

Stanford News


from

Universal access to health care was on the minds of computer scientists at Stanford when they set out to create an artificially intelligent diagnosis algorithm for skin cancer. They made a database of nearly 130,000 skin disease images and trained their algorithm to visually diagnose potential cancer. From the very first test, it performed with inspiring accuracy.

“We realized it was feasible, not just to do something well, but as well as a human dermatologist,” said Sebastian Thrun, an adjunct professor in the Stanford Artificial Intelligence Laboratory.


Better wisdom from crowds

MIT News


from

The wisdom of crowds is not always perfect. But two scholars at MIT’s Sloan Neuroeconomics Lab, along with a colleague at Princeton University, have found a way to make it better.

Their method, explained in a newly published paper, uses a technique the researchers call the “surprisingly popular” algorithm to better extract correct answers from large groups of people. As such, it could refine wisdom-of-crowds surveys, which are used in political and economic forecasting, as well as many other collective activities, from pricing artworks to grading scientific research proposals.

The new method is simple. For a given question, people are asked two things: What they think the right answer is, and what they think popular opinion will be. The variation between the two aggregate responses indicates the correct answer.

“In situations where there is enough information in the crowd to determine the correct answer to a question, that answer will be the one [that] most outperforms expectations,” says paper co-author Drazen Prelec, a professor at the MIT Sloan School of Management as well as the Department of Economics and the Department of Brain and Cognitive Sciences.


We must urgently clarify data-sharing rules

Nature News & Comment, Jan-Eric Litton


from

Scientists have worked hard to ensure that Europe’s new data laws do not harm science, but one last push is needed, says Jan-Eric Litton.


The Best Data Scientists Get Out and Talk to People

Harvard Business Review, Thomas C. Redman


from

You can be a good data scientist by sitting at your computer. After all, the job description involves poring through huge quantities of often disparate data to find insights that may prove helpful in every aspect of a business, including marketing, logistics, and human resources. It also includes cleaning data, dealing with gaps, and sifting through incomplete poor definitions.

But great data scientists know they must do more. They recognize that there are nuances and quality issues in the data that they can’t understand while sitting at their desks. They recognize that the world is filled with “soft data,” relevant sights, sounds, smells, tastes, and textures that are yet to be digitized — and hence are unavailable to those working at their computers. (Think of things like the electricity in the air at a political rally and the fear in the eyes of an executive faced with an unexpected threat.) They know they must understand the larger context, the real problems and opportunities, how decision makers decide, and how their predictions will be used.


Virtualitics Launches as First Platform to Merge Artificial Intelligence, Big Data and Virtual/Augmented Reality

Yahoo Finance, Business Wire


from

Virtualitics, LLC, a data analytics in virtual reality (VR) and augmented reality (AR) startup, today announced the launch of its new tool that combines powerful data visualization in VR/AR and artificial intelligence to provide insights and discover actionable knowledge hidden in big and complex data. The company also announced they closed a $3 million investment seed round from angel investors.

Virtualitics combines VR/AR with machine learning and natural language in a data exploration, collaborative environment suitable for both data scientists and non-expert users. The technology is the only one of its kind that can provide a simultaneous rendering of up to 10 dimensions, revealing multidimensional relationships present in the data, which may not be discoverable in any other way.


Ethics in Data Science

The Guardian has been unable to get clear answers on the data ethics board Google / Alphabet announced it would establish within DeepMind. It does exist, but nobody knows who is on it or what it discusses.

Writing in Nature, Jan-Eric Litton urges scientists to come together to develop a strategy for data sharing in advance of the European Commission’s impending ruling that will govern all EU data.

Researchers at Stony Brook and IBM have launched an app that allows users to get all the insight of recommender systems without letting personal data stray beyond the user’s phone or laptop.

Johan Ugander at Stanford is worried that learning data science via modules or skills-only courses may not involve training in the ethical use of data.

King’s College London warns students and faculty that their email correspondence is subject to monitoring by the UK government.

In the US, life insurance, disability insurance, and long-term care may be denied to people who have genes putting them at higher risk for disease. Only health insurance companies are legally prohibited from basing their decisions on genetics.


Truth, lies, and an ethics of personalization

Medium, Johan Ugander


from

I consider myself very knowledgable about how online ad targeting and personalization work; it’s core to my research as a professor at Stanford. Nothing about the technology involved in political ad campaign targeting surprises me. What has surprise me recently is how it’s being used, and the lack of scruples on behalf of those using it in these ways.


Google, U-M to build digital tools for Flint water crisis

University of Michigan, Michigan Institute for Data Science, MIDAS


from

A partnership between Google and the University of Michigan’s Flint and Ann Arbor campuses aims to provide a smartphone app and other digital tools to Flint residents and officials to help them manage the ongoing water crisis.

The app and other tools will help predict where lead levels will be highest in the city’s water, and they’ll pull together information and resources to make the crisis easier to navigate for those affected. The project is made possible by a $150,000 grant from Google.


Chan Zuckerberg Initiative Acquires Meta

Facebook, Chan Zuckerberg Initiative, Meta


from

We are excited to share that the Chan Zuckerberg Initiative has agreed to acquire Meta, a company that has developed an AI that helps scientists read, understand and prioritize millions of scientific papers.


US Intelligence seeks a universal translator for text search in any language

Ars Technica, Sean Gallagher


from

The Intelligence Advanced Research Projects Agency (IARPA), the US Intelligence Community’s own science and technology research arm, has announced it is seeking contenders for a program to develop what amounts to the ultimate Google Translator. IARPA’s Machine Translation for English Retrieval of Information in Any Language (MATERIAL) program intends to provide researchers and analysts with a tool to search for documents in their field of concern in any of the more than 7,000 languages spoken worldwide.


Open Data Meets Digital Curation: An Investigation of Practices and Needs

International Journal of Digital Curation; Christopher Lee, Suzie Allard, Nancy McGovern, Alice Bishop


from

In the United States, research funded by the government produces a significant portion of data. US law mandates that these data should be freely available to the public through ‘public access’, which is defined as fully discoverable and usable by the public. The U.S. government executive branch supported the public access requirements by issuing an Executive Directive titled ‘Increasing Access to the Results of Federally Funded Scientific Research’ that required federal agencies with annual research and development expenditures of more than $100 million to create public access plans by 22 August 2013. The directive applied to 19 federal agencies, some with multiple divisions. Additional direction for this initiative was provided by the Executive Order ‘Making Open and Machine Readable the New Default for Government Information’ which was accompanied by a memorandum with specific guidelines for information management and instructions to find ways to reduce compliance costs through interagency cooperation.

In late 2013, the Institute of Museum and Library Services (IMLS) funded the Council on Library and Information Resources (CLIR) to conduct a project to help IMLS and its constituents understand the implications of the US federal public access mandate and how needs and gaps in digital curation can best be addressed. Our project has three research components: (1) a structured content analysis of federal agency plans supporting public access to data and publications, identifying both commonalities and differences among plans; (2) case studies (interviews and analysis of project deliverables) of seven projects previously funded by IMLS to identify lessons about skills, capabilities and institutional arrangements that can facilitate data curation activities; and (3) a gap analysis of continuing education and readiness assessment of the workforce. Research and cultural institutions urgently need to rethink the professional identities of those responsible for collecting, organizing, and preserving data for future use. This paper reports on a project to help inform further investments.


NYU to invest more than $500M in Brooklyn on science, engineering, emerging media

Brooklyn Daily Eagle


from

NYU President Andrew Hamilton is scheduled to outline the university’s more than half-a-billion-dollar investment in Brooklyn over the next decade to expand and upgrade science, technology, engineering and emerging media disciplines at an Association for a Better New York breakfast tomorrow, according to a press release from the school. The investment will provide space and support not only for the applied sciences, but also for new initiatives and approaches that rely on the fusion of science, technology and creativity that is a signature feature of what both NYU and Brooklyn’s burgeoning tech sector have to offer.


Apple Set to Join Amazon, Google, Facebook in AI Research Group

Bloomberg Technology, Alex Webb


from

Apple Inc. is set to join the Partnership on AI, an artificial intelligence research group that includes Amazon.com Inc., Alphabet Inc.’s Google, Facebook Inc. and Microsoft Corp.

Apple’s admission into the group could be announced as soon as this week, according to people familiar with the situation. Representatives at Apple and the Partnership on AI declined to comment.


Introducing the 51 Pegasi b Fellowship

Heising-Simons Foundation


from

he Heising-Simons Foundation is pleased to announce the inaugural cohort of 51 Pegasi b Fellows.

Named after the first exoplanet discovered in 1995, the 51 Pegasi b Fellowship recognizes exceptional postdoctoral scientists with great potential to advance scientific research in the field of planetary astronomy.

The inaugural 51 Pegasi b Fellows and their host institutions are:

  • Jason Dittmann – Massachusetts Institute of Technology
  • Katherine de Kleer – California Institute of Technology
  • Peter Gao – University of California at Berkeley
  • Songhu Wang – Yale University

  • Data Visualization of the Week

    Twitter, Matthew C. Klein


    from

     
    Events



    The data science of traffic safety. LA, Vision Zero and big data.



    Los Angeles, CATuesday, January 31, at 6:30 p.m., DataScience, Inc. (200 Corporate Pointe, Suite 200, Culver City) [free, rsvp required]

    Network of Mind 2017



    Sydney, Australia Network of Mind is a four day interdisciplinary workshop on recent advances in theoretical and experimental Neuroscience. January 31-February 3 [free]

    Fintech: How can government promote the good and protect against the bad



    Washington, DC Conversation with Vice Chairman of the House Financial Services Committee Patrick T. McHenry, and Sen. Jeff Merkley, followed by a panel of experts and regulators, Wednesday, February 8 at 9 a.m., Brookings Institution (1775 Massachusetts Ave NW) [free]

    DSI Launch Event & Distinguished Lecture



    Vancouver, BC, Canada DSI Launch Event & Distinguished Lecture by Dr. Robert Gentleman, March 2 at 4:30 p.m. [free]

    OpenVis Conf by Bocoup



    Boston, MA April 24-25 at the State Room (60 State St.) [$$$]
     
    Deadlines



    Keystone DH 2017 CFP

    Philadelphia, PA Digital Humanities conference is July 12-14. Deadline for submissions is March 1.

    Detection of Genome Editing

    The Intelligence Advanced Research Projects Activity is seeking information on potential tools and methods to detect organisms that have been modified using genome editing techniques. Responses due March 3.

    Data Institute 2017: Remote Sensing with Reproducible Workflows

    Boulder, CO Workshop is July 19-24 and costs $750. Deadline to Apply is March 10.

    IARPA N2N Challenge

    Develop the best autonomous nail to nail fingerprint capture device. Deadline for submissions is March 17.

    Students: Apply for a Yoseloff Scholarship to attend SABR 47 in New York City

    With generous funding from The Anthony A. Yoseloff Foundation, Inc., the Society for American Baseball Research will award up to four scholarships to high school or college students to attend SABR 47 on June 28-July 2, 2017, in New York City. Deadline for applications is April 1.
     
    Tools & Resources



    Development update: High speed Apache Parquet in Python with Apache Arrow

    Wes McKinney


    from

    Over the last year, I have been working with the Apache Parquet community to build out parquet-cpp, a first class C++ Parquet file reader/writer implementation suitable for use in Python and other data applications. Uwe Korn and I have built the Python interface and integration with pandas within the Python codebase (pyarrow) in Apache Arrow.

    This blog is a follow up to my 2017 Roadmap post.


    Facilitating the discovery of public datasets

    Google Research Blog, Natasha Noy and Dan Brickley


    from

    “We have recently published new guidelines to help data providers describe their datasets in a structured way, enabling Google and others to link this structured metadata with information describing locations, scientific publications, or even Knowledge Graph, facilitating data discovery for others. We hope that this metadata will help us improve the discovery and reuse of public datasets on the Web for everybody.”


    An Even Easier Introduction to CUDA

    NVIDIA, Parallel Forall blog


    from

    DA, the popular parallel computing platform and programming model from NVIDIA. I wrote a previous “Easy Introduction” to CUDA in 2013 that has been very popular over the years. But CUDA programming has gotten easier, and GPUs have gotten much faster, so it’s time for an updated (and even easier) introduction.


    First Quora Dataset Release: Question Pairs

    Quora, Data @ Quora blog


    from

    “We are excited to announce the first in what we plan to be a series of public dataset releases. Our dataset releases will be oriented around various problems of relevance to Quora and will give researchers in diverse areas such as machine learning, natural language processing, network science, etc. the opportunity to try their hand at some of the challenges that arise in building a scalable online knowledge-sharing platform. Our first dataset is related to the problem of identifying duplicate questions.”


    One Dataset, Visualized 25 Ways

    FlowingData, Nathan Yau


    from

    This is what happens when you let the data ramble.


    NIPS 2016 Program Highlights

    NIPS 2016


    from

    slides and videos


    JuliaML and TensorFlow Tuitorial

    Lyndon White


    from

    This is a demonstration of using JuliaML and TensorFlow to train an LSTM network. It is based on Aymeric Damien’s LSTM tutorial in Python. All the explinations are my own, but the code is generally similar in intent. There are also some differences in terms of network-shape.


    Let’s Crowd-Fund the Data Stories Podcast

    Robert Kosara, Eager Eyes blog


    from

    Enrico Bertini and Moritz Stefaner “are trying to crowd-fund their work rather than rely on advertising. If we all chip in a few dollars or euros per show, this will be easy to accomplish.”


    [1701.06538] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    arXiv, Computer Science > Learning; Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean


    from

    “We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.”

     
    Careers


    Full-time, non-tenured academic positions

    Behavioral Science Intervention Designer



    University of Virginia, Curry School of Education

    Machine learning in neuroimaging researcher



    National Institute of Mental Health (NIMH); Bethesda, MD

    Computational Biologist – gnomAD



    Broad Institute; Cambridge, MA

    Data Visualization Research Specialist



    Northeastern University; Boston, MA
    Postdocs

    Postdoctoral Fellowships



    National Autonomous University of Mexico (UNAM); Mexico City, Mexico

    Leave a Comment

    Your email address will not be published.