Data Science newsletter – August 30, 2017

Newsletter features journalism, research papers, events, tools/software, and jobs for August 30, 2017

GROUP CURATION: N/A

 
 
Data Science News



Y Combinator takes machine intelligence startups to school and learns a thing or two

TechCrunch, John Mannes


from

[Daniel] Gross asserts that a Heroku-esque possibility remains for companies that are capable of building services that are easier to use. In true contrarian form, Gross argues that the actual machine learning prowess of each of these teams comes secondary to their ability to craft a product that developers actually like and would use by choice.

Moving past APIs and other developer services, perception and autonomy were easily the most populated spaces for startups within the YC AI track.


A Game You Can Control With Your Mind

The New York Times, Cade Metz


from

A number of companies are working on ways to control machines simply with a thought. But they are likely to be met with skepticism.


Athelas announces $3.7m funding led by Sequoia Capital

Medium, Athleas


from

YCombinator alum Athelas, a deep learning based biotech company, is unveiling a rapid blood diagnostics and immune monitoring platform that can be used at home by chemotherapy patients, as well as in oncology research by Pharma companies. The company has also closed $3.7 million in funding led by Sequoia Capital, with participation from Initialized Capital and Joe Montana’s Liquid2. Angel investors include Color Genomics co-founder Elad Gil, James Hong, Stanford Deep Learning Professor and Salesforce Chief Scientist Richard Socher, and former Y Combinator COO Qasar Younis.


How Conservatives Manipulated the Mainstream Media to Give Us President Trump

Moyers & Company, Eric Alterman


from

The report, titled Partisanship, Propaganda, and Disinformation: Online Media and the 2016 US Presidential Election, deploys the device of a “media cloud” to help us visualize the manner in which media is actually consumed. Because people tend to get their news in a haphazard way these days — picking up stories from Facebook, Twitter, Instagram, local TV, talk radio, cable, network news, newsweeklies, daily newspapers, and the websites that may or may not be part of a daily diet — it doesn’t make sense to simply treat media consumption as a matter of statistics. Sure, many sources — like this one, for instance — are far more trustworthy when it comes to facts and evidence than many others, but most news consumers do not make this distinction. (A side note: I have long argued against the “wall” between “editorials” and “news,” because, for most people, it is a distinction without a difference, and provides endless fuel for accusations of “liberal bias” when — owing in significant measure to these same accusations — most media institutions bend over backward to be more than fair to conservative sources and, oftentimes, pseudo-realities.)


Irene Chen – 30 things I learned at MLHC 2017

Irene Chen


from

first two:

  • Data are numerous but hard to access! Beth Israel Deaconess Medical Center handles 7 petabytes of patient data. And yet many papers presented handle datasets with patients in the thousands or even dozens due to data availability challenges or targeting rare diseases.
  • FDA approval is hard but important. Although the initial process is arduous, minor updates (e.g. retraining deep learning models) only need notification but new models need reapproval. One method of convincing the FDA involves showing model accuracy fits within the variance of human experts.

  • New NSF awards will bring together cross-disciplinary science communities to develop foundations of data science

    National Science Foundation


    from

    The National Science Foundation (NSF) today announced $17.7 million in funding for 12 Transdisciplinary Research in Principles of Data Science (TRIPODS) projects, which will bring together the statistics, mathematics and theoretical computer science communities to develop the foundations of data science. Conducted at 14 institutions in 11 states, these projects will promote long-term research and training activities in data science that transcend disciplinary boundaries.

    “Data is accelerating the pace of scientific discovery and innovation,” said Jim Kurose, NSF assistant director for Computer and Information Science and Engineering (CISE). “These new TRIPODS projects will help build the theoretical foundations of data science that will enable continued data-driven discovery and breakthroughs across all fields of science and engineering.”


    Artificial intelligence cyber attacks are coming – but what does that mean?

    The Conversation, Jeremy Straub


    from

    The next major cyberattack could involve artificial intelligence systems. It could even happen soon: At a recent cybersecurity conference, 62 industry professionals, out of the 100 questioned, said they thought the first AI-enhanced cyberattack could come in the next 12 months.

    This doesn’t mean robots will be marching down Main Street. Rather, artificial intelligence will make existing cyberattack efforts – things like identity theft, denial-of-service attacks and password cracking – more powerful and more efficient. This is dangerous enough – this type of hacking can steal money, cause emotional harm and even injure or kill people. Larger attacks can cut power to hundreds of thousands of people, shut down hospitals and even affect national security.

    As a scholar who has studied AI decision-making, I can tell you that interpreting human actions is still difficult for AI’s and that humans don’t really trust AI systems to make major decisions. So, unlike in the movies, the capabilities AI could bring to cyberattacks – and cyberdefense – are not likely to immediately involve computers choosing targets and attacking them on their own. People will still have to create attack AI systems, and launch them at particular targets. But nevertheless, adding AI to today’s cybercrime and cybersecurity world will escalate what is already a rapidly changing arms race between attackers and defenders.


    Towards an integrated science of language

    Nature Human Behavior, Morten H. Christiansen & Nick Chater


    from

    It has long been assumed that grammar is a system of abstract rules, that the world’s languages follow universal patterns, and that we are born with a ‘language instinct’. But an alternative paradigm that focuses on how we learn and use language is emerging, overturning these assumptions and many more.


    Artificial intelligence: Big data and invisible patients

    MedCity News, Josh Baxt


    from

    If the barrier to precision medicine is data handling, then artificial intelligence (AI) may be the logical solution. Machine learning and deep learning are making inroads in a variety of industries, and seem poised to have a big impact in medicine, a process that is already in motion – and perhaps not a moment too soon.

    “Your chance in your lifetime of getting a false diagnosis, if you look at the data, is 100 percent,” said Thomas Wilckens, founder and CEO at InnVentis to the audience at the recently-concluded Precision Medicine Leadership Summit in San Diego. “There’s a lot to improve.”

    Wilckens moderated Going Deep in the Fast Lane – the Rise of AI in Precision Medicine, which combined experts from industry and academia to parse this evolving segment. In some cases, these technologies have already arrived, though admittedly in rare silos.


    FTC Blogs Help Define Reasonable Data Security, Attorneys Say

    Bloomberg BNA, Jimmy H. Koo


    from

    The FTC’s weekly data security investigation blogs provide important tips and information for companies regarding the regulator’s views on what constitute best practices, privacy attorneys told Bloomberg BNA.

    There is no data security compliance silver bullet revealed in the blogs, but the Federal Trade Commission posts offer more framing for companies seeking to meet the commission’s requirement that they employ reasonable data security to protect consumer data.

    Companies under the FTC’s jurisdiction—from internet giants Amazon.com Inc. and Facebook Inc. to smaller businesses, such as now-defunct medical testing laboratory LabMD Inc.—have struggled with what level of data security they must provide to convince the nation’s main data security and privacy enforcement agency that their efforts to protect personal data are reasonable.


    Email and Calendar Data Are Helping Firms Understand How Employees Work

    Harvard Business Review, Michael L. Tushman, Anna Kahn, Mary Elizabeth Porray and Andy Binns


    from

    Using data science to predict how people in companies are changing may sound futuristic. As we wrote recently, change management remains one of the few areas largely untouched by the data-driven revolution. But while we may never convert change management into a “hard science,” some firms are already benefiting from the potential that these data-driven techniques offer.

    One of the key enablers is the analysis of email traffic and calendar metadata. This tells us a lot about who is talking to whom, in what departments, what meetings are happening, about what, and for how long. These sorts of analyses are helping EY, where some of us work, by working with Microsoft Workplace Analytics to help clients to predict the likelihood of retaining key talent following an acquisition and to develop strategies to maximize retention. Using email and calendar data, we can identify patterns around who is engaging with whom, which parts of the organization are under stress, and which individuals are most active in reaching across company boundaries.


    How a bacterium can live on methanol

    ETH Zurich


    from

    Many chemists are currently researching how small carbon molecules, such as methane and methanol, can be used to generate larger molecules. The earth is naturally rich in methane, and artificial processes like the fermentation of biomass in biogas plants also produce it in abundance. Methanol can be generated from methane. Both are simple molecules containing only a single carbon atom. However, using them to produce larger molecules with several carbon atoms is complex.

    While challenging for chemists, bacteria learned long ago to build large molecules out of small ones: Some bacteria use methanol as a carbon source in order to create energy carriers and cellular building material. They live primarily on plant leaves and occur in large numbers on every leaf. The bacterium most extensively researched is called Methylobacterium extorquens. A team led by Julia Vorholt, Professor of Microbiology, has now identified all the genes required by this bacterium to live on methanol.


    High-Res Satellites Want to Track Human Activity From Space

    WIRED,Science, Sarah Sc oles


    from

    Hopkinsville, Kentucky, is normally a mid-size town, home to 32,000 people and a big bowling ball manufacturer. But on August 21, its human density more than tripled, as around 100,000 people swarmed toward the total solar eclipse.

    Hundreds of miles above the crowd, high-resolution satellites stared down, snapping images of the sprawl.

    These satellites belong to a company called DigitalGlobe, and their cameras are sharp enough to capture a book on a coffee table. But at that high resolution, they can only image that book (or the Kentucky crowd) at most twice a day. And a lot can happen between brunch and dinner. So the Earth observation giant is building a new constellation of satellites to fill in the gaps in their chronology.


    KSU launches Analytics and Data Science Institute

    Cobb Business Journal


    from

    Kennesaw State University has launched an Analytics and Data Science Institute to facilitate and support advanced study and research in the area of data science and advanced analytics. The Institute will provide training and graduate study that is responsive to current societal needs.

    The new Institute, which will house KSU’s Ph.D. in Analytics and Data Science, will also include multiple research labs in collaboration with partners like Equifax and GE Power. It will occupy a newly designed, 10,000 square foot space in the KSU’s Town Point building in Kennesaw.


    Brown data research gets major boost

    Providence Journal, G. Wayne Miller


    from

    In a clear sign that Brown University’s new Data Science Initiative is on the right research track, the National Science Foundation has awarded it a $1.5-million grant to further develop “new tools for applying data to complex problems,” in the words of initiative head Jeffrey Brock, chair of the Mathematics Department.


    Seven New Faculty to Join CICS

    UMass Amherst, College of Information and Computer Sciences


    from

    “The addition of these seven faculty, after hiring six new faculty in 2016-2017, further cements CICS and UMass Amherst as research destinations of choice. We are thrilled to welcome these outstanding researchers and educators,” said James Allan, chair of the CICS faculty.


    Q&A: Institute’s New Director Leads Quest to Glean Wisdom From Data

    University of Virginia, UVA Today


    from

    For Philip Bourne, a leading data science researcher and the new director of the University of Virginia’s Data Science Institute, mining today’s massive data sets for truth and wisdom – and then sharing that insight with others – is an abiding passion. … Bourne recently offered his thoughts about the Data Science Institute, and data science generally.


    Movidius Beefs up HW Acceleration in AI Chip

    EE Times, Junko Yoshida


    from

    Announcements about so-called deep learning processors are becoming almost as frequent nowadays as tweets from the White House. As the technology industry’s appetite for neural networks grows, so does the demand for powerful, but very low-power inference engines adaptable to a variety of embedded systems.

    Against that backdrop, Movidius, a subsidiary of Intel, launched Monday (Aug. 28) its Myriad X vision processing unit, a follow-up, after 18 months, to the Myriad 2.

    Asked what separates Myriad X from other deep-learning chips announced in recent months, Remi El-Ouazzane, vice president and general manager of Movidius’ Intel New Technology Group, told us, “None of those are shipping. Myriad processors are.”


    Measuring Social Connectedness

    NBER Working Papers; Michael Bailey, Ruiqing (Rachel) Cao, Theresa Kuchler, Johannes Stroebel, Arlene Wong


    from

    We introduce a new measure of social connectedness between U.S. county-pairs, as well as between U.S. counties and foreign countries. Our measure, which we call the “Social Connectedness Index” (SCI), is based on the number of friendship links on Facebook, the world’s largest online social networking service. Within the U.S., social connectedness is strongly decreasing in geographic distance between counties: for the population of the average county, 62.8% of friends live within 100 miles. The populations of counties with more geographically dispersed social networks are generally richer, more educated, and have a higher life expectancy. Region-pairs that are more socially connected have higher trade flows, even after controlling for geographic distance and the similarity of regions along other economic and demographic measures. Higher social connectedness is also associated with more cross-county migration and patent citations. Social connectedness between U.S. counties and foreign countries is correlated with past migration patterns, with social connectedness decaying in the time since the primary migration wave from that country. Trade with foreign countries is also strongly related to social connectedness. These results suggest that the SCI captures an important role of social networks in facilitating both economic and social interactions. Our findings also highlight the potential for the SCI to mitigate the measurement challenges that pervade empirical research on the role of social interactions across the social sciences.

     
    Events



    The Information War: Fake News, Privacy and Big Data

    UW MS in Data Science


    from

    Seattle, WA “Join University of Washington experts in data science, social media and law to learn the latest techniques to detect fake news and BS data.” September 18, starting at 6 p.m., ImpactHub (220 Second Ave. S.). [free, registration required]


    Digital Agriculture: Data Carpentry Workshop

    Data Carpentry


    from

    Ames, IA September 12-13. The workshop will be limited to 40 participants. [registration required]

     
    Deadlines



    A CS Education Summit in Pittsburgh: Addressing the Challenges of Increasing Interest in Computing at the Undergraduate Level through Institutional Transformation

    Organized by Computing Research Association. “We are looking for a diverse community of participants based on geography, level of instruction, service to underrepresented communities, and demonstrated leadership in the CS education community.” Deadline to indicate your interest is Tuesday, September 5.

    Senior Capstone Project

    The Senior Capstone Project offers the opportunity for organizations to propose a project that our graduate students will work on as part of their curriculum for one semester. Here you will find information on the course along with a questionnaire to propose a project.

    NYCDH Graduate Student Project Award

    We are pleased to announce our fourth annual cross-institutional NYCDH Digital Humanities Graduate Student Project Award. We invite all graduate students attending an institution in New York City and the metropolitan area to apply by Tuesday, September 5, 2017.

    Technology in Journalism Award

    The Technology in Journalism Award recognizes individuals or organizations that develop, adapt or creatively apply specific tools or technologies in the gathering and reporting of impactful journalism of the highest quality. Deadline for entries is Friday, October 6.

    Call for submissions: Rescue your data : Scientific Data

    Scientific Data is inviting submissions that release data underlying influential research papers published three or more years ago, for potential inclusion in a special collection to be launched in 2018. In particular, we are encouraging submissions that describe important datasets that were not practical to share online with the original publication.” Deadline for consideration is December 1.
     
    Tools & Resources



    Introducing KSQL: Open Source Streaming SQL for Apache Kafka

    Confluent, Neha Narkhede


    from

    “I’m really excited to announce KSQL, a streaming SQL engine for Apache KafkaTM. KSQL lowers the entry bar to the world of stream processing, providing a simple and completely interactive SQL interface for processing data in Kafka. You no longer need to write code in a programming language such as Java or Python! KSQL is open-source (Apache 2.0 licensed), distributed, scalable, reliable, and real-time. It supports a wide range of powerful stream processing operations including aggregations, joins, windowing, sessionization, and much more.”


    How to Train a Simple Audio Recognition Network

    TensorFlow


    from

    “This tutorial will show you how to build a basic speech recognition network that recognizes ten different words. It’s important to know that real speech and audio recognition systems are much more complex, but like MNIST for images, it should give you a basic understanding of the techniques involved. Once you’ve completed this tutorial, you’ll have a model that tries to classify a one second audio clip as either silence, an unknown word, ‘yes’, ‘no’, ‘up’, ‘down’, ‘left’, ‘right’, ‘on’, ‘off’, ‘stop’, or ‘go’. You’ll also be able to take this model and run it in an Android application.”


    Fashion-MNIST

    GitHub – zalandoresearch


    from

    “Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.”


    How GitLab can help in research reproducibility

    GitLab, Vicky Steeves


    from

    “GitLab is a great platform for active, ongoing, collaborative research. It enables folks to work together easily and share that work in the open. This is especially poignant given the problems in sharing code in academia, across time and people.”

    “It’s no surprise that GitLab, a platform for collaborative coding and Git repository hosting, has features for reproducibility that researchers can leverage for their own and their communities’ benefit.”


    lolviz

    GitHub – parrt


    from

    “A simple Python data-structure visualization tool for Lists Of Lists, lists, dictionaries, and linked lists; primarily for use in Jupyter notebooks / presentations. It seems that I’m always trying to describe how data is laid out in memory to students. There are really great data structure visualization tools but I wanted something I could use directly via Python in Jupyter notebooks. The look and idea was inspired by the awesome Python tutor.”

     
    Careers


    Full-time positions outside academia

    Safety Researcher



    OpenAI; San Francisco, CA

    Division Director, Division of Computing and Communication Foundations, CISE



    National Science Foundation; Arlington, VA
    Tenured and tenure track faculty positions

    Assistant Professor in Syntax



    NYU, Department of Linguistics; New York, NY

    Leave a Comment

    Your email address will not be published.