Modern data visualization software makes it easy for users to explore large datasets in search of interesting correlations and new discoveries. But that ease of use — the ability to ask question after question of a dataset with just a few mouse clicks — comes with a serious pitfall: it increases the likelihood of making false discoveries.
At issue is what statisticians refer to as “multiple hypothesis error.” The problem is essentially this: the more questions someone asks of a dataset, they more likely one is to stumble upon something that looks like a real discovery but is actually just a random fluctuation in the dataset.
A team of researchers from Brown University is working on software to help combat that problem. This week at the SIGMOD2017 conference in Chicago, they presented a new system called QUDE, which adds real-time statistical safeguards to interactive data exploration systems to help reduce false discoveries.
Knoxville News-Sentinel, USA Today Network, Rachel Ohm
A new doctoral program at the University of Tennessee Knoxville will look at data, the way it’s collected and how it can be used to make advancements in other disciplines.
The data science and engineering, or “Big Data,” Ph.D. program comes in the midst of an increase in data being made possible through the advent of the digital age and will be the third program of its type across the nation, according to experts at UT and Oak Ridge National Laboratory, which will partner with UT on the program.
We are in the middle of a major shift in computing that’s transitioning us from a mobile-first world into one that’s AI-first. AI will touch every industry and transform the products and services we use daily. Breakthroughs in machine learning have enabled dramatic improvements in the quality of Google Translate, made your photos easier to organize with Google Photos, and enabled improvements in Search, Maps, YouTube, and more. We’re also sharing the underlying technology with developers and researchers via open-source software such as TensorFlow, academic publications, and a full suite of Cloud machine learning services. Join this session to hear some of Alphabet’s top machine learning experts discuss their cutting-edge research and the opportunities they see ahead.
Imagine your favorite go-to recipe mutated to conform to the traditional methods and ingredients of any number of diverse regional food cultures. Consider, say, lasagne, but a sort of lasagne that’s instead a naturally occurring part of Japanese or Ethiopian cuisine. Not “fusion,” but something deeper—a whole rewriting of what a lasagne even is according to the culinary traditions of some other place.
It’s not necessarily an easy or natural thing to do, but a new machine learning algorithm developed by a team of French, American, and Japanese researchers offers an automated solution based on neural networks and large amounts of food data. The result, which is described in a paper published this month to the arXiv preprint server (via I Programmer), is a system that can take a given recipe and shift it into an alternative dietary style—sushi lasagne, say—as well as parse a recipe for its underlying style components.
“The application we’re thinking of is redubbing a video into another language,” says Joon Son Chung at University of Oxford, one of the creators of the system. In the future, the audio from news clips could be automatically translated into another language and the images updated to fit.
This isn’t the first system to automatically adjust images to new audio, but others have needed large amounts of video to work. They would pair up the way a person’s mouth moved when they made different sounds and then use that part of the image in edited footage.
Last week, we presented a new paper that describes how children are thinking through some of the implications of new forms of data collection and analysis. The presentation was given at the ACM CHI conference in Denver last week and the paper is open access and online.
Over the last couple years, we’ve worked on a large project to support children in doing — and not just learning about — data science. We built a system, Scratch Community Blocks, that allows the 18 million users of the Scratch online community to write their own computer programs — in Scratch of course — to analyze data about their own learning and social interactions. An example of one of those programs to find how many of one’s follower in Scratch are not from the United States is shown below.
The Washington Post, Philip N. Howard and Robert Gorwa
Facebook deployed a “cross functional team of engineers, analysts and data scientists” as part of a detailed investigation into possible foreign involvement in the U.S. election. They found fake groups, fake likes and comments, and automated posting across the network by unnamed malicious actors. The report’s authors claim that their investigation “does not contradict” the findings made in the U.S. Director of National Intelligence report published in January, which blamed Russia for a sweeping online influence campaign conducted in the lead-up to the election.
Essentially, this confirms what researchers have suspected for several years: Large numbers of fake accounts have been used to strategically disseminate political propaganda and mislead voters. These accounts draw everyday users into “astroturf” political groups disguised as legitimate grass-roots movements. Unfortunately, Facebook’s refusal to collaborate with scientists and share data has made it difficult to know how many voters are affected or where this election interference comes from.
Phone tracking technology is already used to locate those in need of aid in humanitarian crises; but the latest development could help further, for example by identifying vulnerable groups such as women with potentially young children.
Mobile phones are one of the fastest growing technologies in the developing world with global penetration rates reaching 90 per cent. However, the fact that most phones in developing countries are pre-paid means that the data lacks key information about the person carrying the phone, including gender and other demographic data, which could be useful in a crisis.
Journal of the American Medical Association, M.J. Friedrich
Small but mighty, the Fogarty International Center has had an oversized impact on improving health around the world for the last half century. By providing funding to advance international health research and train health researchers from the United States and low- and middle-income countries, its efforts have benefitted patients worldwide, including the United States.
The Fogarty Center was established in 1968 in honor of Congressman John E. Fogarty (D-RI), who was an advocate for global health research. The smallest of the 27 institutes and centers that comprise the US National Institutes of Health (NIH), the Fogarty Center has relied on a mere fraction of the total NIH budget to train more than 6000 scientists to carry out over 500 research and training projects in more than 100 countries since its inception. The Center awards about $54 million through about 500 grants each year, with 80% of funds going to US institutions and 100% of Fogarty’s grants involving US scientists.
Long and/or fun reads
Christopher Burger a phd student at Max Planck Institute for Intelligent Systems wrote – and illustrated! – a scintillating explainer on what it was that Google did to make AlphaGo. It’s a great way to introduce your students (or yourself) to machine learning concepts. Note: Ke Jie lost to the AI in the most recent heart-breaking match.
Hilarious app of the semester: Not Hotdog, originating from the HBO television show Silicon Valley. “What would you say if I told you there is a app on the market that tell you if you have a hotdog or not a hotdog. It is very good and I do not want to work on it any more. You can hire someone else.” A roomful of data scientists might subsequently discuss whether hot dogs are a subset of the category ‘sandwich.’
There are great teachers and weak teachers. Statistician William Sanders spent the majority of his career trying to measure how much value particular teachers add, given the trajectories of the students. Intervening in children’s lives is always controversial; this piece details how Sanders’ work impacted teaching, teachers, and reveals just how important statistics can be for policy development.
“What the internet is missing,” Ev Williams [co-founder of Twitter] argues, “is an ethical framework, a new business model that will introduce a market correction to what the internet perceives as user demand for extremism.” Williams goes on to talk about how the internet is broken and whether there is a way to make publishing profitable.
Novelist Gary Shteyngart has a piece on robots in Seoul that is written with far more literary references and metaphorical turns of phrase than typical of science writing. Worth a lunch read.
In our Foursquare City Guide app, we use machine learning in tip ranking. We spent some time manually extracting signals about tips, such as: What is the sentiment in the tip? Does this tip have a photo? How long ago was it written? How many upvotes has it received? etc. We calculated these features and then had tips ranked by an audience so we could understand how each tip is valued by real users. Those tips were ranked, and then we applied that learning to all of our tips. That’s why City Guide tips aren’t ranked by date or popularity, but by how valuable they are to our users. So tips are spot-on and relevant as often as possible.
Increasing evidence suggests that a growing amount of social media content is generated by autonomous entities known as social bots. Many social bots perform useful functions, but there is a growing record of malicious applications of social bots. We believe it is important to provide public datasets and tools that help identification of social bots, since deception and detection technologies are in an arms race.
Applications based on event streams have more demanding architectural qualities than ever, and traditional approaches to storing, querying and reacting to patterns are tearing at the seams. Business requirements mandate that our systems both record everything that’s ever happened, yet also summarize the entirety of that history with increasingly low latency. Reconciling these attributes and others into a new, unified architecture benefits from a change in perception of the problem.
Picasso is a free open-source DNN visualization tool that gives you partial occlusion and saliency maps with minimal fuss. “At Merantix, we work with a variety of neural network architectures; we developed Picasso to make it easy to see standard visualizations across our models in our various verticals.”
For a while now, Microsoft have provided a free Jupyter Notebook service on Microsoft Azure. At the moment they provide compute kernels for Python, R and F# providing up to 4Gb of memory per session. Anyone with a Microsoft account can upload their own notebooks, share notebooks with others and start computing or doing data science for free.
They University of Cambridge uses them for teaching, and they’ve also been used by the LIGO people (gravitational waves) for dissemination purposes.
This got me wondering. How much power does Microsoft provide for free within these notebooks?