When big isn't better: How the flu bug bit Google

March 13, 2014
When big isn't better: How the flu bug bit Google
Ryan Kennedy is a political science professor at the University of Houston. Credit: University of Houston

Numbers and data can be critical tools in bringing complex issues into crisp focus. The understanding of diseases, for example, benefits from algorithms that help monitor their spread. But without context, a number may just be a number, or worse, misleading.

"The Parable of Google Flu: Traps in Big Data Analysis" is published in the journal Science, funded, in part, by a grant from the National Science Foundation. Specifically, the authors examine Google's data-aggregating tool Google Flu Trend (GFT), which was designed to provide real-time monitoring of cases around the world based on Google searches that matched terms for flu-related activity.

"Google Flu Trend is an amazing piece of engineering and a very useful tool, but it also illustrates where 'big data' analysis can go wrong," said Ryan Kennedy, University of Houston political science professor. He and co-researchers David Lazer (Northeastern University/Harvard University), Alex Vespignani (Northeastern University) and Gary King (Harvard University) detail new research about the problematic use of big data from aggregators such as Google.

Even with modifications to the GFT over many years, the tool that set out to improve response to flu outbreaks has overestimated peak in the U.S. over the past two years.

"Many sources of 'big data' come from private companies, who, just like Google, are constantly changing their service in accordance with their business model," said Kennedy, who also teaches research methods and statistics for political scientists. "We need a better understanding of how this affects the data they produce; otherwise we run the risk of drawing incorrect conclusions and adopting improper policies."

GFT overestimated the prevalence of flu in the 2012-2013 season, as well as the actual levels of flu in 2011-2012, by more than 50 percent, according to the research. Additionally, from August 2011 to September 2013, GFT over-predicted the prevalence of flu in 100 out of 108 weeks.

The team also questions data collections from platforms such as Twitter and Facebook (like polling trends and market popularity) as campaigns and companies can manipulate these platforms to ensure their products are trending.

Still, the article contends there is room for data from the Googles and Twitters of the Internet to combine with more traditional methodologies, in the name of creating a deeper and more accurate understanding of human behavior.

"Our analysis of Google Flu demonstrates that the best results come from combining information and techniques from both sources," Kennedy said. "Instead of talking about a ' revolution,' we should be discussing an 'all data revolution,' where new technologies and techniques allow us to do more and better analysis of all kinds."

Explore further: CDC: Flu season starting a little more normally

More information: "The Parable of Google Flu: Traps in Big Data Analysis," by D. Lazer et al. Science, 2014.

Related Stories

CDC: Flu season starting a little more normally

December 12, 2013
Health officials say the flu season seems to be getting off to more normal start this year.

Health officials: Worst of flu season may be over

February 8, 2013
Health officials say the worst of the flu season appears to be over.

First real-time flu forecast successful

December 3, 2013
Scientists were able to reliably predict the timing of the 2012-2013 influenza season up to nine weeks in advance of its peak. The first large-scale demonstration of the flu forecasting system by scientists at Columbia University's ...

Flu remains widespread in US; eases in some areas

January 18, 2013
Health officials say nine more deaths of children from the flu have been reported, bringing the total this flu season to 29.

Recommended for you

MRSA emerged years before methicillin was even discovered

July 19, 2017
Methicillin resistant Staphylococcus aureus (MRSA) emerged long before the introduction of the antibiotic methicillin into clinical practice, according to a study published in the open access journal Genome Biology. It was ...

New test distinguishes Zika from similar viral infections

July 18, 2017
A new test is the best-to-date in differentiating Zika virus infections from infections caused by similar viruses. The antibody-based assay, developed by researchers at UC Berkeley and Humabs BioMed, a private biotechnology ...

'Superbugs' study reveals complex picture of E. coli bloodstream infections

July 18, 2017
The first large-scale genetic study of Escherichia coli (E. coli) cultured from patients with bloodstream infections in England showed that drug resistant 'superbugs' are not always out-competing other strains. Research by ...

Ebola virus can persist in monkeys that survived disease, even after symptoms disappear

July 17, 2017
Ebola virus infection can be detected in rhesus monkeys that survive the disease and no longer show symptoms, according to research published by Army scientists in today's online edition of the journal Nature Microbiology. ...

Mountain gorillas have herpes virus similar to that found in humans

July 13, 2017
Scientists from the University of California, Davis, have detected a herpes virus in wild mountain gorillas that is very similar to the Epstein-Barr virus in humans, according to a study published today in the journal Scientific ...

Vaccines protect fetuses from Zika infection, mouse study shows

July 13, 2017
Zika virus causes a mild, flu-like illness in most people, but to pregnant women the dangers are potentially much worse. The virus can reduce fetal growth, cause microcephaly, an abnormally small head associated with brain ...

1 comment

Adjust slider to filter visible comments by rank

Display comments: newest first

adam_russell_9615
not rated yet Mar 13, 2014
Just because it was an overestimate does not say the algorithm is bad, only that it needs improvement. Even +50% is not bad for a simulation based on metadata.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.