You are on page 1of 1

Data mining Google as an indicator of information dissemination on sexually transmitted diseases

Author: Diana Soto De Jess, University of Amsterdam


diana.soto.dj@gmail.com, dianasoto.com

1. INTRODUCTION
Searching for information online has become commonplace in todays always connected culture. Health information is no exception. This poses the need to study what health information is available through search engines and how it is distributed. This research uses cross-spherical analysis to gather data on how often specific STDs are mentioned when querying the term STDs through three different spheres: Websphere, Blogosphere, and Newsphere. It places such findings on information dissemination within the frame of current public awareness on STDs and potential needs for future education campaigns.

4. RESULTS AND DISCUSSION


STDs Prevalence in the Websphere
3% 1% 2% 2% 1% 5% 6% 26% HIV AIDS Hepatitis Herpes Syphilis Chlamidia Gonorrhea PID B Vaginosis HPV Trichomoniasis Others

5. CONCLUSION
The results point to a disparity between the reality of current spread of STDs and available information through search engines on them. This is reflected in the monopoly of information of AIDS and HIV (in particular the first one) in two out of the three spheres evaluated (the Newsphere and the Blogosphere). It is peculiar that the spheres known for having current information are precisely the ones that do not seem very up to date with the current incidence and prevalence of STDs. This has potential implications for public health practitioners regarding what areas are in need of more thorough educational campaigns as there seem to be information gaps.

I. Among the Blogosphere and Newsphere AIDS is the most mentioned STD, while in the Websphere it is HIV.
However, the current health panorama depicts a lowering of AIDS cases and an up in HIV cases. It is noteworthy then that AIDS dominates the information available in precisely the spheres that are defined by being up to date (News and Blogs). Even if in reality they do not seem to have catched up to current knowledge on these diseases.

7%

12% 20%

2. BASIC CONCEPTS
CROSS-SPHERICAL ANALYSIS - A digital method for conducting comparative media analysis between spheres as demarcated by search engines; it enables tracking the status and handling of topics across media that manage sources differently. SPHERE - Basically demarcated by search engines, a sphere refers to how search engines categorize and value different sources and the information they offer according to their characteristics such as temporality, popularity and reputation, among others. WEBSPHERE - Potentially offers an infinite amount of possible sources and assigns values more according to popularity, through Page Rank, rather than reliability and freshness. BLOGOSPHERE - Results are limited to blogs (a particular kind of source) and are defined by how recent the information is (freshness) and the frequency of updates in the blog. NEWSPHERE - Results are limited to a list of 4,500 newspapers predetermined by Google. Defining criteria are: accountability, reliability and currentness.

16%

0%

0% STDs Prevalence in the Newsphere 0% 0% 2% 2% 2%


2% 3% 2%

0%

II. While in the Websphere there is diversity of information on different STDs, in the Newsphere and particularly in the Blogosphere there is a great predominance of AIDS and HIV as the sole STDs on which information is readily available.
The lack of information on other STDs beyond AIDS and HIV, on not one but two spheres, is preoccupying. Take for example the fact that in 2008, 1,210,523 cases of Chlamydia infections were reported to the CDC, the largest number of cases ever reported to CDC for any condition (CDC, Sexually Transmitted). Yet, Chlamidia is barely a bleep in the Newsphere and pretty much nonexistent in the Blogosphere. There is a disparity between the reality of disease spread and available information on it through such a popular medium as search engines.

STDs prevalence in Google Blogs


1% 1% 1%0%
4% 30% HIV AIDS Hepatitis Herpes Syphilis Chlamidia Gonorrhea PID B Vaginosis HPV HIV Trichomoniasis AIDS Others

With every search there is data left behind, data that says something about the concerns and interests of the user making the search. While one search does not give much valuable information, a million searches does. The capacity of Google, and search engines in general, to find, analyze and categorize information in huge quantities and in constant evolution, has opened up the possibility of conducting research on contemporary cultural trends and behaviors through these tools. Google data has been successfully used to predict the flu (through tracking certain search queries Google data accurately predicted flu spread 7 to 10 days earlier than the CDC statistics). It has also been used as a metric for knowledge dissemination using the number of times an articles has been quoted in Google Scholar to signal its dissemination among the scientific community. With over 1.8 billion users worldwide, the Internet has emerged as a medium that can not be ignored, what people find online both shapes the information they will receive and reflects their biases. Health professionals, in particular public health educators should be aware of what information is available and how it is distributed in this most important of mediums.

41%

72% of Internet users in the US look for health information online, of which 77% do so through search engines (Pew Research Center)

53%

54%

Hepatitis Herpes Syphilis Chlamidia Gonorrhea PID B Vaginosis HPV Trichomoniasis Others

3. METHODS
The three selected spheres (the Websphere, the Newsphere, and the Blogosphere) were queried through Google.com (Google Web, Google Blogs and Google News respectively) by using software provided by the Digital Methods Initiative. The term STDs was queried in all three spheres and the top hundred results retained to make the lists that would be scraped. These results were cleaned to preserve only results related to STDs. Another kind of list, containing known STDs, was made from the Center of Disease Control (CDC) webpage on STDs. In the CDC, HIV and AIDS were listed as one unit (HIV/AIDS), however, since an HIV-positive person does not necessarily have AIDS and will not necessarily develop it, for this research the two terms were queried separately. Furthermore, querying HIV/ AIDS seems less intuitive than querying the terms separately. Using the Digital Methods Initiative tools each of the top 100 results for each sphere was queried for each of the specific STDs. This measured the amount of mentions of each of the specific STDs per sphere, giving the prevalence of said STDs across each queried sphere. All the data was collected on December 2010.

0% 0%

STDs prevalence in Google Blogs


1% 1% 1%0%

41%

54%

HIV AIDS Hepatitis Herpes Syphilis Chlamidia Gonorrhea PID B Vaginosis HPV Trichomoniasis Others

III.Cross-spherical analysis offers the opportunity to identify certain gaps in what information is available and how accessible it is to the public regarding current public health issues. It may ease the delineation of upcoming education campaigns to broach current public awareness gaps.
This may include inciting a shift in focus in News and Blogs from AIDS to HIV, that contextualizes and puts up to date the discussion on HIV. It also includes encouraging the discussion on other STDs which thus far are pretty much ignored in these spheres.

6. REFERENCES
CDC. Sexually Transmitted Disease Surveillance, 2008. Atlanta, GA: CDC, 2008. Print. Rogers, Richard. The End of the Virtual-Digital Methods. Amsterdam: University of Amsterdam, 2009. Print. Fox, Susana. Pew Internet: Health. Pew Research Center for Internet & Society. 20 February 2013. Web. 6 March 2013.

7. ACKNOWLEDGMENTS
This research was made with the use of datamining tools developed by the Digital Methods Initiative, based in Amsterdam. The researcher also received specialized training in the use of these tools during participation in the Digital Methods Initiative Summer School.

You might also like