You are on page 1of 10

Characterizing User-Generated Text Content Mining: a

Systematic Mapping Study of the Portuguese Language

Ellen Souza1, Dayvid Castro1, Douglas Vitório1, Ingryd Teles1, Adriano L. I.


Oliveira2, Cristine Gusmão3
1
MiningBR Research Group, Federal Rural University of Pernambuco (UFRPE),
Serra Talhada, PE, Brazil
eprs@uast.ufrpe.br, wellescastro@gmail.com, douglas.alisson17@gmail.com,
ingryd_vanessa_@hotmail.com
2
Centro de Informática, Federal Unversity of Pernambuco (CIn-UFPE),
Recife, PE, Brazil
{eprs,alio}@cin.ufpe.br
3
Programa de Pós-graduação em Engenharia Biomédica, Centro de Tecnologia e
Geociências - Federal Unversity of Pernambuco (CTG-UFPE), Recife, PE, Brazil
cristinegusmao@gmail.com

Abstract. Unstructured data accounts for more than 80% of enterprise data and
is growing at an annual exponential rate of 60%. Text mining refers to the
process of discovering new, previously unknown and potentially useful
information from a variety of unstructured data including user-generated text
content (UGTC). Given that Portuguese language is one of the most common
languages in the world, and it is also the second most frequent language on
Twitter, the goal of this work is to plot the landscape of current studies that
relates the application of text mining to UGTC in the Portuguese language. The
systematic mapping review method was applied to search, select, and to extract
data from the included studies. Our manual and automated searches retrieved
6075 studies up to year 2014, from which 35 were included in the study. Text
classification concentrates 79% of all text mining tasks, having the Naïve Bayes
as the main classifier and Twitter as the main data source.

Keywords: Text Mining, Text Classification, Opinion Mining, User-Generated


Content, Portuguese Language.

1 Introduction

The growth of social media and user-generated content (UGC) on the Internet
provides a huge quantity of information that allows discovering the experiences,
opinions, and feelings of users or customers [1]. The volume of data generated in
social media has grown from terabytes to petabytes.
According to [2], about 80% of corporate data are stored in non-structured way,
mainly in text format and are growing at an annual exponential rate of 60%. However,
unstructured texts cannot be simply processed by machines, which typically handle

Ó Springer International Publishing Switzerland 2016 1015


Á. Rocha et al. (eds.), New Advances in Information Systems and Technologies,
Advances in Intelligent Systems and Computing 444,
DOI 10.1007/978-3-319-31232-3_96
1016 E. Souza et al.

text as simple sequences of character strings. Specific processing methods, techniques


and algorithms are required in order to extract knowledge from text [3].
Text mining or knowledge discovery from text (KDT) was mentioned for the first
time in 1995 by Feldman and Dagan as a machine supported analysis of text. It is the
process of extracting knowledge from a large amount of unstructured data and it is
also defined as an extension of data mining. However, in contrast to data mining, text
mining focuses on the extraction of knowledge from a large number of documents
written in natural language from various data sources, including UGC.
According to the Organization for Economic Co-operation and Development,
User-Generated or User-Created Content is defined as: i) content made publicly
available over the internet, ii) which reflects a certain amount of creative effort, and
iii) which is created outside of professional routines and practices. Types of UGC are:
text, novel and poetry; photo and images; music and audio; video and film; citizen
journalism; educational content; mobile and virtual content. In this article, we focus
on texts that are generated by users, that is, user-generated text content (UGTC).
Whereas data mining is largely language independent, text mining involves a
significant language component, justifying its study associated with one target
language. Most text mining tools focus on processing English documents [4], but
many other languages, including Spanish and Portuguese, have also been considered.
Given that Portuguese is among the most spoken languages in the world, with
almost 270 million people1 speaking some variant of the language, research interests
on Portuguese processing is shared mainly with Portugal and Brazil [5]. Therefore,
the growing interest for the Portuguese is also related to the fact that the language is
the second most used on Twitter, which is one of the main sources of UGTC [6].
Thus, combining the guidelines to perform Systematic Mapping [7] and Systematic
Reviews Studies [8], the goal of this article is to characterize the current researches
that report the use of text mining for UGTC in the Portuguese language, driven by the
following general research question (RQ): What is the current state of text mining in
the Portuguese language for UGTC?
The automated and manual search procedures retrieved 6075 papers published up
to the year 2014, from which 35 were included in this study. The data2 extracted from
the primary studies were systematically structured and analyzed to answer the
historical, descriptive and classificatory research questions presented below:
RQ1: what is the evolution in the number of publications up to year 2014?
RQ2: which individuals, organizations, and countries are the main contributors in the
research area?
RQ3: what are the adopted text mining tasks?
RQ4: what are the techniques, algorithms, methods and tools applied?
RQ5: what are the characteristics of UGTC data sources and how they were
evaluated?
The remainder of this article is structured as follows: Section 2 provides the related
work. Section 3 details the systematic mapping study protocol. In Section 4, a
1
Brazil (202.656.788), Mozambique (24.692.144), Angola (24.300.000), Portugal
(10.813.834), Guinea-Bissau (1.693.398), East Timor (1.201.542), Equatorial Guinea
(722.254), Macau (587.914), Cabo Verde (538.535) and São Tomé e Príncipe (190.428).
Data extracted from US/CIA - The World Factbook (July, 2014)
2
Data is available in: http://bit.ly/1MX58hY
Characterizing User-Generated Text Content Mining … 1017

comprehensive set of results is presented. Section 5 discusses the results, limitations


and threats to validity. Finally, Section 6 contains the conclusions and directions for
future work. Due to lack of space, the list of primary studies was not included in this
article, it is available online2.

2 Related Work

Although we have made an extensive search, we did not found any text mining
systematic mapping for the Portuguese language and more specifically from UGTC.
However we found several language independent text mining surveys [2, 4, 9], a
paper [5] describing the computational linguistics area in Brazil, a survey [10] of
automatic term extraction for Brazilian Portuguese and a systematic review [11] of
user-generated content (UGC) applied to tourism and hospitality.
In [5], an overview of the computational linguistics or natural language processing
(CL/NLP) in Brazil is presented. According to the authors, research in Brazil is varied
and deals mainly with Portuguese, English and Spanish processing. They also state
that research on text mining is mostly carried out by non-computational linguistics
researchers, but instead by researchers from general artificial intelligence and
database areas. They estimate that Brazil has about 250 researchers in CL/NLP area.
The largest CL/NLP research group in Brazil is the Interinstitutional Center for
Research and Development in Computational Linguistics (NILC), which includes
researchers mainly from University of São Paulo, Federal University of São Carlos
and State University of São Paulo. The authors also state that the Brazilian
Symposium on Information and Human Language Technology (STIL) is the main
event in South America and the International Conference on Computational
Processing of Portuguese Language (PROPOR) is the main conference with focus on
Portuguese language, giving equal space to research on text and speech processing.
In [10], a survey of the state of the art in automatic term extraction (ATE) for the
Brazilian Portuguese language is presented. According to the authors, there are still
several gaps to be filled, for instance, the lack of consensus regarding the formal
definition of meaning of ‘term’. Such gaps are larger for the Brazilian Portuguese
when compared to other languages, such as English, Spanish, and French. Examples
of gaps for Brazilian Portuguese include the lack of a baseline ATE system and the
use of more sophisticated linguistic information, such as the WordNet and Wikipedia
knowledge bases.
In [11], a systematic review was conducted to examine how UGC data have been
used in empirical tourism and hospitality research. 122 articles were systematically
surveyed. The main sources of UGC data are consumer review websites and blogs.
Twitter was classified as a blog. Texts were the dominant UGC data type.

3 Review Method

Secondary studies review all the primary studies relating to a specific research
question with the aim of integrating and synthesizing evidence related to a specific
1018 E. Souza et al.

subject [8]. The systematic mapping study, also referred to scope studying, provides a
structure of the type of research reports and results that have been published by
categorizing them. It often gives a visual summary, the map, of its results [7].
Fig. 1 shows the adopted systematic mapping process. The first step comprises the
definition of research protocol. The second, third, and fourth steps encompass the
primary studies identification, selection, and evaluation in accordance with the
inclusion and exclusion criteria established in the review protocol. In the fifth step,
data from the included studies is extracted and synthesized in order to answer the
research questions.
We searched the literature looking for full papers (primary studies) that reported
text mining applications for UGTC in the Portuguese language. Primary studies that
met at least one of the following exclusion criteria were removed from the study: (i)
written in a language other than English or Portuguese; (ii) not available on online
scientific libraries; (iii) keynote speeches, workshop reports, books, theses, and
dissertations;

Fig. 1 Systematic Mapping Process based on [7]

3.3. Data Sources and Search Strategy

Automated and manual search processes were combined to achieve high coverage.
The automated search was constructed based on two search terms extracted from the
general research question presented in Section 1 (see Fig. 2). Synonyms for both
terms were extracted from the literature and, as we were looking for primary studies
written also in Portuguese language, the translation of terms for Portuguese was also
included in the final query. This search retrieved studies from all kind of text sources
from which we selected only the ones generated by users, that is, the UGTC.

Fig. 2 Generic Search String


Primary studies published up to year 2014 were analyzed using the same procedure
for both search strategies. Six researchers divided into three groups applied the
inclusion and exclusion criteria’s on all retrieved papers after reading the title,
Characterizing User-Generated Text Content Mining … 1019

abstract and keywords. For the 661 potentially relevant studies, the researchers
reapplied the inclusion criteria and exclusion criteria after reading the full paper. This
resulted in a list of 203 studies, from which 35 relate to the use of text mining for
UGTC in Portuguese. Table 1 contains the manual (M) and automated (A) data
sources details.
Table 1. Manual and Automated Data Sources
Data Source Type Retrieved Included UGTC
Studies Studies
International Conference on the Computational M 217 22 1
Processing of Portuguese (PROPOR)
Text Mining and Applications (TEMA) track of M 34 6 1
Portuguese Conference on Artificial Intelligence
Brazilian Workshop of Social Network Analysis and M 99 11 8
Mining (BRASNAM)
Brazilian Symposium on Information and Human M 251 44 3
Language Technology (STIL)
ACM symposium on Document engineering (DocEng) M 273 1 -
Linguateca Database (www.linguateca.pt) M 1312 30 1
Message Understanding Conferences (MUC) M 159 - -
Text Analysis Conference (TAC) M 322 - -
Text REtrieval Conference (TREC) M 1715 - -
Document Understanding Conference (DUC) M 167 - -
IEEE Xplore Digital Library A 306 19 6
ACM Digital Library A 277 29 11
Science Direct A 159 4 1
Scopus A 552 21 2
Portal de Periódicos Capes A 229 15 1
SciELO Scientific Electronic Library Online A 2 1 -
TOTAL 6075 203 35

4 Results

In this section, we present the main findings of our review, organized according to the
five specific research questions.

4.1 RQ1: what is the evolution in the number of publications up to year 2014?

As shown in Fig. 3, the three first primary studies were published in 2009 and the
number of studies has grown over the years, despite the drop in 2010 and 2011.
Primary studies were classified according to the Portuguese language variant: the
European Portuguese (from Portugal) represents 6%, while Brazilian comprises 77%
of all studies. 6% make use of text written in both Brazilian and European Portuguese.
11% did not provide the Portuguese variant information. Table 2 lists the Portuguese
variant dataset used for each primary studies.
1020 E. Souza et al.

Fig. 3 Temporal distribution of primary studies

Table 2. List of primary studies according to the Portuguese language variant


Variant Primary Studies
Both UGC09, UGC10
Brazilian UGC01, UGC02, UGC03, UGC05, UGC07, UGC08, UGC11, UGC12, UGC13,
UGC14, UGC15, UGC16, UGC17, UGC18, UGC20, UGC21, UGC23, UGC24,
UGC25, UGC26, UGC27, UGC28, UGC29, UGC30, UGC31, UGC33, UGC34
European UGC06, UGC32
N/A UGC04, UGC19, UGC22, UGC35

4.2 RQ2: which individuals, organizations, and countries are the main contributors
in the research area?

As expected from Fig. 3, Brazil has a greater number of researchers in the field.
Renata Vieira from UNISINOS (Table 3) and UFMG (Table 4) appear as the main
author and the main organization, respectively. In addition to Brazil (BR) and
Portugal (PT), research interests on Portuguese processing is shared with other
countries like the USA and Canada as primary studies (UGC04, UGC05, UGC10,
UGC12, UGC14, UGC24, UGC26, UGC35) propose multilanguage approaches.
Table 3. Number of articles published by main researchers
Quant Author Institution Quant Author Institution
6 Renata Vieira UNISINOS-BR 3 Larissa A. Freitas PUCRS-BR
4 Wagner Meira Jr. UFMG-BR 3 Eugénio de Oliveira Univer.of Porto-PT
4 Marlo Souza UFRGS-BR 3 Adriano Veloso UFMG-BR
3 Karin Becker UFRGS-BR 3 Luís Sarmento Univer.of Porto-PT
Table 4. Number of researchers per organization
Quant. Organization Country Quant. Organization Country
23 UFMG Brazil 8 UFRGS Brazil
15 PUCRS Brazil 8 Ulisboa Portugal
11 UFRJ Brazil 8 USP Brazil
9 UP Portugal
Characterizing User-Generated Text Content Mining … 1021

4.3 RQ3: what are the adopted text mining tasks?

Four primary studies have performed two research with different text mining tasks
(e.g. classification and information extraction) resulting in 39 text mining task
occurrences (Table 5). Text Classification appears as the main task for UGTC in
Portuguese Language. Three primary studies (UGC02, UGC23, and UGC27) reported
the use of balanced classes while eleven (UGC01, UGC05, UGC06, UGC13, UGC17,
UGC18, UGC21, UGC22, UGC28, UGC33, UGC35) used unbalanced classes.
The Opinion Mining subtask, also known as Sentiment Analysis, represents 62%
of all tasks. Two primary studies (UGC11, UGC12) also evaluated the sentiment or
opinion variation over time, also known as Sentiment Drift. Eighteen papers reported
the usage of lexical resource to perform the sentiment analysis. The main used lexical
resources were: SentiLex-PT, SentiWordNet, OpLexicon and Sentimeter-BR.
Table 5. Text Mining tasks and subtasks
Task % Subtask % Primary Studies
Classification 79 Language 6.5 UGC04, UGC10
Identification
Opinion 74 UGC01, UGC02, UGC03, UGC06, UGC11,
Mining UGC12, UGC13, UGC15, UGC17, UGC18,
UGC19, UGC21, UGC22, UGC25, UGC26,
UGC27, UGC28, UGC29, UGC30, UGC32,
UGC33, UGC34, UGC35
Others 19.5 UGC05, UGC08, UGC09, UGC14, UGC16,
UGC23
Information 13 - - UGC07, UGC14, UGC15, UGC20, UGC24
Extraction
Summarization 2 - - UGC31
Topic 3 - - UGC09
Tracking
Visual Text 3 - - UGC31
Mining

4.4 RQ4: what are the techniques, algorithms, methods and tools applied?

69% of all primary studies performed at least one type of Natural Language
Processing (NLP) (see Table 6). The main tools used for text preprocessing and NLP
were the Python NLTK, LingPipe and Freeling. Two primary studies reported the use
of the TreeTagger-PT for Part-Of-Speech (POS) tagging. For Named Entity
Recognition (NER), the CRF tagger, FS-NER and GeoNames were adopted. Table 7
presents the algorithms or methods used in the text analysis step. Naïve Bayes and
Weka appears as the most used classifier and most used tool, respectively. Python and
Java were the most used programing language in this step.
1022 E. Souza et al.

Table 6. List of adopted pre-processing techniques used in primary studies


% Primary Studies
Applied 69 Stopword UGC01, UGC03, UGC09, UGC15, UGC16, UGC18,
Removal UGC21, UGC25, UGC29, UGC33
Filtering UGC01, UGC04, UGC09, UGC14, UGC16, UGC18,
UGC25, UGC28
Stemming UGC01, UGC03, UGC09, UGC13, UGC18, UGC25, UGC33
POS UGC15, UGC19, UGC20, UGC25, UGC31, UGC32, UGC35
NER UGC06, UGC10, UGC13, UGC14, UGC24, UGC28
Tokenization UGC04, UGC10, UGC19, UGC31
Sentence Splitter UGC19, UGC22, UGC28, UGC31
Lemmatization UGC19, UGC20, UGC25, UGC35
Chunk UGC31
N/A 31 UGC05, UGC07, UGC08, UGC11, UGC12, UGC17,
UGC23, UGC26, UGC27, UGC30, UGC34
Table 7. List of algorithms and methods used in primary studies
Algorithms/Methods % Primary Studies
Naive Bayes 43 UGC01, UGC02, UGC03, UGC04, UGC10, UGC16, UGC18,
UGC21, UGC25, UGC29, UGC30, UGC33, Multinomial Naive
Bayes {UGC12, UGC25, UGC26}
SVM 31 UGC05, UGC09, UGC13, UGC21, UGC23, UGC32, UGC33,
SMO {UGC01, UGC28, UGC29, UGC30}
Decision Tree 14 UGC29, C4.5 {UGC30}, RF {UGC16, UGC21, UGC23}
Rule-Based 17 UGC10, UGC11, UGC12, UGC19, UGC20, UGC31
Pattern-Based 9 UGC06, UGC20, UGC22
N-grams 29 UGC02, UGC04, UGC08, UGC10, UGC23, UGC28, UGC29,
UGC30, UGC32, UGC33
Others 51 k-Nearest Neighbor {UGC09, UGC21}, Neural Network
{UGC33, UGC21}, Filtered Space Saving {UGC09}, Hoeffding
Adaptive Trees {UGC11}, Incremental Lazy Associative
Classifier {UGC11}, Latent Semantic Indexing {UGC15}, Map-
reduce paradigm {UGC33}, OneR classification algorithm
{UGC28}, Online Rule Extraction {UGC12}, Pareto-Efficient
Selective Sampling, {UGC11}, Topic Fuzzy Fingerprints
{UGC09}, Zipping classifier {UGC04}, Genetic Algorithm
{UGC21}, Regular Expression {UGC03, UGC23, UGC28}

4.5 RQ5: what are the characteristics of UGTC data sources and how they were
evaluated?

A total of 46 data sources were employed among the 35 primary studies. Social
networks appear as main sources for UGTC in Portuguese (Table 8). Twitter
represents more than 50% of all data sources. Text domain is varied, but Politics,
Sports and Technology have greater interest. Two primary studies (UGC05, UGC10)
reported the use of publicly datasets, both containing twitter data. The precision,
recall and f-measure trio was used by almost half of the primary studies to evaluate
their results. Eight primary studies reported the adoption of cross validation for
Characterizing User-Generated Text Content Mining … 1023

estimating the classifier performance. Mostly (66%) primary studies built manually
their gold standard.
Table 8. UGTC Data Sources
Quantity Data Source
25 Twitter
2 Booking.com, Buscapé, Portuguese newspapers, Tripadvisor, Folha de São Paulo
1 Apontador, Cinema com Rapadura, CinePlayer, e-bit, Emails, Facebook, Fórum,
Google Play, MySpace, Omelete, Portuguese newspapers

5 Discussion

We could observe an increasing interest in opinion mining, partly due to its potential
applications, such as: marketing, public relations and political campaign. Portuguese
is spoken mainly in Portugal and Brazil, with Brazil having approximately 20 times
the population of Portugal. Choosing a random Tweet in Portuguese, there is a 95%
chance of it originating in Brazil [12]. Facebook and Twitter are important sources of
UGTC, however the first one is less used in text classification as it often contains
pictures and the analysis of the text by itself is not effective [13].
As most of UGTC in Portuguese comes from social networks, more than 90% of
text is short, written in an informal way, with grammatical errors, spelling mistakes,
as well as ambiguous and ironic. Although 69% of works have reported the use of
NLP, none have reported the use of word sense disambiguation. Therefore, the most
used term weighting scheme, the TF-IDF (term frequency – inverse document
frequency), is considered less discriminative for text classification [14].
Even when good results are achieved, the used datasets are rarely published. This
makes it difficult to implement improvements, as well as comparisons on which
technique performs better for a particular dataset. Therefore, less than 50% of all 35
primary studies have fully answered the five research questions. Important data for
comparison like text domain and type, class details and language variant were not
available. We did not find studies that have reported the use of clustering task for
UGTC in Portuguese, as well as a unique tool for all mining tasks.
There are some threats to the validity that are worthy of note: (i) it is possible that
some relevant studies were not included throughout the searching process. This threat
was mitigated by performing an extensive search, as well as, double-checking from
two researchers; (ii) as studies were classified based on personal judgment, it is
possible that some studies may have been incorrectly classified. To mitigate this
threat, the classification step was executed for more than one researcher; (iii) digital
databases do not have a compatible search rules and show some instability when
presenting results. We mitigated this threat by running the search in several digital
databases more than one time by different researchers.
1024 E. Souza et al.

6 Conclusion

This paper plots the landscape of current studies relating to the application of text
mining techniques for UGTC in the Portuguese language. The strength of this paper is
to promote growth in the research of text mining in the Portuguese Language. We
think that the reported data on this paper may help researchers and practitioners to
discover what has been achieved and where the gaps are in this field area.
The lack of some relevant data and published datasets make further analysis in the
research area difficult. This work is part of an ongoing broader research as shown in
the general research question (Section 1). We are mapping not only the use of text
mining techniques for UGTC in the Portuguese language, but for all kind of texts. To
increase coverage we plan to apply snowball techniques on included primary studies.

Acknowledgment

Ellen Souza is supported by FACEPE (IBPG-0765-1-0311).

References

1. Marine-Roig, E., Anton Clavé, S.: Tourism analytics with massive user-generated content:
A case study of Barcelona. J. Destin. Mark. Manag. 1–11 (2015).
2. Delen, D., Crossland, M.D.: Seeding the survey and analysis of research literature with text
mining. Expert Syst. Appl. 34, 1707–1720 (2008).
3. Hotho, A., Andreas, N., Paaß, G., Augustin, S.: A Brief Survey of Text Mining. (2005).
4. Tan, A.: Text Mining : The state of the art and the challenges Concept-based. Proc. PAKDD
1999 Work. Knowl. Disocovery from Adv. Databases. 65–70 (1999).
5. Pardo, T., Gasperin, C., Caseli, H., Nunes, M. das G. V.: Computational Linguistics in
Brazil : an overview. Proc. NAACL HLT 2010 Am. 1–7 (2010).
6. Poblete, B., Garcia, R., Mendoza, M., Jaimes, A.: Do All Birds Tweet the Same ?
Characterizing Twitter Around the World. Society. 1025–1030 (2011).
7. Petersen, K., Feldt, R., Mujtaba, S., Mattsson, M.: Systematic Mapping Studies in Software
Engineering. (2007).
8. Kitchenham, B., Charters, S.: Guidelines for performing Systematic Literature Reviews in
Software Engineering. Tech. Rep. EBSE-2007-01, (2007).
9. Hotho, A., Nürnberger, A., Paaß, G.: A Brief Survey of Text Mining. Ldv Forum. (2005).
10.da Silva Conrado, M., Felippo, A., Salgueiro Pardo, T., Rezende, S.: A survey of automatic
term extraction for Brazilian Portuguese. J. Brazilian Comput. Soc. 20, 12 (2014).
11.Lu, W., Stepchenkova, S.: User-Generated Content as a Research Mode in Tourism and
Hospitality Applications: Topics, Methods, and Software. J. Hosp. Mark. Manag. (2015).
12.Laboreiro, G., Bošnjak, M., Sarmento, L., Rodrigues, E.M., Oliveira, E.: Determining
language variant in microblog messages. In: Proceedings of the 28th Annual ACM
Symposium on Applied Computing - p. 902. ACM Press, USA (2013).
13.Evangelista, T.R., Padilha, T.P.P.: Monitoramento de Posts Sobre Empresas de E-
Commerce em Redes Sociais Utilizando Análise de Sentimentos. (2013).
14.Takçı, H., Güngör, T.: A high performance centroid-based classification approach for
language identification. Pattern Recognit. Lett. 33, 2077–2084 (2012).

You might also like