Report

FACULTY OF COMPUTER AND MATHEMATICAL SCIENCES (FSKM)
BACHELOR OF SCIENCE (HONS) STATISTICS (CS241)

BUSINESS DATA ANALYTICS (ITS480)
CUSTOMERS’ SATISFACTION
EXPERIENCE OF GRAB APPLICATION
IN GOOGLE PLAY STORE
PREPARED BY:
QURRATU’AINI BINTI MUSA 2016331637

NOR AZIRA BINTI MD RADZI 2016589085
NUR FARAH AZREN BINTI MOHD NAZERI 2016598427
TENGKU NURLIYANA NABILA BINTI TENGKU MALIM BUSU 2016522977
GROUP:
CS241/5D
LECTURER’S NAME:
MADAM ROZIANIWATI BINTI YUSOF

TABLE OF CONTENTS
TOPIC PAGE
CHAPTER 1: INTRODUCTION
1.1 Summary of Study 1
1.2 Problem Statement 2
1.3 Objectives 2
CHAPTER 2: BACKGROUND OF STUDY

2.1 Literature Review 3
2.2 Sentiment Analysis 4
2.2.1 Text Mining Pre-Processing Techniques 5
2.2.2 The Process in Sentiment Analysis 6
CHAPTER 3: METHODOLOGY
3.1 Introduction 9
3.2 Cross Industry Standard Process for Data Mining (CRISP-DM)
3.2.1 Business Understanding 10
3.2.2 Data Understanding 11
3.2.3 Data Preparation 12
3.2.4 Modelling 13
3.2.5 Evaluation 14
3.2.6 Deployment 15
CHAPTER 4: EXPERIMENT
4.1 Description of Testing Dataset 16
4.1.1 Producing Training Model 18
4.2 Data Cleaning 20
4.3 Processes in Rapid Miner
4.3.1 Producing Training Model 21
4.3.2 Applying Testing Dataset into Each Training Model 27
4.4 Model Comparison 29
CONCLUSION 30
REFERENCES 31
CHAPTER 1
INTRODUCTION
1.1 SUMMARY OF STUDY
Nowadays, transportation is one of the most basic human needs. In fact, it is important
because it facilitates trade, exchange and travel. Hence, without effective transportation, regions
are largely isolated from each other. Besides, effective, affordable transportation also plays a role
in letting people move to new areas. According to Indra and Ibrahim (2017). the demand for
transport service especially for private car increased in Malaysia since It is an essential need for
an individual. The researchers also declared that the rises in the interest for private cars was due
to the Malaysia with motorization rate, heavy traffic congestions, parking problems and
inadequate public transport infrastructure.
At present, Grab is one of the most preferred transportation among the society. Today, it
is present in eight countries across the region. Furthermore, other countries had developed
Transport Network Company (TNC) in which Uber and Grab Car were classified as new modes
of transportation under it (Paronda, 2017). Also stated that a higher-end car like Uber Black
service of grab car was launched in in May 2014 in the Malaysia, Singapore, Thailand and the
Philippines whereby it became the first accredited TNC by LTFRB in 2015. Digital News Asia
(2018) conform that Grab’s larger mobility vision is to enable a multi-modal future that serves
people’s first-and-last-mile needs more seamlessly beyond just ride-hailing. Above all, it enables
mixing and matching different transport options based on consumers travel preferences and
budget.
Home delivery food is one of the offered service by grab. It works best especially for
working people. According to Digital News Asia (2018), having favourite food delivered right to
their doorsteps will set up the convenience and joy towards the consumer in Malaysia whereby
Grab Food will offer them to order and pay for their food seamlessly with Grab Pay, while
enjoying Grab Rewards points for every order. Tariq (2018) wrote that instead of provides more
income opportunities for local merchants and delivery partners, the expansion of Grab Food in
Malaysia will bring the best of Malaysia’s kitchens to more consumers.
1
Unfortunately, apart from positive review of the Grab Car, there is also several part of
negative feedback from the consumers that had been popped up. Worapong (2015) stated that
customers need to have smartphone to get to their ride which influence toward the limitation of
the Uber and Grab Taxi on their services as they are mobile application based. Hence, having
complaint from the current customers such as shortage of drivers in the peak hours, uncertainty
of service quality and unfair surge rate pricing conform that the Grab still have some faults on
their services. Therefore, this study aims to explore the customer’s satisfaction on the grab
application in google play store. This study also hopes that by conducting the research, the Grab
Car management can made an improvement toward their services in order to rise up the
satisfaction among the consumers.
1.2 PROBLEM STATEMENT
Recently, an atmosphere of competition among taxi is occurred due to the presence of

Grab Car. Moreover, the opportunity and flexibility for transportation, delivery or offer
consumer services through the Grab platform and to earn more income or business revenue at
their own time and pace is also open to everyone who want to leverage on the opportunities of a
digital economy. However, even though Grab provides many convenient facilities, there are still
limitation appeared on their services. Since Uber and Grab Taxi is known as the new alternative
taxi serviced which come up with the new ways of service in order to solve all those unsatisfied
services. However, there are still limitation among them whereby the services were obtaining
complaint from the current customers due to the shortage of drivers in peak hours and
uncertainty of service quality. Therefore, this study will help an improvement towards the
circumstance by measuring the customer’s satisfaction on grab application in google play store.
1.3 OBJECTIVES
1) To analyze the customer’s satisfaction on grab application in Google Play Store.

2) To determine the best model to predict the sentiment of user of grab application in
Google Play Store.
2
CHAPTER 2
BACKGROUND OF STUDY
2.1 LITERATURE REVIEW
According to Vernonchancom, (2018) stated that in Malaysia, Grab has grown from start-
up attempting to solve Malaysia’s taxi woes to becoming Southeast Asia’s brightest stars in six
years ago. From many existing applications that are installs, Grab’s vision is to be Southeast
Asia’s “Everyday Superapp”. For consumer, this application is one of the best applications
because it is can access many services such as transport, food, logistics, shopping and financial
services. Today, with a refresh Grab application, the vision of being the user’s priority is
accomplished. It is easy to access the Grab services like transport, food, delivery services,
mobile reloads, and now grocery delivery (with Grab Fresh).
As studied by Bw Mark (2018), the person that introduced the concept of Grab is
Anthony Tan together with the Tan Hooi Ling, which is who created Mytaxi. In previous, Grab
or before this known as Grab Taxi is a mobile application utilize that enhances the smartphone
technology to provide ride-hailing and logistics services. The Grab become popular in Malaysia
and then expanded into other country in Southeast Asia under Grab Taxi. In 2016, the company
exchange the name of Grab taxi to the Grab where the word Grab itself refer to all the services
such as taxis (GrabTaxi), private car services (GrabCar), carpooling (GrabShare), and package
delivery (GrabExpress). In addition, Grab application is available in iOs and Android formats
which are the users can use this application to send their booking request to the drivers or known
as partners. Through booking, the users can access the driver’s information like driver’s name,
plate number, photo, and phone number. The users also can give feedback about their Grab’s
driver through rate from 1 to 5 stars after the ride. The rate is as a reference to other passengers
about the Grab’s driver performance.
As stated by Kikuchi (2018) nowadays, Grab give variety of services to users whether
rides on motorbikes and shared rides in private cars, shuttle buses and even trial rides on self-
driving cars. These give the opportunity to the mid- to low-income customers who need them to
ride. Grab also provide the users to connect their application through social media like Facebook,
Instagram and Twitter accounts that enhance users to display their names, profile photo and
3
mutual friends with drivers and riders. Therefore, it can help the application to well-known as an
user network and giving awareness to the community to realize on personal connections and trust
against applications.
2.2 SENTIMENT ANALYSIS
In a world where people generate 2.5 quintillion bytes of data every day, sentiment
analysis has become a key tool for making sense of that data. This has allowed companies to get
key insights to all kind of the processes. Text analytics is the process of analyzing unstructured
text, extracting relevant information, and transforming it into useful business intelligence while
sentiment analysis determines if an expression is positive, negative, or neutral. In other words,
text analytics studies the face value of the words, including the grammar and the relationships
among the words. Simply put, text analytics gives the meaning while sentiment analysis gives
insight into the emotion behind the words.
Sentiment analysis or buzz tracking is well-known as a specific form of text analysis.

Sentiment Analysis also known as Opinion Mining is a field within Natural Language
Processing (NLP) that builds systems that try to identify and extract opinions within text.
Usually, besides identifying the opinion, these systems extract attributes of the expression. As an
example:
a) Polarity: If the speaker expresses a positive or negative opinion.

b) Subject: The thing that is being talked about.
c) Opinion holder: The person, or entity that expresses the opinion.
Currently, sentiment analysis is a topic of great interest and development since it has
many practical applications. Since publicly and privately available information over internet is
constantly growing, many texts expressing opinions are available in review sites, forums, blogs,
and social media. With the help of sentiment analysis systems, this unstructured information
could be automatically transformed into structured data of public opinions about products,
services, brands, politics, or any topic that people can express opinions about. This data can be
very useful for commercial applications like marketing analysis, public relations, product
reviews, net promoter scoring, product feedback, and customer service.
4
2.2.1 Text Mining Pre-Processing Techniques
Data mining is used for finding the useful information from the large amount of data.
Data mining techniques are used to implement and solve different types of research problems.
The research related areas in data mining are text mining, web mining, image mining, sequential
pattern mining, spatial mining, medical mining, multimedia mining, structure mining and graph
mining. Text mining is the process of mining the useful information from the text documents. It
is also called knowledge discovery in text (KDT) or knowledge of intelligent text analysis.
Text mining is a technique which extracts information from both structured and
unstructured data and also finding patterns. Text mining techniques are used in various types of
research domains like natural language processing, information retrieval, text classification and
text clustering. Figure 2.1 shows the steps in text mining pre-processing technique.
Figure 2.1: The Text Mining Pre-Processing Technique
5
2.2.2 The Process in Sentiment Analysis
Step 1: Monitor social networks, review sites for mentions.

➢ To resolve a sentence into component parts of speech and explain syntactical relationship,
the sentence must be parsed to get the actual contents. The aim of this parsing is to impose
the structure typically on semi-structured data. For example, html pages and RSS feed.
➢ The structure must be enough to find the part of the raw text of the actual content of review,
titles, date of review. The output is a collection of phrases and words that speaks of the
product of interest.
Step 2: Collect the reviews.

➢ The relevant raw data were extracted and convert into suitable document representation.
This is for represent the collection of phrases and words in a structured manner for
downstream analysis and calculates the number of times a term occurs.
➢ Corpus is a collection of documents. The corpus was representing to archive them to be able
to conduct search for future reference and research.
➢ Reverse indexing provides a way of keeping track of list of all documents that contain a
specific feature and for every possible feature.
➢ Documents are often only relevant to in the context of a corpus, or a specific collection of
documents. Hence, classifiers need to be trained on a specific set of documents. Any
changes to the corpus require retraining of a classifier.
➢ Corpus changes constantly over time and not only do new documents get added, but word
distributions can change over time. This could reduce the effectiveness of classifiers and
filters if they are not retrained such as spam filters.
Step 3: Sort the reviews

➢ Once all reviews have been collected and represented, sort them by the subject of interest
such as product/service. The review must be classifying to sort them.
➢ Typically, by topic tagging that often involves having a team of human users to determine
the classification of a review and tag it accordingly.
6
➢ There are a few rules for topic tagging:
i. If the product is mentioned in the title, then the review is likely to be about the
product.
ii. If the mentions are in the contents the review may or may not be related to the product.
iii. A tweet is more likely about the product than a forum because a review may be about
comparison of different products.
iv. More frequent mentions of the products may indicate the review is relevant.
Step 4: Determining type of review (good or bad).

➢ To determine if a review is good (positive) or bad (negative), commonly classifiers include
Naïve Bayes and Support Vector Machine (SVM) were used.
➢ Naïve Bayes: A family of probabilistic algorithms that uses Bayes’ Theorem to predict the
category of a text.
➢ Support Vector Machines: A non-probabilistic model which uses a representation of text
examples as points in a multidimensional space.
➢ A major bottleneck of this step is the need for tagged training data. Two approaches for this
step are to identify good and bad reviews and utilize sentiment dictionary.
Step 5: Marketing calls up and reads selected reviews in full, for greater insight.
➢ After sentiment/buzz has been analyzed, marketing and sales personnel would then want to
search and retrieve relevant reviews from the corpus. From the reviews, sales and marketing
personnel may gain insight about the product/service offered.
➢ Search by relevance is often made possible using Term Frequency-Inverse Document
Frequency (TF-IDF). The frequency – inverse document frequency is a weight-based
metrics to identify reviews/documents relevant to some query terms.
i. Document frequency: Provides information about how many documents that contains
a term.
ii. IDF: Provides information about how rare the term is across the document corpus and
measure the relevance that DF does not provide.
7
As a conclusion, sentiment analysis systems allow companies to make sense of the
unstructured text by automating business processes, getting actionable insights, and saving hours
of manual data processing. In other words, by making teams more efficient. There are a few
advantages of sentiment analysis include the following:
i. Scalability: There is just too much data to process manually sorting through thousands of
tweets, customer support conversations, or customer reviews. Thus, the sentiment
analysis allows to process data at scale in an efficient and cost-effective way.
ii. Real-time analysis: The sentiment analysis can identify critical information that allows
situational awareness during specific scenarios in real-time. As a situation’s example,
there a PR crisis in social media about to burst and an angry customer that is about to
churn. Therefore, the sentiment analysis system can help to identify these kinds of
situations and take the action.
iii. Consistent criteria: People do not observe clear criteria for evaluating the sentiment of a
piece of text. It is estimated that different people only agree around 60-65% of the
times when judging the sentiment for a particular piece of text. It is a subjective task
which is heavily influenced by personal experiences, thoughts, and beliefs. By using a
centralized sentiment analysis system, companies can apply the same criteria to all their
data. This helps to reduce errors and improve data consistency.
8
CHAPTER 3
METHODOLOGY
3.1 INTRODUCTION
Cross Industry Standard Process for Data Mining (CRISP-DM) is a comprehensive data
mining methodology and process model that provides anyone from beginner to data mining
expert with a complete blueprint for conducting a data mining project. CRISP-DM breaks down
the life cycle of a data mining project into six phases which are business understanding, data
understanding, data preparation, modeling, evaluation and deployment. The life cycle of CRISP-
DM reference model, shown in Figure 3.1.
Figure 3.1: Cross Industry Standard Process for Data Mining (CRISP-DM)
According to Chapman et al. (2000), the sequence of the phases is not fixed because a
different phase is always required to move back and forth. It depends on the outcome of each
phase which phase has to be performed next. The arrows indicate the most important and
frequent dependencies between phases. The outer circle symbolizes the cyclical nature of data
mining itself. Data mining is not over once a solution is deployed.
9
3.2 CROSS INDUSTRY STANDARD PROCESS FOR DATA MINING (CRISP-DM)
3.2.1 Business Understanding
The initial phase focuses on understanding the objectives and requirements from business
perspective, then converting the knowledge into a data mining problem definition and
preliminary plan designed to achieve the objectives. This phase may require multiple iterations
before an acceptable solution would be appearing. There are 4 tasks that should be followed
which are determine business objectives, assess situation, determine data mining goals and
produce project plan.
1. Determine Business Objectives
Data analyst need to completely understand from a business perspective, what the
customer really wants to accomplish. Frequently the customer has many competing objectives
and requirements that must be appropriately balanced. The analyst’s goal is to uncover important
factors at the beginning that can affect the outcome of the project. A possible consequence of
neglecting this step is to expend a great deal of effort producing the right answers to the wrong
questions.
2. Assess Situation
This task involves more detailed fact-finding about all of the resources, constraints,
assumptions, and other factors that should be considered in determining the data analysis goal
and project plan. The details will be expanding in this task.
3. Determine Data Mining Goals
A business goal states objectives in business terminology. A data mining goal states
project objectives in technical terms. Data mining goal can be described as the intended outputs
of the project that enable the achievement of the business objectives.
4. Produce Project Plan
Describe the intended plan for achieving the data mining goals and thereby achieving the
business goals. The plan should specify the steps to be performed during the rest of the project,
including the initial selection of tools and techniques.
10
3.2.2 Data Understanding
Data is the raw material from which the solution will be built. Thus, it is important to
understand the strengths and limitations of the data because rarely is there an exact match with
the problem. For example, historical data often are collected for a different purpose. It is
common for a business problem to have several data mining tasks and the result of each task
solves the problem. The tasks include collect initial data, describe data, explore data and
verify data quality.
1. Collect initial data
Acquire within the project the data (or access to the data) listed in the project resources. This
initial collection includes data loading if necessary for data understanding. For example, data
analyst intends to use a specific tool for data understanding, it is logical to load data into this
tool.
2. Describe data
Examine the “gross” properties of the acquired data and report on the results. Describe the
data which has been acquired including: the format of the data, the quantity of the data, the
identities of the fields and any other surface features of the data which have been discovered.
3. Explore data
This task tackles the data mining questions that can be addressed using querying,
visualization and reporting. These analyses may address directly the data mining goals.
However, they may also contribute to or refine the data description and quality reports and feed
into the transformation and other data preparation needed for further analysis.
4. Verify data quality
Examine the quality of the data, addressing questions such as: Is the data complete (does it
cover all the cases required)? Is it correct or does it contains errors and if there are errors how
common are they? Are there missing values in the data? If so, how are they represented, where
do they occur and how common are they?
11
3.2.3 Data Preparation
Often, data is not in the form that it is required. Hence, conversion is necessary to achieve
a form that can help yield better results. For an example, converting data into tabular format,
removing or inferring missing values, and converting data to different types. There are 5 tasks
included in this phase:
1. Select data
Decide on the data to be used for analysis. Criteria include relevance to the data mining
goals, quality and technical constraints such as limits on data volume or data types. Note that
data selection covers selection of attributes (columns) as well as selection of records (rows) in a
table.
2. Clean data
Raise the data quality to the level required by the selected analysis techniques. This may
involve selection of clean subsets of the data, the insertion of suitable defaults or more ambitious
techniques such as the estimation of missing data by modeling.
3. Construct data
This task includes constructive data preparation operations such as the production of derived
attributes, entire new records or transformed values for existing attributes.
4. Integrate data
These are methods whereby information is combined from multiple tables or records to
create new records or values.
5. Format data
Formatting transformations refer to primarily syntactic modifications made to the data that do
not change its meaning, but might be required by the modeling tool.
12
3.2.4 Modelling
Modeling is the primary place where data mining techniques are applied to the data.
Typically, the output is some sort of model or pattern capturing regularities in the data. Thus, the
followings tasks need to be implemented:
1. Select modeling technique
As the first step in modeling, select the actual modeling technique that is to be used. This
task refers to the specific modeling technique, e.g., decision tree building with C4.5 or neural
network generation with back propagation. If multiple techniques are applied, these tasks for
each technique perform separately.
2. Generate test design
Before build a model, a procedure or mechanism need to be generate to test the model’s
quality and validity. For example, in supervised data mining tasks such as classification, it is
common to use error rates as quality measures for data mining models. Therefore, data analyst
typically separate the dataset into train and test set, build the model on the train set and estimate
its quality on the separate test set.
3. Build model
The modeling tool is running on the prepared dataset to create one or more models. Assess
model the data mining engineer interprets the models according to the domain knowledge, the
data mining success criteria and the desired test design. This task interferes with the subsequent
evaluation phase. Whereas the data mining engineer judges the success of the application of
modeling and discovery techniques more technically then contacts business analysts and domain
experts later in order to discuss the data mining results in the business context. Moreover, this
task only considers models whereas the evaluation phase also takes into account all other results
that were produced in the course of the project.
13
3.2.5 Evaluation
These phase aim to assess the data mining results and to gain confidence that the results
are valid and reliable. Stakeholders would like to know if the proposed model is going to do
more good than harm. Evaluating results of data mining includes both quantitative and
qualitative assessments. The tasks include in evaluation’s phase are:
1. Evaluate results
This step assesses the degree to which the model meets the business objectives and seeks to
determine if there is some business reason why this model is deficient. Another option of
evaluation is to test the model(s) on test applications in the real application if time and budget
constraints permit. Moreover, evaluation also assesses other data mining results generated.
Data mining results cover models which are necessarily related to the original business
objectives and all other findings which are not necessarily related to the original business
objectives but might also unveil additional challenges, information or hints for future directions.
2. Review process
At this point the resultant model hopefully appears to be satisfactory and to satisfy business
needs. It is now appropriate to do a more thorough review of the data mining engagement in
order to determine if there is any important factor or task that has somehow been overlooked.
This review also covers quality assurance issues.
3. Determine next steps
According to the assessment results and the process review, the project decides how to
proceed at this stage. The project needs to decide whether to finish this project and move on to
deployment if appropriate or whether to initiate further iterations or set up new data mining
projects. This task includes analyses of remaining resources and budget that influences the
decisions.
14
3.2.6 Deployment
Data mining results are put into real use in order to realize some return on investment.
This involves implementing the proposed model. The observation from this stage may require
iteration back to the business understanding phase. Improvements and refinements are also made
to the model. Then, the tasks included in this phase are:
1. Plan deployment
In order to deploy the data mining result into the business, this task takes the evaluation
results and concludes a strategy for deployment. If a general procedure has been identified to
create the relevant model, this procedure is documented here for later deployment.
2. Plan monitoring and maintenance
Monitoring and maintenance are important issues if the data mining result becomes part of
the day-to-day business and its environment. A careful preparation of a maintenance strategy
helps to avoid unnecessarily long periods of incorrect usage of data mining results. In order to
monitor the deployment of the data mining result(s), the project needs a detailed plan on the
monitoring process. This plan takes into account the specific type of deployment.
3. Produce final report
At the end of the project, the project leader and his team write up a final report. Depending
on the deployment plan, this report may be only a summary of the project and its experiences (if
they have not already been documented as an ongoing activity) or it may be a final and
comprehensive presentation of the data mining result(s).
4. Review project
Assess what went right and what went wrong, what was done well and what needs to be
improved.
15
CHAPTER 4
EXPERIMENT
4.1 DESCRIPTION OF TESTING DATASET
As part of our project to produce a sentiment analysis model, in order to know people
satisfaction on the following topics, we have decided to collect some reviews from customers
about an existing application from Google Play Store. From the following link is the source of
data for this study:
https://play.google.com/store/apps/details?id=com.grabtaxi.passenger&showAllReviews=true
Grab is Southeast Asia's #1 ride-hailing app, food delivery service, and cashless payment
solution all in one. With the new Grab app, you'll get the most convenient booking service for
private cars and taxis from the largest community of drivers in the region, food delivery from
your favorite restaurants to satisfy any craving, and cashless payments in-app and at merchants
across the city. In this application, there are approximately 3 018 778 total review from user and
non-user of the Grab application. However, to do the web scrapping for this website, we only
take 200 of review’s comments from the people who left their comments on this Grab
application.
16
For the attributes that will be used for web scrapping are name of the users, date and the
review.
Users’ name
Review Date
Users’ Review
Figure 4.1: Attributes Details for Testing Dataset
17
4.1.1 Web Scraping the Testing Dataset
By using Octoparse Software downloaded from internet, we extract the data of the users
review from URL of our source of the data. Octoparse is one-click software that is used for
automatic data extraction. This software enables to scrape web data quickly without coding and
will turn web pages into structured data within clicks. The extracted data that will be used as
Testing Dataset was saved into an Excel File will be used later after each Training Model has
been created.
(Link of Octoparse Software: https://www.octoparse.com/product)
Step 1: To start the data extraction, clicl +Task at Advanced Mode. Then following figure will
appear. Paste the URL of the source of the data in the Website box and click Save URL.
Step 2: At Scroll Down option, tick the box for ‘Scroll down to bottom of the page when finished
loading’. Scroll times set to 5 since all 200 reviews will be shown after 5 times scrolling down.
18
Step 3: First, click on the ‘username’ of the first review and click Select all. Then, click on the
next attribute needed that is ‘date’ and the ‘review’.
1 2
3 4
Last, after all 3 attribute has been selected, click on the Extract data and the following figure will
be shown. Then, click Save and Run.
Step 4: Click on the Local Extraction and wait until all 200 data has been completely extracted.
After that, save the data by export data as Excel file and save as Grab Dataset.xlsx.
19
4.2 DATA CLEANING
All of the Testing dataset (Grab Dataset.xlsx) that being extracted using Octoparse need to be
cleaned first. The review are being label as Text and we label the class target based on reviews collected
with the polarity set as positive, negative or neutral based on the meaning and perceptions of the reviews.
Then, the neutral reviews as well as review that seems not appropriate or not in English language are
being removed from the dataset. From 200 review collected, the final Testing Dataset left are 166 review
with the polarity of positive is 91 and another 75 is negative. Figure 4.2 showed the partial of the final
Testing Dataset that saved as Testing Grab Transport Dataset octoparse.xlsx.
Figure 4.2: Partial of the Testing Dataset
20
4.3 PROCESSES IN RAPID MINER
4.3.1 Producing Training Model
Before applying the Testing Dataset, we will produce three different Training model with
different classifier to compare and find the model with the highest accuracy. In particular, the
Gradient Boosted Tress, Decision Tree and Support Vector Machine (SVM) classifiers were run
on this dataset.
i) Model 1: Gradient Boosted Trees Classifiers to obtain predictive results through gradually
improved estimations. Boosting is a flexible nonlinear regression procedure that helps
improving the accuracy of trees. While boosting trees increases their accuracy, it also
decreases speed and human interpretability. However, the gradient boosting method
generalizes tree boosting to minimize these issues.
ii) Model 2: Decision Tree Classifiers generates a decision tree model, which can be used for
classification and regression. For classification this rule separates values belonging to
different classes, for regression it separates them in order to reduce the error in an optimal
way for the selected parameter criterion.
iii) Model 3: Support Vector Machine (SVM) Classifiers is an effective traditional text
categorization framework. The main idea of SVM is to find the hyper plan, which is
represented as a vector that separates document vectors in one class from document vectors
in other classes. SVM shows very good performance and higher accuracy in many studies
directed towards sentiment analysis in many languages. Work reported by Zhang and Law
(2009) showed that SVM did well with the English language compared to other classifiers.
Before producing the training model, we find another review as Training dataset. The
reviews are extracted from twitter by using Search Twitter Operator. It is required to login to a
twitter account first before get the token access. At Parameter option as in Figure 4.3, click on
the twitter icon, and then add new connection for twitter. After access token managed to retrieve
from the Twitter sites, Run the process after connect the out to result. For the query, ‘grab car’ is
used, with limit to 900 reviews and language is set to English only (en). Then, add Write Excel
21
Operator into the process to transfer the data extracted into Excel file type, Twitter Grab
Dataset.xlsx. There are 580 review resulting from the query.
Figure 4.3: Parameter Option for Search Twitter Operator
Figure 4.4: Process of Data Extraction from Twitter
Next, add Read Excel Operator into the process and read the file Twitter Grab
Dataset.xlsx. Add Analyze Sentiment Operator (Aylien) to extract sentiment from a piece of text
that will provide the valuable insight about the emotions and perspective. Get AYLIEN Text
Analysis key from developer.aylien.com/signup?source=rapidminer. Then, the operator will
analyse whether the tweet polarity is positive, neutral or negative. After that, the data is transfer
into another Excel file named Sentiment Twitter Grab Dataset.xlsx. The complete process is
shown in Figure 4.5.
Figure 4.5: Process Analyze Sentiment of Review for Training Dataset
22
However, before start to create the training model, data cleaning need to be done. Review
with neutral polarity has to be removed from the dataset. Now, the complete training dataset
with 349 review is available, and saved as Training Twitter Grab Dataset.xlsx. Partial of the
dataset is as in Figure 4.6 below.
Figure 4.6: Partial of the Training Dataset after Cleaning
First of all, make sure that the entire operator in process of Figure 4.4 is being disabled.
Now, process of creating the Training Model can be proceeding.
Step 1: Read the final training dataset, Training Twitter Grab Dataset.xlsx by using Read Excel
Operator.
Step 2: Add Select Attribute Operator into the process. Set the parameter as in Figure 4.7. Select
attributes ‘Text’ and ‘polarity’.
Figure 4.7: Parameter Option for Select Attribute Operator
Step 3: Set Class label. The classification must have class target. Therefore, we add Set Role
Operator into the Process panel. Set the parameter as in Figure 4.8.
23
Figure 4.8: Parameter Option for Set Role Operator
Step 4: In Rapid Miner, if the data is text, it will consider be as nominal data type. And with this
nominal data type, text mining cannot be done. Therefore, we change nominal data type to Text.
In the Operators panel, search nominal to text. Drag the Nominal to Text Operator into the
Process panel. The parameter option is same as Step 2 except the selected attribute is only ‘Text’.
Step 5: Add Process Documents from Data Operator into the process. This is done in order to
create bag of words. Set the parameter as in Figure 4.9.
Figure 4.9: Parameter Option for Process Documents from Data Operator
Then, double click the operator and another process-in-process panel will appear. There will be
four operator added into the process as in Figure 4.10. Tokenize Operator splits the text of a
document into a sequence of tokens, which will create bag of words. Stem (Lovins) Operator
stems English words using the Lovins stemming algorithm. Filter Stopwords (English) Operator
will removes English stop words from a document such as ‘a’, ‘an’ or ‘the’. Transforms Cases
Operator will transform cases of characters in a document. We set to transform to lower case.
Then, execute the process.
24
Figure 4.10: Process in Process Documents from Data
Step 6: Back to main process panel. Add Cross Validation Operator to check the accuracy of the
model. Applying cross validation is also to randomize data sampling. Double‐click cross
validation process. There will appear Training and Testing panel sections.
In Training panel sections is where we change the classifier as stated above in order to create 3
different Training Model so that we can get the Testing Model with better accuracy later.
In Testing panel sections, drag Apply Model and Performance Operator on the panel as in Figure
4.11. Connect as in Figure 4.12, execute and the result will show the accuracy of the existing
model.
Figure 4.11: Cross Validation
25
Figure 4.12: Complete Process for Creating the Training Model
The result of accuracy for all 3 Training Models is shown below:
i) Model 1: Gradient Boosted Trees Classifiers
ii) Model 2: Decision Tree Classifiers
26
iii) Model 3: Support Vector Machine (SVM) Classifiers
4.3.2 Applying Testing Dataset into Each Training Model
Next, for each of the Training Model with three different classifiers, Testing Dataset
named Testing Grab Transport Dataset octoparse.xlsx is applied. Do one Training Model at one
time while disable the other Training Model in order to check the accuracy. First step is to read
the final Testing Dataset using Read Excel Operator. Then, repeat the same Step 2 until Step 5 as
in producing the Training Model.
Step 6: Drag Apply Model and Performance Operator on the process panel. Then connect as in
Figure 4.13, execute and the results will show the accuracy of the Testing Model.
Figure 4.13: Complete Process for Producing the Testing Model
27
The result of accuracy for Testing Dataset of each Training Models are shown below:
i) Model 1: Gradient Boosted Trees Classifiers
ii) Model 2: Decision Tree Classifiers
iii) Model 3: Support Vector Machine (SVM) Classifiers
28
4.4 MODEL COMPARISON
After the Testing Dataset has been applied to all three of the Training Model, the results
of the accuracy will be compared to determine the best model. True Positive is the number of
reviews that were correctly classified by the classifier to belong to the current class. True
Negative is the number of reviews that were correctly classified by the classifier not to belong to
the current class.
ACCURACY
CLASSIFIER Training Model Testing Model
Gradient Boosted Trees 58.34% ± 8.73% 70.48%
Decision Tree 52.58% ± 4.58% 58.43%
Support Vector Machine (SVM) 60.63% ± 5.22% 60.24%
Table 4.1: Model Comparison
29
CONCLUSION
In order to achieve the objective of this study that is to analyze the customer’s
satisfaction on grab application in Google Play Store, the polarity of the testing dataset showed
that from 200 review collected, there are 91 of positive, 75 is negative and another 34 is neutral or not
related. This showed that more user have positive satisfaction on grab application in Google Play
Store. But there is still area that need improvement since around 1/3 of the users have negative
satisfaction on grab application in Google Play Store.
Based on the Table 4.1, the best model is selected based on the accuracy of the Training
and Testing Model. The accuracy of the model must be more than 65% to be considered as a
good model. This study has considered sentiment analysis in English language regarding the
Customers’ Satisfaction Experience of Grab Application in Google Play Store. A dataset, which
consists of 167 users’ review, was collected by using Octoparse software and labelled as
negative, positive and neutral. The Gradient Boosted Trees, Decision Tree and Support Vector
Machine (SVM) classifiers were used to compare the accuracy of the dataset. Search Twitter
Operator was used to extract the data with query ‘grab car’ in order to get the training datasets.
The best training accuracy was achieved by Gradient Boosted Trees and it equals to 58.34% ±
8.73%. The best testing accuracy was also achieved by classifier Gradient Boosted Trees and it
equals 70.48%. As a conclusion, the best model for this study is Gradient Boosted Trees since
model has the best testing accuracy that is greater than 65%.
Certainly there are many ways that this analysis can be improved. Firstly, the size of the
dataset is rather small and if we want to make solid conclusions then we definitely need big
datasets. Secondly, crowdsourcing is a useful tool when labelling or annotating large amounts of
data is considered.
30
REFERENCES
Bw_mark. (2018, July 26). Business case study: Grabbing market shares. Retrieved May 10, 2019,
from https://www.bworldonline.com/business-case-study-grabbing-market-shares/
Chapman, P.; Clinton, J.; Kerber, R.; Khabaza, T.; Reinartz, T.; Shearer, C. and Wirth, R.
(2000).CRISPDM 1.0 step-by-step data mining guide. Technical report, CRISP-DM
Digital News Asia (2018). Grab aims to be Malaysians' everyday app. Retrieved May 1, 2019, from
https://www.digitalnewsasia.com/digital-economy/grab-aims-be-malaysians-everyday-app
Indra, B. and Ibrahim Bin Hamzah (2017). The Influence of Customer Satisfaction on ride-
sharing services in Malaysia. Retrieved May 1, 2019, from http://www.ftms.edu.my/journals/pd
f/IJABM/Nov2017/184-196.pdf
Kikuchi, T. (2018, February 09). Ride-hailing app Grab gears up for payment dominance. Retrieved
May 10, 2019, from https://asia.nikkei.com/Business/Company-in-focus/Ride-hailing-app-
Grab-gears-up-for-payment-dominance
Paronda, A. G. (2017). An Exploratory Study on Uber, GrabCar, and Conventional Taxis in

Metro Manila. Retrieved May 1, 2019, from https://www.researchgate.net/publication/3189595
98_An_Exploratory_Study_on_Uber_GrabCar_and_Conventional_Taxis_in_Metro_Manila.
Sentiment Analysis: Nearly Everything You Need to Know. (2018, June 20). Retrieved April 29,
2019, from https://monkeylearn.com/sentiment-analysis/
Tariq, Q. (2018, May 30). Uber Eats says goodbye, GrabFood says hello. Retrieved May 1, 2019,
from https://www.thestar.com.my/tech/tech-news/2018/05/28/grabfood-to-replace-ubereats/
Vernonchancom. (2018, July 10). Starting today the Grab app is no longer the app you'll recognize.
Retrieved May 11, 2019, from https://vernonchan.com/new-grab-app-rolls-out-today-
singapore-indonesia/
Vijayarani, S., Ilamathi, M., and Nithya. (1970, January 01). Preprocessing Techniques for Text
Mining-An Overview Dr. Retrieved April 29, 2019, from
https://www.semanticscholar.org/paper/Preprocessing-Techniques-for-Text-Mining-An-Dr-
Vijayarani-Ilamathi/1fa11c4de09b86a05062127c68a7662e3ba53251
31
Worapong, A. (2015). The Study of Consumer Behavior and Selection Criteria On Alternative Taxi
Service in Bangkok. Retrieved May 1, 2019, from http://ethesisarchive.library.tu.ac.th/thesis/20
15/TU_2015_5702040949_3536_2209.pdf
Ye Q., Zhang Z., and Law R. (2009). "Sentiment classification of online reviews to travel
destinations by supervised machine learning approaches." Expert Systems with Applications,
Vol. 36, Issue 3, pp. 6527-6535
32

Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report

Uploaded by

Copyright:

Available Formats

FACULTY OF COMPUTER AND MATHEMATICAL SCIENCES (FSKM)

BACHELOR OF SCIENCE (HONS) STATISTICS (CS241)

QURRATU’AINI BINTI MUSA 2016331637

MADAM ROZIANIWATI BINTI YUSOF

CHAPTER 2: BACKGROUND OF STUDY

1.1 SUMMARY OF STUDY

1.2 PROBLEM STATEMENT

Recently, an atmosphere of competition among taxi is occurred due to the presence of

1) To analyze the customer’s satisfaction on grab application in Google Play Store.

2.1 LITERATURE REVIEW

2.2 SENTIMENT ANALYSIS

Sentiment analysis or buzz tracking is well-known as a specific form of text analysis.

a) Polarity: If the speaker expresses a positive or negative opinion.

Figure 2.1: The Text Mining Pre-Processing Technique

Step 1: Monitor social networks, review sites for mentions.

Step 2: Collect the reviews.

Step 3: Sort the reviews

Step 4: Determining type of review (good or bad).

3.2.1 Business Understanding

1. Determine Business Objectives

3. Determine Data Mining Goals

4. Produce Project Plan

1. Collect initial data

4. Verify data quality

1. Select modeling technique

2. Generate test design

3. Determine next steps

2. Plan monitoring and maintenance

3. Produce final report

4.1 DESCRIPTION OF TESTING DATASET

Figure 4.1: Attributes Details for Testing Dataset

(Link of Octoparse Software: https://www.octoparse.com/product)

Figure 4.2: Partial of the Testing Dataset

4.3.1 Producing Training Model

Figure 4.3: Parameter Option for Search Twitter Operator

Figure 4.4: Process of Data Extraction from Twitter

Figure 4.5: Process Analyze Sentiment of Review for Training Dataset

Figure 4.6: Partial of the Training Dataset after Cleaning

Figure 4.7: Parameter Option for Select Attribute Operator

Figure 4.11: Cross Validation

The result of accuracy for all 3 Training Models is shown below:

i) Model 1: Gradient Boosted Trees Classifiers

ii) Model 2: Decision Tree Classifiers

4.3.2 Applying Testing Dataset into Each Training Model

Figure 4.13: Complete Process for Producing the Testing Model

i) Model 1: Gradient Boosted Trees Classifiers

ii) Model 2: Decision Tree Classifiers

iii) Model 3: Support Vector Machine (SVM) Classifiers

Table 4.1: Model Comparison

Paronda, A. G. (2017). An Exploratory Study on Uber, GrabCar, and Conventional Taxis in

You might also like