You are on page 1of 8

2017-2018

COURSE CODE: CMSE11358


Principles of Data Analytics

EXAM NUMBER: B049517

MSc Business Analytics


Table of Contents
Abstract ........................................................................................................................... 3
Introduction ..................................................................................................................... 4
Data Gathering ................................................................................................................. 4
Methods .......................................................................................................................... 5
Conclusion........................................................................................................................ 6
Similar Papers .................................................................................................................. 7
References ....................................................................................................................... 8
Abstract
There have been several approaches being used by different authors to formulate research
questions. Some research questions focus on one’s attention onto the relationship of
theories and concepts, others aim to open an area to let new theories be found.
In this paper, we explore two papers, namely, Backpage and Bitcoin: Uncovering Human
Trafficking and Quick Access: Building a Smart Experience for Google Drive. In the first
paper, the authors explore ways to classify an ad posted by an independent sex worker from
a human trafficked victim. A machine learning classifier that uses stylometry is built to
classify the above. In the second paper, the authors develop a feature for google drive for
quick access that surfaces the most relevant documents when a user visits the home screen.

With the age of big data, there have been several methods that have been used to extract
data. Data has been extracted in different ways in both the above papers. In the first paper,
scraping techniques have been used to extract data. In contrast, the second paper extracts
data uses Activity Service of Google that provides access to all recent activity in Drive. It logs
all requests such as open, create, edit, delete, rename, comment and upload events made to
the Drive by any of the users.
In this paper, we describe the common approach that is followed by the two papers to
formulate their research questions. We discuss how they have collected and analysed the
relevant data. In the end, we also mention papers that tackle a similar problem but uses
other analysis methodologies.
Introduction
The most important part of any research project is the specification of research questions or
the hypothesis (i.e. the part that is to be studied) and the research strategy. The research
strategy to use depends on the nature of the problem domain and the formulation of
research questions.
In this paper, we present similar ways in which the authors of two different papers use
similar techniques to formulate research questions.

The first paper, Backpage and Bitcoin: Uncovering Human Trafficking, first introduces the
topic and the problem of human trafficking. The authors explain why there is a need for the
tool that they have created. They explain how human traffickers post ads and hence it
becomes almost impossible for a human to distinguish whether an ad has been posted by a
human trafficker or an independent sex worker. The authors gather data with the help of
scraping techniques. The aim of this paper was to develop and demonstrate automatic
techniques for clustering sex ads by owner. Two such techniques were designed in this
paper. The first was a machine learning stylometry classifier that determined if two ads
were written by the same or different author. The second was a technique that links specific
ads to publicly available transaction information on Bitcoin.

The second paper, Quick Access: Building a Smart Experience for Google Drive, uses a similar
approach to formulate the research question. It first introduces the topic and what Google
Drive is and then talks about the feature the authors have developed. Similar to the first
paper, the authors talk about why they developed the tool and how it would help the users.
Their data gathering technique is different to that of the first paper. The objective of this
paper was to make a Quick Access feature that would surface the most relevant documents
when a user visits his home screen on Google Drive. They use Googles Activity Service to
gather users data. The Activity Service logs the users activity on google Drive. It logs actions
like open, create, edit, delete, rename, comment and upload events. They then use Neural
Networks and Machine Learning techniques on the data so that the framework can “learn”
what the user does and can surface the most relevant documents when a user visited his
home screen.

Data Gathering
Both the papers use different ways to gather data. In the first paper, Backpage and Bitcoin:
Uncovering Human Trafficking uses scraping methods to gather data. The authors scraped
1,164,663 ads from January 2008 to September 2014. They defined the author to be an
entity tied to a set of hard identifiers that co-occur in any given ad. They processed all the
ads and linked the phone numbers and email addresses. In the end, they gathered 336,315
authors in the dataset.
Backpage has a premium and a free feature. The premium feature bumps the ad to the top
of a listings page and requires bitcoins if one is posting an adult ad. The transactions via
Bitcoins show on the mempool of backpage. Backpage shows the time of the transaction in
its mempool and links it to the time of the ad. To establish such a link the authors built a
tool that would take snapshots of the state of the network at fine granularity. And since
Backpage used GoCoin for transacting Bitcoins, they used two methods to identify
transactions to GoCoin, namely Chainalysis Labels and GoCoin Heuristics.
Chainalysis can discover wallet addresses used for making payments to GoCoin. Through its
subscriber-only API, the authors can check if a transaction made payments to GoCoin.
Another method they used was GoCoin heuristics. Chainalysis was unable to discover ALL
GoCoin transactions. To account for false negative, they developed heuristics to identify
possible GoCoin transactions. By analyzing many GoCoin transactions that Chainalysis
identified, they found that GoCoin transactions has the following unique features: (1) fresh
wallet address appeared in exactly two transactions (ii) The deposited bitcoin amount is
always less than 1BTC and has between 3 and 4 decimal places. (iii) The bitcoins are
aggregated along with other bitcoins that follow Feature (ii) and (iv) all these bitcoins are
aggregated into a single multi-signature wallet address that starts with the number 3. The
authors label any wallet address that meets all four conditions as GoCoin heuristic.

In contrast, in the paper, Quick Access: Building a Smart Experience for Google Drive, the
authors gather data in a different way. They collect data using the googles Activity Service.
The activity service provides all recent activity in Drive. It gathers all requests made to the
Drive backend by any of the clients used to access Drive – including web, desktop and
mobile clients and the many third party apps that work through the Drive API. The Activity
Service logs high level user actions on documents as create, open, edit, delete, rename,
comment and upload events. Additional sources of input for document selection models
include user context such as recent or upcoming meetings on the calendar, their drive
settings (eg personal vs business account) and long-term usage statistics (e.g. total number
of collaborators, total files created since joining).

Methods
Both the papers use different approaches in solving their respective problems. In the first
paper, (Rebecca S. et al, 2007). It was important for them to have a training, test and a
validation set. In the process, the authors removed ads that were exact duplicates of each
other and having only one copy in the final set. They also removed the ads that had fewer
than 50 words. After sampling and resampling multiple times, they created three
training/testing datasets with each one consisting of 7500 instances: 5000 different
instances and 2500 same instances. They built a binary classifier that takes a pair of ads as
inputs and outputs if the ads are written by the same or ‘different’ otherwise. The authors
tested with different machine learning techniques for training. They used WritePrints
Limited, which used a limited section of the Writeprints feature set which consists mainly of
counts of characters, words, punctuation etc. This feature has been widely used for author
attribution. The authors also used Jaccard and Structure model which uses a variety of text-
based features: word unigrams/bigrams, parts of speech and proper names. They evaluated
their tool with all six-different training/testing combinations, training on one of the datasets
and separately testing on the other two, for all pair-wise combination of the three datasets.
In all cases, the same vs. different author classifier was effective, achieving 89.54% true
positive rate and 1.13% false positive rage on average.
In contrast, Quick Access: Building a Smart Experience for Google Drive, uses different
methods to train and evaluate. In this paper, the data extraction pipeline is responsible for
scanning the Activity Service BigTable and generating a scenario to each open event. Each
scenario represents a user visiting the Drive Home Screen and is an implicit request for
document suggestions. The authors generated their training, test and validation set by
collecting data from the Activity Service. Training data is stored in protocol buffers for each
scenario along with the recent activity from the Activity Service. The data is split into train,
test and validation sets by sampling at the user level and the scenarios are converted into a
set of positive and negative examples with suitable features. During collection of data, the
authors limited the candidates to documents with activity within the last 60 days. A
distributed neural network implementation was used to train over the training data. Once
the model has been trained, the evaluation pipeline is used to compute various custom
metrics like top-k accuracy for various k. To score each document, they used deep neural
networks as Google invests a significant amount in this, neural networks are well known for
reducing effort for this kind of task and deep neural networks are known to have a knack to
learn a competitive model such as this one. A simple architecture a stack of fully connected
layers of Rectified Linear Units (ReLU) was employed. This was fed into a logistic layer.
Feature engineering was also used by the efforts which included efforts such as making
additional input data sources available to the ML model, experimenting with different
representations of the underlying data, and encoding derived features in addition to the
base features.

Conclusion
We can see that both the authors in both the papers used similar techniques to formulate
their research question. They both first talk by introducing the problem and then talk about
why they chose to develop the tool they developed.
The authors in both the papers used different technique to gather and analyse data. In the
Bitcoin paper, the authors used mainly backpage to gather their data. They used scraping
techniques to gather the information about the ads. They, then used chainalyses and
GoCoin to link the data together. They also used graph theory to link ads to bitcoin
transactions. In contrast, the authors in the quick access drive, gathered data using Googles
Activity service. They logged users every action such as open, close, modify etc. in google
drive.

Finally, we can see that the methods used in both the papers are different. In (Rebecca S, et
al. 2017), the authors first split their data into training, test and validation set. They then
built a classifier which could have a true positive rate of 89.54% and a false positive rate of
1.13%. In contrast, the technologies the google authors used was a lot complicated. A
distributed neural network is used over training data. To score each document, a deep
neural network is used so the documents can be scored in real time while the user is
manipulating his files in Google Drive. A simple architecture a stack of fully connected layers
of Rectified Linear Units (ReLU) was employed. In addition to this, Feature engineering was
also used.
Similar Papers
 Similar to Backpage and Bitcoin: Uncovering Human Traffickers (Rebecca S. et al,
2017)
o Building High-level Features: Using Large Scale Unsupervised Learning In 2013
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 8595–8598.


The above paper tackles the problem of classification but uses unlabelled
data to solve its problem as opposed to (Rebecca S. et al, 2017) which uses
labelled data. The classifier used is also different, as the above mentioned
paper uses an unsupervised classifier as opposed to (Rebecca S. et al, 2017)
which uses a binary classifier.
 Similar to Quick Access: Building a Smart Experience for Google Drive (Sandeep Tata
et al, 2017)
o Jun Chen, Chaokun Wang, and Jianmin Wang. 2015. Will You "Reconsume"
the Near Past? Fast Prediction on Short-term Re- consumption Behaviors. In
29th AAAI Conference on Artificial Intelligence (AAAI). 23–29

The above paper tackles a similar prediction problem but uses linear and
quadratic kernels to predict whether the user will perform a short
reconsumption at a specific time or not as opposed to (Sandeep Tata et al, 2017)
where deep neural networks is used.
References
[1] http://www.idi.ntnu.no/grupper/su/publ/html/totland/ch012.htm

[2] Rebecca S., Danny Yuxing Huang, Periwinkle Doerfler (2017). Backpage and Bitcoin:

Uncovering Human Traffickers

[3] Sandeep Tata, Alexandrin Popescul, Marc Najork, Mike Colagrosso, Julian Gibbons, Alan

Green, Alexandre Mah, Michael Smith, Divanshu Garg, Cayden Meyer, Reuben Kan (2017)

Quick Access: Building a Smart Experience for Google Drive

You might also like