You are on page 1of 8

Predicting Search Engine Switching in WSCD 2013

Challenge

Qiang Yan Xingxing Wang, Dongying Kong


Institute of Automation Qiang Xu Institute of Computing
Chinese Academy of Sciences Computer Network Information Technology
Beijing, China Center Chinese Academy of Sciences
qyan@nlpr.ia.ac.cn Chinese Academy of Sciences Beijing, China
Beijing, China kongdongying@software.ict.ac.cn
{wxx, xuqiang}@cnic.cn
Danny Bickson Quan Yuan Qing Yang
Carnegie Mellon University Taobao. Inc Institute of Automation
Pittsburgh, PA Beijing, China Chinese Academy of Sciences
bickson@cs.cmu.edu yuanquan.yq@taobao.com Beijing, China
qyang@nlpr.ia.ac.cn

ABSTRACT Keywords
How to accurately predict search engine switch behavior is Search engine switching, Predicting Switch Rate, Online
a very important but challenging problem. This paper de- Bayesian Model
scribes the solution of GraphLab team that achieves the 4th
place for WSCD 2013 Search Engine Switch Detect con- 1. INTRODUCTION
test sponsored by Yandex. There are three core steps in
The WSCD Switching Detection Challenge is a compe-
our solution: Feature extraction, Prediction, and Model en-
tition which requires building models to predict the switch
semble. First, we extract features related to the quality of
probability from one search engine to another when given
result, user preference and search behavior sequence pat-
a user search session. The dataset, which includes search
tern from user actions, query logs, and sequence patterns
sessions extracted from Yandex query logs with user ids,
of click-streams. Second, models like Online Bayesian Pro-
queries, URL rankings, clicks and search engine switching
bit Regression (BPR), Online Bayesian Matrix Factoriza-
actions, consists of a training set and test set. The training
tion(BMF), Support Vector Regression (SVR), Logistic Re-
set contains 7856734 sessions in which users have done at
gression(LR) and Factorization Machine Model (FM) are ex-
least one switch within a period of 27 days. The following 3
ploited based on the previous features. Finally, we propose
days are considered to be the test period and all sessions of
a two-step ensemble method to blend our individual models
users from the training set are included in the test dataset,
in order to fully exploit the dataset and get more accurate
all switch action in test sessions are deleted to let partici-
result based on the local and public test dataset. Our final
pants predict their presence in these sessions. Test dataset
solution achieves 0.8439 AUC on the public leaderboard and
size is 738997 sessions.
0.8432 AUC on the private test set.
Each session that contains a set of actions gathered in
chronological order can be viewed as:

Categories and Subject Descriptors Session := (#U ser, #Day, QC+ , #SwitchT ype) (1)
H.3.3 [Information Storage and Retrieval]: Retrieval QC := (#QueryID, #ListOf U RLs, CU ) (2)
Models, Search Process
CU := (#ClickedU RLID) (3)
Which means that, given a user (#U serID) and day (#Day),
General Terms users may issue one or more queries (#QueryID), each
Predictive analytics, Collaborative filtering query returns a list of 10 urls (#ListOf U RLs) ordered from
left to right as they are shown to the user from the top to the
bottom. User may click zero or more URLs out of the list
(#ClickedU RLID). User may switch at any point during
the session. During each session, users have three possible
Permission to make digital or hard copies of all or part of this work for ways to switch:
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
P: Switch on search engine result page (SERP) ;
bear this notice and the full citation on the first page. To copy otherwise, to Click on another search engines URL at the result
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. page returned in response to a navigational query
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. seeking for another search engine.
Click one of the links to major search engines
which Yandex provides at the bottom of its search
engine result pages.
B: Switch with toolbar
Switch to another search engine with Yandex brows-
er toolbar (when installed).
H: Both of B and P.
Switch to another search engine both with toolbar
and search engine result page.
Performance of this competition is evaluated by Area Under
Curve metric [9] (AUC) that is calculated by using the rank-
ing of sessions provided by participants for the test period.
AUC is a popular measure used for evaluation of predictors
and represents the probability that a classifier will rank a
randomly chosen positive instance higher than a random-
ly chosen negative one. Teams are allowed to submit the
prediction result once every hour, and the AUC of part of Figure 1: Overview of Our System
test dataset called public test data set will be calculated
and showed in the leaderboard. The remaining of the test
10% of the training set randomly.
dataset was used to determine the winners of the contest.
However, the split between subsets (including their sizes) is Days 26-27: The training dataset is from day 1 to day
not disclosed. Only the last submission is considered for e- 27 while test dataset from 28 to 30, so the intuitive
valuation, on which the live public rating is also based. The idea is to use later data (day 26 - 27) in the training
full description of the contest and dataset can be found at set for local testing.
http://switchdetect.yandex.ru/en/datasets.
The remainder of this paper is structured as follows. Sec- Days 21-23: In consideration of the diversity of each
tion 2 introduces the architecture of our system, including day of week, we also try to use day 21 - 23 the previous
local test data set generation, features extraction from the week of the days in the test dataset.
raw data and instance generation. Section 3 discusses the in- Our experiments show that the random sampling and Day
dividual models, based on the predictive features described 26-27 sampling are more stable.
in section 2.3. Section 4 introduces a two-step ensemble In the modeling step, we train all the models twice: first,
model to blend all individual models. We conclude this pa- train on local training dataset, and predict on local test
per in section 5. dataset, then retrain on the full training dataset with the
same parameters setting as before, and predict on the full
2. SYSTEM ARCHITECTURE test dataset. In blending step, we aggregate all the predic-
tions from the individual models on local test dataset to
2.1 Overview of Our System ensemble. The whole procedure is shown in Figure 2.
Our solution to this contest can be divided into three core
steps: features, models and ensemble, as shown in Figure 1.
In the first step, we explore strong predictive features
which are critical to all machine learning techniques from
the raw session data, including features related to user pref-
erence, search result satisfaction and query-click sequence
pattern. In the second step, several online Bayesian models,
regression models and classification models are implement-
ed, based on two different perspectives: session based and Figure 2: Composition of the Dataset
query based. In the final step, we blend the predictors from
the second step with a linear model separately, and then
the final result is obtained by assembling based on the in- 2.3 Feature Engineering
formation from the leaderboard. This general framework is In this section, we introduce the features used in our in in-
similar to our solution method deployed in KDD cup 2011 dividual model; each model may use different set of features.
Yahoo! music recommendation contest [17] and KDD cup Overall all the features can be grouped as user related fea-
2012 CTR prediction [15]. tures, query result dissatisfaction features and query-click
sequence pattern features.
2.2 Local Test DataSet
Our experience in earlier KDD cups [15][17] shows that it 2.3.1 User Related Features
is very important to sample a stable local test dataset from The user preference is the most important session feature,
training dataset for modeling and ensemble steps, which experiment shows that we can get AUC of 0.7291 on the
share the same distribution of the test dataset. We have leaderboard with only one feature - UserID. The features
tried some sample methods in this competition: related to user preference are listed in Table 1.
In this section we provide information on creating training
Table 1: User Related Features and test instances out of the Yandex logs, as well as labeling
ID Feature Description
instances for improving prediction.
User ID Unique identifier of a user
User SR Switch rate for each user 2.4.1 Session-Based vs. Query-Based Instance
User SC Session count for each user
Since the primary goal of the challenge is to predict switch
User QC Query count for each user probability given a search session for one user, the intuitive
User CC Click count for each user idea of instance generation is session-based, that means, we
extract feature vector for each session. However, previous
work on switching [5] determined that query performance
2.3.2 Query Related Features for search result is a very important factor for predicting
Guo et al. [16] study the reasons behind search engine switching, alternative approach is to generate instance for
switching within a session, and find that dissatisfaction (D- each query in the session, the label for the query based in-
SAT) with the results is the most important reason why user stance is the label of this session, in order to explore the
switch. In our system, we exploit Query ID, Query Count, query sequence, additionally we add features related to pre-
Click Count etc. We observed that many users switch at the and post-query. Our experiment result shows that the s-
end of session, so features like ingle model of query-based instance generation cannot im-
Last Query Click Count, Last Query MaxClick Pos were con- prove the prediction performance, but we can achieve better
structed for the purpose of mining this kind of information. AUC when ensemble query-based and session-based results
The features related to Query performance are listed in Ta- we used.
ble 2.
2.4.2 Binary vs. 3-class
Additional approach tried is to classify switch and non-
Table 2: Query Related Features switch is binary labels for classification. Fortunately, the
ID Feature Description detailed type of switch (B, P and H) has been provided in
Query ID unique identifier of a query the dataset, so just like 1 vs. all multi-classification we can
Query Count Query count for each session generate instance for each individual type of switch, for ex-
Duplicated Query Count Duplicated query count ample, for toolbar switch (B), we take all sessions with labels
Click Count Click count for each session of B and H as positive instances, while the other sessions are
Last Query Click Count Click count of last query negative instances, we name this dataset as B vs. all. We
Last Query MaxClick Pos Max click pos of last query train models on the datasets for different switch labels, then
URL Click Count Click count of each URL ensemble all the models together, the ensemble method is.
Pos Click Count Click count for each position
P ri (all) = P ri (B) + P ri (P ) P ri (B) P ri (P ) (4)
where P ri (B) is the prediction of the i-th instance based
2.3.3 Query-click Sequence Pattern Features on B vs. all dataset, while P ri (P ) denotes the prediction
based on P vs. all dataset, and P ri (all) presents the ensem-
[6] shows that query-click sequence pattern is an impor- ble of this two predicting result above for i-th instance, as
tant feature for switch prediction; each session is represented shown in Figure 3.
as a character sequence, including actions and dwell time be-
tween these actions. All the actions we used in our system
are issued query (Q), clicked result link (C), click back to
previous link (B), dwell times are splited into Short (S),
Middle (M) and Long (L), the segmentation points of d-
well time we used are 60 and 560 by cross validation. For ex-
ample, QSCLC represents that a user issues a query, views
one URL for a short time, then clicks another URL within
a longer time. In order to fully exploit sequence pattern, we
tried last M sequence feature and N-gram sequence feature
in our system, we experiment on different M and N, the set- Figure 3: User Switch Type in Dataset
tings M as 11 and N as 5, 6 leads to the best performance.
All the features which are bound up with Query-click Se- Our experiment result shows that blending of 3-class mod-
quence Pattern are listed in Table 3. els can significantly improves performance and ensemble bi-
nary class model and 3-class also can improve the final AUC.

Table 3: Query-click Sequence Pattern Features


ID Feature Description
3. MODELS
Last N SQ Last 11 sequence pattern We adopt several state-of-the-art approaches to predict
5-gram SQ 5-gram sequence pattern the switch rate given a session or query, which can be cat-
egorized as Online Bayesian Model, Regression Model and
6-gram SQ 6-gram sequence pattern
Classification Model. We introduce the detailed informa-
tion about these individual models in this section. All the
models except Online Baysian models are based on sessions.
2.4 Instance Generation Overall performance of our models is listed on Table 4.
p (t|s) = N t; s, 2

(9)
Table 4: AUC of each individual model
AUC on AUC on p (y|t) = (t) (10)
Model Description
Local Test Leaderboard
0.7651 0.8131 2-class query-based therefore,the joint density function p (y, t, s, w|x) can be fac-
BPR 0.7706 0.8214 2-class session-based torises as
0.7812 0.8326 3-class session-based p (y|t) p (t|s) p (s|x, w) p (w) (11)
0.7805 0.8078 2-class query-based
BMF 0.7814 0.8057 2-class session-based
0.7855 0.8271 3-class session-based
0.7789 0.8219 2-class
SVR
0.7899 0.8226 3-class
0.7721 0.8165 2-class
CRR
0.7745 0.8192 3-class
0.7812 0.8234 3-class(L-BFGS)
0.7827 0.8266 3-class L1(SGD)
LR
0.7891 0.8298 2-class L2
0.7910 0.8357 3-class L2
0.7763 0.8121 2-class SGD
0.7786 0.8179 3-class SGD
FM
0.7815 0.8187 2-class MCMC
0.7861 0.8192 3-class MCMC

3.1 Online Bayesian Model


3.1.1 Online Bayesian Probit Regression
Probit Regression is a linear model based on the link func-
tion of probit, which maps discrete or real-valued input fea-
tures to probabilities. In the Online Bayesian Probit Re-
gression(BPR) [4] context, prior Gaussian distributions are
placed on the coefficients, in the form of Gaussian distribu-
tions. It maintains Gaussian beliefs over weight of the model
and performs Gaussian updates derived from approximate
message passing. BPR performed well on predicting adver-
tisement placement for Bing search engine.

Probit Regression is a generalised linear mode with a pro-


bit link function:
Figure 4: Factor graph model of Bayesian probit
 T
yw x
 regression with message flow
p (y|x, w) = (5)
This distribution can be understood in terms of the follow-
Here (t) is the cumulative distribution function of stan- ing generative process, which is also reflected in the factor
dard normal distribution: graph in Figure 4.
Z t Factor Ni,j : Sample weight wi,j from the Gaussian
2
(t) = N (s; 0, 1) ds (6) prior p (wi,j ) = Ni,j i,j , i,j .

Factor g: Calculate the score s for x as theinner prod-
(t) can map the sum of the linear coefficients from (, )
uct wT such that p (s|x, w) = s wT x .
to (0, 1). x is the feature vector and w denotes weight vec-
tor. For convenience, we group the feature vector x into Factor h: Add zero-mean Gaussian noise to obtain t
many groups of vectors, and the weight vector is grouped from s,such that p (t|s) = N t; s, 2 .
T T
according to the feature vector: w = wT T

1 , w 2 , . . . , xN .
Factor q: Determine y by a threshold on the noisy
y {1, +1} denotes the label. can control the steepness
score t at zero,such that p (y|t) = (y sign (t)).
of the inverse link function. Based on the Probit Regression,
the factorising Gaussian prior distribution over the weights notes the Dirac Delta function,which has properties:
of the model is postulated and two latent variables s, t is 
introduced: + x=0
(x) = (12)
0 x 6= 0
Mi
N Y
Y 2
p (w) = N (wi,j ; i,j , i,j ) (7) and
i=1 1
Z +
 T 
s = wT x p(s) = N s|xT , x2 2 (8) (x)dx = 1 (13)

A belief propagation method can be used to calculate the because similar user may search the similar queries. There-
posterior over weight w in factor graph Figure 4. In or- fore, we apply MF to exploit latent information from data.
der to maintain Gaussian beliefs, an approximation method, We utilize this kind of model on probit regression models[14]
, expectation propagation [11], is used. After derivation
by using x = {u, q, f } u Rm , q Rn , f Rl (18)
 expectation propagation,the posterior weight w
e =
e 2 can be calculated as following:
e,
2
y (x) = wf T f + (su)T (tq) (19)
y xT
 
i,j
ei,j = i,j + yxi,j
(14) sR mk
,t R nk
, wf R l

pswitch = (y (x)) (20)
y xT
  
2 2 i,j

ei,j = i,j 1 x2i,j 2 (15) where u and q are the user features and query features,

f are additional features. wf are the weight of other fea-
where: tures. s and Rt represent factors of the features of u and q.
t
Here (t) = N (s; 0, 1) ds is the standardized cumula-
2 = 2 + x2
T
2 (16) tive Gaussian density (probit function) which serves as the
inverse link function mapping the output of the linear model
in (, ) to (0, 1).
N (t; 0, 1)
(t) = , (t) = (t) [ (t) + t] (17) In order to arrive at a Bayesian online learning algorithm
(t; 0, 1) we postulate a factorising Gaussian prior distribution over
For the first training, we initialized the parameter of the the weights and factors:
prior weight distribution . 2 to reflect any prior infor-


e2 are calculated
  Y
mation and the posterior parameter e, 2
p (wf ) = N (wf,i ; f,i , f,i ) (21)
according to the equations above.After that, the previously
obtained posterior is used as the prior for the next update, N Y
K
e 2 . And we can get a online learning
Y 2
i.e., e and p (s) = N (si,k ; i,k , i,k ) (22)
algorithm. i=1 k=1
Finally we got an AUC of 0.8326 in leaderboard with on-
line Bayesian probit regression model. N Y
Y K
2
p (t) = N (ti,k ; i,k , i,k ) (23)
3.1.2 Online Bayesian Matrix Factorization i=1 k=1

We can calculate the posterior of the distribution of the


weights and factors. The factor model is showed in Figure
5.We use Assumed-Density Filtering algorithm to calculate
the posterior for every record,which is shown in Figure 5.

Firstly, we compute the sums for the linear Gaussian in-


put models. All the message labeled 1 in Figure 5 can be
calculated in this way. Then, because messages (2) and mes-
sages (5) can not be calculated directly, EP approximation
is used, which result in that the messages (2,3,4,5,6,7) must
be iterated until the marginal distribution of the probability
of the switch, pswitch , no longer changes. Finally, messages
(8) are calculated and the posterior distribution in each case
is obtained by multiplying this message by the prior. The
detailed inference can be found in[16].
Finally we got an AUC of 0.8271 in the leaderboard with
Online Bayesian Matrix Factorization.

3.2 Regression Model


3.2.1 Support Vector Regression
Support vector regression (SVR) [2] is used for regression
by introducing an alternative loss function like -insensitive
loss function. SVR maps data to a high dimensional space
and employs a kernel function. The optimization target of
Figure 5: Factor graph model of online bayesian ma- L2-regularized SVR we use is:
trix factoriaztion regression with message flow
N
1 X
Matrix Factorization (MF) is a common technique for min T + (; xi , yi ) (24)
2
i=1
collaborative filtering (CF).This competition task can be
medeled as a recommending problem solvable bayesian CF, Where is the penalty parameter.
We utilized a large scale linear solver toolbox - LIBLIN- Regularization
EAR [3] in our final solution. LIBLINEAR offers two types To avoid overfitting, two types of regularization has
of complementary training methods for linear SVC:Newton been added to the loss fuction:
type method to solve the primal-form and coordinate de-
scent approach for the dual form. L1 Regularization loss function :
The detailed parameter for LIBLINEAR are: =0.6, e = N
X N
X
0.1. `(w, x) = yi (w, xi ) yi )2 + C
(b |wj | (29)
L2-SVR can achieve 0.7899 on local test data set and i=1 j=1
0.8266 on the leaderboard with the blending of 4-class clas- L2 Regularization loss function:
sification.
N
X
3.2.2 Combined Regression and Ranking `(w, x) = yi (w, xi ) yi )2 + CwT w
(b (30)
In this contest, the performance of predicting result is e- i=1

valuated by AUC, [15] shows that minimize pairwise 0/1


where C > 0 is the penalty parameter.
loss is equivalent to optimize AUC directly. Combined Re-
gression and Ranking (CRR) [13] optimizes regression-based Optimization
and rank-based objectives simultaneously which gives great In our experiment, to avoid the tuning tune the pa-
performance both in regression and ranking metrics. Let rameters in SGD, We choose Limited-memory BFGS
L(w, D) denote regression loss, L(w, P ) denote pairwise rank- (L-BFGS) described in [18] and LIBLINEAR provided
ing loss. The CRR optimization problem is: in [3] for L1- and L2- Regularized Logistic Regression.

min L (w, D) + (1 ) L (w, P ) + kwk22 (25) We include three Logistic Regression models in our final
w 2
result, the first one is L1 Regularization Logistic Regression
where: with L-BFGS, the other two models are L1 and L2 Regular-
ization Logistic Regression with LIBLINEAR. The detailed
1 X
parameters of LIBLINEAR we used are: =0.6, e = 0.01
L (w, D) = l (y, f (w, x)) (26)
|D| The L2 logistic Regression with LIBLINEAR on 3-class
(x,y)D
dataset is the best individual model giving 0.835672 on the
1 X   
public test set.
L (w, P ) = l t(y + y ), f w, x+ x
|P |
(x+ ,x )P 3.3.2 Factorization Machine
where D is collections of instances and P is collection- Factorization Machine(FM) [12] is a general class of mod-
s of pair-wise instances, (x+ , y + ) is positive instance while els which combine the advantages of Support Vector Ma-
(x , y ) is negtive instance, is a tradeoff parameter weight- chines (SVM) with factorization models. We explore the
s up between regression loss and pairwise ranking loss, l(., .) Factorization Machine (FM) with order d=2 shown below:
is loss functions and f (., .) is the prediction function associ- N
X N
X N
X K
X
ated with loss function. We use the package suite of fast in- yb (x) := w0 + w i xi + xi xj v ik v jk (31)
cremental algorithms for machine learning (sofia-ml) which i=1 i=1 j=i+1 k=1
provided in [13]. We experiment on different type of learn-
parameters are:
er and loop types provided in sofial-ml, and we find that
logistic loss, sgd-svm learner and combined-ROC loop type w0 R, w RN , v RN K (32)
lead to the best AUC. The detailed parameter settings are:
where: x is input variable and yb is prediction of the target.
=0.75, =0.00005, iterations=400000000.
While w0 is the global bias, w is the feature bias and v is
This model achieves an AUC of 0.7745 on local test dataset
latent factor for each feature. K is dimensionality of the 2-d
and 0.8192 on the leaderboard with combination of 3-class
factorization.
results.
To reduce the computational complexity, FM model in
3.3 Classification Model (30) can be transformed into:
!2
N K N N
3.3.1 Logistic Regression X 1 X X X 2 2
yb (x) = w0 + w i xi + v ik xi v ik xi
Logistic Regression is a widely used classification model. i=1
2 i=1 i=1
k=1
For given x and w, it models the probability that y = 1
is defined as follows: The optimization criterion is:
1
p (y = 1|x, w) = (27)
1 + ey.(w x)
T X
min L (y yb (x)) + kk2 subject to (20)
2
The log-likelihood(LL) for Logistic Regression is given by : (x,y)D

N
X where represents the the parameters w0 , w and v,
` (w) = log[p(yi = 1|xi )I(yi =1) +(1 p (yi = 1|xi ))I(yi =1) ] is the regularization parameter. We explore FM model for
i=1 binary classification(y {+1, 1}), logistic loss is used in
our final solution:
N  
1
X
= log[p(yi = 1|xi ) + (1 yi ) log(1 p(yi = 1|xi )] (28) L () := ln (33)
i=1 1 + e
Finally, we explored SGD and MCMC learning methods, 4.2 Ensemble on Public Test dataset
SGD is fast but not easy to tuning the parameters while In order to leverage the results of the individual model-
MCMC is slow but less parameters needed to tune. The s and leaderboard infomation fully, a rank based ensemble
detailed parameter settings for FM are listed in Table 5. method is introduced to combine models above. Suppose
that we have N instances in the testing set and M individ-
ual models. By ranking all the instances according to their
Table 5: Parameter Settings for FM prediction. The detailed ensemble method is :
Methods Paramters
Learning Rate =0.002 M
P r(i) = P (36)
SGD Regularization =0.00005 M 1
j=1
size of latent f actor=30 pj (i)
size of latent f actor=50
MCMC where pj (i) denotes the prediction of instance i with model
iterations=150
j, and P r(i) is the ensemble result of instance i.
In addition, our experiment result shows that re-ensemble
FM model with SGD can achieve 0.7786 on local test da- with predictions from some individual models can gain some
ta set and 0.8179 on the leaderboard with 2-class classifica- additional improvement. We ensemble models E1, E2, E3
tion, while FM model with MCMC can achieve 0.7861 on with BPR, CRR, LR-L2, LR-L1 to achieve the final result
local test data set and 0.0.8191 on the leaderboard with the 0.843930.
blending of 3-class classification.

5. CONCLUSIONS
4. ENSEMBLE
In this paper, we present our approach to WSCD 2013
[1][15][17] show that the combination of different weak Switching Detection Challenge. Since the data is rather
models can lead to significant performance improvement over large, we exploit serval individual models for large scale da-
individual models. In this section, we introduce the two- ta, like Online Bayesian Probit Regression (BPR), Support
step ensemble we adopt in our learning system: Local Test Vector Regression (SVR), and Factorization Machine Mod-
dataset based and Public Test dataset ensemble. Since the el (FM), and we proposed a two-step ensemble method to
output of models may have different distribution, we adopt blend models above. Our experimental results demonstrat-
a data normlization [10] step for each model before ensem- ed effectiveness of this solution. In the future, we would
ble: sort the prediction on test dataset by their value, and like to conduct other sequence learning models (e.g., Hid-
scale the scores into [0, 1] linearly. den Markov models) to explore the sequence properties from
the dataset, and in order to improve the computational per-
4.1 Ensemble on Local Test dataset formance, other parallel machine learning frameworks like
As described in section 2, ensemble on local test dataset GraphLab [19] and GraphChi [8] can be used to implemen-
is just similar to individual models, we aggregate all the t our models above, especially the online bayesian models
predictions from the individual models on local test dataset since they naturally support graphical models.
as instance and the label of this instance is the real switch
type of this instance(+1/-1), then build a ensemble mod-
el to learn the optimal M -dimensional linear combination 6. ACKNOWLEDGMENTS
weights w, where M is the number of ensemble models. In We show great thank to Yandex for holding this interest-
our ensemble system, we used ridge regression: ing contest. This work was supported by Innovation Fund
Project of Computer Network Information Center of Chinese
W = (X T X + I)1 Xy (34) Academy of Sciences Notology-based Scientists Resource
where X denotes the feature matrix generated by aggregat- Service Platform (Project Code: CNIC CX 10003). Dan-
ing all the predictions from the individual models, y repre- ny Bickson is supported by the 7th Framework IST pro-
sents the real value vector for each instance, I is the identity gramme of the European Union through the focused re-
matrix and is the regularization parameter determined by search project (STREP) on Longitudinal Analytics of We-
cross validation [7], the final prediction is obtained by: b Archive data (LAWA) under contract no. 258105, ARO
MURI W911NF0810242, the ONR PECASE-N00014- 10-1-
T 0672 and the National Science Foundation grant IIS-0803333.
p
b=X
c W (35)
Since the output of different models show different perfor-
mance on local test dataset, we ensemble them separately,
and the result on public test set are shown in the following
7. REFERENCES
table. [1] R. M. Bell and Y. Koren. Lessons From the Netflix
Prize Challenge. SIGKDD Explorations, 9(2):7579,
2007.
Table 6: Leaderboard AUC of ensemble models [2] B. E. Boser, I. Guyon, and V. Vapnik. A Training
No. Ensembled Model AUC on Leaderboard
Algorithm for Optimal Margin Classifiers. In COLT,
E1 BPR & BMF 0.8392
pages 144152, 1992.
E2 FM& LR 0.8416
[3] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,
E3 SVR & CRR 0.8304
and C.-J. Lin. LIBLINEAR: A Library for Large
Linear Classification. JMLR.
[4] T. Graepel, J. Q. Candela, T. Borchert, and Distributed GraphLab: A Framework for Machine
R. Herbrich. Web-Scale Bayesian Click-Through rate Learning and Data Mining in the Cloud. PVLDB,
Prediction for Sponsored Search Advertising in 2012.
Microsofts Bing Search Engine. In ICML, pages
1320, 2010.
[5] Q. Guo, R. W. White, Y. Zhang, B. Anderson, and
S. T. Dumais. Why Searchers Switch: Understanding
and Predicting Engine Switching Rationales. In
SIGIR, pages 335344, 2011.
[6] A. P. Heath and R. W. White. Defection Detection:
Predicting Search Engine Switching. In WWW, pages
11731174, 2008.
[7] M. Jahrer, A. T oscher, and R. A. Legenstein.
Combining Predictions for Accurate Recommender
Systems. In KDD, pages 693702, 2010.
[8] A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi:
Large-scale Graph Computation On just a PC. In
Proceedings of the 10th USENIX conference on
Operating Systems Design and Implementation,
OSDI12, pages 3146, Berkeley, CA, USA, 2012.
USENIX Association.
[9] C. X. Ling, J. Huang, and H. Zhang. AUC: a
Statistically Consistent and more Discriminating
Measure than Accuracy. In IJCAI, pages 519526,
2003.
[10] T. McKenzie, C.-S. Ferng, Y.-N. Chen, C.-L. Li, C.-H.
Tsai, K.-W. Wu, Y.-H. Chang, C.-Y. Li, W.-S. Lin,
S.-H. Yu, C.-Y. Lin, P.-W. Wang, C.-M. Ni, W.-L. Su,
T.-T. Kuo, C.-T. Tsai, P.-L. Chen, R.-B. Chiu, K.-C.
Chou, Y.-C. Chou, C.-C. Wang, C.-H. Wu, H.-T. Lin,
C.-J. Lin, and S.-D. Lin. Novel Models and Ensemble
Techniques to Discriminate Favorite Items from
Unrated Ones for Personalized Music
Recommendation. Journal of Machine Learning
Research - Proceedings Track, 18:101135, 2012.
[11] Y. Qi, A. H. Abdel-Gawad, and T. P. Minka. Variable
Sigma Gaussian Processes: An Expectation
Propagation perspective. CoRR, abs/0910.0668, 2009.
[12] S. Rendle. Factorization Machines with libFM. ACM
Trans. Intell. Syst. Technol., 3(3):57:157:22, May
2012.
[13] D. Sculley. Combined Regression and Ranking. In
KDD, pages 979988, 2010.
[14] D. H. Stern, R. Herbrich, and T. Graepel. Matchbox:
Large Scale Online Bayesian Recommendations. In
WWW, pages 111120, 2009.
[15] X. Wang, S. Lin, D. Kong, L. Xu, Q. Yan, S. Lai,
L. Wu, A. Chin, G. Zhu, H. Gao, et al. Click-Through
Prediction for Sponsored Search Advertising with
Hybrid Models.
[16] R. W. White and S. T. Dumais. Characterizing and
Predicting Search Engine Ewitching Behavior. In
CIKM, pages 8796, 2009.
[17] Y. Wu, Q. Yan, D. Bickson, Y. Low, and Q. Yang.
Efficient Multicore Collaborative Filtering. In ACM
KDD CUP workshop, 2011.
[18] Y. Xiao, Z. Wei, and Z. Wang. A Limited Memory
BFGS-type Method for Large-scale Unconstrained
Optimization. Computers & Mathematics with
Applications, 56(4):10011009, 2008.
[19] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny
Bickson, Carlos Guestrin and Joseph M. Hellerstein.

You might also like