You are on page 1of 47

A Review of

Information Filtering
Part II: Collaborative Filtering

Chengxiang Zhai

Language Technologies Institiute


School of Computer Science
Carnegie Mellon University
Outline

• A Conceptual Framework for Collaborative


Filtering (CF)
• Rating-based Methods (Breese et al. 98)
– Memory-based methods
– Model-based methods
• Preference-based Methods (Cohen et al. 99 &
Freund et al. 98)

• Summary & Research Directions


What is Collaborative Filtering (CF)?

• Making filtering decisions for an individual user


based on the judgments of other users
• Inferring individual’s interest/preferences from
that of other similar users
• General idea
– Given a user u, find similar users {u1, …, um}
– Predict u’s preferences based on the preferences of u1,
…, um
CF: Applications
• Recommender Systems: books, CDs, Videos, Movie
potentially anything!
• Can be combined with content-based filtering
• Example (commercial) systems
– GroupLens (Resnick et al. 94): usenet news rating
– Amazon: book recommendation
– Firefly (purchased by Microsoft?): music
recommendation
– Alexa: web page recommendation
CF: Assumptions
• Users with a common interest will have similar
preferences
• Users with similar preferences probably share the
same interest
• Examples
– “interest is IR” => “read SIGIR papers”
– “read SIGIR papers” => “interest is IR”

• Sufficiently large number of user preferences are


available
CF: Intuitions

• User similarity
– If Jamie liked the paper, I’ll like the paper
– ? If Jamie liked the movie, I’ll like the movie
– Suppose Jamie and I viewed similar movies in the past
six months …
• Item similarity
– Since 90% of those who liked Star Wars also liked
Independence Day, and, you liked Star Wars
– You may also like Independence Day
Collaborative Filtering vs.
Content-based Filtering
• Basic filtering question: Will user U like item
X?
• Two different ways of answering it
– Look at what U likes => characterize X => content-based filtering

– Look at who likes X => characterize U => collaborative filtering

• Can be combined
Rating-based vs. Preference-based
• Rating-based: User’s preferences are encoded using
numerical ratings on items
– Complete ordering
– Absolute values can be meaningful
– But, values must be normalized to combine
• Preferences: User’s preferences are represented by
partial ordering of items
– Partial ordering
– Easier to exploit implicit preferences
A Formal Framework for Rating

Objects: O
Users: U o1 o2 … oj … on Xij=f(ui,oj)=?
u1
u2 3 1.5 …. … 2
The task
… • Assume known f values for some (u,o)’s
? • Predict f values for other (u,o)’s

ui 2 Essentially function approximation, like
other learning problems

...
1

um Unknown function
f: U x O R
Where are the intuitions?
• Similar users have similar preferences
– If u  u’, then for all o’s, f(u,o)  f(u’,o)
• Similar objects have similar user preferences
– If o  o’, then for all u’s, f(u,o)  f(u,o’)
• In general, f is “locally constant”
– If u  u’ and o  o’, then f(u,o)  f(u’,o’)
– “Local smoothness” makes it possible to predict unknown values
by interpolation or extrapolation
• What does “local” mean?
Two Groups of Approaches
• Memory-based approaches
– f(u,o) = g(u)(o)  g(u’)(o) if u  u’
– Find “neighbors” of u and combine g(u’)(o)’s

• Model-based approaches
– Assume structures/model: object cluster, user cluster, f’
defined on clusters
– f(u,o) = f’(cu, co)
– Estimation & Probabilistic inference
Memory-based Approaches
(Breese et al. 98)

• General ideas:
– Xij: rating of object j by user i
– ni: average rating of all objects by user i
– Normalized ratings: Vij = Xij - ni
– Memory-based prediction
m m
vˆaj  K  w(a, i )vij xˆ aj  vˆaj  na k  1 /  w(a, i )
• Specific approaches
i 1
differ in w(a,i) --i the
1

distance/similarity between user a and i


User Similarity Measures
• Pearson correlation coefficient (sum over commonly
rated items)

j ( x aj  na )( x ij  ni )
w p ( a, i ) 
j ( x aj  na ) j ( x ij  ni )
2 2

• Cosine measure
n


j
x aj x ij
1
w c ( a, i ) 
• Many other possibilities! n n

 x aj  x ij
2 2

j 1 j 1
Improving User Similarity
Measures (Breese et al. 98)

• Dealing with missing values: default


ratings
• Inverse User Frequency (IUF): similar to
IDF
• Case Amplification: use w(a,I)p, e.g., p=2.5
Model-based Approaches
(Breese et al. 98)

• General ideas
– Assume that data/ratings are explained by a probabilistic
model with parameter 
– Estimate/learn model parameter  based on data
– Predict unknown rating using E [xk+1 | x1, …, xk], which is
computed using the estimated model

E [ x k 1 | x 1 ,..., x k ]   p (x k 1  r | x 1 ,..., x k , )r
• Specific methods differr in the model used and how
the model is estimated
Probabilistic Clustering
• Clustering users based on their ratings
– Assume ratings are observations of a
multinomial mixture model with parameters
p(C), p(xi|C)
– Model estimated using standard EM

• Predict ratings using E[xk+1 | x1, …, xk]


E[ xk 1 | x1 ,..., xk ]   p (xk 1  r | x1 ,..., xk )r
r

p( xk 1  r | x1 ,..., xk )   p (C  c |x1 ,..., xk ) p( xk 1  r | C  c)


c
Bayesian Network
• Use BN to capture object/item dependency
– Each item/object is a node
– (Dependency) structure is learned from all data
– Model parameters: p(xk+1 |pa(xk+1)) where
pa(xk+1) is the parents/predictors of xk+1
(represented as a decision tree)
• Predict ratings using E[xk+1 | x1, …, xk]
E[ xk 1 | x1 ,..., xk ]   p (xk 1  r | x1 ,..., xk )r
r

p( xk 1  r | x1 ,..., xk ) given by the decision tree at node xk 1


Three-way Aspect Model
(Popescul et al. 2001)

• CF + content-based
• Generative model
• (u,d,w) as observations
• z as hidden variable
• Standard EM
• Essentially clustering the joint data
• Evaluation on ResearchIndex data
• Found it’s better to treat (u,w) as
observations
Evaluation Criteria (Breese et al. 98)
• Rating accuracy
– Average absolute deviation Sa  |P1 |  |xˆ aj  x aj |
a
j Pa

– Pa = set of items predicted


• Ranking accuracy
max( x aj  d ,0)
– Expected utility Sa  
j 2( j 1) /( 1)
– Exponentially decaying viewing probabillity
  ( halflife )= the rank where the viewing probability
=0.5
– d = neutral rating
Datasets
Results

- BN & CR+ are generally better


than VSIM & BC
- BN is best with more training data
- VSIM is better with little training data
- Inverse User Freq. Is effective
- Case amplification is mostly effective
Summary of
Rating-based Methods
• Effectiveness
– Both memory-based and model-based methods can
be effective
– The correlation method appears to be robust
– Bayesian network works well with plenty of training
data, but not very well with little training data
– The cosine similarity method works well with little
training data
Summary of
Rating-based Methods (cont.)
• Efficiency
– Memory based methods are slower than model-
based methods in predicting
– Learning can be extremely slow for model-based
methods
Preference-based Methods
(Cohen et al. 99, Freund et al. 98)

• Motivation
– Explicit ratings are not always available, but
implicit orderings/preferences might be available
– Only relative ratings are meaningful, even if when
ratings are available
– Combining preferences has other applications, e.g.,
• Merging results from different search engines
A Formal Model of Preferences

• Instances: O={o1,…, on}


• Ranking function: R: (U x) O x O  [0,1]
– R(u,v)=1 means u is strongly preferred to v
– R(u,v)=0 means v is strongly preferred to u
– R(u,v)=0.5 means no preference

• Feedback: F = {(u,v)}, u is preferred to v


• Minimize Loss:
1
L (R , F )  1   R (u ,v )
| F | ( u ,v )F
R   arg min L (R , F )
RH

Hypothesis space
The Hypothesis Space H

• Without constraints on H, the loss is


minimized by any R that agrees with F
• Appropriate constraints for collaborative
filtering Ra (u ,v )   w i Ri (u ,v )  w i  1
i U  { a }

• Compare this with


m m
vˆaj  K  w(a, i )vij xˆaj  vˆaj  na k  1 /  w(a, i )
i 1 i 1
The Hedge Algorithm for
Combining Preferences
• Iterative updating of w1, w2, …, wn

• Initialization: wi is uniform

• Updating:   [0,1]

L ( R it ,F t )
t 1 w 
t
w i  i
Zt
• L=0 => weight stays
• L is large => weight is decreased
Some Theoretical Results

• The cumulative loss of Ra will not be much


worse than that of the best ranking
expert/feature
• Preferences Ra => ordering  => R
L(R,F) <= DISAGREE(,Ra)/|F| + L(Ra,F)
• Need to find  that minimizes disagreement
• General case: NP-complete
A Greedy Ordering Algorithm
• Use weighted graph to represent preferences R
• For each node, compute the potential value, I.e.,
outgoing_weights - ingoing_weights

• Rank thenode
(v )   with
R (v , uthe
) highest
R (u ,v ) potential value
above all others u O u O

• Remove this node and its edges, repeat


• At least half of the optimal agreement is guaranteed
Improvement

• Identify all the strongly connected


components
• Rank the components consistently with the
edges between them
• Rank the nodes within a component using
the basic greedy algorithm
Evaluation of Ordering Algorithms

• Measure: “weight coverage”


• Datasets = randomly generated small graphs
• Observations
– The basic greedy algorithm works better than a
random permutation baseline
– Improved version is generally better, but the
improvement is insignificant for large graphs
Metasearch Experiments
• Task: Known item search
– Search for a ML researchers’ homepage
– Search for a university homepage

• Search expert = variant of query


• Learn to merge results of all search experts
• Feedback
– Complete : known item preferred to all others
– Click data : known item preferred to all above it
• Leave-one-out testing
Metasearch Results
• Measures: compare combined preferences with
individual ranking function
– sign test: to see which system tends to rank the
known relevant article higher.
– #queries with the known relevant item ranked above k.
– average rank of the known relevant item

• Learned system better than individual expert by


all measure (not surprising, why?)
Metasearch Results (cont.)
Direct Learning of an
Ordering Function
• Each expert is treated as a ranking feature fi: O  R U
{0} (allow partial ranking)
• Given preference feedback : X x X  R
• Goal: to learn H that minimizes the loss

rloss D (H )   D (x 0 , x 1 )[[ H ( x 0 )  H ( x 1 )]]  Pr( x 0 , x 1 )~D [ H ( x 0 )  H ( x 1 )]


• D (x0,x1): xa0 , x 1distribution over X x X (actually a uniform
dist. over pairs with feedback order) D (x0,x1) = c max{0,
(x0,x1) }
The RankBoost Algorithm
• Iterative updating of D(x0,x1)
• Initialization: D1= D
• For t=1,…,T:
– Train weak learner using Dt
– Get weak hypothesis ht: X  R
– Choose t >0
– Update
Dt ( x 0 , x 1 )e t ( ht ( x 0 )  ht ( x 1 ))
Dt 1 ( x 0 , x 1 ) 
Zt
• Final hypothesis: T
H ( x )   ht ( x )
t 1
How to Choose t and Design ht ?
• Bound on the ranking loss
T
rloss D   Z t
t 1
• Thus, we should choose t that minimizes the bound
• Three approaches:
– Numerical search
– Special case: h is either 0 or 1
– Approximation of Z, then find analytic solution
Efficient RankBoost for Bipartite
Feedback
X0
Bipartite feedback:
Essentially binary classification
X1

Dt ( x0 , x1 )  vt ( x0 )vt ( x1 )
Zt  Z Z 0 1 vt ( x0 )e t ht ( x0 )
t t vt 1 ( x0 )  0
Zt
Dt ( x 0 , x 1 )e t ( ht ( x 0 ) ht ( x 1 ))
Dt 1 ( x 0 , x 1 )  vt ( x1 )e  t ht ( x1 )
Zt vt 1 ( x1 )  1
Zt

Complexity at each round: O(|X0||X1|)  O(|X0|+|X1|)


Evaluation of RankBoost

• Meta-search: Same as in (Cohen et al 99)


• Perfect feedback
• 4-fold cross validation
EachMovie Evaluation

# users #movies/user

#feedback movies
Performance Comparison
Cohen et al. 99 vs. Freund et al. 99
Summary
• CF is “easy”
– The user’s expectation is low
– Any recommendation is better than none
– Making it practically useful
• CF is “hard”
– Data sparseness
– Scalability
– Domain-dependent
Summary (cont.)
• CF as a Learning Task
– Rating-based formulation
• Learn f: U x O -> R
• Algorithms
– Instance-based/memory-based (k-nearest neighbors)
– Model-based (probabilistic clustering)

– Preference-based formulation
• Learn PREF: U x O x O -> R
• Algorithms
– General preference combination (Hedge), greedy ordering
– Efficient restricted preference combination (RankBoost)
Summary (cont.)

• Evaluation
– Rating-based methods
• Simple methods seem to be reasonably effective
• Advantage of sophisticated methods seems to be limited
– Preference-based methods
• More effective than rating-based methods according to
one evaluation
• Evaluation on meta-search is weak
Research Directions
• Exploiting complete information
– CF + content-based filtering + domain knowledge +
user model …
• More “localized” kernels for instance-based
methods
– Predicting movies need different “neighbor users” than
predicting books
– Suggesting using items similar to the target item as
features to find neighbors
Research Directions (cont.)

• Modeling time
– There might be sequential patterns on the items a user
purchased (e.g., bread machine -> bread machine mix)
• Probabilistic model of preferences
– Making preference function a probability function, e.g,
P(A>B|U)
– Clustering items and users
– Minimizing preference disagreements
References
• Cohen, W.W., Schapire, R.E., and Singer, Y. (1999) "Learning to Order Things",
Journal of AI Research, Volume 10, pages 243-270.

• Freund, Y., Iyer, R.,Schapire, R.E., & Singer, Y. (1999). An efficient boosting
algorithm for combining preferences. Machine Learning Journal. 1999.

• Breese, J. S., Heckerman, D., and Kadie, C. (1998). Empirical Analysis of Predictive
Algorithms for Collaborative Filtering. In Proceedings of the 14th Conference on
Uncertainty in Articial Intelligence, pp. 43-52.

• Alexandrin Popescul and Lyle H. Ungar, Probabilistic Models for Unified


Collaborative and Content-Based Recommendation in Sparse-Data Environments,
UAI 2001.

• N. Good, J.B. Schafer, J. Konstan, A. Borchers, B. Sarwar, J. Herlocker, and J. Riedl.


"Combining Collaborative Filtering with Personal Agents for Better
Recommendations." Proceedings AAAI-99. pp 439-446. 1999.

You might also like