Wikipedia Participation Challenge Solution

Wikipedia Participation Challenge Solution
Keith T. Herring User Name: Ernest Shackleton Team: Ernest Shackleton October 1, 2011
Abstract This document describes my (Keith Herring) solution to the Wikipedia Participation Challenge. I can be reached at keith.herring@gmail.com or kherring@mit.edu. I appreciate the important contributions Wikipedia has made to the accessibliity of information, so I hope this analysis will be useful to their cause. Thanks.
Contents
1 File List 2 Raw Data 2.1 Namespace Classier Bug . . . . . . . . . . . . . . . . . . . 3 Training Set Construction 4 Feature Construction 5 Sample Bias Correction 6 Model Learning/Training 6.1 Standard Random Forest Learning . . . . . . . . . . . . . . 6.2 Future Edits Learning Algorithm: . . . . . . . . . . . . . . . 7 Conclusions and Interpretation 2 2 3 3 3 5 7 7 7 9
File List
1. sample construction.py (author Keith Herring): A python script for initializing the training samples from the data set of raw user edit histories. It inializes each sample as a time-limited record of a single users edit history. 2. feature construction.py (author Keith Herring): A python script for converting the raw edit history of a time-limited user sample into the 206-element feature vector that the edits predictor operates on. 3. random forest (GPL licensed): Matlab CMEX implementation of the Breimann-Cutler Random Forest Regression Algorithm 4. edits learner.m (author Keith Herring): Matlab implemented algorithm for learning a suite of weak-to-strong future-edit models. 5. ensemble optimizer.m (author Keith Herring): Matlab implemented algorithm for nding the optimal model weights for the esemble future edits predictor. 6. edits predictor.m (author Keith Herring): Matlab implemented algorithm that predicts the future edits for a user as a function of its associated edit-history derived 206-element feature vector. 7. models.mat (author Keith Herring): Matlab data le containing the 34 decision tree models in the nal ensemble. 8. training statistics.m at(author Keith Herring): Matlab data le containg the training population means and standard deviations for the 206 features. 9. ensemble weights.mat (author Keith Herring): Matlab data le containg the weights for the 34 models in the nal ensemble.
Raw Data
An interesting aspect of this challenge was that it involved public data.. As such there was opportunity to improve ones learning capability by obtaining additional data not included in the base data set (training.tsv). Given this setup, the rst step of my solution was to write a web scraper for obtaining additional pre-Sept 1, 2010 (denoted in timestamp format 2010-09-01 in subsequent text) data for model training. More specically I wanted to obtain a larger, more representative sample of user edit histories and also additional elds not included in the original training set. In total I gathered pre-2010-09-01 edit histories for approximately 1 million Wikipedia editors. The following attributes were scraped for each user: 1. Blocked Timestamp: The timestamp in which a user was blocked. NULL if not blocked or blocked after Aug 31, 2010. 2. Pre-2010-09-01 Edits: For each pre-2010-09-01 user-edit the following attributes were scraped: (a) (b) (c) (d) Edit ID Timestamp Article Title: The title of the article edited. Namespace: 0-5. All other namespaces were discarded, although an intersting extension would be to include the > 5 namespaces to test if they provide useful information on the editing volume over lower namespaces.
(e) New Flag: Whether or not the edit created a new article (f) Minor Flag: Whether or not the edit was marked as minor by the user. (g) Comment Length: The length of the edit comment left by the user. (h) Comment Auto Flag: Whether the comment was automatically generated. I dend this as comments that contained particular tags associated with several automated services, e.g. mw-redirect.
2.1
Namespace Classier Bug
I noted during this scraping process a bug in the original training data. Specically articles whose title started with a namespace keyword were incorrectly classied as being in that namespace, the regexp wasnt checking for the colon. The result being that some namespace 0-5 edits were being considered as namespace > 5, and thus left out from the training set. Im not sure if this bug was introduced during the construction of the archival dumps or the training set itself.
Training Set Construction
A single training sample can be constructed by considering a single users edit history ending at any point before 2010-09-01. I used the following strategy for assembling a set of trainnig samples from the raw data desribed above.: 1. Start at an initial end date of 153 days before 2010-09-01, i.e. April 1 2010. 2. Create a sample from each user that has at least one edit during the year prior to the end date. This is to be consistent with the sampling strategy employed by wikipedia in constructing the base training set. 3. Now move the end date back 30 days and repeat. Repeating the above process for 32 osets I obtaind a training set with approximately 10 million samples, i.e. time-limited user edit histories.
Feature Construction
Given that a users edit history is a multi-dimensional time-series over continuous time, it is necessary for tractibility to project onto a lowerdimensional feature space, with the goal of retaining the relevant information in the time-series with respect to future editing behavior. A priori my intuition was that many distinct feaures of a users edit time series may play a role in inuencing future edit behavior. My strategy then was to contstruct a large number of features to feed into my learning algorithm to ensure most information would be available to the learning process. My nal solution operated on the following features which I constructed from the raw edit data described above: 1. age: the number of days between a users rst edit and the end of the observation period (e.g. April 1, 2010=X for oset 0, X-30 days for oset 1, etc.). Both linear and log scale.
2. Edits X: The number of edits per day made by the user over the last X days of the observation period. Log and linear scale. Calculated for X=1, 4 , 7 , 14, 28, 56, 112, 153, 365, 365*2, 365*4, 365*10 days. 3. Norm Edits X: The number of edits made by the user over the last min(age,X) days. This normalizes for the fact that some users were not editing over the entire X day period. 4. Exponential Edits X: The number of edits made by the user weighted by a decaying exponential with half life X days wrt the end of the observation period. This feature weights edits made more recently more heavily. Calculated for half lives: 1, 2, 4, 8, 16, 32, 40, 50, 64, 80, 90, 110, 120, 128, 256, 512, 1024 days. 5. Norm Exponential Edits X: Same as above except normalized by the tail of the exponential before the user started editing. 6. Duration X: The number of days between the Xth most recent edit made by the user and the end of the observation period. For users with less than X edits, it is set as a large constant. Calculated for X = 1, 2, 3, 4, 5, 8, 10, 13, 16, 20, 25, 28, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144 edits. 7. 153 day time series: The number of edits made over 153 day intervals. Starting at the nal 153 days of the observation period and progressing backwards in 30 day increments for 15 increments. 8. Average Edit Time: The average edit timestamp (date and time) for the user. 9. Standard Deviation Edit Time: The standard deviation (days) around the mean edit time. 10. Sigma Edit Time: The number of standard deviations between the average edit time and the end of the observation period. 11. Unique Articles: The percent of articles edited by the user which were unique, i.e. what fraction were distinct articles. Also the total raw number of distinct articles, number per age, and time windowed unique articles over the past 153 and 28 days. 12. Namespace: For each of namespace 0-5, the percent of edits in that namespace. Also the total raw number, number per age, and time windowed number over the past 153 and 28 days. 13. New Flag: The percent of edits in that were new/original. Also the total raw number, number per age, and time windowed number over the past 153 and 28 days. 14. Minor Flag: The percent of edits in that were marked as minor. Also the total raw number, number per age, and time windowed number over the past 153 and 28 days. 15. Comment: The percent of edits in that had a comment. Also the total raw number, number per age, and time windowed number over the past 153 and 28 days. Additionally the average comment length. 16. Auto Flag: The percent of edits in that had auto-generated comments. Also the total raw number, number per age, and time windowed number over the past 153 and 28 days. 17. Blocked Flag: Whether or not the user was blocked before the end of the observation period. April 1, 2010 or earlier for all training samples, August 31, 2010 or earlier for evaluation samples.
18. Blocked Days: The number of days before the end of the observation period the user was blocked, if they were. 19. Post Blocked Edits: The number of edits made by the user after they were blocked, but before the end of the observation period. In total 206 features are used in the nal model.
Sample Bias Correction
After calculating the features for both the training and evaluation sets, the latter referring to the 44514 users provided by the competition, I compared the feature distributions over the two sets. This analysis revealed nontrivial dierences between the two sample distributions suggesting that the sample construction process I employed did not fully replicate the construction process used by the competition organizers. Figure 1 displays a comparison between the empirical distribution across the evaluation and training sets for two representative features. Each feature was binned into tens of value bins, such that the number of samples in each bin over each data set could be calculated. The x-axis of the gures represents the feature-bin, where increased index represents an increased feature value for the samples in that bin. They y-axis represents the number of evaluation set samples whose feature value is in the bin divided by the number of training set samples whose feature value is in the bin. If the sample construction mechanisms for the evaluation and training sets were exactly the same, then the evlauation to training ratio would be approximately constant as a function of feature value bin, for all features. The *ed curve corresponds to the empirically observed ratio for the original non-corrected training set as constructed according to the description above, i.e. random users selected such that they have at least one edit the year leading up to the end of the observation period. Here we see that the distribution ratio varies signicanly (a factor 3-4) as a function of feature value. For example Figure 1b, which displays the total edits feature, shows that the training set has relatively more users who make a small number of edits over their observation period as compared to the evaluation set (about 100-1 ratio), while it has relatively fewer users who make a larger number of edits (about 25-1 ratio). This observation was consistent across features, e.g. the exponentially weighted edits feature in Figure 1a. This observation suggests that the evaluation set was not constructed by uniformly selecting from the universe of users, but rather with some bias towards more active users. It should be noted then that the models submitted in this competition, by myself and others, will be trained/learned with respect to this active-user bias sample. If the Foundation wishes to have a model that operates on purely randomized users, with no such selection bias, then the chosen model(s) should be re-trained on a non-biased sample. However it may be the case that there was motivation for this activeuser biased sampling strategy. I just want to point out that re-training may be a good idea depending on the Foundations goals.
0.09 75% 0.08 60% 55% 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 50% 45% 40% 25% 84%
Evaluation to Training Ratio
20
40
60
80
100
120
Exponetially weighted edits (increasing)
(a) Comparison between distribution of Exponential 50 feature across training and evaluation sets.
75%
0.08
60% 55%
Evaluation to Training Ratio
0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0
50% 45% 40% 25% 84%
10
15
20
25
30
35
40
45
Total Edits (increasing)
(b) Comparison between distribution of Total Edits (Edits 3650) feature across training and evaluation sets.
Figure 1: Sample Correction

In order for the training set to be representative of the evaluation set population this active user bias had to be replicated. This can be done in many ways, e.g. unsupervised clustering in a subspace of the feature space. A more compuationally tractible, yet generally less accurate, approach that I considered was to approach the bias from the output distribution of the training set. When the users were sampled uniformly I found that approximately 84 percent of them quit, i.e. had zero edits over their 153 day projection periods (note again these projection periods are all pre- Sept 1, 2010, since the latest (0 oset) begins April 1, 2010). The remaining training set output distribution (> 0 future edits) died o exponentially (on log scale) at a slower subsequent rate. One way to inject active user bias into the training set then was to consider removing some fraction of the quitting users and observe how the feature distributions changed. The colored curves show the feature distirubtion comparison as a function of having quitting users make up X percent of the training population, for X at values between 25 and 75 percent. Here I found that removing quitting users from the training set such that they made up somewhere between 40 and 60 percent of the population caused the (atleast one dimensional) feature distributions to be approximately equivalent. The optimal percent X tough did vary by up to 10-20 percent depending on the feature, centered
roughly at X = 50 percent across features. I used this correction method to reduce the bias between the training and evaluation sets. After bias correction I was left with approximately 2.5 million training samples. Not surprisingly it wasnt completely accurate as the observed model perfomance when applying cross-validation to the training set diered from the performance calculated by kaggle on the evaluation set. This suggests that the two populations were still quite dierent, and that retraining my algorithm on the actual population of interest (be it fully randomized or the unknown competition method) will provide more represenatative results for that population.
Model Learning/Training
Given the high-dimensional feature space of the problem, I decided to use a custom variant of the random forest regression algorithm for learning a future edits prediction model.
6.1
Standard Random Forest Learning
The standard random forest regression algorithm proceeds as follows: 1. Parameter Selection: Select some intuitive initial values for the random forest parameters. The main parameters of the random forest algorithm are as follows: (a) Features per Node: The number of features to randomly choose from the feature universe at each node for making an optimal splitting decision at that node. (b) Sample Size: The sample size (less than or equal to the fully available sample size) used to train each random decision tree. (c) Maximum Leaf Size: Stopping condition on further node splitting. The lower this value, the more the tree overts the data. (d) T: The number of random decision trees to learn in the forest. 2. Train: Train T random decision trees using the selected parameter values. For each tree record which samples were used to learn that tree in-bag (IB) vs those that were left out out-of-bag (OOB). 3. OOB Validation: For each tree make predictions on all of its associated OOB samples. Form the overall prediction as the mean prediction across all T trees. 4. Parameter Optimization: Repeat the above steps for new/tweaked parameter values until satisied that a quasi-optimal set of parameters have been found.
6.2
Future Edits Learning Algorithm:
The algorithm I wrote for learning future editing behavior proceeds as follows: 1. Random Parameter Random Trees: Rather than using the same parameter values across all trees in the forest, I instead constructed random decision trees one at a time. Each tree used random parameter values.. In total approximately 3000 such trees were generated. The parameter values where randomly chosen for each tree independently as follows:
(a) Features per Node: Select a Weak, Medium, Strong learner with probabilities 65, 25, 10 percent respectively. Where: i. A weak learner has numfeatures randomly uniformly chosen between 1 and 10. ii. A medium learner has numfeatures randomly unifromly chosen between 1 and 50. iii. A strong learner has numfeatures randomly unifromly chosen between 1 and the total number of features (206). (b) Sample Size: Randomly choose the sample size between 1 and 50 percent of the training data. (c) Maximum Leaf Size: Randomly choose the maximum leaf size uniformly between Sample size divided by 50 (undert extreme) and Sample size divided by 500000 (overt extreme). With proper bounding between 1 and the samplesize - 1. I used an implementation of the Briemann Cutler algorithm (GPL licensed) for building the individual trees. 2. Ensemble Optimization: Rather than weighting each tree/model equally, as is done in the standard algorithm, my algorithm takes into account the second order performance statistics across models. Finding the optimal weights across models/trees can be formulated as the following quadratic optimization problem:
T
wopt = min w w , s.t.

w i=1
wi = 1, wi 0
where wi represents the ensemble weight of the ith tree/model, and is the covariance matrix of the model/tree errors . Using wopt to weight the models in the ensemble takes into account the individual model performances and their correlation to one another, generaly providing a better result. 3. Calculating the Error Covaraiance Matrix: When calculating the error covariance matrix it is important to only use the OOB samples for each model, otherwise you would be overtting the data. Since the OOB samples vary by model, the covariance between any two models must be calculated on the intersection of the OOB samples from both models. To ensure that the resulting covariance matrix is positive denite, and hence the quadratic optimization problem above has a unique solution (follows from the matrix being hermitian), it is important that the number of intersecting OOB samples be statistically signicant. This is why I restricted the sample size to be at most 50 percent of the entire training sample for each tree. This ensures on average 25 percent of the data will be availabe for calculating the covariances for the most extreme models (those using 50 percent). 4. Sequential Optimization: Since I learned thousands of models/trees, calculating the full error covariance matrix across all models is intractible. Instead I used the following heuristic for squentially approximating the optimal weight vector wopt
i (a) Let wopt denote the optimal weight vector at stage i of the agorithm, i.e. after having processed models m1 , m2 , . . . , mi , for i = 1, 2, . . . , T
(b) Let Si denote the subset of models (out of the rst i) that have i non-neglible weight in the corresponding weight vector wopt . 10 Here I used the threshold of 10 for classifying a weight as being neglible or not. (c) At stage i initialize Si recursively as follows: Si = {Si1 , mi } i.e. add the model i to the current set of non-negligibly weighted models. i (d) Next solve for wopt by using a quadratic solver for the quadratic optimization problem dened above using the error covariance matrix for the |Si | models in Si . Again the individual covariances are calculated on the intersection of the OOB samples for each model. (e) Throw out any models in Si have neglgible weight, i.e. 1010 (f) increment i and repeat, stoping after i = T . Using this heuristic I found that out of the approximately 3000 models, 34 had nonnegible weight, i.e. provided useful non-redundant prediction capbility. 5. Final Model Execution: The nal future edits predictor routine then operates on an input as follows: (a) Calculate the prediction for each of the nal models m1 , m2 , . . . , m34 , on the associated 206 element feature vector of the user under consideration. (b) Calculate the nal ensemble prediction as a weighted sum of the individual model predictions, using the calcualted optimal weight vector.
Conclusions and Interpretation
Finally it is of most interest to understand which of the 206 features contained the most information wrt to predicting future editing behavior. A measure of feature/variables predictive capacity often used in decision trees is the permutation importance. The permutation importance of the kth feature is calculated on a given decision tree by rst calculating the performance on the out-of-bag (OOB) samples. Next performance is calculated on the same OOB samples with their kth feature value randomly permuted. The importance of that variable/feature wrt the decision tree then is calculated as the dierence between the two performance. A variable/feature with a large dierence in performances signies greater importance since the random permutation of that feature caused signicant decrease in prediction performance. The tables on the following pages rank the importances of the 206 featues as averaged over the 34 trees in the ensemble predictor. Here we see that it was the timing and volume of a users edits that played the most important role in predicting their future editing volume, i.e. exponentially weighted editing volumes. The attributes of the edits themselves, e.g. namespace, comments, etc played a lessor role. It should be noted that this importance measure is an aggregate measure. Features such as whether the user was blocked before the end of the observation period were
quite informative as expected, however the percentage of blocked users is so small that its importance on the aggregate population prediction is also small.
10
Feature Number 69 68 76 97 70 67 65 59 35 72 98 64 78 57 36 24 145 75 61 66 73 62 63 79 133 147 82 95 89 74 150 71 15 149 20 85 60 47 111 112
Feature Name edits exponential 80 norm edits exponential 64 norm edits exponential 120 duration 32 norm edits exponential 80 edits exponential 64 edits exponential 50 edits exponential 16 edits 365 norm edits exponential 90 duration 64 norm edits exponential 40 norm edits exponential 128 edits exponential 8 edits log 365 edits log 56 sigma edit time edits exponential 120 edits exponential 32 norm edits exponential 50 edits exponential 110 norm edits exponential 32 edits exponential 40 edits exponential 256 153days 11 total unique articles norm edits exponential 512 duration 25 duration 5 norm edits exponential 110 window153 unique articles edits exponential 90 edits 14 window28 unique articles edits log 28 duration 1 norm edits exponential 16 edits 3650 153days 0 153days log 0
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Importance (Percent of Top) 1 0.98548 0.74835 0.49516 0.45831 0.40019 0.3745 0.34139 0.30208 0.29964 0.29766 0.29487 0.28544 0.28308 0.26396 0.25545 0.25296 0.20396 0.19681 0.19181 0.19048 0.17627 0.17168 0.16852 0.16272 0.16237 0.15909 0.15465 0.14233 0.11969 0.109 0.10739 0.099643 0.079767 0.073614 0.069926 0.068615 0.064891 0.063671 0.061862
11
Feature Number 91 144 100 99 77 143 4 81 53 155 33 19 86 39 25 16 203 164 140 90 22 92 58 48 204 205 27 34 80 87 51 30 190 52 54 29 93 88 122
Feature Name duration 10 std edit time log duration 256 duration 128 edits exponential 128 std edit time edits log 1 edits exponential 512 edits exponential 2 window153 namespace 0 norm edits 153 edits 28 duration 2 edits 730 norm edits 56 edits log 14 blocked days window28 namespace 2 153days log 14 duration 8 norm edits log 28 duration 13 norm edits exponential 8 edits log 3650 blocked days log blocked ag edits 112 norm edits log 153 norm edits exponential 256 duration 3 edits exponential 1 norm edits log 112 window153 minor norm edits exponential 1 norm edits exponential 2 norm edits 112 duration 16 duration 4 153days log 5
Rank 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
Importance (Percent of Top) 0.055087 0.053374 0.051915 0.050096 0.046112 0.045415 0.039679 0.034697 0.034205 0.034045 0.033259 0.031825 0.028715 0.028537 0.028188 0.02702 0.024489 0.024421 0.023819 0.021907 0.021671 0.020713 0.020418 0.020053 0.019793 0.01966 0.019489 0.018519 0.018364 0.018032 0.017924 0.017751 0.017081 0.016862 0.015046 0.014862 0.014545 0.013943 0.012995
12
Feature Number 138 21 55 46 178 96 153 84 17 56 13 148 152 168 146 94 2 14 11 114 44 126 200 41 50 192 40 9 121 118 42 158 1 119 37 183 83 188 193 124 31
Feature Name 153days log 13 norm edits 28 edits exponential 4 norm edits log 1460 norm namespace 5 duration 28 norm namespace 0 norm edits exponential 1024 norm edits 14 norm edits exponential 4 norm edits 7 norm unique articles total namespace 0 norm namespace 3 percent unique articles duration 20 age log norm edits log 7 edits 7 153days log 1 edits log 1460 153days log 7 norm auto norm edits 730 norm edits log 3650 total comment edits log 730 norm edits 4 153days 5 153days log 3 norm edits log 730 norm namespace 1 age 153days 4 norm edits 365 norm new edits exponential 1024 norm minor norm comment 153days log 6 edits 153
Rank 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
Importance (Percent of Top) 0.012546 0.012461 0.012336 0.012198 0.012053 0.011975 0.011591 0.01137 0.010469 0.010286 0.010156 0.0093989 0.0086399 0.0085046 0.0083759 0.0083699 0.007332 0.0073051 0.0072385 0.0069303 0.0067639 0.0065288 0.0064827 0.0063822 0.0063366 0.0062078 0.0061951 0.0061891 0.0061177 0.0060612 0.0060261 0.0059127 0.005839 0.0058199 0.0057204 0.0055757 0.005547 0.00541 0.005227 0.0052228 0.0051197
13
Feature Number 141 173 123 113 163 131 154 49 127 38 196 28 125 116 45 117 128 23 142 120 26 43 8 197 187 12 115 135 136 191 198 130 181 151 7 166 10 156 195 186
Feature Name avg edit time norm namespace 4 153days 6 153days 1 norm namespace 2 153days 10 window28 namespace 0 norm edits 3650 153days 8 norm edits log 365 avg comment edits log 112 153days 7 153days log 2 norm edits 1460 153days 3 153days log 8 edits 56 avg edit time log 153days log 4 norm edits log 56 edits 1460 edits log 4 avg nonzero comment total minor edits log 7 153days 2 153days 12 153days log 12 percent comment percent auto 153days log 9 percent new percent namespace 0 edits 4 percent namespace 3 norm edits log 4 percent namespace 1 window153 comment percent minor
Rank 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
Importance (Percent of Top) 0.005059 0.0050337 0.0050324 0.0049575 0.0049218 0.0048792 0.0047579 0.0046498 0.004594 0.0045254 0.0044674 0.0044465 0.0043117 0.0041826 0.0041289 0.0040325 0.0039511 0.003911 0.0038954 0.0038617 0.0037719 0.0036661 0.0036119 0.0035716 0.0035051 0.0034155 0.003268 0.0032389 0.0030958 0.0028546 0.0026283 0.002595 0.0024734 0.0024299 0.0024043 0.0023494 0.0022995 0.0021607 0.0021395 0.0021182
14
Feature Number 137 161 172 32 18 171 157 194 182 199 160 139 162 129 132 167 134 101 6 165 5 202 170 176 3 159 185 177 189 175 102 104 105 201 103 174 184 180 169 206 179 107 106 109 110 108
Feature Name 153days 13 percent namespace 2 total namespace 4 edits log 153 norm edits log 14 percent namespace 4 total namespace 1 window28 comment total new total auto window153 namespace 1 153days 14 total namespace 2 153days 9 153days log 10 total namespace 3 153days log 11 duration 512 norm edits log 1 window153 namespace 2 norm edits 1 window153 auto window153 namespace 3 percent namespace 5 edits 1 window28 namespace 1 window153 new total namespace 5 window28 minor window153 namespace 4 duration 1024 duration 4096 duration 8192 window28 auto duration 2048 window28 namespace 4 window28 new window153 namespace 5 window28 namespace 3 post blocked edits window28 namespace 5 duration 32768 duration 16384 duration 131072 duration 262144 duration 65536
Rank 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206
Importance (Percent of Top) 0.0021109 0.0021044 0.0020987 0.0020983 0.0020137 0.0020059 0.0017939 0.0017928 0.0017241 0.0016079 0.0016045 0.0014878 0.0014769 0.0014484 0.0014318 0.0014271 0.0014268 0.0013424 0.0012506 0.0011808 0.0011607 0.0010824 0.0010579 0.00099416 0.00089403 0.00088 0.00084931 0.00083623 0.00079185 0.00072526 0.00066113 0.00063444 0.00050265 0.00048325 0.00037847 0.00036964 0.00036793 0.00036601 0.00036482 0.0003116 0.00015836 0.00011703 0.0001093 8.6258e-05 9.7083e-06 2.7508e-06
15

Wikipedia Participation Challenge Solution

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wikipedia Participation Challenge Solution

Uploaded by

Copyright:

Available Formats

Wikipedia Participation Challenge Solution

Namespace Classier Bug

Training Set Construction

Sample Bias Correction

Evaluation to Training Ratio

Exponetially weighted edits (increasing)

Evaluation to Training Ratio

0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0

50% 45% 40% 25% 84%

Total Edits (increasing)

Figure 1: Sample Correction

Standard Random Forest Learning

Future Edits Learning Algorithm:

wopt = min w w , s.t.

Conclusions and Interpretation

Feature Number 69 68 76 97 70 67 65 59 35 72 98 64 78 57 36 24 145 75 61 66 73 62 63 79 133 147 82 95 89 74 150 71 15 149 20 85 60 47 111 112

You might also like