You are on page 1of 2

In practice we often map our data to very very high dimensional spaces using Ker

nels. In fact, certain Kernels like RBF kernel map data to infinite dimensional
spaces. As you might be thinking, mapping and producing such high dimensional re
presentation is a computationally daunting task. A related concept called Kernel
Trick lets us pretty much bypass this computation cheaply. Kernel functions are
almost never used without this Kernel Trick.

Recommender systems typically produce a list of recommendations in one of two wa


ys - through collaborative or content-based filtering.[5] Collaborative filterin
g approaches building a model from a user's past behavior (items previously purc
hased or selected and/or numerical ratings given to those items) as well as simi
lar decisions made by other users. This model is then used to predict items (or
ratings for items) that the user may have an interest in.[6] Content-based filte
ring approaches utilize a series of discrete characteristics of an item in order
to recommend additional items with similar properties.[7] These approaches are
often combined (see Hybrid Recommender Systems).
digits = datasets.load_digits()
iris = datasets.load_iris()
X, y = iris.data, iris.target
#general for preprocessing type stuff:
fit_transform(X) - fits to X, then transforms X
fit(X) - just fits to X
metrics.classification_report(y_true, y_pred) #precision, recall, F1 score for e
ach class.
metrics.confusion_matrix(y_true, y_predd)
metrics.roc_auc_score(y_true, y_predicted)
#Model persistence
s = pickle.dumps(clf)
clf2 = pickle.loads(s)

from sklearn.preprocessing import Imputer


imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(X)
scaler = sklearn.preprocessing.StandardScaler().fit(X_train)
>>> scaler.mean_
array([ 1. ..., 0. ..., 0.33...])
>>> scaler.std_
array([ 0.81..., 0.81..., 1.24...])
>>> scaler.transform(X_test)
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])

#Categorical encoding: assumes all elements of input are integers


enc = sklearn.preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])

>>> enc.transform([[0, 1, 3]]).toarray()


array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

#DictVectorizer()
>>> measurements = [
...
{'city': 'Dubai', 'temperature': 33.},
...
{'city': 'London', 'temperature': 12.},
...
{'city': 'San Fransisco', 'temperature': 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']

>>> from sklearn.ensemble import ExtraTreesClassifier


>>> clf = ExtraTreesClassifier()
>>> X_new = clf.fit(X, y).transform(X)
>>> clf.feature_importances_
array([ 0.04..., 0.05..., 0.4..., 0.4...])
clf = sklearn.ensemble.RandomForestClassifier()

sklearn.feature_selection.VarianceThreshold(threshold=0.0)
>>> X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
>>> selector = VarianceThreshold()
>>> X_new = selector.fit_transform(X) #get rid of all features with variane <= t
hreshold
clf = Pipeline([
('feature_selection', LinearSVC(penalty="l1")),
('classification', RandomForestClassifier())
])
clf.fit(X, y)

from sklearn.cross_validation import train_test_split


a_train, a_test, b_train, b_test = train_test_split(X,y, test_size=0.33, random_
state=42)

You might also like