机器学习初步 PDF

华东师范大学统计学院
福建移动大数据与人工智能培训
1 / 106
Python机器学习
第二讲：机器学习初步
2018/7
2 / 106
目录
1.回归模型
2.无监督学习：聚类k-means
3.监督学习：KNN, Naive Bayes, LDA
4.从感知机到Logistic回归和支持向量机
5.分类问题的评价指标
6.强化学习简例
3 / 106
回归模型
简单线性回归 Ordinary Least Square
梯度下降法 Gradient Descent
梯度下降法是一个最优化算法，通常也称为最速下降法。最速下降法是求解无约束优化问题最简单和最古老的方法之
一。许多有效算法都是以它为基础进行改进和修正而得到的。最速下降法是用负梯度方向为搜索方向的。最速下降法越
接近目标值，步长越小，前进越慢。
Wiki上的解释:
如果目标函数F (x)在点a处可微且有定义，那么函数F (x)在点a沿着梯度相反的方向−∇F (a)下降最快。
其中∇为梯度算子:
∂ ∂ ∂
T
∇ = ( , ,…, )
∂x1 ∂x2 ∂xn
什么是梯度下降法？
梯度下降法，可作为一种求解最小二乘法的方式，它是最优化中比较古老的一种方法
梯度下降，设定起始点负梯度方向 (即数值减小的方向) 为搜索方向，寻找最小值。梯度下降法越接近目

标值，步长越小，前进越慢. 5 / 106
简单线性回归 Ordinary Least Square
损失函数
n
1 (i)
(i) 2
J (w) = ∑(y ^
− y )
2
i=1
梯度
n
∂J (i) (i)
(i)
= − ∑(y ^
− y )x
j
∂wj
i=1
更新规则
∂J
w := w − η
∂w
6 / 106
Python函数类实现
import numpy as np
class LinearRegressionGD(object):
def __init__(self, eta=0.001, n_iter=20):
self.eta = eta # learning rate 学习速率
迭代次数
self.n_iter = n_iter #
def fit(self, X, y): # 训练函数
# self.w_ = np.zeros(1, 1 + X.shape[1])
self.coef_ = np.zeros(shape=(1, X.shape[1])) # 代表被训练的系数，初始化为 0
self.intercept_ = np.zeros(1)
self.cost_ = [] #用于保存损失的空 list
for i in range(self.n_iter):
output = self.net_input(X) # 计算预测的 Y
errors = y - output
self.coef_ += self.eta * np.dot(errors.T, X) #根据更新规则更新系数，思考一下为什么不是减号
self.intercept_ += self.eta * errors.sum() # 更新 bias，相当于x_0取常数1
cost = (errors**2).sum() / 2.0计算损失 #
记录损失函数的值
self.cost_.append(cost) #
return self
def net_input(self, X): #给定系数和计算预测的
X Y
return np.dot(X, self.coef_.T) + self.intercept_
def predict(self, X):
return self.net_input(X) 7 / 106
数据应用案例——波士顿房价数据集
Variables Definition
CRIM per capita crime rate by town 每个城镇人均犯罪率
ZN proportion of residential land zoned for lots over 25000 sqft. 逾25000方尺划居住地的比例
INDUS proportion of non-retail business acres per town 非零售商用地百分比
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 是否被河道包围
NOX nitric oxides concentration (parts per 10 million)氮氧化物浓度
RM average number of rooms per dwelling 住宅平均房间数目
AGE proportion of owner-occupied units built prior to 1940 1940年前建成自用单位比例
DIS weighted distances to five Boston employment centres 5个波士顿就业服务中心的加权距离
RAD index of accessibility to radial highways 无障碍径向高速公路指数
TAX full-value property-tax rate per $10,000 每万元物业税率
PTRATIO pupil-teacher ratio by town 小学师生比例
B the proportion of blacks by town 黑人比例指数
LSTAT % lower status of the population 低层人口比例
MEDV Median value of owner-occupied homes 业主自住房屋中值
8 / 106
数据应用案例
# 读取数据
import pandas as pd
df = pd.read_csv('./data/housing.csv')
print(df.head())
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \

## 0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0
## 1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0
## 2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0
## 3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0
## 4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0
##
## PTRATIO B LSTAT MEDV
## 0 15.3 396.90 4.98 24.0
## 1 17.8 396.90 9.14 21.6
## 2 17.8 392.83 4.03 34.7
## 3 18.7 394.63 2.94 33.4
## 4 18.7 396.90 5.33 36.2
数据分析的第一步是进行探索性数据分析 (Exploratory Data Analysis, EDA)，理解变量的分布与变量之间的关

系。
9 / 106
数据应用案例
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style='whitegrid', context='notebook') # 设定样式，还原可用 sns.reset_orig()
# MEDV 是目标变量，为了方便演示，只挑个预测变量 4
cols = ['LSTAT', 'INDUS', 'NOX', 'RM', 'MEDV']
对角线上是变量分布的直方图，非对角线上是两个变量的散点图
# scatterplot matrix,
sns.pairplot(df[cols], size=3)
plt.show()
10 / 106
数据应用案例
探索变量相关性
# correlation map
import numpy as np
cm = np.corrcoef(df[cols].values.T) # 计算相关系数
sns.set(font_scale=1.5)
print(cm)
## [[ 1. 0.60379972 0.59087892 -0.61380827 -0.73766273]

## [ 0.60379972 1. 0.76365145 -0.39167585 -0.48372516]
## [ 0.59087892 0.76365145 1. -0.30218819 -0.42732077]
## [-0.61380827 -0.39167585 -0.30218819 1. 0.69535995]
## [-0.73766273 -0.48372516 -0.42732077 0.69535995 1. ]]
11 / 106
数据应用案例
# 画相关系数矩阵的热点图
hm = sns.heatmap(cm, annot=True, square=True,fmt='.2f',annot_kws={'size': 11},yticklabels=cols,xt
plt.tight_layout()
对与 MEDV correlation 高的变量感兴趣, LSTAT 最高(-0.74), 其次是 RM (0.7)

但从前图看出 MEDV 与 LSTAT 呈非线性关系，而与 RM更呈线性关系，所以下面选用 RM 来演示简单线性回归
12 / 106
数据应用案例
利用函数类实现简单线性回归
# RM 作为解释变量
X = df[['RM']].values
y = df[['MEDV']].values
# 数据标准化 standardize
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
X_std = sc_x.fit_transform(X)
y_std = sc_y.fit_transform(y)
# 线性回归模型 Linear Regression
lr = LinearRegressionGD()
lr.fit(X_std, y_std); # 喂入数据进行训练
13 / 106
数据应用案例
# cost function
plt.plot(range(1, lr.n_iter+1), lr.cost_)
plt.ylabel('SSE')
plt.xlabel('Epoch')
plt.show()
14 / 106
数据应用案例
基于线性回归模型进行预测
# 定义一个绘图函数用于展示
def lin_regplot(X, y, model):
plt.scatter(X, y, c='lightblue')
plt.plot(X, model.predict(X), color='red', linewidth=2)
return None
# 画出预测
lin_regplot(X_std, y_std, lr)
plt.xlabel('Average number of rooms [RM] (standardized)')
plt.ylabel('Price in $1000\'s [MEDV] (standardized)')
plt.show()
15 / 106
数据应用案例
16 / 106
数据应用案例
print('Slope: %.3f' % lr.coef_[0])
## Slope: 0.695
print('Intercept: %.3f' % lr.intercept_)

# 直线的斜率及截距
## Intercept: -0.000
# 预测 RM=5 时，房价为多少
num_rooms_std = sc_x.transform([[5.0]])
price_std = lr.predict(num_rooms_std)
print("Price in $1000's: %.3f" %sc_y.inverse_transform(price_std))
## Price in $1000's: 10.840
17 / 106
数据应用案例
利用scikit-learn做线性回归
from sklearn.linear_model import LinearRegression

slr = LinearRegression()
slr.fit(X_std, y_std)
print('Slope: %.3f' % slr.coef_[0])
## Slope: 0.695
print('Intercept: %.3f' % slr.intercept_)
lin_regplot(X_std, y_std, slr)

plt.xlabel('Average number of rooms [RM] (standardized)')
plt.ylabel('Price in $1000\'s [MEDV] (standardized)')
plt.tight_layout()
18 / 106
数据应用案例
19 / 106
数据应用案例
# 如果不标准化，直接用原始数据进行回归
slr.fit(X, y)
lin_regplot(X, y, slr)
plt.xlabel('Average number of rooms [RM]')
plt.ylabel('Price in $1000\'s [MEDV]')
plt.tight_layout()
20 / 106
数据应用案例
21 / 106
数据应用案例
slr = LinearRegression()
slr.fit(X, y)
print('Slope: %.3f' % slr.coef_[0])
## Slope: 9.102
print('Intercept: %.3f' % slr.intercept_)
该结果与使用Gradient Descent函数类的结果接近。
思考：什么时候需要使用标准化？
22 / 106
评估线性回归模型的性能
from sklearn.model_selection import train_test_split

X = df.iloc[:, :-1].values
y = df['MEDV'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)# 分割训练
slr.fit(X_train, y_train)
y_train_pred = slr.predict(X_train)#根据拟合模型直接做预测
y_test_pred = slr.predict(X_test)
# 残差图经常被用来检查回归模型
,
plt.scatter(y_train_pred, y_train_pred - y_train, c='blue',marker='o', label='Training data')
plt.scatter(y_test_pred, y_test_pred - y_test, c='lightgreen',marker='s', label='Test data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.legend(loc='upper left')
plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
plt.xlim([-10, 50])
plt.tight_layout()
23 / 106
如果预测都是正确的, 那么残差就是0，这是理想情况。实际中, 我们希望误差是随机分布的。

从上图看, 有部分误差是离红色线较远的, 可能是异常值引起较大的偏差。
24 / 106
另一种评估方法是 Mean Squred Error, MSE, 就是 SSE 的平均值
n
1 (i)
(i) 2
M SE = ∑(y ^
− y )
n
i=1
R-squre 也是重要的评估标准, 它代表着有多少百分比的数据被模型解释. 越高代表模型拟合越好
SSE M SE
2
R = 1 − = 1 −
SST V ar(y)
25 / 106
其中R-square确定系数如下
SSE: sum of squares due to error

n
(i) (i) 2
SSE = ∑ (y − y
^ )
i=1
SSR：sum of square of the regression()

n
(i) (i) 2
SSR = ∑ (y
^ − ȳ )
i=1
SST：total sum of square

n
(i) (i) 2
SST = ∑ (y − ȳ )
i=1
26 / 106
用scikit-learn计算MSE和R-square
from sklearn.metrics import r2_score

from sklearn.metrics import mean_squared_error
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
## MSE train: 19.958, test: 27.196
print('R^2 train: %.3f, test: %.3f' % (

r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
## R^2 train: 0.765, test: 0.673
27 / 106
无监督学习：聚类k-means
算法思想
k-means算法实际上就是通过计算不同样本间的距离来判断他们的相近关系的，相近的就会放到同一个类别中去。
首先选定一个k值，也就是我们希望把数据分成多少类；k值的选择对结果的影响很大。k值的标准可以根
据实际需求背景，也可以根据不同k值的聚类结果选取最优值。
然后需要选择最初的聚类点（质心），这里的选择一般是随机选择，然后再进行多次取均值等算法处
理。
计算数据集中所有的点与这些质心的距离，并分配到离质心最近的一类中去。完成后继续将每个簇算出
平均值点作为新的质心。反复重复这两步，直到收敛。
29 / 106
算法特点
优点：
原理简单，容易实现
运算速度比较快
对大数据集有比较好的伸缩性
缺陷：
k值的选择是用户指定的，不同的k得到的结果会有显著的不同。对k的选择可以先用一些算法分析数
据的分布，如重心和密度等，然后选择合适的k
对k个初始质心的选择比较敏感，容易陷入局部最小值，对初始值和异常值敏感。
存在局限性，对于非球状的数据分布比较难以处理。
数据集比较大的时候，收敛会比较慢。
使用数据类型：数值型数据
30 / 106
k-means基础算法实现
from numpy import *

from sklearn import datasets
from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
# 计算欧几里得距离
def distEclud(vecA, vecB):
return sqrt(sum(power(vecA - vecB, 2))) #求两个向量之间的距离
# 构建聚簇中心，取个此例中为随机质心
k ( 4)
def randCent(dataSet, k):
n = shape(dataSet)[1]
centroids = mat(zeros((k,n)))
#每个质心有个坐标值，总共要个质心
n k
for j in range(n):
minJ = min(dataSet[:,j])
maxJ = max(dataSet[:,j])
rangeJ = float(maxJ - minJ)
centroids[:,j] = minJ + rangeJ * random.rand(k, 1)
return centroids
31 / 106
# k-means 聚类算法
def kMeans(dataSet, k, distMeans =distEclud, createCent = randCent):
m = shape(dataSet)[0]
clusterAssment = mat(zeros((m,2)))
#用于存放该样本属于哪类及质心距离第一列存放该数据所属的中心点，第二列是该数据到中心点的距离
;clusterAssment
centroids = createCent(dataSet, k)
用来判断聚类是否已经收敛
clusterChanged = True #
while clusterChanged:
clusterChanged = False;
for i in range(m): # 把每一个数据点划分到离它最近的中心点
minDist = inf; minIndex = -1;
for j in range(k):
distJI = distMeans(centroids[j,:], dataSet[i,:])
if distJI < minDist:
minDist = distJI; minIndex = j#如果第个数据点到第个中心点更近，则将i归属为j
i j
如果分配发生变化，则需要继续
if clusterAssment[i,0] != minIndex: clusterChanged = True; #
并将第个数据点的分配情况存入字典
clusterAssment[i,:] = minIndex,minDist**2# i
print (centroids)
for cent in range(k): #重新计算中心点
ptsInClust = dataSet[nonzero(clusterAssment[:,0].A == cent)[0]] 去第一列等于cent的所有
#
centroids[cent,:] = mean(ptsInClust, axis = 0)#算出这些数据的中心点
return centroids, clusterAssment
32 / 106
def show(dataSet, k, centroids, clusterAssment):
numSamples, dim = dataSet.shape
mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
for i in range(numSamples):
markIndex = int(clusterAssment[i, 0])
plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])
mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']
for i in range(k):
plt.plot(centroids[i, 0], centroids[i, 1], mark[i],markersize = 12)
33 / 106
# 用测试数据及测试 kmeans 算法
X=load_iris()
datMat=X.data[:, 2:4]
myCentroids,clustAssing = kMeans(datMat,3)
## [[ 3.96537164 0.31435365]
## [ 3.9369364 1.80994892]
## [ 1.40824723 1.16665528]]
## [[ 3.62857143 1. ]
## [ 5.00215054 1.72688172]
## [ 1.464 0.244 ]]
## [[ 3.90384615 1.19230769]
## [ 5.25810811 1.84594595]
## [ 1.464 0.244 ]]
## [[ 4.10512821 1.26666667]
## [ 5.41803279 1.93770492]
## [ 1.464 0.244 ]]
## [[ 4.19130435 1.30217391]
## [ 5.51481481 1.99444444]
## [ 1.464 0.244 ]]
## [[ 4.22083333 1.31041667]
## [ 5.53846154 2.01346154]
## [ 1.464 0.244 ]]
## [[ 4.25490196 1.33921569]
## [ 5.58367347 2.02653061]
## [ 1.464 0.244 ]]
34 / 106
## [[ 4.26923077 1.34230769]
print (myCentroids)
## [[ 4.26923077 1.34230769]
## [ 5.59583333 2.0375 ]
## [ 1.464 0.244 ]]
print (clustAssing)
## [[ 2. 0.006032 ]
## [ 2. 0.006032 ]
## [ 2. 0.028832 ]
## [ 2. 0.003232 ]
## [ 2. 0.006032 ]
## [ 2. 0.080032 ]
## [ 2. 0.007232 ]
## [ 2. 0.003232 ]
## [ 2. 0.006032 ]
## [ 2. 0.022032 ]
## [ 2. 0.003232 ]
## [ 2. 0.020432 ]
## [ 2. 0.024832 ]
## [ 2. 0.153232 ]
## [ 2. 0.071632 ]
## [ 2. 0.025632 ]
## [ 2. 0.051232 ]
## [ 2. 0.007232 ] 35 / 106
show(datMat, 3, myCentroids, clustAssing)
36 / 106
用sklearn实现kmeans
from sklearn.cluster import KMeans

X=load_iris()
datMat=X.data[:, 2:4]
print(datMat.shape)#查看数据结构
#使用 kmeans算法
estimator = KMeans(n_clusters=3)#构造聚类器
estimator.fit(datMat)#聚类
label_pred = estimator.labels_ #获取聚类标签
#绘制 k-means结果
x0 = datMat[label_pred == 0]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o',label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*',label='label1')
plt.scatter(x2[:, 0], x2[:, 1], c = "blue", marker='+',label='label2')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()
37 / 106
用sklearn实现kmeans
38 / 106
监督学习：KNN, Naive Bayes, LDA
用scikit-learn实现knn
from sklearn.neighbors import KNeighborsClassifier

# 寻找个邻居
5
knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)
# 闵氏距离，幂参数，时，p p=1 manhattan_distance
时，为欧式距离
# p=2
plot_decision_regions(X_combined_std, y_combined, classifier=knn, test_idx=range(105,150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.tight_layout()
from sklearn.neighbors import KNeighborsClassifier

plt.close('all')
knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)
# 闵氏距离，幂参数，时，p p=1 manhattan_distance
时，为欧式距离
# p=2
plot_decision_regions(X_combined_std, y_combined, classifier=knn, test_idx=range(105,150))
plt.legend(loc='upper left') 40 / 106
从感知机、Logistic回归到支持向量机
什么是感知机分类
最简单形式的前馈神经网络，是一种二元线性分类器, 把矩阵上的输入 x （实数值向量）映射到输出值 f (x) 上（一
个二元的值）。
+1 if w ⋅ x > b
f (x) = {
−1 else
42 / 106
学习算法
我们首先定义一些变量：
x(j) 表示n维输入向量中的第j项
w(j) 表示权重向量的第j项
f (x) 表示神经元接受输入 x 产生的输出
更进一步，为了简便我们可以令w(0) 等于 −b，x(0) 等于 1。
感知器的学习通过对所有训练实例进行多次的迭代进行更新的方式来建模。
43 / 106
学习算法
令D m = {(x1 , y1 ), … , (xm , ym )} 表示一个有 m 个训练实例的训练集。
每次迭代权重向量以如下方式更新：对于每个 D m
= {(x1 , y1 ), … , (xm , ym )} 中的每个 (x, y) 对，
w(j) := w(j) + α(y − f (x))x(j) (j = 1, … , n)
其中α是学习率，在0到1之间取值。
注意这意味着，仅当针对给定训练实例 (x, y) 产生的输出值 f (x) 与预期的输出值 y 不同时，权重向量才会发生改变。
如果存在一个正的常数 γ 和权重向量 w ，对所有的 i 满足:
yi ⋅ (⟨w, xi ⟩ + b) > γ
训练集 D 就被叫做线性分隔
m
然而，如果训练集不是线性分隔的，那么这个算法则不能确保会收敛。
44 / 106
学习算法
关于收敛性：
1. 线性可分
2. 学习率足够小
45 / 106
利用Python实现感知器学习算法
import numpy as np
class Perceptron(object):
"""Perceptron classifier.
Parameters
------------
eta : float
学习率 (between 0.0 and 1.0)
n_iter : int
迭代次数
属性
-----------
w_ : 1d-array
拟合后的权重
errors_ : list
每轮的错误分类的数量
"""
self.eta = eta
self.n_iter = n_iter # 轮数
46 / 106
def fit(self, X, y):
"""Fit training data.
Parameters
----------
X : {array-like}, shape = [n_samples, n_features]训练数据
y : array-like, shape = [n_samples]目标向量
Returns
-------
self : object
"""
self.w_ = np.zeros(1 + X.shape[1]) #权重初始值设置为0，在后面的迭代过程中，会不断更新
,
self.errors_ = []
# 对每个 sample循环更新
for _ in range(self.n_iter):
errors = 0
for xi, target in zip(X, y):
update = self.eta * (target - self.predict(xi)) #(learning rate)*(error)
self.w_[1:] += update * xi
self.w_[0] += update
errors += int(update != 0.0)
self.errors_.append(errors) # 错误的分类结果
return self
47 / 106
def net_input(self, X):
"""Calculate net input w*x"""
return np.dot(X, self.w_[1:]) + self.w_[0]

"""Return class label after unit step"""
return np.where(self.net_input(X) >= 0.0, 1, -1)
48 / 106
在Iris数据集上训练一个感知器
这里只考虑两种花 Setosa 和 Versicolor , 以及两种特征 sepal length 和 petal length.
但是 Perceptron Model 可以解决多类别分类问题, 参考 one-vs-all
读入数据
import pandas as pd
df = pd.read_csv('./data/iris.csv')
print(df.head())
## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 0 5.1 3.5 1.4 0.2 setosa
## 1 4.9 3.0 1.4 0.2 setosa
## 2 4.7 3.2 1.3 0.2 setosa
## 3 4.6 3.1 1.5 0.2 setosa
## 4 5.0 3.6 1.4 0.2 setosa
49 / 106
数据观察-通过图形
# 将两个分类先可视化
# select setosa and versicolor
# 两种各选择个把类别改为
50 , -1 和 1,方便画图
y = df.iloc[0:100, 4].values
y = np.where(y == 'setosa', -1, 1)
# 提取 sepal length 和 petal length
X = df.iloc[0:100, [0, 2]].values
# 绘制图形
plt.scatter(X[:50, 0], X[:50, 1],color='red', marker='o', label='setosa')
plt.scatter(X[50:100, 0], X[50:100, 1],color='blue', marker='x', label='versicolor')
plt.xlabel('petal length [cm]')
plt.ylabel('sepal length [cm]')
plt.tight_layout()
50 / 106
数据观察-通过图形
51 / 106
训练感知器模型
ppn = Perceptron(eta=0.1, n_iter=10)

ppn.fit(X, y)
ppn.errors_
# 绘制 error 的图形，在几轮更新之后，检查是否 error 趋近于 0

plt.plot(range(1, len(ppn.errors_) + 1), ppn.errors_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Number of misclassifications')
plt.tight_layout()
52 / 106
训练感知器模型
通过图形发现，结果 error 的确最后为 0, 并且收敛, 且分类效果较好
53 / 106
编写一个函数用于绘制决策边界
from matplotlib.colors import ListedColormap

# Colormap object generated from a list of colors.
def plot_decision_regions(X, y, classifier, resolution=0.02):
#利用 ListedColormap设置 marker generator和 color map
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
#确定横纵轴边界
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1 # 最小 -1, 最大 +1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
#建立一对，铺平
grid arrays ，然后进行预测
grid arrays
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),np.arange(x2_min, x2_max, resolu
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
# 将不同的决策边界对应不同的颜色
plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
# 绘制样本点
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],alpha=0.8, c=cmap(idx),marker=markers[idx],
54 / 106
绘制决策边界
根据训练出的感知器结果，绘制决策边界：
plot_decision_regions(X, y,classifier=ppn)
plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.tight_layout()
55 / 106
绘制决策边界
虽然 Perceptron Model 在上面 Iris例子里表现得很好，但在其他问题上却不一定表现得好。

从数学上可以证明，在线性可分的数据里，Perceptron的学习规则会收敛，但在线性不可分的情况下，却无法收
敛。
56 / 106
利用scikit-learn训练感知器
数据加载和预处理
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target
print('Class labels:', np.unique(y))
## Class labels: [0 1 2]
print(iris.target_names)
## ['setosa' 'versicolor' 'virginica']
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train) # standardize by mean & std
X_test_std = sc.transform(X_test)
57 / 106
利用scikit-learn训练感知器
from sklearn.linear_model import Perceptron

# sklearn 中有封装好的 Perceptron 函数
ppn = Perceptron(n_iter=40, eta0=0.1, random_state=0)
ppn.fit(X_train_std, y_train)
y_pred = ppn.predict(X_test_std) # predict

print('Misclassified samples: %d' % (y_test != y_pred).sum()) # 错误个数
## Misclassified samples: 4
from sklearn.metrics import accuracy_score

查看准确率
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred)) #
## Accuracy: 0.91
58 / 106
重新定义画决策边界函数
from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):
# setup marker generator and color map
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
# plot the decision surface
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),np.arange(x2_min, x2_max, resolu
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
# plot all samples
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],alpha=0.8, c=cmap(idx),marker=markers[idx],
# highlight test samples
if test_idx:
X_test, y_test = X[test_idx, :], y[test_idx]
plt.scatter(X_test[:, 0], X_test[:, 1], c='b', alpha=0.1, linewidth=1, marker='o', s=55,
59 / 106
绘制决策边界
使用标准化后的数据
X_combined_std = np.vstack((X_train_std, X_test_std))

y_combined = np.hstack((y_train, y_test))
plot_decision_regions(X=X_combined_std, y=y_combined, classifier=ppn, test_idx=range(105,150))
plt.tight_layout()
60 / 106
绘制决策边界
61 / 106
逻辑回归 Logistic Regression
Logisitic = 对数几率回归
设x 为m + 1个互相独立的分类特征输入，令：
i
T
z(x) = w0 x0 + w1 x1 +. . . +wm xm + b = w x + b
设条件概率P (y = 1|x) = p ，那么Logistic回归模型定义为：
1
P (y = 1|x) = ϕ(z) =
−z
1 + e
同理可得：
1 1
P (y = 0|x) = 1 − P (y = 1|x) = 1 − =
−z z
1 + e 1 + e
62 / 106
p
odds ratio(发生比/胜率): 特定事件的odds即事件发生与不发生的概率之比。如果p代表一
P (y=1|x)
z
= = e
P (y=0|x)
(1 − p)
类正向事件的发生可能性，我们通过对odds取对数，定义对数几率logit函数(或称为log odds函数)为：
p P (y = 1|x)
z(x) T
logit(p) = ln = ln = ln e = w x + b
1 − p P (y = 0|x)
63 / 106
概率p是logit函数的逆，我们称之为sigmoid function
1
ϕ(z) =
−z
1 + e
sigmoid function的输出即是特定的样本属于class 1的概率
模型特点：
直接对分类可能性进行建模，还可以得到近似概率预测
Sigmoid是任意阶可导的凸函数
64 / 106
绘制 sigmoid function:
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
z = np.arange(-7, 7, 0.1) #从到画图
-7 7
phi_z = sigmoid(z)
plt.plot(z, phi_z)
plt.axvline(0.0, color='k')
plt.ylim(-0.1, 1.1)
plt.xlabel('z')
plt.ylabel('$\phi (z)$')
# y axis ticks and gridline
plt.yticks([0.0, 0.5, 1.0])
ax = plt.gca()
ax.yaxis.grid(True)
plt.tight_layout()
65 / 106
绘制 sigmoid function:
若z→∞, 则ϕ(z)接近1；若z→−∞则接近0。
66 / 106
损失函数:
设有m个观测样本，对应的观测值为y 1, . . . ym ，则p i = P (yi = 1|xi ) ，每个观测值的概率为
yi 1−yi
P (yi ) = p ∗ (1 − p)
i
使用log-likelihood function估计参数w，同时重新定义损失函数 Logistic的损失函数

m
(i) (i) (i) (i)

ln L(w) = J (w) = ∑ ( − y ln(ϕ(z )) − (1 − y )ln(1 − ϕ(z ))).
i=1
−ln(ϕ(z)) if y = 1
J (ϕ(z), y; w) = {
−ln(1 − ϕ(z)) if y = 0
接下来的步骤，就是对于ln L(w)求偏导=0的运算，这样即得到一由m + 1个方程构成的方程组。根据凸

函数的可导性，需要通过迭代算法逐轮逼近最优值。
67 / 106
绘制损失函数
−ln(ϕ(z)) if y = 1
J (ϕ(z), y; w) = {
−ln(1 − ϕ(z)) if y = 0
def cost_1(z):
return - np.log(sigmoid(z))
def cost_0(z):
return - np.log(1 - sigmoid(z))
z = np.arange(-10, 10, 0.1)
phi_z = sigmoid(z)
c1 = [cost_1(x) for x in z]
plt.plot(phi_z, c1, label='J(w) if y=1')
c0 = [cost_0(x) for x in z]
plt.plot(phi_z, c0, linestyle='--', label='J(w) if y=0')
68 / 106
plt.close('all')
plt.ylim(0.0, 5.1)
plt.xlim([0, 1])
plt.xlabel('$\phi$(z)')
plt.ylabel('J(w)')
plt.legend(loc='best')
plt.tight_layout()
如果正确预测 class，则cost 趋近于0(损失函数的定义)。
69 / 106
利用Python实现Logistic回归
class LogisticRegression(object):
"""LogisticRegression classifier.
Parameters
------------
eta : Learning rate (float between 0.0 and 1.0)
n_iter : Passes over the training dataset.(int)
Attributes
-----------
w_ : Weights after fitting.(1d-array)
cost_ : Cost in every epoch.(list)
"""
self.eta = eta
self.n_iter = n_iter
70 / 106
def fit(self, X, y):
""" Fit training data.
Parameters
----------
X : {array-like}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape = [n_samples]
Target values.
Returns
-------
self : object
"""
self.w_ = np.zeros(1 + X.shape[1])
self.cost_ = []
for i in range(self.n_iter):
y_val = self.activation(X)
errors = (y - y_val)
neg_grad = X.T.dot(errors)
self.w_[1:] += self.eta * neg_grad
self.w_[0] += self.eta * errors.sum()
self.cost_.append(self._logit_cost(y, self.activation(X)))
return self
71 / 106
def _logit_cost(self, y, y_val):
logit = -y.dot(np.log(y_val)) - ((1 - y).dot(np.log(1 - y_val)))
return logit
def _sigmoid(self, z):

return 1.0 / (1.0 + np.exp(-z))
def net_input(self, X):

"""Calculate net input"""
return np.dot(X, self.w_[1:]) + self.w_[0]
72 / 106
def activation(self, X):
""" Activate the logistic neuron"""
z = self.net_input(X)
return self._sigmoid(z)
def predict_proba(self, X):

"""
Predict class probabilities for X.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns
----------
Class 1 probability : float
"""
return activation(X)
73 / 106
"""
Predict class labels for X.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns
----------
class : int
Predicted class label.
"""
# equivalent to np.where(self.activation(X) >= 0.5, 1, 0)
return np.where(self.net_input(X) >= 0.0, 1, 0)
74 / 106
# 特征标准化后再训练
X_std = np.copy(X)
X_std[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std()
X_std[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std()
y[y == -1] = 0 # 将负性标签编码为 0
lr = LogisticRegression(n_iter=500, eta=0.02).fit(X_std, y)
plt.plot(range(1, len(lr.cost_) + 1), np.log10(lr.cost_))
plt.xlabel('Epochs')
plt.ylabel('Cost')
plt.title('Logistic Regression - Learning rate 0.02')
plt.tight_layout()
plot_decision_regions(X_std, y, classifier=lr)
plt.title('Logistic Regression - Gradient Descent')
plt.xlabel('sepal length [standardized]')
plt.ylabel('petal length [standardized]')
plt.tight_layout()
75 / 106
利用scikit-learn实现logistic regression
# use Logistic Regression

from sklearn.linear_model import LogisticRegression
# C parameter 是什么呢 ?
lr = LogisticRegression(C=1000.0, random_state=0)
lr.fit(X_train, y_train)
lr.score(X_train, y_train)
lr.coef_
76 / 106
利用支持向量机进行最大间隔分类
感知器：最小化错误分类数
支持向量机：最大化间隔
77 / 106
利用支持向量机进行最大间隔分类
# 训练，支持向量分类器
SVC
from sklearn.svm import SVC
svm = SVC(kernel='linear', C=1.0, random_state=0)
svm.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std, y_combined,
classifier=svm, test_idx=range(105,150))
plt.tight_layout()
78 / 106
利用kernel SVM解决非线性问题
#创建一个简单的数据集，满足异或规则 xor

np.random.seed(0)
X_xor = np.random.randn(200, 2)
个标签为
y_xor = np.logical_xor(X_xor[:, 0] > 0, X_xor[:, 1] > 0) # 100 1, 100个标签为 0
y_xor = np.where(y_xor, 1, -1)
plt.scatter(X_xor[y_xor==1, 0], X_xor[y_xor==1, 1], c='b', marker='x', label='1')
plt.scatter(X_xor[y_xor==-1, 0], X_xor[y_xor==-1, 1], c='r', marker='s', label='-1')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.legend(loc='best')
plt.tight_layout()
79 / 106
2 2
ϕ(x1 , x2 ) = (z1 , z2 , z3 ) = (x1 , x2 , x + x )
1 2
80 / 106
在利用二次规划训练svm的过程中，我们需要点积x T
x 使用ϕ(x) T
ϕ(x) 进行替换。为了达到上述效果，我们定义核函
数：
rbf :radial basis function kernel 或者 Gaussian kernel
(i) (j) 2
∥x − x ∥
(i) (j)
k(x ,x ) = exp(− )
2
2σ
可以简化为 exp(−γ∥x (i)

− x
(j) 2
∥ )
其中γ =
1
2
作为参数可以优化
2σ
81 / 106
# 使用 svm kernel 方法投射到高维度中使之成为线性可分离的

, ,
svm = SVC(kernel='rbf', random_state=0, gamma=0.1, C=10.0)
svm.fit(X_xor, y_xor)
plot_decision_regions(X_xor, y_xor,
classifier=svm)
plt.tight_layout()
82 / 106
## gamma 较小
svm = SVC(kernel='rbf', random_state=0, gamma=0.2, C=1.0)
plt.tight_layout()
83 / 106
# gamma 很大边界
, 过拟合
tight,
svm = SVC(kernel='rbf', random_state=0, gamma=10, C=1)
plt.tight_layout()
84 / 106
分类问题的评价指标
Scikit-learn中的评价指标
The sklearn.metrics 模块提供了各类评价指标（直接调用）
有一些是限于二分类任务:
matthews_corrcoef(y_true, y_pred) Compute the Matthews correlation coefficient (MCC) for

binary classes
precision_recall_curve(y_true, probas_pred) Compute precision-recall pairs for different probability
thresholds
roc_curve(y_true, y_score) Compute Receiver operating characteristic (ROC)
有一些适合多分类任务:
confusion_matrix(y_true, y_pred) Compute confusion matrix to evaluate the accuracy of a classification

hinge_loss(y_true, pred_decision) Average hinge loss (non-regularized)
86 / 106
Scikit-learn中的评价指标
The sklearn.metrics 模块提供了各类评价指标（直接调用）
有一些适合多标签任务:
accuracy_score(y_true, y_pred) Accuracy classification score.

classification_report(y_true, y_pred) Build a text report showing the main classification metrics
f1_score(y_true, y_pred]) Compute the F1 score, also known as balanced F-score or F-
measure
还有一些二分类多标签任务:
average_precision_score(y_true, y_score) Compute average precision (AP) from prediction scores

roc_auc_score(y_true, y_score) Compute Area Under the Curve (AUC) from prediction
scores
87 / 106
构建数据并用svm拟合
X, y = datasets.make_classification(n_classes=2, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train) # standardize by mean & std
X_test_std = sc.transform(X_test)
model = SVC(probability=True, random_state=0)
model.fit(X_train_std, y_train);
88 / 106
指标实例
sklearn 中分类任务的默认评分是accuracy (标签预测正确的比例)：
n−1
1
^) =
accuracy(y, y ^ = yi )
∑ 1(y i
n
i=0
where 1(x) is the indicator function
model.score(X_test_std, y_test)
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test_std)
accuracy_score(y_test, y_pred)
89 / 106
混淆矩阵 confusion matrix
90 / 106
from sklearn.metrics import confusion_matrix

y_test_pred = model.predict(X_test_std)
confmat = confusion_matrix(y_test, y_test_pred)
print(confmat)
## [[15 3]
## [ 2 10]]
91 / 106
plt.matshow(confusion_matrix(y_test, y_test_pred), cmap=plt.cm.Blues)

plt.colorbar()
plt.xlabel("Predicted label")
plt.ylabel("True label");
92 / 106
Precision(准确率), recall(召回率) and F-measures
Precision 预测为A类的有多少确实是A类
Recall 有多少真实的正例被预测到
f1-score 综合P和R，两者的调和平均
我们用TP, FP, TN, FN, FPR, TPR 表示 "true positive", "false positive", "true negative" ，"false negative", "false
positive rate", "true positive rate" :
TP
P REcision = (1)
TP + FP
TP
RECall = T P R = (2)
FN + TP
P RE × REC
F1 = 2 (3)
P RE + REC
2
P RE × REC
Fβ = (1 + β ) (4)
2
β P RE + REC
FP
FPR = (5)
FP + TN
TP
TPR = (6)
FN + TP
93 / 106
思考：
80正常+20违约
预测：50人违约（包含20人违约）
违约——正例
问：精确度，准确率，召回率
TP = 20， FP = 30， FN = 0， TN = 50
acc：精确度：20+50/100
precision：20/50 不高
recall: 20/20
全预测正常呢？
94 / 106
from sklearn.metrics import precision_score, recall_score, f1_score, fbeta_score

print('Precision: %.3f' % precision_score(y_true=y_test, y_pred=y_test_pred))
## Precision: 0.769
print('Recall: %.3f' % recall_score(y_true=y_test, y_pred=y_test_pred))
## Recall: 0.833
print('F1: %.3f' % f1_score(y_true=y_test, y_pred=y_test_pred))
## F1: 0.800
print('F_beta2: %.3f' % fbeta_score(y_true=y_test, y_pred=y_test_pred, beta=2))
## F_beta2: 0.820
95 / 106
from sklearn.metrics import classification_report

print(classification_report(y_test, y_test_pred))
## precision recall f1-score support

##
## 0 0.88 0.83 0.86 18
## 1 0.77 0.83 0.80 12
##
## avg / total 0.84 0.83 0.83 30
实际中，两种情况特别有用：
1. 非平衡数据
2. 非对称损失（代价敏感）
96 / 106
ROC and AUC
Receiver Operating Characteristic ROC曲线常被用来评价一个二分类器（binary classifier）的优劣
考虑ROC曲线图中的四个点和一条线:
第一个点，(0,1)，即FPR=0, TPR=1，这意味着FN（false negative）=0，并且FP（false positive）

=0。这是一个完美的分类器，它将所有的样本都正确分类。
第二个点，(1,0)，即FPR=1，TPR=0，类似地分析可以发现这是一个最糟糕的分类器，因为它成功避开
了所有的正确答案。
第三个点，(0,0)，即FPR=TPR=0，即FP（false positive）=TP（true positive）=0，可以发现该分类器

预测所有的样本都为负样本（negative）。
第四个点（1,1），分类器实际上预测所有的样本都为正样本。
考虑ROC曲线图中的虚线y=x上的点。这条对角线上的点表示的是一个采用随机猜测策略的分类器的结果，例如
(0.5,0.5)，表示该分类器随机对于一半的样本猜测其为正样本，另外一半的样本为负样本。
因此：如果分类器效果很好, 那么图应该会在左上角. ROC曲线越接近左上角，该分类器的性能越好。
在 ROC curve 的基础上, 可以计算 AUC -- Area Under Curve.
97 / 106
ROC and AUC
Area Under Curve
AUC是二分类问题的一个通用的平均指标，表示ROC曲线下面积。通常是0.5-1之间。
AUC有个很好的特点：当测试集中的正负样本的分布变化的时候，ROC曲线能够保持不变。因此很适合作为一些数据的
评价指标（比如非平衡数据）
98 / 106
ROC and AUC
def ROC_c(true_labels, predicted_probs, n_points=100, pos_class=1):

thr = np.linspace(0,1,n_points)
tpr = np.zeros(n_points)
fpr = np.zeros(n_points)
pos = true_labels == pos_class
neg = np.logical_not(pos)
n_pos = np.count_nonzero(pos)
n_neg = np.count_nonzero(neg)
for i,t in enumerate(thr):

tpr[i] = np.count_nonzero(np.logical_and(predicted_probs >= t, pos)) / float(n_pos)
fpr[i] = np.count_nonzero(np.logical_and(predicted_probs >= t, neg)) / float(n_neg)
return fpr, tpr, thr
99 / 106
ROC and AUC
数据示例
df_imputed = pd.read_csv('data/df_imputed')
df_imputed.head()
features = ['revolving_utilization_of_unsecured_lines','age','number_of_time30-59_days_past_due_n
y = df_imputed.serious_dlqin2yrs
X = pd.get_dummies(df_imputed[features], columns = ['income_bins', 'age_bin'])
from sklearn.cross_validation import train_test_split
## /Users/Xu/anaconda/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: Thi

## "This module will be removed in 0.20.", DeprecationWarning)
train_X, test_X, train_y, test_y = train_test_split(X, y ,train_size=0.7,random_state=1)
100 / 106
ROC and AUC
数据示例
# 随机猜测的预测值
preds = np.random.rand(len(test_y))
fpr, tpr, thr = ROC_c(test_y, preds)
plt.plot(fpr, tpr);
101 / 106
ROC and AUC
数据示例
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(train_X,train_y)
preds = clf.predict_proba(test_X)[:,1]
fpr, tpr, thr = ROC_c(test_y, preds)
plt.plot(fpr, tpr);
102 / 106
用sklearn计算ROC和AUC
数据示例
from sklearn import metrics

fpr, tpr, thr = metrics.roc_curve(test_y, preds)
roc_auc = metrics.auc(fpr, tpr)
print(roc_auc)
plt.plot(fpr, tpr, lw=1)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
## 0.707889321472
103 / 106
用sklearn计算ROC和AUC
数据示例
104 / 106
强化学习简例
106 / 106

机器学习初步 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

机器学习初步 PDF

Uploaded by

Copyright:

Available Formats

华东师范大学统计学院

3.监督学习：KNN, Naive Bayes, LDA

如果目标函数F (x)在点a处可微且有定义，那么函数F (x)在点a沿着梯度相反的方向−∇F (a)下降最快。

梯度下降，设定起始点负梯度方向 (即数值减小的方向) 为搜索方向，寻找最小值。梯度下降法越接近目

## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \

数据分析的第一步是进行探索性数据分析 (Exploratory Data Analysis, EDA)，理解变量的分布与变量之间的关

import matplotlib.pyplot as plt

## [[ 1. 0.60379972 0.59087892 -0.61380827 -0.73766273]

对与 MEDV correlation 高的变量感兴趣, LSTAT 最高(-0.74), 其次是 RM (0.7)

print('Slope: %.3f' % lr.coef_[0])

print('Intercept: %.3f' % lr.intercept_)

## Price in $1000's: 10.840

from sklearn.linear_model import LinearRegression

print('Intercept: %.3f' % slr.intercept_)

lin_regplot(X_std, y_std, slr)

print('Intercept: %.3f' % slr.intercept_)

from sklearn.model_selection import train_test_split

如果预测都是正确的, 那么残差就是0，这是理想情况。实际中, 我们希望误差是随机分布的。

R-squre 也是重要的评估标准, 它代表着有多少百分比的数据被模型解释. 越高代表模型拟合越好

SSE: sum of squares due to error

SSR：sum of square of the regression()

SST：total sum of square

from sklearn.metrics import r2_score

## MSE train: 19.958, test: 27.196

print('R^2 train: %.3f, test: %.3f' % (

## R^2 train: 0.765, test: 0.673

from numpy import *

from sklearn.cluster import KMeans

from sklearn.neighbors import KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

f (x) 表示神经元接受输入 x 产生的输出

w(j) := w(j) + α(y − f (x))x(j) (j = 1, … , n)

注意这意味着，仅当针对给定训练实例 (x, y) 产生的输出值 f (x) 与预期的输出值 y 不同时，权重向量才会发生改变。

如果存在一个正的常数 γ 和权重向量 w ，对所有的 i 满足:

def predict(self, X):

但是 Perceptron Model 可以解决多类别分类问题, 参考 one-vs-all

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

ppn = Perceptron(eta=0.1, n_iter=10)

# 绘制 error 的图形，在几轮更新之后，检查是否 error 趋近于 0

通过图形发现，结果 error 的确最后为 0, 并且收敛, 且分类效果较好

from matplotlib.colors import ListedColormap

虽然 Perceptron Model 在上面 Iris例子里表现得很好，但在其他问题上却不一定表现得好。

from sklearn import datasets

## ['setosa' 'versicolor' 'virginica']

from sklearn.model_selection import train_test_split

from sklearn.linear_model import Perceptron

y_pred = ppn.predict(X_test_std) # predict

from sklearn.metrics import accuracy_score

from matplotlib.colors import ListedColormap

X_combined_std = np.vstack((X_train_std, X_test_std))

设条件概率P (y = 1|x) = p ，那么Logistic回归模型定义为：

sigmoid function的输出即是特定的样本属于class 1的概率

使用log-likelihood function估计参数w，同时重新定义损失函数 Logistic的损失函数

(i) (i) (i) (i)

接下来的步骤，就是对于ln L(w)求偏导=0的运算，这样即得到一由m + 1个方程构成的方程组。根据凸

如果正确预测 class，则cost 趋近于0(损失函数的定义)。

def _sigmoid(self, z):

def net_input(self, X):

def predict_proba(self, X):

# use Logistic Regression

#创建一个简单的数据集，满足 异或规则 xor

rbf :radial basis function kernel 或者 Gaussian kernel

可以简化为 exp(−γ∥x (i)

#创建一个简单的数据集，满足异或规则 xor

# 使用 svm kernel 方法投射到高维度中使之成为线性可分离的