1. 绪论
人类的强大 在于 经验 + 创新
问题:是否能够让程序根据经验来改进自身以满足需求或外界的变化?
机器学习致力于研究通过计算的手段,利用经验来改进自己的性能。
经验–>来自数据
利用经验的方式–> 模型
计算的手段–>从数据中产生模型的方法–>学习算法
Item1 | Item2 |
---|---|
data | Different applications have different data. |
model(function) | Define the model according to specific problem. |
loss(prediction) | Using loss function to evaluation model |
貌似现在接触的机器学习大多数都是针对结构化的数据,我们称一条结构化的数据为一条记录或一个样本,记录的集合为数据集。
监督学习原理
存在某个不为人知的规律,即y与X有关系,我们猜测是满足下面这种形式
$$y=f(X)=\omega X+b$$
该公式代表了一系列的具体的公式,我们需要通过一些已知的y0和X0的关系来逐步修改公式来逼近真实情况
1. define a set of function
2. goodless of function
3. pick the best function
分类与回归
分类:预测的是离散值
回归:预测的是连续值
无监督学习
聚类
奥卡姆剃刀(Occam’s razor)
若有多个假设与观察一致,则选最简单的那一个。
机器学习的目标是使学得的模型能很好地适用于新样本。
2. 模型评估与选择
经验误差与过拟合
error rate=错误率:分类错误的样本数占样本总数的比例
accuracy=精度:分类正确的样本数占样本总数的比例
error=误差:学习器的实际预测输出与样本的真实输出之间的差异
training error=训练误差/empirical error=经验误差:在训练集上的误差
generalization error=泛化误差:在新样本上的误差
评估方法
留出法hold-out
直接将数据集划分成互斥的两个集合
一般需要若干次随机划分,取平均值。2/3~4/5样本用于训练
交叉验证法corss validation
将数据集划分成k个大小相似的互斥子集,k-1个作为训练集,另外1个作为测试集
k折交叉验证 k-fold cross validation
10次10折交叉验证
留一法Leave-One-out
n个样本,分成n份的交叉验证。
自助法 bootstrapping
自助采样法(bootstrap samping)
包外估计(out-of-bag estimate)
自助法对数据集较小、难以有效划分训练集和测试集时很有用。但是,改变了初始数据集的分布,引入了估计偏差。
调参与最终模型
parameter tuning=调参
validation set=验证集
性能度量performance measure
回归问题中最常见的性能度量是均方误差mean squared error
查准率=精确率precision、查全率=召回率recall、F1
预测正例 | 预测反例 | |
真实正例 | 真正例TP | 假反例FN |
真实反例 | 假正例FP | 真反例TN |
$$P = { TP \over TP + FN}.$$
$$R= { TP \over TP+FN}.$$
Break-Even Point平衡点:查准率=查全率 的取值。
F1
$$F1 = { {2 \times P \times R} \over {P + R} } = { {2 \times TP} \over { 样例总数 + TP -TN } }$$
当对于 P 和 R 有侧重时
$$F_\beta=\frac{(1 + \beta^2) \times P \times R}{(\beta^2 \times P) + R}$$
- 当 $\beta = 1$,退化为F1
- 当 $\beta > 1$,查全率比较重要
- 当 $\beta < 1$,查准率比较重要
ROC 与 AUC
ROC = Receiver Operating Characteristic = 受试者工作特征
真正例率
$$ TPR = {TP \over TP + FN}$$
假正例率
$$ FPR = {FP \over TN + FP}$$
AUC = Area Under ROC Curve
ROC底下面积,越大越好
代价矩阵
由于可能对于分类错误的情况会造成不同的代价,可以引入代价矩阵,将最优化变为希望最小化总体代价
线性模型
基本形式
$$f(x)=\omega_1x_1+\omega_2x_2+…+\omega_dx_d+b$$
向量形式
$$f(x)=\omega^Tx+b$$
复习提纲
BasicConcept 都考
- Basic Concepts about Machine Learning
Q1. What is Machine Learning?
A1. Mchine Learning compose of three parts:- Data
- Model(function)
- Loss(prediction)
Item1 Item2 data Different applications have different data. model(function) Define the model according to specific problem. loss(prediction) Using loss function to evaluation model
Supervised learning is the machine learning task of inferring a function from labeled training data
监督学习是从标记训练数据推断函数的机器学习任务。
Linear Regression
Simple linear regression describes the linear relationship between a predictor variable, plotted on the x-axis, and a response variable, plotted on the y-axis
简单线性回归描述了预测变量之间的线性关系,绘制在x轴上,并在y轴上绘制一个响应变量。Closed-form solution 封闭解
许多矩阵是不可逆的。
大矩阵的逆需要巨大的内存。
逆运算需要O(m 3)来计算。
所以使用 Gradient Descent 梯度下降
- Gradient Descent
Learning rate η has a large impact on convergence
Too large η → oscillatory and may even diverge
Too small η → too slow to converge
学习率η对收敛的影响较大
太大的η→振荡,甚至发散
太小η→太慢收敛
Adaptive learning rate(For example):
Set larger learning rate at the begining
Use relatively smaller learning rate in the later epochs
Decrease the learning rate: η t+1 = η t / t+1
自适应学习率(例如):
一开始就设置较大的学习率
在以后的时代使用相对较小的学习率
减少学习率:η t+1 =η t / t+1
Linear Classfication: Dual SVM及其推导不考
- Linear Classification
f(xi)
≥ 0 yi = +1
< 0 yi = −1
i.e. yi f(xi) > 0 for a correct classification
- Support Vector Machine
What’s a Good Decision Boundary
什么是好的决策边界?
Maximum margin solution: most stable under perturbations of the inputs
最大间隔解:在输入扰动下最稳定
Select two parallel hyperplanes that separate the two classes of data and let the distance between them as large as possible
The region bounded by these two hyperplanes is called the ”margin”
选择两个平行的超平面分开两类数据,让他们之间的距离尽可能大
该区域范围内的两平面称为“边缘”
In general there is a trade off between the margin and the number of mistakes on the training data
软间隔 soft margin
松弛变量 slack variables
- Stochastic Gradient Descent
True gradient descent is a batch algorithm, slow but sure
真正的梯度下降是一个批处理算法,缓慢但肯定
Information is redundant
Sufficient samples means we can afford more frequent, noisy updates
Never-ending stream means we should not wait for all data
Tracking non-stationary data means that the target is moving
信息是多余的
充足的样本意味着我们能负担得起更频繁、更嘈杂的更新。
永不结束流意味着我们不应该等待所有数据。
跟踪非平稳数据意味着目标在移动。
Better for large datasets and often faster convergence
Hard to reach high accuracy
Best classical methods can not handle stochastic approximation
Theoretical definitions for convergence not as well defined
对于大型数据集,往往更快收敛
难以达到高精度
最好的经典方法不能处理随机逼近。
收敛性定义不明确的理论定义
SGD的好处
- Gradient is easy to calculate (instantaneous)
- Less prone to local minima
- Small memory footprint
- Get to a reasonable solution quickly
- Works for non-stationary environments as well as online settings
- Can be used for more complex models and error surfaces
梯度很容易计算(瞬时)
不易发生局部极小
小内存占用
迅速找到合理的解决办法
适用于非平稳环境以及在线设置。
可用于更复杂的模型和错误表面。Is convergence necessary?
Non-stationary: convergence may not be required
Stationary: learning rate should decrease with time
Robbins-Monroe sequence is adequate ηt = 1/tMini-batch Stochastic Gradient Descent
小批量梯度下降SGD Recommenddations
- Randomly shuffle training examples
Although theory says you should randomly pick examples, it is easier to make a pass through your training set sequentially
Shuffling before each iteration eliminates the effect of order
随机洗牌训练示例
虽然理论说你应该随机挑选例子,但是更容易通过你的训练集顺序。
每次迭代之前进行洗牌消除了订单的影响。 - Monitor both training cost and validation error
Set aside samples for a decent validation set
Compute the objective on the training set and validation set (expensive but better than overfitting or wasting computation)
监控培训成本和验证错误
为正确的验证集预留样本
在训练集和验证集的目的计算(贵,但比过拟合或浪费计算) - Check gradient using finite differences
If computation is slightly incorrect can yield erratic and slow algorithm
Verify your code by slightly perturbing the parameter and inspecting differences between the two gradients
利用有限差分检验梯度
如果计算稍不正确,则会产生不稳定和缓慢的算法。
通过轻微的扰动参数和检验两者之间的梯度差异,验证你的代码 - Experiment with the learning rates using small sample of training set
SGD convergence rates are independent from sample size
Use traditional optimization algorithms as a reference point
用训练样本的小样本进行学习率的实验
SGD的收敛速度是独立样本
使用传统的优化算法作为参考点 - Leverage sparsity of the training examples
For very high-dimensional vectors with few non zero coefficients, you only need to update the weight coefficients corresponding to nonzero pattern in x
利用训练样本的稀疏性
对于很少的非零系数的高维向量,只需更新x中对应于非零模式的权系数。 - Use learning rates of the form ηt = η0/(1 + η0λt)
Allows you to start from reasonable learning rates determined by testing on a small sample
Works well in most situations if the initial point is slightly smaller than best value observed in training sample
使用表格ηT = 0 /η学习率(1 + 0ηλT)
允许你从一个小样本测试中确定的合理学习率开始。
如果初始点略小于训练样本中观察到的最佳值,那么在大多数情况下都能很好地工作。
Logistic Regression and Softmax Regression: 都考
Logistic Regression
还没懂
Softmax Regression
还没懂
Multiclass classification: Hierarchical classification 不考
那不看先
Cross-Validation都考
Overfitting, Underfitting and Cross-Validation
Problem of Model Validation
Q1. Why We All Need Validation
Business Reasons
Need to choose the best model.
Measure accuracy/power of selected model.
Good to measure ROI of the modeling project.
Statistical Reasons
Model building techniques are inherently designed to minimize “loss” or “bias”.
To an extent, a model will always fit “noise” as well as “signal”.
If you just fit a bunch of models on a given dataset and choose the “best” one, it will likely be overly “optimistic”.
企业的原因
需要选择最好的模型。
测量精度/模型功率。
很好地度量建模项目的ROI。
统计方面的原因
模型构建技术本质上是为了最小化“损失”或“偏见”而设计的。
在某种程度上,一个模型总是符合“噪声”和“信号”。
如果你只是在给定的数据集上选择一堆模型,选择“最佳”,那么它可能过于“乐观”。
Underfitting and Overfitting
Underfitting: high bias 偏差大
Model cannot capture the underlying trend of the data
Overfitting: high variance 方差大
Model is too complex to capture the true pattern
Model seeks to fit the noise or outlier of the data
Bias-Variance Tradeoff
Error on the dataset used to fit the model can be misleading
Doesn’t predict future performance.
Too much complexity can diminish model’s accuracy on future data.
Sometimes called the Bias-Variance Tradeoff.
用于拟合模型的数据集上的错误可能会引起误解。
不能预测未来的表现。
太多的复杂性会降低模型对未来数据的准确性。
有时称为偏差方差权衡。
Complex model:
Low “bias”:
The model fit is good on the training data.
The model value is close to the data’s expected value.
High “Variance”:
Model more likely to make a wrong prediction.
复杂的模型:
低偏压:
模型拟合对训练数据有较好的拟合效果。
模型值接近数据的期望值。
高“方差”:
更有可能做出错误预测的模型。
How do we know if we are underfitting or overfitting?
If by increasing capacity we decrease generalization error, then we are underfitting, otherwise we are overfitting.
If the error in representing the training set is relatively large and the generalization error is large, then underfitting;
如果我们增加容量,降低泛化误差,那么我们欠学习,否则我们过。
如果错误代表训练集比较大,产生的误差较大,那么欠;
Need to increase capacity (complexity of models).
– If the error in representing the training set is relatively small and the generalization error is large, then overfitting;
Need to decrease capacity or increase training set.
– Many features and relatively small training set.
– if you have chosen a large capacity to complement the many features, then you might overfit the data:
need to decrease the capacity.
需要增加容量(模型的复杂性)。
–如果错误代表训练集比较小的泛化误差大,然后过;
需要减少能力或增加训练集。
-许多功能和相对较小的训练集。
–如果你选择了一个大容量为补充的多种特征,那么你可能过度拟合数据:
需要降低容量。
Cross-Validation 重要
Training Data, Validation Data, Testing Data
Performance Report
Parameter Tuning
K-Fold Cross-Validation
PCA:只考1-16页,即Another Point of View 之后都不考
Ensemble Methods:Adaboost: GBDT不考
Ensemble Methods(Decision Tree and Random Forest): Random forest不考
Clustering: Hierarchical Agglomerative Clustering不考
- Clustering
- K-Means Clustering
- Hierarchical Agglomerative Clustering 不考
- Conclusion
Recommendation system:只考问题定义:什么叫Collaborative Filtering,Recommendation有哪几种典型方法及各自优缺点,Cold-start Recommendation概念;不考方法,方法在课程项目中体现。
Recommender System applies statistical and knowledge discovery techniques to the problem of making product recommendations.
推荐系统将统计和知识发现技术应用于产品推荐问题。
Advantages of recommender systems:
Improve conversion rate: Help customers find a product she/he wants to buy.
增加营业额
Cross-selling: Suggest additional products.
促销,附加商品
Improve customer loyalty: Create a value-added relationship. 提高用户忠诚度
Q1. 什么叫Collaborative Filtering
Make automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many other users (collaboration).
通过收集偏好或从许多其他用户那里获取信息来进行用户兴趣的自动预测(过滤)(协作)。
A1. Collaborative filtering, 即协同过滤。 使用某人的行为behavior来预测其它人会做什么。协同过滤就是基于邻域的算法,分为基于用户的协同过滤算法UserCF和基于物品的协同过滤算法ItemCF。
一种是User-Based,即是利用用户与用户之间的相似性,生成最近的邻居,当需要推荐的时候,从最近的邻居中得到推荐得分最高的几篇文章,用作推荐;
另外一种是Item-Based,即是基于item之间的关系,针对item来作推荐
Q2. Recommendation有哪几种典型方法及各自优缺点
Memory-based CF :utilize the entire user-item database to generate a prediction.
- User-based CF : find similar users to predict ratings.
- Item-based CF : use similar items to predict ratings
Model-based CF : build a model from the rating data
( Matrix factorization, etc.) and use this model to predict missing ratings.
矩阵分解
User-based similarity is more dynamic.
Precompute user neighbourhood may lead to poor predictions.
Item-based similarity is static.
Precompute item neighbourhood may provide accurate results.
基于用户的相似性更具动态性。
预计算用户附近可能导致不良的预测。
基于项的相似性是静态的。
预计算项目附近可以提供准确的结果。
Memory-Based CF
优点:
Require minimal knowledge engineering efforts.
Users and products are symbols without any internal structure or characteristics.
Produce good-enough results in most cases.
需要最少的知识工程努力。
用户和产品是没有任何内部结构或特性的符号。
在大多数情况下产生足够好的结果。
缺点:
Require a large number of explicit and reliable ratings.
Highly time consuming to compute similarity with millions of users & items.
需要大量明确可靠的评级。
非常耗费时间来计算与数以百万计的用户和项目的相似性。
Model-based CF
Models are learned from the underlying data rather than heuristics
模型是从底层数据中学习的,而不是启发式的。
Models of user ratings (or purchases):
用户评级模型(或购买):
Matrix Factorization is the most widely used algorithm.
Comparison between SGD and ALS
ALS is easier to parallelize than SGD.
ALS converges faster than the SGD.
SGD has less storage complexity than ALS.
(ALS needs to store the rating matrix R)
SGD has less computational complexity than ALS.
(ALS needs to compute the matrix-vector multiplication)
SGD和ALS的比较
ALS是比SGD容易并行化。
ALS的SGD更快的收敛速度比。
SGD比ALS较少的存储复杂度。
(ALS需要存储评级矩阵R)
SGD具有较小的计算复杂度比ALS。
(ALS需要计算矩阵向量乘法)
推荐系统构建三大方法:基于内容的推荐content-based,协同过滤collaborative filtering,隐语义模型(LFM, latent factor model)推荐
- 基于内容的推荐content-based
- 协同过滤collaborative filtering
- 隐语义模型(LFM, latent factor model)推荐
Q3. Cold-start Recommendation概念
如何在没有大量用户数据的情况下设计个性化推荐系统并让用户对推荐结果满意从而愿意使用推荐系统,就是冷启动问题。
1)用户冷启动:如何给新用户做个性化推荐
2)物品冷启动:如何将新物品推荐给可能对其感兴趣的用户。在新闻网站等时效性很强的网站中非常重要。
3)系统冷启动:如何在一个新开发的网站上设计个性化推荐,从而在网站刚发布时就让用户体验到个性化推荐服务。没有用户,只有一些物品信息。
转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至3213359017@qq.com