Machine Learning - PZheng

准确率 accuracy

当您使用分类不平衡的数据集（比如正类别标签和负类别标签的数量之间存在明显差异）时，单单准确率一项并不能反映全面情况所以需要精确率和召回率

precision and recall

Precision:在被识别为正类别的样本中，确实为正类别的比例是多少？

ROC AUC

曲线下面积表示随机正类别（绿色）样本位于随机负类别（红色）样本右侧的概率。

曲线下面积因以下两个原因而比较实用：

曲线下面积的尺度不变。它测量预测的排名情况，而不是测量其绝对值。
曲线下面积的分类阈值不变。它测量模型预测的质量，而不考虑所选的分类阈值

L1, L2 正则化

L1 正则化可以让一些不必要的权重变为0，减少RAM消耗，和计算量，如果用L2话不能使得权重变为0，那么计算量依然是很大的。

“非线性”意味着您无法使用形式为 b+w1x1+w2x2 的模型准确预测标签。也就是说，“决策面”不是直线。之前，我们了解了对非线性问题进行建模的一种可行方法

Learning Curves

Training an algorithm on a very few number of data points (such as 1, 2 or 3) will easily have 0 errors because we can always find a quadratic curve that touches exactly those number of points. Hence:

As the training set gets larger, the error for a quadratic function increases.
The error value will plateau out after a certain m, or training set size.

Experiencing high bias:

If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.

Experiencing high variance:

If a learning algorithm is suffering from high variance, getting more training data is likely to help.

Deal with High bias or high variance

Getting more training examples: Fixes high variance
Trying smaller sets of features: Fixes high variance
Adding features: Fixes high bias
Adding polynomial features: Fixes high bias
Decreasing λ: Fixes high bias
Increasing λ: Fixes high variance.

ML-Clustering-Unsupervised Learning: Introduction

Using these variables we can define our cost function:

With k-means, it is not possible for the cost function to sometimes increase. It should always descend. K-means can get stuck in local optima. To decrease the chance of this happening,** you can run the algorithm on many different random initializations**. In cases where K<10 it is strongly recommended to run a loop of random initializations.

Choosing the Number of Clusters:

The elbow method: plot the cost J and the number of clusters K. The cost function should reduce as we increase the number of clusters, and then flatten out. Choose K at the point where the cost function starts to flatten out.

Note: J will always decrease as K is increased. The one exception is if k-means gets stuck at a bad local optimum.

Learning with Large Datasets

Stochastic Gradient Descent

This algorithm will only try to fit one training example at a time.

Mini-Batch Gradient Descent

Instead of using all m examples as in batch gradient descent, and instead of using only 1 example as in stochastic gradient descent, we will use some in-between number of examples b.

Map Reduce and Data Parallelism

Regression and Hyper-parameter

Naive Cross-Validation

Hyper-Parameter Tuning

Cross-Validation to determine the best Parameter, for example: using different degree k in polynomial regression. Then, find the best θ for each k, and use cross-validation to find the best θ with the lowest loss function. Then, we can know the corresponded k.

k-fold cross-validation

Derivative of Vector

KNN

k is the hyper-parameter for tuning: cross-validation

Efficient KNN:

先选取landmarks，相当于cluster；然后找最近的landmark, 然后再在最近的landmark cluster里面找最近的点。

SVM

需要指出 C, choice of kernel function

one vs all, 得到 θ1,…,θn; then choose the largest value of max(i) θi*x + b

Decision Tree and Random Forest:

Decision Tree

决策树学习采用的是自顶向下的递归方法, 其基本思想是以信息熵为度量构造一棵熵值下降最快的树,到叶子节点处的熵值为零, 此时每个叶节点中的实例都属于同一类。

bootstrap

Bootstraping的名称来自成语“pull up by your own bootstraps”,意思是依靠你自己的资源,称为自助法,它是一种有放回的抽样方法. “拔靴法”: 从手上有限的资料, 产生不同的副本, 来模拟不一样的资料. 从已知大小为 N 的数据集 D 中, 进行 t=1,2,…,T 的过程: 进行有放回的采样, 得到 N′ 大小的数据集 Dt , 从 A{D} 中建立分类器 gt (ID3、C4.5、CART、SVM、Logistic回归等) 重复上述步骤, 得到 m 个分类器, G=Uniform(gt) 作用: 通过投票/平均, 降低变化variance 与决策树恰好相反, 决策树的作用, 是使variance变大, 对数据敏感.

决策树的优缺点

优点: 决策树对训练属于有很好的分类能力, 缺点: 但对未知的测试数据未必有好的分类能力,泛化能力弱,即可能发生过拟合现象。剪枝,随机森林

随机森林

随机森林能够解决, 决策树的过拟合问题. 随机森林用训练集生成多个(非常深的)决策树.在预测时, 每个树的都会预测一个结果, 每个结果加权表决, 来避免过拟合. 例如, 如果你训练了3个树, 其中有2个树的结果是A, 1个数的结果是B, 那么最终结果会是A。

Random Forest classifiers work around that limitation by creating a whole bunch of decision trees (hence “forest”) — each trained on random subsets of training samples (drawn with replacement) and features (drawn without replacement) — and have the decision trees work together to make a more accurate classification.

信息熵，information entropy：-p*log(p); 如果可取数目较多的属性有所偏好，导致分支过于细化。增益率:信息增益/IV(可取数值越大，值越大) 基尼系数：Gini(D) = 1- (+)p^2 随机抽取两个样本，其类别标记不一致的概率。

不同的子数据集，有放回的抽取
不同的特征子集，可能有重复，可以选取部分

Ensemble Learning: 集成学习

数据集大 划分成多个小数据集，学习多个模型进行组合 数据集小 利用Bootstrap方法进行抽样，得到多个数据集，分别训练多个模型再进行组合

Bagging

bootstrap aggregating :自助法，有放回的抽样

在Bagging方法中，利用bootstrap方法从整体数据集中采取有放回抽样得到N个数据集，在每个数据集上学习出一个模型，最后的预测结果利用N个模型的输出得到，具体地：分类问题采用N个模型预测投票的方式，回归问题采用N个模型预测平均的方式

Boosting

算法思想：首先从训练集初始权重训练出一个弱学习器1，根据弱学习器的学习误差率表现来更新训练样本的权重，使得之前弱学习器学习误差率高的训练样本点的权重变高，使得这些误差率高的店在后面的弱学习器2中得到更多的重视。然后基于调整权重后的训练集来训练弱学习器2，如此重复进行，知道弱学习器数达到预定的数目T，最终将这T个弱学习器通过集合策略进行整合，最终得到强学习器

Details

Matrix multiplication

Associative. (A∗B)∗C=A∗(B∗C)

Inverse

The inverse of a matrix A is denoted A−1. Multiplying by the inverse results in the identity matrix.

A non square matrix does not have an inverse matrix. We can compute inverses of matrices in octave with the pinv(A) function [1] and in matlab with the inv(A) function. Matrices that don’t have an inverse are singular or degenerate.

A	= 0 奇异矩阵，没有逆矩阵

Gradient Descent

θ:=θ−α∇J(θ)

Diagnostics

A typical rule of thumb when running diagnostics is:

More training examples fixes high variance but not high bias.
Fewer features fixes high variance but not high bias.
Additional features fixes high bias but not high variance.
The addition of polynomial and interaction features fixes high bias but not high variance.
When using gradient descent, decreasing lambda can fix high bias and increasing lambda can fix high variance (lambda is the regularization parameter).
When using neural networks, small neural networks are more prone to under-fitting and big neural networks are prone to over-fitting. Cross-validation of network size is a way to choose alternatives.

SVM vs Logistic

Logistic Regression vs. SVMs If n is large (relative to m), then use logistic regression, or SVM without a kernel (the “linear kernel”)

If n is small and m is intermediate, then use SVM with a Gaussian Kernel

If n is small and m is large, then manually create/add more features, then use logistic regression or SVM without a kernel.

In the first case, we don’t have enough examples to need a complicated polynomial hypothesis. In the second example, we have enough examples that we may need a complex non-linear hypothesis. In the last case, we want to increase our features so that logistic regression becomes applicable.

Principle Component Analysis PCA

Find a vector!! Reduce 3D to 2D

Projection Error: 到这个线的距离，或者这个平面的距离

左边是Linear Regression, 右边是 PCA

Mean Normalization and Feature Scaling

SVD: singular value decomposition

Reduce data from n-dimensions to k-dimensions

z = U(T)x; U is n*n, 选取k个向量，竖着的，这样就能将n维降低到k维

协方差矩阵 covariance matrix

各个维度的相关性，样品特征均值化后，各特征方差一样。计算得到的协方差矩阵，其中元素的值越大，则说明对应下标的特征之间相关性更高。

用来PCA 的 SVD分解