Review of 4721 Machine Learning_python

Contents

Overview of machine learning
- Supervised Learning
Decision Tree
Model selection
- Hyperparameters and Model selection
- Cross Validation
Multivariate Gaussians
- Review: Variance and covariance
- Multivariate Gaussians
PCA/ SVD
- PCA
- Eigenvector computation
- Low-rank Approximation
- Application: Latent Semantic Analysis
- Application: Matrix Completion
Kernel machines and neural networks
- Kernel machines

Overview of machine learning Supervised Learning

ML algorithms takes training data as input, and returns a predictor.
- Predictor: a function that takes a feature vector as input and returns a prediction.
- Classifier: a predictor that returns values from a discrete set
Feature engineering: deciding what goes into the feature vector
Feature extraction: transforming “raw input” into a feature vector
Model selection: choosing a predictor template (model, hypothesis class)
Training as optimization
- Objective function: real-valued function of predictor parameters to optimize (max or min)
- Optimization: find specific setting of parameters to optimize the objective

Statistical models of data

Labeled example in training data: a random variable
Labeled example encountered in future: random variable with same distribution as above, but independent of training data
Assumption: labeled examples in training data are independent & identically distribution (IID)

Challenges in Supervised learning

Feature engineering and model selection are critical and rely on domain expertise
Procuring useful training data
Under-fitting (failure of optimization)
Over-fitting (failure of generalization)
And so on

Decision Tree

xxx

Model selection Hyperparameters and Model selection

Many ML algorithms have hyperparameters: parameters of ML algorithm (not the parameters of predictor learned by ML algorithm)
The data-driven way to set hyperparameters: Model selection (hyperparameter tuning)
Evaluate different ML algorithms with different hyperparameters by using test data

Cross Validation

Cross validation: model selection method that simulates training/testing using only training data
K-fold cross, Leave-one-out,

Multivariate Gaussians Review: Variance and covariance

Covariance matrix: X ⃗ = ( X 1 , X 2 , … , X d ) \vec{X} = (X_1, X_2, \dots, X_d) X =(X1,X2,…,Xd), c o v ( X ⃗ ) cov(\vec{X}) cov(X )
Linear functions: α ∈ R d \alpha \in{R^d} α∈Rd, β ∈ R d \beta \in{R^d} β∈Rd

E ( α ⃗ ⋅ X ⃗ ) = α ⃗ ⋅ E ( X ⃗ ) E(\vec{\alpha}\cdot\vec{X}) = \vec{\alpha}\cdot E(\vec{X}) E(α ⋅X )=α ⋅E(X )

v a r ( α ⃗ ⋅ X ⃗ ) = α ⃗ T c o v ( X ⃗ ) α ⃗ var(\vec{\alpha}\cdot\vec{X}) = \vec{\alpha}^T cov(\vec{X})\vec{\alpha} var(α ⋅X )=α Tcov(X )α

c o v ( α ⃗ ⋅ X ⃗ , β ⃗ ⋅ X ⃗ ) = α ⃗ T c o v ( X ⃗ ) β ⃗ cov(\vec{\alpha}\cdot\vec{X}, \vec{\beta}\cdot\vec{X}) = \vec{\alpha}^T cov(\vec{X})\vec{\beta} cov(α ⋅X ,β ⋅X )=α Tcov(X )β

Linear Transformations: A ∈ R d ∗ k A \in{R^{d*k}} A∈Rd∗k

E ( A T X ⃗ ) = A T E ( X ⃗ ) E(A^T \vec{X}) = A^T E(\vec{X}) E(ATX )=ATE(X )

c o v ( A T X ⃗ ) = A T c o v ( X ⃗ ) A cov(A^T \vec{X}) = A^T cov(\vec{X}) A cov(ATX )=ATcov(X )A

Isotropic and anisotropic random vectors

v a r ( X i ) = 1. f o r a l l i , a n d c o v ( X i , X j ) = 0 , f o r a l l i ≠ j var(X_i) = 1. for\ all\ i, and\ cov(X_i, X_j) = 0, for\ all\ i \neq j var(Xi)=1.for all i,and cov(Xi,Xj)=0,for all i=j

c o v ( X ⃗ ) = I d cov(\vec{X}) = I_d cov(X )=Id

Note: Isotropy of X ⃗ \vec{X} X doesn’t mean X 1 , … X d a r e i n d e p e n d e n t . X_1, \dots X_d are\ independent. X1,…Xdare independent.

Multivariate Gaussians

Standard multivariate Gaussian:

Z 1 , … Z d ∼ N ( 0 , 1 ) Z_1,\ \dots Z_d \sim N(0, 1) Z1, …Zd∼N(0,1)

Z ⃗ ∼ N ( 0 ⃗ , I d ) \vec{Z} \sim N(\vec{0}, I_d) Z ∼N(0 ,Id)

E ( Z ⃗ ) = 0 ⃗ E(\vec{Z}) = \vec{0} E(Z )=0

c o v ( Z ⃗ ) = I d cov(\vec{Z}) =I_d cov(Z )=Id

PDF

Affine functions 1:

Random variable: X : = μ + α ⃗ ⋅ Z ⃗ = μ + ∑ α i Z i X:= \mu + \vec{\alpha} \cdot \vec{Z} = \mu + \sum \alpha_i Z_i X:=μ+α ⋅Z =μ+∑αiZi

Mean: μ \mu μ

Variance: σ 2 : = ∣ ∣ α ⃗ ∣ ∣ 2 2 \sigma^2 := ||\vec{\alpha}||^{2}_2 σ2:=∣∣α ∣∣22

Shorthand: X ∼ N ( μ , σ 2 ) X \sim N(\mu, \sigma^2) X∼N(μ,σ2)

PDF

Affine functions 2:

Random variables: X 1 X_1 X1, X 2 X_2 X2

PDF

General multivariate Gaussian distributions

X ⃗ : = μ ⃗ + A T Z ⃗ \vec{X}:= \vec{\mu} + A^T \vec{Z} X :=μ +ATZ

Mean: μ ⃗ \vec{\mu} μ , Covariance: Σ : = A T A \Sigma := A^TA Σ:=ATA

Shorthand: X ⃗ ∼ N ( μ ⃗ , Σ ) \vec{X} \sim N(\vec{\mu}, \Sigma) X ∼N(μ ,Σ)

PDF

Marginalization

Let X ⃗ = ( X 1 , X 2 , … , X d ) ∼ N ( μ ⃗ , Σ ) \vec{X} = (X_1, X_2, \dots, X_d) \sim N(\vec{\mu}, \Sigma) X =(X1,X2,…,Xd)∼N(μ ,Σ)

What is the marginal distribution of ( X 1 , X 2 , … , X k ) (X_1, X_2, \dots, X_k) (X1,X2,…,Xk) for some k < d?

Conditioning
Estimation

PCA/ SVD PCA

Empirical (sample) covariance

c o v ( X ⃗ ) = E [ ( X ⃗ − E ( X ⃗ ) ) ( X ⃗ − E ( X ⃗ ) ) T ] cov(\vec{X}) = E[(\vec{X} - E(\vec{X}))(\vec{X} - E(\vec{X}))^T] cov(X )=E[(X −E(X ))(X −E(X ))T]

More properties of the covariance matrix
- Symmetric
- d real eigenvalues λ 1 , … , λ d \lambda_1, \dots, \lambda_d λ1,…,λd
- corresponding eigenvectors v 1 ⃗ , v 2 ⃗ , … , v ⃗ d \vec{v_1}, \vec{v_2},\dots, \vec{v}_d v1 ,v2 ,…,v d s.t. they are orthonormal
- all eigenvalues are non-negative:
  - v ⃗ i T c o v ( X ⃗ ) v ⃗ i = v ⃗ i T ( λ i v ⃗ i ) v ⃗ i = λ \vec{v}_i^Tcov(\vec{X})\vec{v}_i = \vec{v}_i^T(\lambda_i\vec{v}_i)\vec{v}_i=\lambda v iTcov(X )v i=v iT(λiv i)v i=λ
  - v ⃗ i T c o v ( X ⃗ ) v ⃗ i = v a r ( v ⃗ i ⋅ X ⃗ ) > = 0 \vec{v}_i^Tcov(\vec{X})\vec{v}_i = var( \vec{v}_i \cdot \vec{X}) >=0 v iTcov(X )v i=var(v i⋅X )>=0
- Eigenvalue decomposition of c o v ( X ⃗ ) cov(\vec{X}) cov(X )
  - c o v ( X ⃗ ) = ∑ λ i v ⃗ i v ⃗ i T cov(\vec{X}) = \sum \lambda_i \vec{v}_i \vec{v}_i^T cov(X )=∑λiv iv iT
Data approximation

注：PCA有两种理解方式：一种解释是样本点到投影平面（直线）的距离足够近，第二种解释是样本点在这个投影平面（直线）上的投影能尽可能的分开。老师用的是第一种理解方式，就是尽可能地让投影能够“模拟”原始数据。
https://www.cnblogs.com/pinard/p/6239403.html

Special case: Coordinate subspaces
Special case: k = 1

注：根据勾股定理来证明。让原数据和投影的距离尽量“近”，也就是让投影尽可能“大”。

Maximum variance linear function of a random vector

Top eigenvalue of C O V ( X ⃗ ) COV(\vec{X}) COV(X )

Optimality of top eigenvector

v a r ( α ⃗ ⋅ X ⃗ ) = ∑ λ i ( α i ⃗ ⋅ v ⃗ i ) 2 var(\vec{\alpha} \cdot \vec{X}) = \sum \lambda_i (\vec{\alpha_i} \cdot \vec{v}_i)^2 var(α ⋅X )=∑λi(αi ⋅v i)2

Put all “weight” on λ 1 \lambda_1 λ1, need ( α ⃗ ⋅ v ⃗ 1 ) 2 = 1 (\vec{\alpha} \cdot \vec{v}_1)^2 = 1 (α ⋅v 1)2=1; achieved by α ⃗ = + − v ⃗ 1 \vec{\alpha} = +- \vec{v}_1 α =+−v 1

General case: k >= 1
Optimality of top k eigenvectors

What’s the maximum variance “accounted for” by k dimensional subspace?

Answer: sum of k largest eigenvalues of c o v ( X ⃗ ) cov(\vec{X}) cov(X )

Achieved by any orthonormal basis for the span of k corresponding eigenvectors. (Top k eigenvectors)

New features from PCA

Properties of eigenfeatures v ⃗ i ⋅ X ⃗ \vec{v}_i \cdot \vec{X} v i⋅X for i =1,2 … k:

v a r ( v ⃗ i ⋅ X ⃗ ) = λ i var(\vec{v}_i \cdot \vec{X}) = \lambda_i var(v i⋅X )=λi for all i
c o v ( v ⃗ i ⋅ X ⃗ , v ⃗ j ⋅ X ⃗ ) = 0 cov(\vec{v}_i \cdot \vec{X}, \vec{v}_j \cdot \vec{X}) = 0 cov(v i⋅X ,v j⋅X )=0 for all i ≠ j i \neq j i=j

PCA and beyond

Dimension reduction
PCR
Whitening

Postscript: Un-centered PCA

Eigenvector computation

Simple approach to top eigenvector computation
Power iteration
Some issue with Power iteration

Low-rank Approximation

Low-rank Approximation
Two optimization methods
Singular value decomposition
Truncated SVD
Connection to PCA

Application: Latent Semantic Analysis

Topic

Application: Matrix Completion

Use low-rank matrix to generate a big matrix

Kernel machines and neural networks Kernel machines

Motivation
Why polynomials?

With little information about the target function, can still hope to get good approximations using polynomials of high-enough degree.

Caveat: In standard IID model, OLS with feature expansion may have bad MSE unless: sample size >> dimension of feature expansion.

Trick to compute dot products quickly
Living in the span

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/740019.html

Review of 4721 Machine Learning

发表评论

评论列表（0条）