Review of 4721 Machine Learning

Review of 4721 Machine Learning,第1张

Contents
  • Overview of machine learning
    • Supervised Learning
  • Decision Tree
  • Model selection
    • Hyperparameters and Model selection
    • Cross Validation
  • Multivariate Gaussians
    • Review: Variance and covariance
    • Multivariate Gaussians
  • PCA/ SVD
    • PCA
    • Eigenvector computation
    • Low-rank Approximation
    • Application: Latent Semantic Analysis
    • Application: Matrix Completion
  • Kernel machines and neural networks
    • Kernel machines

Overview of machine learning Supervised Learning
  • ML algorithms takes training data as input, and returns a predictor.
    • Predictor: a function that takes a feature vector as input and returns a prediction.
    • Classifier: a predictor that returns values from a discrete set
  • Feature engineering: deciding what goes into the feature vector
  • Feature extraction: transforming “raw input” into a feature vector
  • Model selection: choosing a predictor template (model, hypothesis class)
  • Training as optimization
    • Objective function: real-valued function of predictor parameters to optimize (max or min)
    • Optimization: find specific setting of parameters to optimize the objective

Statistical models of data

  • Labeled example in training data: a random variable
  • Labeled example encountered in future: random variable with same distribution as above, but independent of training data
  • Assumption: labeled examples in training data are independent & identically distribution (IID)

Challenges in Supervised learning

  • Feature engineering and model selection are critical and rely on domain expertise
  • Procuring useful training data
  • Under-fitting (failure of optimization)
  • Over-fitting (failure of generalization)
  • And so on
Decision Tree

xxx

Model selection Hyperparameters and Model selection
  • Many ML algorithms have hyperparameters: parameters of ML algorithm (not the parameters of predictor learned by ML algorithm)
  • The data-driven way to set hyperparameters: Model selection (hyperparameter tuning)
  • Evaluate different ML algorithms with different hyperparameters by using test data
Cross Validation
  • Cross validation: model selection method that simulates training/testing using only training data
  • K-fold cross, Leave-one-out,
Multivariate Gaussians Review: Variance and covariance
  • Covariance matrix: X ⃗ = ( X 1 , X 2 , … , X d ) \vec{X} = (X_1, X_2, \dots, X_d) X =(X1,X2,,Xd), c o v ( X ⃗ ) cov(\vec{X}) cov(X )
  • Linear functions: α ∈ R d \alpha \in{R^d} αRd, β ∈ R d \beta \in{R^d} βRd

E ( α ⃗ ⋅ X ⃗ ) = α ⃗ ⋅ E ( X ⃗ ) E(\vec{\alpha}\cdot\vec{X}) = \vec{\alpha}\cdot E(\vec{X}) E(α X )=α E(X )

v a r ( α ⃗ ⋅ X ⃗ ) = α ⃗ T c o v ( X ⃗ ) α ⃗ var(\vec{\alpha}\cdot\vec{X}) = \vec{\alpha}^T cov(\vec{X})\vec{\alpha} var(α X )=α Tcov(X )α

c o v ( α ⃗ ⋅ X ⃗ , β ⃗ ⋅ X ⃗ ) = α ⃗ T c o v ( X ⃗ ) β ⃗ cov(\vec{\alpha}\cdot\vec{X}, \vec{\beta}\cdot\vec{X}) = \vec{\alpha}^T cov(\vec{X})\vec{\beta} cov(α X ,β X )=α Tcov(X )β

  • Linear Transformations: A ∈ R d ∗ k A \in{R^{d*k}} ARdk

E ( A T X ⃗ ) = A T E ( X ⃗ ) E(A^T \vec{X}) = A^T E(\vec{X}) E(ATX )=ATE(X )

c o v ( A T X ⃗ ) = A T c o v ( X ⃗ ) A cov(A^T \vec{X}) = A^T cov(\vec{X}) A cov(ATX )=ATcov(X )A

  • Isotropic and anisotropic random vectors

v a r ( X i ) = 1. f o r   a l l   i , a n d   c o v ( X i , X j ) = 0 , f o r   a l l   i ≠ j var(X_i) = 1. for\ all\ i, and\ cov(X_i, X_j) = 0, for\ all\ i \neq j var(Xi)=1.for all i,and cov(Xi,Xj)=0,for all i=j

c o v ( X ⃗ ) = I d cov(\vec{X}) = I_d cov(X )=Id

Note: Isotropy of X ⃗ \vec{X} X doesn’t mean X 1 , … X d a r e   i n d e p e n d e n t . X_1, \dots X_d are\ independent. X1,Xdare independent.

Multivariate Gaussians
  • Standard multivariate Gaussian:

Z 1 ,   … Z d ∼ N ( 0 , 1 ) Z_1,\ \dots Z_d \sim N(0, 1) Z1, ZdN(0,1)

Z ⃗ ∼ N ( 0 ⃗ , I d ) \vec{Z} \sim N(\vec{0}, I_d) Z N(0 ,Id)

E ( Z ⃗ ) = 0 ⃗ E(\vec{Z}) = \vec{0} E(Z )=0

c o v ( Z ⃗ ) = I d cov(\vec{Z}) =I_d cov(Z )=Id

PDF

  • Affine functions 1:

Random variable: X : = μ + α ⃗ ⋅ Z ⃗ = μ + ∑ α i Z i X:= \mu + \vec{\alpha} \cdot \vec{Z} = \mu + \sum \alpha_i Z_i X:=μ+α Z =μ+αiZi

Mean: μ \mu μ

Variance: σ 2 : = ∣ ∣ α ⃗ ∣ ∣ 2 2 \sigma^2 := ||\vec{\alpha}||^{2}_2 σ2:=α 22

Shorthand: X ∼ N ( μ , σ 2 ) X \sim N(\mu, \sigma^2) XN(μ,σ2)

PDF

  • Affine functions 2:

Random variables: X 1 X_1 X1, X 2 X_2 X2

PDF

  • General multivariate Gaussian distributions

X ⃗ : = μ ⃗ + A T Z ⃗ \vec{X}:= \vec{\mu} + A^T \vec{Z} X :=μ +ATZ

Mean: μ ⃗ \vec{\mu} μ , Covariance: Σ : = A T A \Sigma := A^TA Σ:=ATA

Shorthand: X ⃗ ∼ N ( μ ⃗ , Σ ) \vec{X} \sim N(\vec{\mu}, \Sigma) X N(μ ,Σ)

PDF

  • Marginalization

Let X ⃗ = ( X 1 , X 2 , … , X d ) ∼ N ( μ ⃗ , Σ ) \vec{X} = (X_1, X_2, \dots, X_d) \sim N(\vec{\mu}, \Sigma) X =(X1,X2,,Xd)N(μ ,Σ)

What is the marginal distribution of ( X 1 , X 2 , … , X k ) (X_1, X_2, \dots, X_k) (X1,X2,,Xk) for some k < d?

  • Conditioning

  • Estimation

PCA/ SVD PCA
  • Empirical (sample) covariance

c o v ( X ⃗ ) = E [ ( X ⃗ − E ( X ⃗ ) ) ( X ⃗ − E ( X ⃗ ) ) T ] cov(\vec{X}) = E[(\vec{X} - E(\vec{X}))(\vec{X} - E(\vec{X}))^T] cov(X )=E[(X E(X ))(X E(X ))T]

  • More properties of the covariance matrix

    • Symmetric

    • d real eigenvalues λ 1 , … , λ d \lambda_1, \dots, \lambda_d λ1,,λd

    • corresponding eigenvectors v 1 ⃗ , v 2 ⃗ , … , v ⃗ d \vec{v_1}, \vec{v_2},\dots, \vec{v}_d v1 ,v2 ,,v d s.t. they are orthonormal

    • all eigenvalues are non-negative:

      • v ⃗ i T c o v ( X ⃗ ) v ⃗ i = v ⃗ i T ( λ i v ⃗ i ) v ⃗ i = λ \vec{v}_i^Tcov(\vec{X})\vec{v}_i = \vec{v}_i^T(\lambda_i\vec{v}_i)\vec{v}_i=\lambda v iTcov(X )v i=v iT(λiv i)v i=λ
      • v ⃗ i T c o v ( X ⃗ ) v ⃗ i = v a r ( v ⃗ i ⋅ X ⃗ ) > = 0 \vec{v}_i^Tcov(\vec{X})\vec{v}_i = var( \vec{v}_i \cdot \vec{X}) >=0 v iTcov(X )v i=var(v iX )>=0
    • Eigenvalue decomposition of c o v ( X ⃗ ) cov(\vec{X}) cov(X )

      • c o v ( X ⃗ ) = ∑ λ i v ⃗ i v ⃗ i T cov(\vec{X}) = \sum \lambda_i \vec{v}_i \vec{v}_i^T cov(X )=λiv iv iT
  • Data approximation

注:PCA有两种理解方式:一种解释是样本点到投影平面(直线)的距离足够近,第二种解释是样本点在这个投影平面(直线)上的投影能尽可能的分开。老师用的是第一种理解方式,就是尽可能地让投影能够“模拟”原始数据。
https://www.cnblogs.com/pinard/p/6239403.html

  • Special case: Coordinate subspaces

  • Special case: k = 1

注:根据勾股定理来证明。让原数据和投影的距离尽量“近”,也就是让投影尽可能“大”。

  • Maximum variance linear function of a random vector

Top eigenvalue of C O V ( X ⃗ ) COV(\vec{X}) COV(X )

  • Optimality of top eigenvector

v a r ( α ⃗ ⋅ X ⃗ ) = ∑ λ i ( α i ⃗ ⋅ v ⃗ i ) 2 var(\vec{\alpha} \cdot \vec{X}) = \sum \lambda_i (\vec{\alpha_i} \cdot \vec{v}_i)^2 var(α X )=λi(αi v i)2

Put all “weight” on λ 1 \lambda_1 λ1, need ( α ⃗ ⋅ v ⃗ 1 ) 2 = 1 (\vec{\alpha} \cdot \vec{v}_1)^2 = 1 (α v 1)2=1; achieved by α ⃗ = + − v ⃗ 1 \vec{\alpha} = +- \vec{v}_1 α =+v 1

  • General case: k >= 1

  • Optimality of top k eigenvectors

What’s the maximum variance “accounted for” by k dimensional subspace?

Answer: sum of k largest eigenvalues of c o v ( X ⃗ ) cov(\vec{X}) cov(X )

Achieved by any orthonormal basis for the span of k corresponding eigenvectors. (Top k eigenvectors)

  • New features from PCA

Properties of eigenfeatures v ⃗ i ⋅ X ⃗ \vec{v}_i \cdot \vec{X} v iX for i =1,2 … k:

v a r ( v ⃗ i ⋅ X ⃗ ) = λ i var(\vec{v}_i \cdot \vec{X}) = \lambda_i var(v iX )=λi for all i
c o v ( v ⃗ i ⋅ X ⃗ , v ⃗ j ⋅ X ⃗ ) = 0 cov(\vec{v}_i \cdot \vec{X}, \vec{v}_j \cdot \vec{X}) = 0 cov(v iX ,v jX )=0 for all i ≠ j i \neq j i=j

  • PCA and beyond

Dimension reduction
PCR
Whitening

  • Postscript: Un-centered PCA
Eigenvector computation
  • Simple approach to top eigenvector computation

  • Power iteration

  • Some issue with Power iteration

Low-rank Approximation
  • Low-rank Approximation

  • Two optimization methods

  • Singular value decomposition

  • Truncated SVD

  • Connection to PCA

Application: Latent Semantic Analysis

Topic

Application: Matrix Completion

Use low-rank matrix to generate a big matrix

Kernel machines and neural networks Kernel machines
  • Motivation

  • Why polynomials?

With little information about the target function, can still hope to get good approximations using polynomials of high-enough degree.

Caveat: In standard IID model, OLS with feature expansion may have bad MSE unless: sample size >> dimension of feature expansion.

  • Trick to compute dot products quickly

  • Living in the span

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/740017.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-04-28
下一篇 2022-04-28

发表评论

登录后才能评论

评论列表(0条)

保存