- Overview of machine learning
- Supervised Learning
- Decision Tree
- Model selection
- Hyperparameters and Model selection
- Cross Validation
- Multivariate Gaussians
- Review: Variance and covariance
- Multivariate Gaussians
- PCA/ SVD
- PCA
- Eigenvector computation
- Low-rank Approximation
- Application: Latent Semantic Analysis
- Application: Matrix Completion
- Kernel machines and neural networks
- Kernel machines
- ML algorithms takes training data as input, and returns a predictor.
- Predictor: a function that takes a feature vector as input and returns a prediction.
- Classifier: a predictor that returns values from a discrete set
- Feature engineering: deciding what goes into the feature vector
- Feature extraction: transforming “raw input” into a feature vector
- Model selection: choosing a predictor template (model, hypothesis class)
- Training as optimization
- Objective function: real-valued function of predictor parameters to optimize (max or min)
- Optimization: find specific setting of parameters to optimize the objective
Statistical models of data
- Labeled example in training data: a random variable
- Labeled example encountered in future: random variable with same distribution as above, but independent of training data
- Assumption: labeled examples in training data are independent & identically distribution (IID)
Challenges in Supervised learning
- Feature engineering and model selection are critical and rely on domain expertise
- Procuring useful training data
- Under-fitting (failure of optimization)
- Over-fitting (failure of generalization)
- And so on
xxx
Model selection Hyperparameters and Model selection- Many ML algorithms have hyperparameters: parameters of ML algorithm (not the parameters of predictor learned by ML algorithm)
- The data-driven way to set hyperparameters: Model selection (hyperparameter tuning)
- Evaluate different ML algorithms with different hyperparameters by using test data
- Cross validation: model selection method that simulates training/testing using only training data
- K-fold cross, Leave-one-out,
- Covariance matrix: X ⃗ = ( X 1 , X 2 , … , X d ) \vec{X} = (X_1, X_2, \dots, X_d) X =(X1,X2,…,Xd), c o v ( X ⃗ ) cov(\vec{X}) cov(X )
- Linear functions: α ∈ R d \alpha \in{R^d} α∈Rd, β ∈ R d \beta \in{R^d} β∈Rd
E ( α ⃗ ⋅ X ⃗ ) = α ⃗ ⋅ E ( X ⃗ ) E(\vec{\alpha}\cdot\vec{X}) = \vec{\alpha}\cdot E(\vec{X}) E(α ⋅X )=α ⋅E(X )
v a r ( α ⃗ ⋅ X ⃗ ) = α ⃗ T c o v ( X ⃗ ) α ⃗ var(\vec{\alpha}\cdot\vec{X}) = \vec{\alpha}^T cov(\vec{X})\vec{\alpha} var(α ⋅X )=α Tcov(X )α
c o v ( α ⃗ ⋅ X ⃗ , β ⃗ ⋅ X ⃗ ) = α ⃗ T c o v ( X ⃗ ) β ⃗ cov(\vec{\alpha}\cdot\vec{X}, \vec{\beta}\cdot\vec{X}) = \vec{\alpha}^T cov(\vec{X})\vec{\beta} cov(α ⋅X ,β ⋅X )=α Tcov(X )β
- Linear Transformations: A ∈ R d ∗ k A \in{R^{d*k}} A∈Rd∗k
E ( A T X ⃗ ) = A T E ( X ⃗ ) E(A^T \vec{X}) = A^T E(\vec{X}) E(ATX )=ATE(X )
c o v ( A T X ⃗ ) = A T c o v ( X ⃗ ) A cov(A^T \vec{X}) = A^T cov(\vec{X}) A cov(ATX )=ATcov(X )A
- Isotropic and anisotropic random vectors
v a r ( X i ) = 1. f o r a l l i , a n d c o v ( X i , X j ) = 0 , f o r a l l i ≠ j var(X_i) = 1. for\ all\ i, and\ cov(X_i, X_j) = 0, for\ all\ i \neq j var(Xi)=1.for all i,and cov(Xi,Xj)=0,for all i=j
c o v ( X ⃗ ) = I d cov(\vec{X}) = I_d cov(X )=Id
Note: Isotropy of X ⃗ \vec{X} X doesn’t mean X 1 , … X d a r e i n d e p e n d e n t . X_1, \dots X_d are\ independent. X1,…Xdare independent.
Multivariate Gaussians- Standard multivariate Gaussian:
Z 1 , … Z d ∼ N ( 0 , 1 ) Z_1,\ \dots Z_d \sim N(0, 1) Z1, …Zd∼N(0,1)
Z ⃗ ∼ N ( 0 ⃗ , I d ) \vec{Z} \sim N(\vec{0}, I_d) Z ∼N(0 ,Id)
E ( Z ⃗ ) = 0 ⃗ E(\vec{Z}) = \vec{0} E(Z )=0
c o v ( Z ⃗ ) = I d cov(\vec{Z}) =I_d cov(Z )=Id
- Affine functions 1:
Random variable: X : = μ + α ⃗ ⋅ Z ⃗ = μ + ∑ α i Z i X:= \mu + \vec{\alpha} \cdot \vec{Z} = \mu + \sum \alpha_i Z_i X:=μ+α ⋅Z =μ+∑αiZi
Mean: μ \mu μ
Variance: σ 2 : = ∣ ∣ α ⃗ ∣ ∣ 2 2 \sigma^2 := ||\vec{\alpha}||^{2}_2 σ2:=∣∣α ∣∣22
Shorthand: X ∼ N ( μ , σ 2 ) X \sim N(\mu, \sigma^2) X∼N(μ,σ2)
- Affine functions 2:
Random variables: X 1 X_1 X1, X 2 X_2 X2
- General multivariate Gaussian distributions
X ⃗ : = μ ⃗ + A T Z ⃗ \vec{X}:= \vec{\mu} + A^T \vec{Z} X :=μ +ATZ
Mean: μ ⃗ \vec{\mu} μ , Covariance: Σ : = A T A \Sigma := A^TA Σ:=ATA
Shorthand: X ⃗ ∼ N ( μ ⃗ , Σ ) \vec{X} \sim N(\vec{\mu}, \Sigma) X ∼N(μ ,Σ)
- Marginalization
Let X ⃗ = ( X 1 , X 2 , … , X d ) ∼ N ( μ ⃗ , Σ ) \vec{X} = (X_1, X_2, \dots, X_d) \sim N(\vec{\mu}, \Sigma) X =(X1,X2,…,Xd)∼N(μ ,Σ)
What is the marginal distribution of ( X 1 , X 2 , … , X k ) (X_1, X_2, \dots, X_k) (X1,X2,…,Xk) for some k < d?
-
Conditioning
-
Estimation
- Empirical (sample) covariance
c o v ( X ⃗ ) = E [ ( X ⃗ − E ( X ⃗ ) ) ( X ⃗ − E ( X ⃗ ) ) T ] cov(\vec{X}) = E[(\vec{X} - E(\vec{X}))(\vec{X} - E(\vec{X}))^T] cov(X )=E[(X −E(X ))(X −E(X ))T]
-
More properties of the covariance matrix
-
Symmetric
-
d real eigenvalues λ 1 , … , λ d \lambda_1, \dots, \lambda_d λ1,…,λd
-
corresponding eigenvectors v 1 ⃗ , v 2 ⃗ , … , v ⃗ d \vec{v_1}, \vec{v_2},\dots, \vec{v}_d v1 ,v2 ,…,v d s.t. they are orthonormal
-
all eigenvalues are non-negative:
- v ⃗ i T c o v ( X ⃗ ) v ⃗ i = v ⃗ i T ( λ i v ⃗ i ) v ⃗ i = λ \vec{v}_i^Tcov(\vec{X})\vec{v}_i = \vec{v}_i^T(\lambda_i\vec{v}_i)\vec{v}_i=\lambda v iTcov(X )v i=v iT(λiv i)v i=λ
- v ⃗ i T c o v ( X ⃗ ) v ⃗ i = v a r ( v ⃗ i ⋅ X ⃗ ) > = 0 \vec{v}_i^Tcov(\vec{X})\vec{v}_i = var( \vec{v}_i \cdot \vec{X}) >=0 v iTcov(X )v i=var(v i⋅X )>=0
-
Eigenvalue decomposition of c o v ( X ⃗ ) cov(\vec{X}) cov(X )
- c o v ( X ⃗ ) = ∑ λ i v ⃗ i v ⃗ i T cov(\vec{X}) = \sum \lambda_i \vec{v}_i \vec{v}_i^T cov(X )=∑λiv iv iT
-
-
Data approximation
注:PCA有两种理解方式:一种解释是样本点到投影平面(直线)的距离足够近,第二种解释是样本点在这个投影平面(直线)上的投影能尽可能的分开。老师用的是第一种理解方式,就是尽可能地让投影能够“模拟”原始数据。
https://www.cnblogs.com/pinard/p/6239403.html
-
Special case: Coordinate subspaces
-
Special case: k = 1
注:根据勾股定理来证明。让原数据和投影的距离尽量“近”,也就是让投影尽可能“大”。
- Maximum variance linear function of a random vector
Top eigenvalue of C O V ( X ⃗ ) COV(\vec{X}) COV(X )
- Optimality of top eigenvector
v a r ( α ⃗ ⋅ X ⃗ ) = ∑ λ i ( α i ⃗ ⋅ v ⃗ i ) 2 var(\vec{\alpha} \cdot \vec{X}) = \sum \lambda_i (\vec{\alpha_i} \cdot \vec{v}_i)^2 var(α ⋅X )=∑λi(αi ⋅v i)2
Put all “weight” on λ 1 \lambda_1 λ1, need ( α ⃗ ⋅ v ⃗ 1 ) 2 = 1 (\vec{\alpha} \cdot \vec{v}_1)^2 = 1 (α ⋅v 1)2=1; achieved by α ⃗ = + − v ⃗ 1 \vec{\alpha} = +- \vec{v}_1 α =+−v 1
-
General case: k >= 1
-
Optimality of top k eigenvectors
What’s the maximum variance “accounted for” by k dimensional subspace?
Answer: sum of k largest eigenvalues of c o v ( X ⃗ ) cov(\vec{X}) cov(X )
Achieved by any orthonormal basis for the span of k corresponding eigenvectors. (Top k eigenvectors)
- New features from PCA
Properties of eigenfeatures v ⃗ i ⋅ X ⃗ \vec{v}_i \cdot \vec{X} v i⋅X for i =1,2 … k:
v
a
r
(
v
⃗
i
⋅
X
⃗
)
=
λ
i
var(\vec{v}_i \cdot \vec{X}) = \lambda_i
var(v
i⋅X
)=λi for all i
c
o
v
(
v
⃗
i
⋅
X
⃗
,
v
⃗
j
⋅
X
⃗
)
=
0
cov(\vec{v}_i \cdot \vec{X}, \vec{v}_j \cdot \vec{X}) = 0
cov(v
i⋅X
,v
j⋅X
)=0 for all
i
≠
j
i \neq j
i=j
- PCA and beyond
Dimension reduction
PCR
Whitening
- Postscript: Un-centered PCA
-
Simple approach to top eigenvector computation
-
Power iteration
-
Some issue with Power iteration
-
Low-rank Approximation
-
Two optimization methods
-
Singular value decomposition
-
Truncated SVD
-
Connection to PCA
Topic
Application: Matrix CompletionUse low-rank matrix to generate a big matrix
Kernel machines and neural networks Kernel machines-
Motivation
-
Why polynomials?
With little information about the target function, can still hope to get good approximations using polynomials of high-enough degree.
Caveat: In standard IID model, OLS with feature expansion may have bad MSE unless: sample size >> dimension of feature expansion.
-
Trick to compute dot products quickly
-
Living in the span
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)