The Category-theoretic Perspective of Statistical Learning for Amateurs

news/2024/12/23 4:52:52/

统计学习.范畴论视角

title: The Category-theoretic Perspective of Statistical Learning for Amateurs
author: Congwei Song
description: A representation in BIMSA


The Category-theoretical Perspective of Statistical Learning for Amateurs

Congwei Song
Email: william@bimsa.cn
Research Interest: Machine Learning, Wavelet Analysis

Abstract Statistical learning is a fascinating field that has long been the mainstream of machine learning/artificial intelligence. A large number of results have been produced which can be widely applied to real-world problems. It also leads to many research topics and also stimulates new research. This report summarizes some classic statistical learning models and well-known algorithms, especially for amateurs, and provides a category-theoretic perspective on understanding statistical learning models. The goal is to attract researchers from other fields, including basic mathematics, to participate in the research related to statistical learning.

Keywords Statistical Learning, Statistics, Category Theory, Variational Models, Neural Networks, Deep Learning

Abbreviations
distr.: distribution(s)
var.: variable(s)
cat.: category(ies)
rv: random varible(s)

Notations

  • P ( X ) P(X) P(X): distr. of target rv X X X
  • P ( X ∣ θ ) P(X|\theta) P(Xθ): parameteric distr.

Introduction

  • Probability theory/Statistics
  • Category theory
  • Statistical Learning
  • Classiscal models of statistical Learning
  • Advanced models
  • Misc

Probability theory

Definition (Probability Model)
The probability model is a probability measure sp, denoted by ( Ω , A , P ) (\Omega, \mathcal{A},P) (Ω,A,P); or ( X , P ( X ) ) (\mathcal{X}, P(X)) (X,P(X)), as its pushback by the rv X X X, where X \mathcal{X} X is the sample space of X X X.

X ∼ P X\sim P XP: the distr. of X X X is P P P, or draw X X X from P P P


Statistics

Definition (Statistical Model)
The statistical model is a family/set of probability models denoted by ( Ω , A , { P λ } ) (\Omega, \mathcal{A},\{P_\lambda\}) (Ω,A,{Pλ}) (with common ambient space) or ( X , { P λ } ) (\mathcal{X},\{P_\lambda\}) (X,{Pλ}) (with common sample space which is the range of a target rv X X X), denoted by P ( X ) P(X) P(X) for short.

Parameterized version: ( X , { P ( X ∣ θ ) } , θ ∈ Θ ) (\mathcal{X},\{P(X|\theta)\},\theta\in\Theta) (X,{P(Xθ)},θΘ) where Θ \Theta Θ is the parameter space, denoted by P ( X ∣ θ ) P(X|\theta) P(Xθ) for short.

Example
N ( μ , σ 2 ) , C a t ( p ) N(\mu,\sigma^2),Cat(p) N(μ,σ2),Cat(p)


Definition (Baysian Model)
THe Baysian model is a statistical model with priori distr. of parameters, as
( M θ , p ( θ ) ) (M_\theta,p(\theta)) (Mθ,p(θ)), where M θ M_\theta Mθ is a given statistical model.

Definition (Baysian Hierachical Model)
( M θ , p ( θ ∣ α ) , p ( α ) ) (M_\theta,p(\theta|\alpha),p(\alpha)) (Mθ,p(θα),p(α))


Category theory of Statistical Models

  • P r o b \mathcal{Prob} Prob: cat. of all probability models,
  • P r o b X \mathcal{Prob}_X ProbX: sub-cat. of P r o b \mathcal{Prob} Prob with the form of ( X , P ( X ) ) (\mathcal{X}, P(X)) (X,P(X))
  • P r o b Y ∣ x \mathcal{Prob}_{Y|x} ProbYx: sub-cat. of P r o b \mathcal{Prob} Prob taking the form of ( Y , P ( Y ∣ x ) ) (\mathcal{Y}, P(Y|x)) (Y,P(Yx)) with the conditional var. x x x.
  • S t a t \mathcal{Stat} Stat: cat. of The statistical models
  • S t a t X \mathcal{Stat}_X StatX: cat. of The statistical models of the target rv X X X
  • S t a t Y ∣ x \mathcal{Stat}_{Y|x} StatYx: cat. of The statistical models of the target rv Y ∣ x Y|x Yx with conditional var x x x.
  • B a y e s \mathcal{Bayes} Bayes: cat. of the Baysian models

S t a t \mathcal{Stat} Stat is regarded as a sub-cat. of S t a t \mathcal{Stat} Stat with the flatten priori. The Bayesian model gives joint P ( x , θ ) P(x,\theta) P(x,θ)。Therefore the category of the Bayesian models is a sub-cat. of P r o b Prob Prob.


Estimator

Definition (Statistical model with an estimator)
model with estimator: ( M θ , θ ^ ( X ) ) (M_\theta, \hat\theta(X)) (Mθ,θ^(X)) where θ ^ : X N → Θ \hat\theta: \mathcal{X}^N\to \Theta θ^:XNΘ, X X X is a sample with size N N N.

In most case, we use MLE. The estimator has been implied by model.


Statistical Learning

supervised learning (determinant form):
( X , Y , P ( Y ∣ X ) ) (\mathcal{X},\mathcal{Y}, P(Y|X)) (X,Y,P(YX)) where X X X is input (conditional var.), Y Y Y is output

Supervised learning based on sample X = { X i } X=\{X_i\} X={Xi}, is identified with statistical model:
( Y N , P ( Y ∣ X ) = ∏ i P ( y i ∣ x i ) ) (\mathcal{Y}^N, P(Y|X)=\prod_iP(y_i|x_i)) (YN,P(YX)=iP(yixi))
where the sample X X X is fixed, named design var. (design matrix, if it forms a matrix) and Y Y Y is a sample point (sample with size 1).

I claim: statistical learning == conditionalized statistics (model)


Facts in statistical learning are also facts in statistics

Bias-Variance Decomposition in statistics: Error = Bias + Variance; in statistical learning: Error = Bias + Variance under condition of input variable;


Classical Models in Statistical Learning(SL)

  • Supervised Learning
    • Regression: P ( Y ∣ X ) P(Y|X) P(YX), Linear Regression, Ridge/LASSO Regression ( X X X is the conditional rv)
    • Classification: P ( Y ∣ X ) P(Y|X) P(YX), Logistic Regression/LDA(/QDA)/Naive Bayesian Classifier
  • Unsupervised Learning
    • Clustering: P ( X , Z ) P(X,Z) P(X,Z), where Z Z Z is unobservable, K-means/GMM
    • Dimension Reduction: P ( X , Z ) P(X,Z) P(X,Z), PCA/ICA/MNF
    • Latent Variable Models: P ( X , Z ) P(X,Z) P(X,Z), Mixed models(GMM), pLSA
    • Hidden Markov Model: P ∗ ( X 1 : T , Z 1 : T ) P^*(X_{1:T},Z_{1:T}) P(X1:T,Z1:T) where ( X t , Z t ) ∼ P (X_t,Z_t)\sim P (Xt,Zt)P
  • Others
    Time sequence: P ∗ ( X 1 : T ) P^*(X_{1:T}) P(X1:T), ARMA

Learner

A learner is an estimator for statistical model:
( M θ , θ ^ ( X ) ) (M_\theta, \hat{\theta}(X)) (Mθ,θ^(X))
where M θ M_\theta Mθ is a statistical model.

One can define a latent model as ( P ( X , Z ) , θ ^ ( X ) ) (P(X,Z), \hat{\theta}(X)) (P(X,Z),θ^(X)) for unsupervised learning; ( P ( X , Y ) , θ ^ ( X , Y ) ) (P(X,Y), \hat{\theta}(X,Y)) (P(X,Y),θ^(X,Y)) for supervised learning.


Beginners’ Magic Cube

The classical models are all categories. And we have a diagram about them. I’d like to call it “the beginners’ magic cube”, since it looks like a cube and the beginner of SL should learn them first.

在这里插入图片描述


Probabilitic Graph Model

Another way to describe the statistical (learning) model.

  • Bayesian Network: Directed acyclic graph
  • Markov Network(Random Field): Undirected graph

Methods

Methods as Functors:

  • Kernel trick: X → ϕ ( X ) X\to \phi(X) Xϕ(X)
  • Localization/Smoothing: ∑ i K ( x 0 , x i ) l ( x i , y i ) \sum_i K(x_0,x_i)l(x_i,y_i) iK(x0,xi)l(xi,yi)
  • Hierarchical model: P ( X , Z 1 , ⋯ , Z n , Y ) P(X,Z_1,\cdots, Z_n,Y) P(X,Z1,,Zn,Y) where X → Z 1 → ⋯ → Z n → Y X\to Z_1\to \cdots\to Z_n\to Y XZ1ZnY forms Markov chain usually and Z 1 , ⋯ , Z n Z_1,\cdots, Z_n Z1,,Zn are hidden.
  • Variational trick: ( P ( X , Z ) , Q ( Z ∣ X ) ) (P(X,Z), Q(Z|X)) (P(X,Z),Q(ZX)) where Q ( Z ∣ X ) Q(Z|X) Q(ZX) is the variational distr.
  • Neural network(NN): f ( x ) → N e t ( x ) f(x) \to Net(x) f(x)Net(x)
  • Stochastic method/Monte Carlo method: Important sampling/MCMC

Advanced Models I:

Neural Models: Models equipped with Neural Networks; apply neural network in statistical models.

  • Neural Models: MLP, embed NN into regression models
  • RNN/LSTM: as a conditional HMM implemented by NN
  • Neural Autoencoder: NLPCA, embed NN into the autoencoder
  • Probabilistic Neural Autoencoder: Variational Autoencoder(VAE; the stochastic perturbation affacts the outputs of the layers)
  • Stochastic NN: Dropout(the stochastic perturbation affacts the weights of the layers)
  • Normalization Flow: Reparameterization, as a non-stochastic hierarchical VAE
  • Hierarchical VAE: Diffusion Model/Consistency Model

Beginners’ Star

在这里插入图片描述


Create an advanced models I

Take VAE as an example

P ( X ) ∼ N ( μ , σ 2 ) → P ( X ∣ Z ) ∼ N ( f ( z ) , σ 2 ) , P ( Z ) ∼ N ( 0 , 1 ) → ( P ( X , Z ) , Q ( Z ∣ X = x ) ∼ N ( g ( x ) , h ( x ) ) ) → ( P ( X , Z ) , Q ( Z ∣ X = x ) = g ( x ) + ξ h ( x ) ) , ξ ∼ N ( 0 , 1 ) P(X)\sim N(\mu,\sigma^2) \to P(X|Z)\sim N(f(z),\sigma^2),P(Z)\sim N(0,1)\\ \to (P(X,Z),Q(Z|X=x)\sim N(g(x),h(x)))\\ \to (P(X,Z),Q(Z|X=x) = g(x)+\xi h(x)),\xi\sim N(0,1) P(X)N(μ,σ2)P(XZ)N(f(z),σ2),P(Z)N(0,1)(P(X,Z),Q(ZX=x)N(g(x),h(x)))(P(X,Z),Q(ZX=x)=g(x)+ξh(x)),ξN(0,1)

Write it in the style of the composition of functors (informally)
V A E ( f , g , h ) = R e p ∘ V a r ∘ L V M ( P ( X ) ) VAE(f,g,h) = \mathrm{Rep}\circ \mathrm{Var}\circ\mathrm{LVM}(P(X)) VAE(f,g,h)=RepVarLVM(P(X)), regarding functions f , g , h f,g,h f,g,h as parameters.

The implimentation of VAE by the following NN (with a regularizing term):
y ∼ f ( g ( x ) + h ( x ) ξ ) , y \sim f(g(x)+h(x)\xi), yf(g(x)+h(x)ξ),
through self-supervised learning with data { ( x i , x i ) } \{(x_i,x_i)\} {(xi,xi)}, where f , g , h f,g,h f,g,h are all neural layers, ξ ∼ N ( 0 , 1 ) \xi\sim N(0,1) ξN(0,1) is the perturbation variable of the hidden layer g g g. when ξ → 0 \xi\to 0 ξ0, Q Q Q is degenerated, and VAE degenerates to an ordinary NN f ( g ( x ) ) f(g(x)) f(g(x)).


Create an advanced models II

Take RNN as an example

P ( Y ) → ⋯ → P ∗ ( Y , Z ∣ X ) → y t = N e t ( x t , z t − 1 ) , z t = N e t ( x t , z t − 1 ) P(Y)\to \cdots \to P^*(Y,Z|X)\\ \to y_t=Net(x_t,z_{t-1}),z_{t}=Net(x_t,z_{t-1}) P(Y)P(Y,ZX)yt=Net(xt,zt1),zt=Net(xt,zt1)

say R N N ( w ) = N N ∘ T S ∘ C o n d i ∘ L V M ( P ) RNN(w) = \mathrm{NN\circ TS\circ Condi\circ LVM}(P) RNN(w)=NNTSCondiLVM(P).

Homework

  1. What about Z k ∼ B ( p k ) Z_k\sim B(p_k) ZkB(pk) (multivariate Bernoullian distr.)
  2. To create time sequence version of VAE

Advanced Models II

  • Ensemble Learning
  • Transfer Learning
  • Incremental Learning(Continual Learning、On-line Learning)
  • Life-Long Learing

Possible definition of Transfer Learning

  • ( P ( X ∣ θ 1 ) , P ( X ∣ θ 2 ) ) , θ 1 , θ 2 ∼ P ( θ ∣ α ) (P(X|\theta_1),P(X|\theta_2)), \theta_1,\theta_2\sim P(\theta|\alpha) (P(Xθ1),P(Xθ2)),θ1,θ2P(θα)
  • ( P ( X ∣ θ 1 , θ 0 ) , P ( X ∣ θ 2 , θ 0 ) ) (P(X|\theta_1,\theta_0),P(X|\theta_2,\theta_0)) (P(Xθ1,θ0),P(Xθ2,θ0))
  • ( P ( ϕ ( X ) ∣ θ ) , P ( ϕ ( X ) ∣ θ ) ) (P(\phi(X)|\theta),P(\phi(X)|\theta)) (P(ϕ(X)θ),P(ϕ(X)θ))
    with sample X 1 X_1 X1 from sorce domain and sample X 2 X_2 X2 from target domain

Misc.

  • Reinforcement Learning: Stochastic Learning
    evaluation/sampling-estimation alternating.
    θ → v → θ → v → ⋯ \theta \to v \to \theta \to v\to \cdots θvθv
  • BiLSTM/BiLM/ELMo
  • Transformer/Self-attention
  • models based on Unnormalized distr. (Energy-based models)

Inspired by BiLSTM/BiLM: Tied Model
( P ( X ∣ θ 1 , θ 0 ) , P ( X ∣ θ 2 , θ 0 ) ) (P(X|\theta_1,\theta_0),P(X|\theta_2,\theta_0)) (P(Xθ1,θ0),P(Xθ2,θ0)) with the same sample.

Tied likelihood: P ( X ∣ θ 1 , θ 0 ) P ( X ∣ θ 2 , θ 0 ) P(X|\theta_1,\theta_0)P(X|\theta_2,\theta_0) P(Xθ1,θ0)P(Xθ2,θ0)
(a sort of pseudo-likelihood; a product of expert without normalized coef.)


Future Works


References

  • PETER MCCULLAGH WHAT IS A STATISTICAL MODEL? The Annals of Statistics. 2002, 30(5): 1225–1310
  • Jared Culbertson and Kirk Sturtz. Bayesian Machine Learning via Category Theory, 2013.
  • Categories for AI. https://www.youtube.com/watch?v=4poHENv4kR0
  • Kenneth A. Lloyd, Jr. A Category-Theoretic Approach to Agent-based Modeling and Simulation, 2010.
  • Dan Shiebler, Bruno Gavranovic, Paul Wilson, Category Theory in Machine Learning, 2021.

链接: https://pan.baidu.com/s/1GdPiVGG3GIKVS4nWqlBm-w?pwd=1111 提取码: 1111


http://www.ppmy.cn/news/150122.html

相关文章

关于这款开源的ES的ORM框架-Easy-Es适合初学者入手不?

前言 最近笔者为了捡回以前自学的ES知识,准备重新对ES的一些基础使用做个大致学习总结。然后在摸鱼逛开源社区时无意中发现了一款不错的ElasticSearch插件-Easy-ES,可称之为“ES界的MyBatis-Plus”。联想到之前每次用RestHighLevelClient写一些DSL操作时…

Fiddler抓包工具之fiddler设置过滤

fiddler设置过滤 基本的过滤操作流程以百度为例 步骤: 1、右侧高级工具栏点击Filters》勾选Use Filters》选择Show only Internet Hosts和Show only the following Hosts》在文本框中输入host地址 2、点击Changes not yet saved》再点击Actions》Run Filterset …

开发移动端官网总结_Vue2.x

目录 1、自定义加载中效果 2、HTML页面注入环境变量 / 加载CDN和本地库 3、在 Vue 中使用 wow.js 4、过渡路由 5、全局监听当前设备,切换PC端和移动端 6、移动端常用初始化样式 7、官网默认入口文件 8、回到顶部滑动过渡效果(显示与隐藏、滑动置…

华为OD机试真题 Java 实现【获得完美走位】【2023Q1 100分】

一、题目描述 在第一人称射击游戏中,玩家通过键盘的 A、S、D、W 四个按键控制游戏人物分别向左、向后、向右、向前进行移动,从而完成走位假设玩家每按动一次键盘,游戏任务会向某个方向移动一步,如果玩家在操作一定次数的键盘并且…

追寻幸福:探索幸福的关键特征和行为

目录 1. 积极的心态 2. 良好的人际关系 3. 自我接纳和自尊 4. 追求意义和目标 5. 健康的身心状态 6. 感知和实现个人价值 幸福是一个主观的感受,因此不同的人对于幸福的定义和追求方式可能会有所不同。然而,有一些共同的特点和行为模式&#xff0c…

MyBatis - 高级查询

文章目录 1.一对一映射2.一对多映射3.多对多映射4.自定义类型映射4.1 枚举类型案例4.2 货币类型案例 5.分页插件 当使用 MyBatis 进行对象关系映射(ORM)时,我们经常需要处理一对一映射、一对多映射和多对多映射的关系。同时还可能遇到需要进行…

当贝Z1 PRO使用心得

大屏会议需要,采购了1个当贝Z1 PRO电视盒子,据说是目前性能最强的,也是极少数带摄像头的电视盒子。 得益于强大的处理器,整体比较流畅,但系统太拉跨, 64位处理器,跑着32位的老系统,…

zuk android os 流量,原生用户最爱 Cyanogen OS版ZUK Z1固件

11月16日消息,Cyanogen通过官方网站发布了ZUK Z1的Cyanogen OS 12.1固件,喜欢原生安卓风格的ZUK Z1用户可以通过刷新固件的方式,将ZUI变更为Cyanogen OS 12.1系统。 ZUK Z1是国产新晋品牌ZUK于今年8月推出的首款智能手机产品,在国…