Jokyokai

Theory of Information Integration in Statistical Learning ( 統計的学習における情報統合の理論 ) 情報理工学系研究科　数理情報学専攻数理第五研究室　助教鈴木　大慈 2011 年 4 月 25 日

博士論文のテーマ（？）人間は様々な事物を「統合」し，多くの問題を　解決している．情報統合情報：機能・データ（知識）統合：組み合わせる・まとめる・整合性を取る機能（関数）の統合知識（データ）の統合異なる環境の情報統合

発表の概要ベイズ予測分布事前分布の選択と α ダイバージェンス相互情報量推定二乗損失型相互情報量 Multiple Kernel Learning L1 と L2 の狭間で

ベイズ予測分布 - 事前分布の選択と α ダイバージェンス -

ベイズ予測分布分布を事後分布で積分真の分布　　　　を推定したい．ベイズ予測分布モデル：事前分布：事後分布

ベイズ予測分布ベイズ予測分布はモデルをはみ出る（ Komaki, ’96 ）．モデル真最尤推定量事前分布による変動ベイズ予測分布真の分布がモデルに含まれている場合

事前分布の選択 KL- リスクをなるべく小さくしたい -> 　事前分布の選択 Jeffreys 事前分布のリスク事前分布 π のリスク (Komaki, 2006) : Jeffreys 事前分布 : Fisher 計量 : KL- ダイバージェンスラプラシアン：これが負であれば良いラプラシアン我々の結果：これを拡張定理

α- ベイズ α ダイバージェンス： KL ダイバージェンスの一般化 (α=-1 で KL) α- ベイズ予測分布 -> α ダイバージェンスに関するベイズリスクを最小化している： α=-1 の時は普通のベイズ予測分布

リスク β- 予測分布の真の分布からの α- ダイバージェンス β- 予測分布： α- ダイバージェンス：を小さくする事前分布を選ぶ．できる限り一般化

結果：　　を事前分布としたときの β- ベイズ予測分布漸近的なリスクの差 [Suzuki&Komaki,2010] 二階微分作用素 ← 補正項 (α=β で 0) A 優調和関数であれば良い α=β の時，ラプラシアンによる特徴付けが現われる A に付随した拡散過程が存在

大域幾何学との関係定理 (Aomoto, 1966) 　　断面局率が至る所負なら非負優調和関数が存在 (d≧2) 断面曲率正負ブラウン運動が非再帰的　⇔　正値優調和関数が存在ブラウン運動との関係

ラプラシアンと統計的推測正規分布平均値推定 (Brown, 1971) 正規分布密度推定 (George, Liang and Xu, 2006) ( ある有界性条件のもと ) 　　ベイズ推定量が admissible ⇔ 　事前分布で特徴付けられるブラウン運動が再帰的　　　（スタイン推定）　　事前分布が優調和関数 ⇒ 　ベイズ予測分布が minimax

真がモデルからはずれている場合モデルベイズ予測分布最尤推定量モデルのどちら側に真があるかで推定の良し悪しが変わる

Fisher 計量 Fisher 計量（ Fisher 情報行列）真がモデルからはずれることにより扱いが　　　　非自明になる．他の情報量（真がモデルに含まれる）なら

結果最尤推定ベイズ予測分布 β- ベイズ予測分布真の分布から推定した分布への KL- ダイバージェンス -> 　 β の値（分布の統合の仕方）によって　　モデルのどちら側に飛び出るかが変わる．真 ( 最近点 )

結論ラプラシアンによる特徴付けは普遍性があった． α≠β の場合は二階楕円型微分作用素が現われた．　　　-> 大域幾何学との関係．真がモデルに含まれていない場合，　　　　　　　　　　　 1/N のオーダーに β の影響が現われた． -> 　モデルと真との位置関係およびモデルの　　　　　　　　　　　　　　　　　　　　　幾何的性質を反映．　　　　-> （今後の課題）最適な β の決定方法．

相互情報量に関する研究 - 二乗損失型相互情報量 -

Mutual Information Common strategy : Find W which makes as independent as possible. Mutual Information is a good independence measure. are mutually independent. ⇔ : joint distribution of : marginal distribution of

Our Proposal Squared-loss Mutual Information (SMI) are mutually independent. ⇔ We propose a non-parametric estimator of I s thanks to squared loss, analytic solution is available Gradient of I s w.r.t. W is also analytically available gradient descent method for ICA and Dim. Reduction.

Estimation Method Estimate the density ratio : (Legendre-Fenchel convex duality [Nguyen et al. 08] ) Define , then we can write where sup is taken over all measurable functions . the optimal function is the density ratio

The problem is reduced to solving Empirical Approximation The objective function is empirically approximated as V-statistics (Decoupling) Assume we have n samples:

Linear model for g Linear model is basis function, e.g., Gaussian kernel penalty term

Gaussian Kernel We use a Gaussian kernel for basis functions: where are center points randomly chosen from sample points: . Linear combinations of Gaussian kernels span a broad function class. Distribution Free

Model Selection The estimator of SMI is formulated as an optimization problem . Cross Validation is applicable. Model selection is available Now we have two parameters : regularization parameter : Gaussian width

Asymptotic Analysis Regularization parameter : Theorem : Complexity of the model ( large:complex, small:simple ) Theorem Nonarametric Parametric : matrices like Fisher Information matrix (bracketing entropy condition)

Applications ICA (Independent Component Analysis) [Suzuki&Sugiyama, 2011] SDR (Sufficient Dimension Reduction) [Suzuki&Sugiyama, 2010] Independence Test [Sugiyama&Suzuki, 2011] Causal Inference [Yamada&Sugiyama, 2010]

ICA mixed signal (observation) original signal ( d dimension) independent of each other estimated signal (demixed signal) :mixing matrix ( d × d matrix) Goal : estimating demixing matrix ( d × d matrix) Ideally

Supervised Dimension Reduction Input Output :“ good ” low dimensional representation -> 　 Sufficient Dimension Reduction (SDR) A natural choice of W :

Artificial Data Set We compared our method with KDR (Kernel Dimension Reduction) HSIC (Hilbert-Schmidt Independence Criterion) SIR (Sliced Inverse Regression) SAVE (Sliced Average Variance Estimation) Performance measure: We used median distance for Gaussian width of KDR and HSIC .

Data Sets d=1 d=1 d=1 d=1 d=1 d=2

Result one-sided t-test with sig. level 1 %. Mean and standard deviation over 50 times trials Our method nicely performs.

UCI Data Set one-sided t-test with sig. level 1 %. Choose 200 samples and train SVM on the low dimensional representation. Classification error over 20 trials.

Multiple Kernel Learning (MKL) ↓ Elasticnet MKL Lp-norm MKL 汎化誤差を理論的に解析スパース性と汎化誤差の関係どのような正則化が好ましい？

Sparse Learning ： n samples ： Convex loss （ hinge, square, logistic ） L 1 -regularization-> sparse Lasso Group Lasso I : subset of indices [Yuan&Lin:JRSS2006] [Tibshirani :JRSS1996]

教師有りカーネル法回帰 , 判別 : SVM, SVR, …. カーネル関数（　：再生核ヒルベルト空間）

Reproducing Kernel Hilbert Space (RKHS) ： Hilbert space of real valued functions ： map to the Hilbert space such that Reproducing kernel Representer theorem

Moore-Aronszajn Theorem : positive (semi-)definite, symmetric : RKHS with reproducing kernel k one to one

ガウシアン，多項式 , カイ二乗 , …. パラメータ：ガウス幅 , 多項式の次数，… 特徴量 Computer Vision ：色 , 勾配 , sift (sift, hsvsift, huesift, scaling of sift), Geometric Blur, 　画像領域の切り出し , ．．．カーネル関数の例 MKL ：カーネルを選択して統合

MKL: Multiple Kernel Learning : M 個のカーネル関数：カーネル関数 k m に付随した RKHS [ Lanckriet et al. 2004 ] L1 正則化：スパース Gourp Lasso の無限次元への拡張 [Bach, Lanchriet, Jordan:ICML 2004 ]

カーネル重みとの関係 [Micchelli & Pontil: JMLR2005] 目的関数をカーネル関数の凸結合の中で最小化： given k は k m らの凸結合 Young の不等式

カーネル重み : L 2 Ｌ１ (MKL) Ｌ 2 (Uniform) ：単なる一様重みでの重ね合わせスパースデンス結構良い性能

L 1 L 2 スパースデンス

L 1 と L 2 の橋渡し Elasticnet MKL Lp-norm MKL (1≦p≦2) [Marius et al.: NIPS2009] [Shawe-Taylor: NIPS workshop 2008, Tomioka & Suzuki: NIPS workshop 2009] cf. elastic-net: [Zou & Hastie: JRSS, 2005]

Best Medium density dense [Tomioka & Suzuki: NIPS 2009 Workshop ] Elasticnet MKL: caltech 101 dataset L1 L2 中間的なスパースさが良い

[Cortes, Mohri, and Rostamizadeh: UAI 2009] MKL (sparse) 一様重み (dense) 中間 (p=4/3) Lp-norm MKL # of features

ここまでのまとめ L 1 (MKL) と L 2 ( 一様重み ) の中間的なスパースさ　->　 elasticnet/Lp-norm MKL 　->　実験的に性能○ 実は，計算量も少なくてすむ（後述）

なぜ，性能が良いのか？どのような条件のとき，中間的スパースさが良いのか？　　以後，主に Elasticnet MKL を扱う．

導入効率的計算法漸近的汎化誤差の解析 Elasticnet MKL の収束レート真がスパースな状況真がスパースでない状況 Lp-norm MKL の収束レート概要

双対問題表現定理：なめらか！降下法（ Newton 法など）が使える Fenchel 双対

数値実験 UCI:Ringnorm UCI:Splice SimpleMKL(L1) SpicyMKL(L1) Elasticnet MKL

漸近的汎化誤差の解析これからは二乗ロス（回帰）を想定：

Lasso & Dantzig Selector Candes & Tao: AS2007 (Dantzig selector) Bunea, Tsybakov & Wegkamp: AS2007 (Lasso) Meinshausen & Yu: AS2009 (Lasso) Bickel, Ritov & Tsybakov: AS2009 (Dantzig&Lasso) Raskutti, Wainwright & Yu: arXiv:0910.2042, 2009. mini-max レートスパース学習の収束レート

L 1 -MKL Koltchinskii & Yuan: COLT2008 Minimax- レート Raskutti, Wainwright & Yu: NIPS2009 Elasticnet 型正則化 Meier, van de Geer & B ü hlmann: AS2009 Sobolev 空間タイトではない MKL に関する既存の結果

Jokyokai

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to Jokyokai (20)

More from Taiji Suzuki (10)

Jokyokai