19. Mutual Information Common strategy : Find W which makes as independent as possible. Mutual Information is a good independence measure. are mutually independent. ⇔ : joint distribution of : marginal distribution of
20. Our Proposal Squared-loss Mutual Information (SMI) are mutually independent. ⇔ We propose a non-parametric estimator of I s thanks to squared loss, analytic solution is available Gradient of I s w.r.t. W is also analytically available gradient descent method for ICA and Dim. Reduction.
21. Estimation Method Estimate the density ratio : (Legendre-Fenchel convex duality [Nguyen et al. 08] ) Define , then we can write where sup is taken over all measurable functions . the optimal function is the density ratio
22. The problem is reduced to solving Empirical Approximation The objective function is empirically approximated as V-statistics (Decoupling) Assume we have n samples:
23. Linear model for g Linear model is basis function, e.g., Gaussian kernel penalty term
25. Gaussian Kernel We use a Gaussian kernel for basis functions: where are center points randomly chosen from sample points: . Linear combinations of Gaussian kernels span a broad function class. Distribution Free
26. Model Selection The estimator of SMI is formulated as an optimization problem . Cross Validation is applicable. Model selection is available Now we have two parameters : regularization parameter : Gaussian width
27. Asymptotic Analysis Regularization parameter : Theorem : Complexity of the model ( large:complex, small:simple ) Theorem Nonarametric Parametric : matrices like Fisher Information matrix (bracketing entropy condition)
29. ICA mixed signal (observation) original signal ( d dimension) independent of each other estimated signal (demixed signal) :mixing matrix ( d × d matrix) Goal : estimating demixing matrix ( d × d matrix) Ideally
30. Supervised Dimension Reduction Input Output :“ good ” low dimensional representation -> Sufficient Dimension Reduction (SDR) A natural choice of W :
31. Artificial Data Set We compared our method with KDR (Kernel Dimension Reduction) HSIC (Hilbert-Schmidt Independence Criterion) SIR (Sliced Inverse Regression) SAVE (Sliced Average Variance Estimation) Performance measure: We used median distance for Gaussian width of KDR and HSIC .
33. Result one-sided t-test with sig. level 1 %. Mean and standard deviation over 50 times trials Our method nicely performs.
34. UCI Data Set one-sided t-test with sig. level 1 %. Choose 200 samples and train SVM on the low dimensional representation. Classification error over 20 trials.
38. Sparse Learning : n samples : Convex loss ( hinge, square, logistic ) L 1 -regularization-> sparse Lasso Group Lasso I : subset of indices [Yuan&Lin:JRSS2006] [Tibshirani :JRSS1996]
40. Reproducing Kernel Hilbert Space (RKHS) : Hilbert space of real valued functions : map to the Hilbert space such that Reproducing kernel Representer theorem
41. Moore-Aronszajn Theorem : positive (semi-)definite, symmetric : RKHS with reproducing kernel k one to one