On
top of yesterday's paper, here is the list of accepted papers at NIPS. A few are on the web or on arxiv, here is a subsample that caught our attention (most of which have been featured here before)
Preconditioned Spectral Descent for Deep Learning by
Carlson, David;
Collins, Edo;
Hsieh, Ya-Ping;
Carin, Lawrence;
Cevher, Volkan
(supplemental material is
here.)
Optimizing objective functions in deep learning is a
notoriously difficult task. Classical algorithms, including variants of
gradient descent and quasi-Newton methods, can be interpreted as
approximations to the objective function in Euclidean norms. However, it
has recently been shown that approximations via non-Euclidean norms
can significantly improve optimization performance. In this paper, we
provide evidence that neither of the above two methods entirely capture
the ``geometry'' of the objective functions in deep learning, while a
combination of the two does. We theoretically formalize our arguments
and derive a novel second-order non-Euclidean algorithms. We implement
our algorithms on Restricted Boltzmann Machines, Feedforward Neural
Nets, and Convolutional Neural Nets, and demonstrate improvements in
both computational efficiency and model fit.
Tensorizing Neural Networks (presentation slides) by Alexander Novikov, Dmitry Podoprihin, Anton Osokin, Dmitry Vetrov,
Sum-of-Squares Lower Bounds for Sparse PCA
Tengyu Ma,
Avi Wigderson
This paper establishes a statistical versus computational trade-off for
solving a basic high-dimensional machine learning problem via a basic convex
relaxation method. Specifically, we consider the {\em Sparse Principal
Component Analysis} (Sparse PCA) problem, and the family of {\em
Sum-of-Squares} (SoS, aka Lasserre/Parillo) convex relaxations. It was well
known that in large dimension p, a planted k-sparse unit vector can be {\em
in principle} detected using only n≈klogp (Gaussian or Bernoulli)
samples, but all {\em efficient} (polynomial time) algorithms known require n≈k2logp samples. It was also known that this quadratic gap cannot
be improved by the the most basic {\em semi-definite} (SDP, aka spectral)
relaxation, equivalent to a degree-2 SoS algorithms. Here we prove that also
degree-4 SoS algorithms cannot improve this quadratic gap. This average-case
lower bound adds to the small collection of hardness results in machine
learning for this powerful family of convex relaxation algorithms. Moreover,
our design of moments (or "pseudo-expectations") for this lower bound is quite
different than previous lower bounds. Establishing lower bounds for higher
degree SoS algorithms for remains a challenging problem.
Natural Neural Networks
Guillaume Desjardins,
Karen Simonyan,
Razvan Pascanu,
Koray Kavukcuoglu
We introduce Natural Neural Networks, a novel family of algorithms that speed
up convergence by adapting their internal representation during training to
improve conditioning of the Fisher matrix. In particular, we show a specific
example that employs a simple and efficient reparametrization of the neural
network weights by implicitly whitening the representation obtained at each
layer, while preserving the feed-forward computation of the network. Such
networks can be trained efficiently via the proposed Projected Natural Gradient
Descent algorithm (PRONG), which amortizes the cost of these reparametrizations
over many parameter updates and is closely related to the Mirror Descent online
learning algorithm. We highlight the benefits of our method on both
unsupervised and supervised learning tasks, and showcase its scalability by
training on the large-scale ImageNet Challenge dataset.
Approximating Sparse PCA from Incomplete Data
Abhisek Kundu,
Petros Drineas,
Malik Magdon-Ismail
We study how well one can recover sparse principal components of a data
matrix using a sketch formed from a few of its elements. We show that for a
wide class of optimization problems, if the sketch is close (in the spectral
norm) to the original data matrix, then one can recover a near optimal solution
to the optimization problem by using the sketch. In particular, we use this
approach to obtain sparse principal components and show that for \math{m} data
points in \math{n} dimensions, \math{O(\epsilon^{-2}\tilde k\max\{m,n\})}
elements gives an \math{\epsilon}-additive approximation to the sparse PCA
problem (\math{\tilde k} is the stable rank of the data matrix). We demonstrate
our algorithms extensively on image, text, biological and financial data. The
results show that not only are we able to recover the sparse PCAs from the
incomplete data, but by using our sparse sketch, the running time drops by a
factor of five or more.
Training Restricted Boltzmann Machines via the Thouless-Anderson-Palmer Free Energy
Marylou Gabrié,
Eric W. Tramel,
Florent Krzakala
Restricted Boltzmann machines are undirected neural networks which have been
shown to be effective in many applications, including serving as
initializations for training deep multi-layer neural networks. One of the main
reasons for their success is the existence of efficient and practical
stochastic algorithms, such as contrastive divergence, for unsupervised
training. We propose an alternative deterministic iterative procedure based on
an improved mean field method from statistical physics known as the
Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides
performance equal to, and sometimes superior to, persistent contrastive
divergence, while also providing a clear and easy to evaluate objective
function. We believe that this strategy can be easily generalized to other
models as well as to more accurate higher-order approximations, paving the way
for systematic improvements in training Boltzmann machines with hidden units.
Robust Regression via Hard Thresholding
Kush Bhatia,
Prateek Jain,
Purushottam Kar
We study the problem of Robust Least Squares Regression (RLSR) where several
response variables can be adversarially corrupted. More specifically, for a
data matrix X \in R^{p x n} and an underlying model w*, the response vector is
generated as y = X'w* + b where b \in R^n is the corruption vector supported
over at most C.n coordinates. Existing exact recovery results for RLSR focus
solely on L1-penalty based convex formulations and impose relatively strict
model assumptions such as requiring the corruptions b to be selected
independently of X.
In this work, we study a simple hard-thresholding algorithm called TORRENT
which, under mild conditions on X, can recover w* exactly even if b corrupts
the response variables in an adversarial manner, i.e. both the support and
entries of b are selected adversarially after observing X and w*. Our results
hold under deterministic assumptions which are satisfied if X is sampled from
any sub-Gaussian distribution. Finally unlike existing results that apply only
to a fixed w*, generated independently of X, our results are universal and hold
for any w* \in R^p.
Next, we propose gradient descent-based extensions of TORRENT that can scale
efficiently to large scale problems, such as high dimensional sparse recovery
and prove similar recovery guarantees for these extensions. Empirically we find
TORRENT, and more so its extensions, offering significantly faster recovery
than the state-of-the-art L1 solvers. For instance, even on moderate-sized
datasets (with p = 50K) with around 40% corrupted responses, a variant of our
proposed method called TORRENT-HYB is more than 20x faster than the best L1
solver.
Sparse PCA via Bipartite Matchings
Megasthenis Asteris,
Dimitris Papailiopoulos,
Anastasios Kyrillidis,
Alexandros G. Dimakis
We consider the following multi-component sparse PCA problem: given a set of
data points, we seek to extract a small number of sparse components with
disjoint supports that jointly capture the maximum possible variance. These
components can be computed one by one, repeatedly solving the single-component
problem and deflating the input data matrix, but as we show this greedy
procedure is suboptimal. We present a novel algorithm for sparse PCA that
jointly optimizes multiple disjoint components. The extracted features capture
variance that lies within a multiplicative factor arbitrarily close to 1 from
the optimal. Our algorithm is combinatorial and computes the desired components
by solving multiple instances of the bipartite maximum weight matching problem.
Its complexity grows as a low order polynomial in the ambient dimension of the
input data matrix, but exponentially in its rank. However, it can be
effectively applied on a low-dimensional sketch of the data; this allows us to
obtain polynomial-time approximation guarantees via spectral bounds. We
evaluate our algorithm on real data-sets and empirically demonstrate that in
many cases it outperforms existing, deflation-based approaches.
Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems
We consider the fundamental problem of solving quadratic systems of equations
in n variables, where yi=|⟨ai,x⟩|2, i=1,…,m and x∈Rn is
unknown. We propose a novel method, which starting with an initial guess
computed by means of a spectral method, proceeds by minimizing a nonconvex
functional as in the Wirtinger flow approach. There are several key
distinguishing features, most notably, a distinct objective functional and
novel update rules, which operate in an adaptive fashion and drop terms bearing
too much influence on the search direction. These careful selection rules
provide a tighter initial guess, better descent directions, and thus enhanced
practical performance. On the theoretical side, we prove that for certain
unstructured models of quadratic systems, our algorithms return the correct
solution in linear time, i.e. in time proportional to reading the data
{ai} and {yi} as soon as the ratio m/n between the
number of equations and unknowns exceeds a fixed numerical constant. We extend
the theory to deal with noisy systems in which we only have yi≈|⟨ai,x⟩|2 and prove that our
algorithms achieve a statistical accuracy, which is nearly un-improvable. We
complement our theoretical study with numerical examples showing that solving
random quadratic systems is both computationally and statistically not much
harder than solving linear systems of the same size---hence the title of this
paper. For instance, we demonstrate empirically that the computational cost of
our algorithm is about four times that of solving a least-squares problem of
the same size.
Locally Non-linear Embeddings for Extreme Multi-label Learning
Kush Bhatia,
Himanshu Jain,
Purushottam Kar,
Prateek Jain,
Manik Varma
The objective in extreme multi-label learning is to train a classifier that
can automatically tag a novel data point with the most relevant subset of
labels from an extremely large label set. Embedding based approaches make
training and prediction tractable by assuming that the training label matrix is
low-rank and hence the effective number of labels can be reduced by projecting
the high dimensional label vectors onto a low dimensional linear subspace.
Still, leading embedding approaches have been unable to deliver high prediction
accuracies or scale to large problems as the low rank assumption is violated in
most real world applications.
This paper develops the X-One classifier to address both limitations. The
main technical contribution in X-One is a formulation for learning a small
ensemble of local distance preserving embeddings which can accurately predict
infrequently occurring (tail) labels. This allows X-One to break free of the
traditional low-rank assumption and boost classification accuracy by learning
embeddings which preserve pairwise distances between only the nearest label
vectors.
We conducted extensive experiments on several real-world as well as benchmark
data sets and compared our method against state-of-the-art methods for extreme
multi-label classification. Experiments reveal that X-One can make
significantly more accurate predictions then the state-of-the-art methods
including both embeddings (by as much as 35%) as well as trees (by as much as
6%). X-One can also scale efficiently to data sets with a million labels which
are beyond the pale of leading embedding methods.
Efficient Compressive Phase Retrieval with Constrained Sensing Vectors
Sohail Bahmani,
Justin Romberg
We propose a new approach to the problem of compressive phase retrieval in
which the goal is to reconstruct a sparse vector from the magnitude of a number
of its linear measurements. The proposed framework relies on constrained
sensing vectors and a two-stage reconstruction method that consists of two
standard convex optimization programs that are solved sequentially.
Various methods for compressive phase retrieval have been proposed in recent
years, but none have come with strong efficiency and computational complexity
guarantees. The main obstacle has been that there is no straightforward convex
relaxations for the type of structure in the target. Given a set of
underdetermined measurements, there is a standard framework for recovering a
sparse matrix, and a standard framework for recovering a low-rank matrix.
However, a general, efficient method for recovering a matrix which is jointly
sparse and low-rank has remained elusive.
In this paper, we show that if the sensing vectors are chosen at random from
an incoherent subspace, then the low-rank and sparse structures of the target
signal can be effectively decoupled. We show that a recovery algorithm that
consists of a low-rank recovery stage followed by a sparse recovery stage will
produce an accurate estimate of the target when the number of measurements is
O(klogdk), where k and d denote the sparsity level
and the dimension of the input signal. We also evaluate the algorithm through
numerical simulation.
A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements
Qinqing Zheng,
John Lafferty
We propose a simple, scalable, and fast gradient descent algorithm to
optimize a nonconvex objective for the rank minimization problem and a closely
related family of semidefinite programs. With O(r2κ2nlogn) random
measurements of a positive semidefinite n×n matrix of rank r and
condition number κ, our method is guaranteed to converge linearly to the
global optimum.
On the limitation of spectral methods: From the Gaussian hidden clique problem to rank one perturbations of Gaussian tensors
Andrea Montanari,
Daniel Reichman,
Ofer Zeitouni
We consider the following detection problem: given a realization of a
symmetric matrix X of dimension n, distinguish between the
hypothesis that all upper triangular variables are i.i.d. Gaussians variables
with mean 0 and variance 1 and the hypothesis where X is the sum
of such matrix and an independent rank-one perturbation.
This setup applies to the situation where under the alternative, there is a
planted principal submatrix B of size L for which all upper
triangular variables are i.i.d. Gaussians with mean 1 and variance 1,
whereas all other upper triangular elements of X not in
B are i.i.d. Gaussians variables with mean 0 and variance 1. We
refer to this as the `Gaussian hidden clique problem.'
When L=(1+ϵ)n√ (ϵ>0), it is possible to solve this
detection problem with probability 1−on(1) by computing the spectrum of
X and considering the largest eigenvalue of X. We
prove that this condition is tight in the following sense: when
L<(1−ϵ)n√ no algorithm that examines only the eigenvalues of
X can detect the existence of a hidden Gaussian clique, with error
probability vanishing as n→∞.
We prove this result as an immediate consequence of a more general result on
rank-one perturbations of k-dimensional Gaussian tensors. In this context we
establish a lower bound on the critical signal-to-noise ratio below which a
rank-one signal cannot be detected.
Optimized Projections for Compressed Sensing via Direct Mutual Coherence Minimization
Zhouchen Lin,
Canyi Lu,
Huan Li
Compressed Sensing (CS) is a novel technique for simultaneous signal sampling
and compression based on the existence of a sparse representation of signal and
a projected dictionary $\PP\D$, where $\PP\in\mathbb{R}^{m\times d}$ is the
projection matrix and $\D\in\mathbb{R}^{d\times n}$ is the dictionary. To
exactly recover the signal with a small number of measurements $m$, the
projected dictionary $\PP\D$ is expected to be of low mutual coherence. Several
previous methods attempt to find the projection $\PP$ such that the mutual
coherence of $\PP\D$ can be as low as possible. However, they do not minimize
the mutual coherence directly and thus their methods are far from optimal. Also
the solvers they used lack of the convergence guarantee and thus there has no
guarantee on the quality of their obtained solutions. This work aims to address
these issues. We propose to find an optimal projection by minimizing the mutual
coherence of $\PP\D$ directly. This leads to a nonconvex nonsmooth minimization
problem. We then approximate it by smoothing and solve it by alternate
minimization. We further prove the convergence of our algorithm. To the best of
our knowledge, this is the first work which directly minimizes the mutual
coherence of the projected dictionary with a convergence guarantee. Numerical
experiments demonstrate that the proposed method can recover sparse signals
better than existing methods.
Completing Low-Rank Matrices with Corrupted Samples from Few Coefficients in General Basis
Hongyang Zhang,
Zhouchen Lin,
Chao Zhang
Subspace recovery from corrupted and missing data is crucial for various
applications in signal processing and information theory. To complete missing
values and detect column corruptions, existing robust Matrix Completion (MC)
methods mostly concentrate on recovering a low-rank matrix from few corrupted
coefficients w.r.t. the standard basis, which, however, does not apply to more
general basis, e.g., the Fourier basis. In this paper, we prove that the range
space of an $m\times n$ matrix with rank $r$ can be exactly recovered from few
coefficients w.r.t. general basis, though the rank $r$ and the number of
corrupted samples are both as high as $O(\min\{m,n\}/\log^3 (m+n))$. Thus our
results cover previous work as special cases, and robust MC can recover the
intrinsic matrix with a higher rank. Moreover, we suggest a universal choice of
the regularization parameter, which is $\lambda=1/\sqrt{\log n}$. By our
$\ell_{2,1}$ filtering algorithm, which has theoretical guarantees, we can
further reduce the computational cost of our model. As an application, we also
find that the solutions to extended robust Low-Rank Representation and to our
extended robust MC are mutually expressible, so both our theory and algorithm
can be immediately applied to the subspace clustering problem with missing
values. Experiments verify our theories.
Approximating Sparse PCA from Incomplete Data
Abhisek Kundu,
Petros Drineas,
Malik Magdon-Ismail
We study how well one can recover sparse principal components of a data
matrix using a sketch formed from a few of its elements. We show that for a
wide class of optimization problems, if the sketch is close (in the spectral
norm) to the original data matrix, then one can recover a near optimal solution
to the optimization problem by using the sketch. In particular, we use this
approach to obtain sparse principal components and show that for \math{m} data
points in \math{n} dimensions, \math{O(\epsilon^{-2}\tilde k\max\{m,n\})}
elements gives an \math{\epsilon}-additive approximation to the sparse PCA
problem (\math{\tilde k} is the stable rank of the data matrix). We demonstrate
our algorithms extensively on image, text, biological and financial data. The
results show that not only are we able to recover the sparse PCAs from the
incomplete data, but by using our sparse sketch, the running time drops by a
factor of five or more.
Optimal Rates for Random Fourier Features
Bharath K. Sriperumbudur,
Zoltan Szabo
Kernel methods represent one of the most powerful tools in machine learning
to tackle problems expressed in terms of function values and derivatives due to
their capability to represent and model complex relations. While these methods
show good versatility, they are computationally intensive and have poor
scalability to large data as they require operations on Gram matrices. In order
to mitigate this serious computational limitation, recently randomized
constructions have been proposed in the literature, which allow the application
of fast linear algorithms. Random Fourier features (RFF) are among the most
popular and widely applied constructions: they provide an easily computable,
low-dimensional feature representation for shift-invariant kernels. Despite the
popularity of RFFs, very little is understood theoretically about their
approximation quality. In this paper, we provide the first detailed theoretical
analysis about the approximation quality of RFFs by establishing optimal (in
terms of the RFF dimension) performance guarantees in uniform and Lr (1≤r<∞) norms. We also propose a RFF approximation to derivatives of a
kernel with a theoretical study on its approximation quality.
Learning with Group Invariant Features: A Kernel Perspective
Youssef Mroueh,
Stephen Voinea,
Tomaso Poggio
We analyze in this paper a random feature map based on a theory of invariance
I-theory introduced recently. More specifically, a group invariant signal
signature is obtained through cumulative distributions of group transformed
random projections. Our analysis bridges invariant feature learning with kernel
methods, as we show that this feature map defines an expected Haar integration
kernel that is invariant to the specified group action. We show how this
non-linear random feature map approximates this group invariant kernel
uniformly on a set of N points. Moreover, we show that it defines a function
space that is dense in the equivalent Invariant Reproducing Kernel Hilbert
Space. Finally, we quantify error rates of the convergence of the empirical
risk minimization, as well as the reduction in
Optimal linear estimation under unknown nonlinear transform
Xinyang Yi,
Zhaoran Wang,
Constantine Caramanis,
Han Liu
Linear regression studies the problem of estimating a model parameter
β∗∈Rp, from n observations
{(yi,xi)}ni=1 from linear model yi=⟨xi,β∗⟩+ϵi. We consider a significant
generalization in which the relationship between ⟨xi,β∗⟩ and yi is noisy, quantized to a single bit, potentially nonlinear,
noninvertible, as well as unknown. This model is known as the single-index
model in statistics, and, among other things, it represents a significant
generalization of one-bit compressed sensing. We propose a novel spectral-based
estimation procedure and show that we can recover β∗ in settings (i.e.,
classes of link function f) where previous algorithms fail. In general, our
algorithm requires only very mild restrictions on the (unknown) functional
relationship between yi and ⟨xi,β∗⟩. We also
consider the high dimensional setting where β∗ is sparse ,and introduce
a two-stage nonconvex framework that addresses estimation challenges in high
dimensional regimes where p≫n. For a broad class of link functions
between ⟨xi,β∗⟩ and yi, we establish minimax
lower bounds that demonstrate the optimality of our estimators in both the
classical and high dimensional regimes.
Deeply Learning the Messages in Message Passing Inference
Guosheng Lin,
Chunhua Shen,
Ian Reid,
Anton van den Hengel
Deep structured output learning shows great promise in tasks like semantic
image segmentation. We proffer a new, efficient deep structured model learning
scheme, in which we show how deep Convolutional Neural Networks (CNNs) can be
used to estimate the messages in message passing inference for structured
prediction with Conditional Random Fields (CRFs). With such CNN message
estimators, we obviate the need to learn or evaluate potential functions for
message calculation. This confers significant efficiency for learning, since
otherwise when performing structured learning for a CRF with CNN potentials it
is necessary to undertake expensive inference for every stochastic gradient
iteration. The network output dimension for message estimation is the same as
the number of classes, in contrast to the network output for general CNN
potential functions in CRFs, which is exponential in the order of the
potentials. Hence CNN message learning has fewer network parameters and is more
scalable for cases that a large number of classes are involved. We apply our
method to semantic image segmentation on the PASCAL VOC 2012 dataset. We
achieve an intersection-over-union score of 73.4 on its test set, which is the
best reported result for methods using the VOC training images alone. This
impressive performance demonstrates the effectiveness and usefulness of our CNN
message learning method.
Fast Label Embeddings via Randomized Linear Algebra
Paul Mineiro,
Nikos Karampatziakis
Many modern multiclass and multilabel problems are characterized by
increasingly large output spaces. For these problems, label embeddings have
been shown to be a useful primitive that can improve computational and
statistical efficiency. In this work we utilize a correspondence between rank
constrained estimation and low dimensional label embeddings that uncovers a
fast label embedding algorithm which works in both the multiclass and
multilabel settings. The result is a randomized algorithm whose running time is
exponentially faster than naive algorithms. We demonstrate our techniques on
two large-scale public datasets, from the Large Scale Hierarchical Text
Challenge and the Open Directory Project, where we obtain state of the art
results.
Fast Randomized Kernel Methods With Statistical Guarantees
Ahmed El Alaoui,
Michael W. Mahoney
One approach to improving the running time of kernel-based machine learning
methods is to build a small sketch of the input and use it in lieu of the full
kernel matrix in the machine learning task of interest. Here, we describe a
version of this approach that comes with running time guarantees as well as
improved guarantees on its statistical performance. By extending the notion of
\emph{statistical leverage scores} to the setting of kernel ridge regression,
our main statistical result is to identify an importance sampling distribution
that reduces the size of the sketch (i.e., the required number of columns to be
sampled) to the \emph{effective dimensionality} of the problem. This quantity
is often much smaller than previous bounds that depend on the \emph{maximal
degrees of freedom}. Our main algorithmic result is to present a fast algorithm
to compute approximations to these scores. This algorithm runs in time that is
linear in the number of samples---more precisely, the running time is
O(np2), where the parameter p depends only on the trace of the kernel
matrix and the regularization parameter---and it can be applied to the matrix
of feature vectors, without having to form the full kernel matrix. This is
obtained via a variant of length-squared sampling that we adapt to the kernel
setting in a way that is of independent interest. Lastly, we provide empirical
results illustrating our theory, and we discuss how this new notion of the
statistical leverage of a data point captures in a fine way the difficulty of
the original statistical learning problem.
Randomized Dual Coordinate Ascent with Arbitrary Sampling
Zheng Qu,
Peter Richtárik,
Tong Zhang
We study the problem of minimizing the average of a large number of smooth
convex functions penalized with a strongly convex regularizer. We propose and
analyze a novel primal-dual method (Quartz) which at every iteration samples
and updates a random subset of the dual variables, chosen according to an
arbitrary distribution. In contrast to typical analysis, we directly bound the
decrease of the primal-dual error (in expectation), without the need to first
analyze the dual error. Depending on the choice of the sampling, we obtain
efficient serial, parallel and distributed variants of the method. In the
serial case, our bounds match the best known bounds for SDCA (both with uniform
and importance sampling). With standard mini-batching, our bounds predict
initial data-independent speedup as well as additional data-driven speedup
which depends on spectral and sparsity properties of the data. We calculate
theoretical speedup factors and find that they are excellent predictors of
actual speedup in practice. Moreover, we illustrate that it is possible to
design an efficient mini-batch importance sampling. The distributed variant of
Quartz is the first distributed SDCA-like method with an analysis for
non-separable data.
Fast and Guaranteed Tensor Decomposition via Sketching
Yining Wang,
Hsiao-Yu Tung,
Alexander Smola,
Animashree Anandkumar
Tensor CANDECOMP/PARAFAC (CP) decomposition has wide applications in
statistical learning of latent variable models and in data mining. In this
paper, we propose fast and randomized tensor CP decomposition algorithms based
on sketching. We build on the idea of count sketches, but introduce many novel
ideas which are unique to tensors. We develop novel methods for randomized
computation of tensor contractions via FFTs, without explicitly forming the
tensors. Such tensor contractions are encountered in decomposition methods such
as tensor power iterations and alternating least squares. We also design novel
colliding hashes for symmetric tensors to further save time in computing the
sketches. We then combine these sketching ideas with existing whitening and
tensor power iterative techniques to obtain the fastest algorithm on both
sparse and dense tensors. The quality of approximation under our method does
not depend on properties such as sparsity, uniformity of elements, etc. We
apply the method for topic modeling and obtain competitive results.
Learning both Weights and Connections for Efficient Neural Networks
Song Han,
Jeff Pool,
John Tran,
William J. Dally
Neural networks are both computationally intensive and memory intensive,
making them difficult to deploy on embedded systems. Also, conventional
networks fix the architecture before training starts; as a result, training
cannot improve the architecture. To address these limitations, we describe a
method to reduce the storage and computation required by neural networks by an
order of magnitude without affecting their accuracy, by learning only the
important connections. Our method prunes redundant connections using a
three-step method. First, we train the network to learn which connections are
important. Next, we prune the unimportant connections. Finally, we retrain the
network to fine tune the weights of the remaining connections. On the ImageNet
dataset, our method reduced the number of parameters of AlexNet by a factor of
9x, from 61 million to 6.7 million, without incurring accuracy loss. Similar
experiments with VGG16 found that the network as a whole can be reduced by 13x,
again with no loss of accuracy.
Matrix Completion from Fewer Entries: Spectral Detectability and Rank Estimation
Alaa Saade,
Florent Krzakala,
Lenka Zdeborová
The completion of low rank matrices from few entries is a task with many
practical applications. We consider here two aspects of this problem:
detectability, i.e. the ability to estimate the rank r reliably from the
fewest possible random entries, and performance in achieving small
reconstruction error. We propose a spectral algorithm for these two tasks
called MaCBetH (for Matrix Completion with the Bethe Hessian). The rank is
estimated as the number of negative eigenvalues of the Bethe Hessian matrix,
and the corresponding eigenvectors are used as initial condition for the
minimization of the discrepancy between the estimated matrix and the revealed
entries. We analyze the performance in a random matrix setting using results
from the statistical mechanics of the Hopfield neural network, and show in
particular that MaCBetH efficiently detects the rank r of a large n×m
matrix from C(r)rnm−−−√ entries, where C(r) is a constant close to 1.
We also evaluate the corresponding root-mean-square error empirically and show
that MaCBetH compares favorably to other existing approaches.
Beyond Convexity: Stochastic Quasi-Convex Optimization
Elad Hazan,
Kfir Y. Levy,
Shai Shalev-Shwartz
Stochastic convex optimization is a basic and well studied primitive in
machine learning. It is well known that convex and Lipschitz functions can be
minimized efficiently using Stochastic Gradient Descent (SGD). The Normalized
Gradient Descent (NGD) algorithm, is an adaptation of Gradient Descent, which
updates according to the direction of the gradients, rather than the gradients
themselves. In this paper we analyze a stochastic version of NGD and prove its
convergence to a global minimum for a wider class of functions: we require the
functions to be quasi-convex and locally-Lipschitz. Quasi-convexity broadens
the con- cept of unimodality to multidimensions and allows for certain types of
saddle points, which are a known hurdle for first-order optimization methods
such as gradient decent. Locally-Lipschitz functions are only required to be
Lipschitz in a small region around the optimum. This assumption circumvents
gradient explosion, which is another known hurdle for gradient descent
variants. Interestingly, unlike the vanilla SGD algorithm, the stochastic
normalized gradient descent algorithm provably requires a minimal minibatch
size.
Training Very Deep Networks
Rupesh Kumar Srivastava,
Klaus Greff,
Jürgen Schmidhuber
Theoretical and empirical evidence indicates that the depth of neural
networks is crucial for their success. However, training becomes more difficult
as depth increases, and training of very deep networks remains an open problem.
Here we introduce a new architecture designed to overcome this. Our so-called
highway networks allow unimpeded information flow across many layers on
information highways. They are inspired by Long Short-Term Memory recurrent
networks and use adaptive gating units to regulate the information flow. Even
with hundreds of layers, highway networks can be trained directly through
simple gradient descent. This enables the study of extremely deep and efficient
architectures.
Path-SGD: Path-Normalized Optimization in Deep Neural Networks
Behnam Neyshabur,
Ruslan Salakhutdinov,
Nathan Srebro
We revisit the choice of SGD for training deep neural networks by
reconsidering the appropriate geometry in which to optimize the weights. We
argue for a geometry invariant to rescaling of weights that does not affect the
output of the network, and suggest Path-SGD, which is an approximate steepest
descent method with respect to a path-wise regularizer related to max-norm
regularization. Path-SGD is easy and efficient to implement and leads to
empirical gains over SGD and AdaGrad.
Spectral Representations for Convolutional Neural Networks
Oren Rippel,
Jasper Snoek,
Ryan P. Adams
Discrete Fourier transforms provide a significant speedup in the computation
of convolutions in deep learning. In this work, we demonstrate that, beyond its
advantages for efficient computation, the spectral domain also provides a
powerful representation in which to model and train convolutional neural
networks (CNNs).
We employ spectral representations to introduce a number of innovations to
CNN design. First, we propose spectral pooling, which performs dimensionality
reduction by truncating the representation in the frequency domain. This
approach preserves considerably more information per parameter than other
pooling strategies and enables flexibility in the choice of pooling output
dimensionality. This representation also enables a new form of stochastic
regularization by randomized modification of resolution. We show that these
methods achieve competitive results on classification and approximation tasks,
without using any dropout or max-pooling.
Finally, we demonstrate the effectiveness of complex-coefficient spectral
parameterization of convolutional filters. While this leaves the underlying
model unchanged, it results in a representation that greatly facilitates
optimization. We observe on a variety of popular CNN configurations that this
leads to significantly faster convergence during training.
Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms
Christopher De Sa,
Ce Zhang,
Kunle Olukotun,
Christopher Ré
Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of
machine learning problems. Researchers and industry have developed several
techniques to optimize SGD's runtime performance, including asynchronous
execution and reduced precision. Our main result is a martingale-based analysis
that enables us to capture the rich noise models that may arise from such
techniques. Specifically, we use our new analysis in three ways: (1) we derive
convergence rates for the convex case (Hogwild!) with relaxed assumptions on
the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for
non-convex matrix problems including matrix completion; and (3) we design and
analyze an asynchronous SGD algorithm, called Buckwild!, that uses
lower-precision arithmetic. We show experimentally that our algorithms run
efficiently for a variety of problems on modern hardware.
Fast, Provable Algorithms for Isotonic Regression in all ℓp-norms
Rasmus Kyng,
Anup Rao,
Sushant Sachdeva
Given a directed acyclic graph G, and a set of values y on the vertices,
the Isotonic Regression of y is a vector x that respects the partial order
described by G, and minimizes ||x−y||, for a specified norm. This paper
gives improved algorithms for computing the Isotonic Regression for all
weighted ℓp-norms with rigorous performance guarantees. Our algorithms
are quite practical, and their variants can be implemented to run fast in
practice.
Efficient and Parsimonious Agnostic Active Learning
Tzu-Kuo Huang,
Alekh Agarwal,
Daniel J. Hsu,
John Langford,
Robert E. Schapire
We develop a new active learning algorithm for the streaming setting
satisfying three important properties: 1) It provably works for any classifier
representation and classification problem including those with severe noise. 2)
It is efficiently implementable with an ERM oracle. 3) It is more aggressive
than all previous approaches satisfying 1 and 2. To do this we create an
algorithm based on a newly defined optimization problem and analyze it. We also
conduct the first experimental analysis of all efficient agnostic active
learning algorithms, discovering that this one is typically better across a
wide variety of datasets and label complexities.
Grammar as a Foreign Language
Oriol Vinyals,
Lukasz Kaiser,
Terry Koo,
Slav Petrov,
Ilya Sutskever,
Geoffrey Hinton
Syntactic constituency parsing is a fundamental problem in natural language
processing and has been the subject of intensive research and engineering for
decades. As a result, the most accurate parsers are domain specific, complex,
and inefficient. In this paper we show that the domain agnostic
attention-enhanced sequence-to-sequence model achieves state-of-the-art results
on the most widely used syntactic constituency parsing dataset, when trained on
a large synthetic corpus that was annotated using existing parsers. It also
matches the performance of standard parsers when trained only on a small
human-annotated dataset, which shows that this model is highly data-efficient,
in contrast to sequence-to-sequence models without the attention mechanism. Our
parser is also fast, processing over a hundred sentences per second with an
unoptimized CPU implementation.
Regularization-free estimation in trace regression with symmetric positive semidefinite matrices
Martin Slawski,
Ping Li,
Matthias Hein
Over the past few years, trace regression models have received considerable
attention in the context of matrix completion, quantum state tomography, and
compressed sensing. Estimation of the underlying matrix from
regularization-based approaches promoting low-rankedness, notably nuclear norm
regularization, have enjoyed great popularity. In the present paper, we argue
that such regularization may no longer be necessary if the underlying matrix is
symmetric positive semidefinite (\textsf{spd}) and the design satisfies certain
conditions. In this situation, simple least squares estimation subject to an
\textsf{spd} constraint may perform as well as regularization-based approaches
with a proper choice of the regularization parameter, which entails knowledge
of the noise level and/or tuning. By contrast, constrained least squares
estimation comes without any tuning parameter and may hence be preferred due to
its simplicity.
Winner-Take-All Autoencoders
Alireza Makhzani,
Brendan Frey
In this paper, we propose a winner-take-all method for learning hierarchical
sparse representations in an unsupervised fashion. We first introduce
fully-connected winner-take-all autoencoders which use mini-batch statistics to
directly enforce a lifetime sparsity in the activations of the hidden units. We
then propose the convolutional winner-take-all autoencoder which combines the
benefits of convolutional architectures and autoencoders for learning
shift-invariant sparse representations. We describe a way to train
convolutional autoencoders layer by layer, where in addition to lifetime
sparsity, a spatial sparsity within each feature map is achieved using
winner-take-all activation functions. We will show that winner-take-all
autoencoders can be used to to learn deep sparse representations from the
MNIST, CIFAR-10, ImageNet, Street View House Numbers and Toronto Face datasets,
and achieve competitive classification performance.
The LASSO with Non-linear Measurements is Equivalent to One With Linear Measurements
Chrtistos Thrampoulidis,
Ehsan Abbasi,
Babak Hassibi
Consider estimating an unknown, but structured, signal x0∈Rn from m
measurement yi=gi(aTix0), where the ai's are the rows of a known
measurement matrix A, and, g is a (potentially unknown) nonlinear and
random link-function. Such measurement functions could arise in applications
where the measurement device has nonlinearities and uncertainties. It could
also arise by design, e.g., gi(x)=sign(x+zi), corresponds to noisy
1-bit quantized measurements. Motivated by the classical work of Brillinger,
and more recent work of Plan and Vershynin, we estimate x0 via solving the
Generalized-LASSO for some regularization parameter λ>0 and some
(typically non-smooth) convex structure-inducing regularizer function. While
this approach seems to naively ignore the nonlinear function g, both
Brillinger (in the non-constrained case) and Plan and Vershynin have shown
that, when the entries of A are iid standard normal, this is a good estimator
of x0 up to a constant of proportionality μ, which only depends on g.
In this work, we considerably strengthen these results by obtaining explicit
expressions for the squared error, for the \emph{regularized} LASSO, that are
asymptotically \emph{precise} when m and n grow large. A main result is
that the estimation performance of the Generalized LASSO with non-linear
measurements is \emph{asymptotically the same} as one whose measurements are
linear yi=μaTix0+σzi, with μ=Eγg(γ) and
σ2=E(g(γ)−μγ)2, and, γ standard normal. To the
best of our knowledge, the derived expressions on the estimation performance
are the first-known precise results in this context. One interesting
consequence of our result is that the optimal quantizer of the measurements
that minimizes the estimation error of the LASSO is the celebrated Lloyd-Max
quantizer.
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.