Showing posts with label sketching. Show all posts
Showing posts with label sketching. Show all posts

Tuesday, April 27, 2021

Randomized Algorithms for Scientific Computing (RASC)

** Nuit Blanche is now on Twitter: @NuitBlog **

At LightOn, we build photonic hardware that performs random projections and it is nice to find a source of materials on the subject in one document. Here is a report comprehensively presenting how randomized algorithms are key to the future of computing:


Randomized Algorithms for Scientific Computing (RASC) by Aydin Buluc, Tamara G. Kolda, Stefan M. Wild, Mihai Anitescu, Anthony DeGennaro, John Jakeman, Chandrika Kamath, Ramakrishnan (Ramki)Kannan, Miles E. Lopes, Per-Gunnar Martinsson, Kary Myers, Jelani Nelson, Juan M. Restrepo, C. Seshadhri, Draguna Vrabie, Brendt Wohlberg, Stephen J. Wright, Chao Yang, Peter Zwart

Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and scalability. This report summarizes the outcomes of that workshop, "Randomized Algorithms for Scientific Computing (RASC)," held virtually across four days in December 2020 and January 2021.

 
Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn  or the Advanced Matrix Factorization group on LinkedIn

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

Other links:
Paris Machine LearningMeetup.com||@Archives||LinkedIn||Facebook|| @ParisMLGroup About LightOnNewsletter ||@LightOnIO|| on LinkedIn || on CrunchBase || our Blog
About myselfLightOn || Google Scholar || LinkedIn ||@IgorCarron ||Homepage||ArXiv

Thursday, December 21, 2017

Program / Streaming: The Future of Random Projections : a mini-workshop






Florent Krzakala and I are organizing a mini-workshop on The Future of Random Projections tomorrow.

Streaming will be live here:



Random Projections have proven useful in many areas ranging from signal processing to machine learning. In this informal mini-workshop, we aim to bring together researchers from different areas to discuss this exciting topic and its future use in the area of big data, heavy computations, and Deep Learning. 
 
Where and When:

Friday, December 22, 9:30am to 12:30pm (Paris time)
Amphitheatre A, IPGG, 6 rue Jean Calvin

the program:
  • Romain Couillet (Centrale Supelec), A Random Matrix Approach to Random Feature Maps
  • In this talk, I will discuss how advanced tools from random matrix theory allow to better understand and improve the large dimensional statistics of many standard machine learning methods, and in particular non-linear random feature maps. We will notably show that the performance of extreme learning machines (that can be seen as mere ridge-linear regression of non-linear RFMs) is easily understood, particularly so when the input data arise from a mixture model.
    Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. An increasingly popular approach is to first compress the database into a representation called a linear sketch, that satisfies all the mentioned requirements, then learn the desired information using only this sketch, which can be significantly faster than using the full data if the sketch is small.
    In this talk, we introduce a generic methodology to fit a mixture of probability distributions on the data, using only a sketch of the database. The sketch is defined by combining two notions from the reproducing kernel literature, namely kernel mean embedding and Random Features expansions. It is seen to correspond to linear measurements of the underlying probability distribution of the data, and the estimation problem is analyzed under the lens of Compressive Sensing (CS). We extend CS results to our infinite-dimensional framework, give generic conditions for successful estimation and apply them analysis to many problems, with a focus on mixture models estimation. We base our method on the construction of random sketching operators, using kernel mean embeddings and random features, such that some Restricted Isometry Property (RIP) condition holds in the Banach space of finite signed measures, with high probability, for a number of random features that only depends on the complexity of the problem. We also describe a flexible heuristic greedy algorithm to estimate mixture models from a sketch, and apply it on synthetic and real data.

  • Gael Varoquaux (INRIA), Recursive nearest agglomeration (ReNA): fast clustering for approximation of structured signals

  • In this work, we revisit fast dimension reduction approaches, as with random projections and random sampling. Our goal is to summarize the data to decrease computational costs and memory footprint of subsequent analysis. Such dimension reduction can be very efficient when the signals of interest have a strong structure, such as with images. We focus on this setting and investigate feature clustering schemes for data reductions that capture this structure. An impediment to fast dimension reduction is that good clustering comes with large algorithmic costs. We address it by contributing a linear-time agglomerative clustering scheme, Recursive Nearest Agglomeration (ReNA). Unlike existing fast agglomerative schemes, it avoids the creation of giant clusters. We empirically validate that it approximates the data as well as traditional variance-minimizing clustering schemes that have a quadratic complexity. In addition, we analyze signal approximation with feature clustering and show that it can remove noise, improving subsequent analysis steps. As a consequence, data reduction by clustering features with ReNA yields very fast and accurate models, enabling to process large datasets on budget. Our theoretical analysis is backed by extensive experiments on publicly-available data that illustrate the computation efficiency and the denoising properties of the resulting dimension reduction scheme.
  • Arthur Mensch (INRIA) Stochastic Subsampling for Factorizing Huge Matrices
  • We present a matrix-factorization algorithm that scales to input matrices with both huge number of rows and columns. Learned factors may be sparse or dense and/or nonnegative, which makes our algorithm suitable for dictionary learning, sparse component analysis, and nonnegative matrix factorization. Our algorithm streams matrix columns while subsampling them to iteratively learn the matrix factors. At each iteration, the row dimension of a new sample is reduced by subsampling, resulting in lower time complexity compared to a simple streaming algorithm. Our method comes with convergence guarantees to reach a stationary point of the matrix-factorization problem. We demonstrate its efficiency on massive functional magnetic resonance imaging data (2 TB), and on patches extracted from hyperspectral images (103 GB). For both problems, which involve different penalties on rows and columns, we obtain significant speed-ups compared to state-of-the-art algorithms. 

Tuesday, October 24, 2017

Thesis: Sketching for Large-Scale Learning of Mixture Models by Nicolas Keriven

Congratulations Dr. Keriven ! We've featured some of his work before... 
but now we have the whole thesis and Nicolas tells me there is more good stuff in it. Woohoo !. 



Automatic learning processes are becoming ubiquitous in many domains of science. However, nowadays databases commonly comprise millions or billions of elements, which challenge traditional learning methods. Furthermore, modern database architectures involve new difficulties: data may be seen once then discarded (a situation usually referred to as data stream), often databases are not stored in one single location but distributed across several storage places and it is undesirable to gather the whole database in one place for the sake of privacy and robustness to malicious attacks. It has thus become necessary to derive learning procedures that are amenable to very large databases, and to distributed and streaming computing. A popular idea is to define an intermediary compressed representation of a database, which is fast to compute, adapted to streaming and distributed computing through update and merge mechanisms, preserve data privacy, and such that the desired learning task can be performed using only this compressed representation, with a computational complexity that is greatly reduced compared to using the full database. A popular class of such representations is called linear sketches: the whole database is compressed into a single fixed-size vector called sketch, such that the sketch of the union of two databases is the sum of their sketches. Because of this property it is obvious that linear sketches are particularly convenient for streaming, distributed and parallel computing. In [BGP13; BGP15], Bourrier et al. introduced a learning method based on a linear sketch formed by a random sampling of the empirical characteristic function of a collection of multidimensional vectors. They showed empirically that it was possible to fit a Gaussian Mixture Model (GMM) with fixed identity covariance on the original data, using only its sketch. However, the method was restricted to GMMs with identity covariance, and theoretical justi- fications were still an open question. Extending this method to other models and providing a theoretical analysis of the approach is the main purpose of this thesis work. To do so, we develop an original framework based on several different sets of mathematical tools. The expression of the sketching operator is formalized by combining kernel mean embedding, which allows to define tunable Hilbertian metrics on the set of probability distributions, with Random Feature expansions, that approximate the infinite-dimensional mapping associated with a kernel function by a finite-dimensional mapping designed randomly. Using this mathematical framework, we analyze the sketching method under the lens of Compressive Sensing, which states that any signal that is in some sense less complex than the ambient dimension can be successfully compressed and estimated. We adapt classic proofs for finite-dimensional settings to our generalized infinite-dimensional framework. We provide guarantees for many problems, including for that of estimating mixtures of multivariate elliptic α-stable distributions from a sketch, for which no estimator was known. We particularly extend the framework and relate it to more traditional learning in two cases: first when recovering centroids from a sketch for the k-means or k-medians problem, and for GMM estimation with known covariance. We introduce a flexible heuristic greedy algorithm coined Compressive Learning - Orthogonal Matching Pursuit with Replacement (CL-OMPR) that can estimate any parametric mixture model from any sketch in a very wide variety of situations. Experiments are performed on real and synthetic data for three models. First, mixtures of Diracs, for which our approach is shown to be more efficient and more stable than k-means on large databases; second, GMMs with unknown diagonal covariances, where the proposed approach is seen to be faster and lighter that classic Expectation Maximization (EM). And, finally, mixtures of multivariate elliptic α-stable distributions, where our approach is the first viable algorithm of which we are aware that can perform this task.


Résumé : Les bases de données modernes sont de très grande taille, parfois divisées et distribuées sur plusieurs lieux de stockage, ou encore sous forme de flux de données : ceci soulève de nouveaux défis majeurs pour les méthodes d’apprentissage statistique. Une des méthodes récentes capable de s’adapter à ces situations consiste à d’abord compresser les données en une structure appelée sketch linéaire, puis ensuite de réaliser la tâche d’apprentissage en utilisant uniquement ce sketch, ce qui est extrêmement rapide si celui-ci est de petite taille. Dans cette thèse, nous définissons une telle méthode pour estimer un modèle de mélange de distributions de probabilités à partir des données, en utilisant uniquement un sketch de celles-ci. Ce sketch est défini en s’inspirant de plusieurs notions venant du domaine des méthodes à noyaux : le plongement par noyau moyen et les approximations aléatoires de noyaux. Défini comme tel, le sketch correspond à des mesures linéaires de la distribution de probabilité sous-jacente aux données. Ainsi nous analysons le problème en utilisant des outils venant du domaine de l’acquisition comprimée, dans lequel un signal est mesuré aléatoirement sans perte d’information, sous certaines conditions. Nous étendons certains résultats de l’acquisition comprimée à la dimension infinie, donnons des conditions génériques garantissant le succès de notre méthode d’estimation de modèles de mélanges, et les appliquons à plusieurs problèmes, dont notamment celui d’estimer des mélanges de distributions stables multivariées, pour lequel il n’existait à ce jour aucun estimateur. Notre analyse est basée sur la construction d’opérateurs de sketch construits aléatoirement, qui satisfont une Propriété d’Isométrie Restreinte dans l’espace de Banach des mesures finies signées avec forte probabilité. Dans une second partie, nous introduisons un algorithme glouton capable heuristiquement d’estimer un modèle de mélange depuis un sketch linéaire. Cet algorithme est appliqué sur données simulées et réelles à trois problèmes : l’estimation de centres significatifs dans les données, pour lequel on constate que la méthode de sketch est significativement plus rapide qu’un algorithme de k-moyennes classique, l’estimation de mélanges de Gaussiennes, pour lequel elle est plus rapide qu’un algorithme d’Espérance-Maximisation, et enfin l’estimation de mélange de distributions stables multivariées, pour lequel il n’existait à ce jour, à notre connaissance, aucun algorithme capable de réaliser une telle tâche.

Monday, May 22, 2017

Short and Deep: Sketching and Neural Networks

This is a revised version of an earlier preprint.
Short and Deep: Sketching and Neural Networks by Amit Daniely, Nevena Lazic, Yoram Singer, Kunal Talwar
Data-independent methods for dimensionality reduction such as random projections, sketches, and feature hashing have become increasingly popular in recent years. These methods often seek to reduce dimensionality while preserving the hypothesis class, resulting in inherent lower bounds on the size of projected data. For example, preserving linear separability requires Ω(1/γ2 ) dimensions, where γ is the margin, and in the case of polynomial functions, the number of required dimensions has an exponential dependence on the polynomial degree. Despite these limitations, we show that the dimensionality can be reduced further while maintaining performance guarantees, using improper learning with a slightly larger hypothesis class. In particular, we show that any sparse polynomial function of a sparse binary vector can be computed from a compact sketch by a single-layer neural network, where the sketch size has a logarithmic dependence on the polynomial degree. A practical consequence is that networks trained on sketched data are compact, and therefore suitable for settings with memory and power constraints. We empirically show that our approach leads to networks with fewer parameters than related methods such as feature hashing, at equal or better performance. 






Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Friday, April 28, 2017

Thesis: Randomized Algorithms for Large-Scale Data Analysis by Farhad Pourkamali-Anaraki

Image 1

Stephen just sent me the following:

Hi Igor, 
It's a pleasure to write to you again and announce the graduation of my PhD student Farhad Pourkamali-Anaraki.

It contains a lot of good things, some published some not. In particular (see attached image 1) he has great work on a 1-pass algorithm for K-means that seems to be one of the only 1-pass algorithms to accurately estimate cluster centers (implementation at https://github.com/stephenbeckr/SparsifiedKMeans ), and also has very recent work on efficient variations of the Nystrom method for approximating kernel matrices that seems to give the high-accuracy of the clustered Nystrom method at a fraction of the computational cost (see image 2). 
Best,
Stephen

Image 2




Thanks Stephen but I think the following paper also does 1-pass for K-Means (Keriven N., Tremblay N., Traonmilin Y., Gribonval R., "Compressive K-means" and its implementation SketchMLbox: A MATLAB toolbox for large-scale mixture learning ) even though the contruction seems different. Both of these implementations will be added to the Advanced Matrix Factorization Jungle page.

Anyway, congratulations Dr. Pourkamali-Anaraki !
Randomized Algorithms for Large-Scale Data AnalysisFarhad Pourkamali-Anaraki The abstract reads :

Massive high-dimensional data sets are ubiquitous in all scientific disciplines. Extract- ing meaningful information from these data sets will bring future advances in fields of science and engineering. However, the complexity and high-dimensionality of modern data sets pose unique computational and statistical challenges. The computational requirements of analyzing large-scale data exceed the capacity of traditional data analytic tools. The challenges surrounding large high-dimensional data are felt not just in processing power, but also in memory access, storage requirements, and communication costs. For example, modern data sets are often too large to fit into the main memory of a single workstation and thus data points are processed sequentially without a chance to store the full data. Therefore, there is an urgent need for the development of scalable learning tools and efficient optimization algorithms in today’s high-dimensional data regimes.

A powerful approach to tackle these challenges is centered around preprocessing high-dimensional data sets via a dimensionality reduction technique that preserves the underlying geometry and structure of the data. This approach stems from the observation that high- dimensional data sets often have intrinsic dimension which is significantly smaller than the ambient dimension. Therefore, information-preserving dimensionality reduction methods are valuable tools for reducing the memory and computational requirements of data analytic tasks on large-scale data sets.

Recently, randomized dimension reduction has received a lot of attention in several fields, including signal processing, machine learning, and numerical linear algebra. These methods use random sampling or random projection to construct low-dimensional representations of the data, known as sketches or compressive measurements. These randomized methods are effective in modern data settings since they provide a non-adaptive data- independent mapping of high-dimensional data into a low-dimensional space. However, such methods require strong theoretical guarantees to ensure that the key properties of original data are preserved under a randomized mapping.

This dissertation focuses on the design and analysis of efficient data analytic tasks using randomized dimensionality reduction techniques. Specifically, four efficient signal processing and machine learning algorithms for large high-dimensional data sets are proposed: covariance estimation and principal component analysis, dictionary learning, clustering, and low-rank approximation of positive semidefinite kernel matrices. These techniques are valu- able tools to extract important information and patterns from massive data sets. Moreover, an efficient data sparsification framework is introduced that does not require incoherence and distributional assumptions on the data. A main feature of the proposed compression scheme is that it requires only one pass over the data due to the randomized preconditioning transformation, which makes it applicable to streaming and distributed data settings.

The main contribution of this dissertation is threefold: (1) strong theoretical guarantees are provided to ensure that the proposed randomized methods preserve the key properties and structure of high-dimensional data; (2) tradeoffs between accuracy and memory/computation savings are characterized for a large class of data sets as well as dimensionality reduction methods, including random linear maps and random sampling; (3) extensive numerical experiments are presented to demonstrate the performance and benefits of our proposed methods compared to prior works.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Monday, January 16, 2017

Thesis: Privacy-aware and Scalable Recommender Systems using Sketching Techniques by Raghavendran Balu


Congratulations Dr. Balu !

Privacy-aware and 
Scalable Recommender 
Systems 
using Sketching 
Techniques
 by Raghavendran Balu


In this thesis, we aim to study and evaluate the privacy and scalability properties of recommendersystems using sketching techniques and propose scalable privacy preserving personalization mechanisms. Hence, the thesis is at the intersection of three different topics: recommender systems, differential privacy and sketching techniques. On the privacy aspects, we are interested in both new privacy preserving mechanisms and the evaluation of such mechanisms. We observe that the primary parameter  in differential privacy is a control parameter and motivated to find techniques that can assess the privacy guarantees. We are also interested in proposing new mechanisms that are privacy preserving and get along well with the evaluation metrics. On the scalability aspects, weaim to solve the challenges arising in user modeling and item retrieval. User modeling with evolving data poses difficulties, to be addressed, in storage and adapting to new data. Also, addressing the retrieval aspects finds applications in various domains other than recommender systems. We evaluate the impact of our contributions through extensive experiments conducted on benchmark real datasets and through the results, we surmise that our contributions very well address the privacy and scalability challenges.






Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Tuesday, January 03, 2017

Videos and Slides: 6th MMDS Workshop on Algorithms for Modern Massive Data Sets


 The 6th MMDS Workshop on Algorithms for Modern Massive Data Sets was held June 21–24, 2016, in Berkeley, CA. It is an event that occurs every two years. Unlike the ML conferences,the KDD conferences or the Simons workshop, this meeting finds its voice in gathering folks around the issue of dealing with very large data. Video recordings of all the talks may be found on their YouTube channel. Download the full MMDS 2016 program here. The links to the videos and slides of most talks are listed below, Previous MMDS meetings with slides and videos can be found with the MMDS tag. enjoy !












 Tue, June 21

Data Analysis and Statistical Data Analysis *
08:00 09:45 Breakfast and registration *
09:45 10:00 Welcome and opening remarks Organizers
10:00 11:00 Meaningful Visual Exploration of Massive Data Peter Wang
11:00 11:30 Scalable Collective Inference from Richly Structured Data
show video
Lise Getoor
11:30 12:00 A Framework for Processing Large Graphs in Shared Memory
show video
Julian Shun
12:00 02:00 Lunch *
02:00 02:30 Minimax optimal subsampling for large sample linear regression
show video
Aarti Singh
02:30 03:00 Randomized Low-Rank Approximation and PCA: Beyond Sketching
show video
Cameron Musco
03:00 03:30 Restricted Strong Convexity Implies Weak Submodularity
show video
Alex Dimakis
03:30 04:00 Coffee break *
04:00 04:30 The Stability Principle for Information Extraction from Data
show video
Bin Yu
04:30 05:00 New Results in Non-Convex Optimization for Large Scale Machine Learning
show video
Constantine Caramanis
05:00 05:30 The Union of Intersections Method
show video
Kristofer Bouchard
05:30 06:00 Head, Torso and Tail - Performance for modeling real data
show video
Alex Smola
06:00 08:00 Dinner Reception
Wed, June 22 Industrial and Scientific Applications *
09:00 10:00 New Methods for Designing and Analyzing Large Scale Randomized Experiment
show video
Jasjeet Sekhon
10:00 10:30 Cooperative Computing for Autonomous Data Centers Storing Social Network Data
show video
Jonathan Berry
10:30 11:00 Coffee break
11:00 11:30 Is manifold learning for toy data only?
show video
Marina Meila
11:30 12:00 Exploring Galaxy Evolution through Manifold Learning Jake VanderPlas
12:00 02:00 Lunch
02:00 02:30 Fast, flexible, and interpretable regression modeling
show video
Daniela Witten
02:30 03:00 Randomized Composable Core-sets for Distributed Computation Vahab Mirrokni
03:00 03:30 Local graph clustering algorithms: an optimization perspective
show video
Kimon Fountoulakis
03:30 04:00 Coffee break
04:00 04:30 Using Principal Component Analysis to Estimate a High Dimensional Factor Model with High-Frequency Data
show video
Dacheng Xiu
04:30 05:00 Identifying Broad and Narrow Financial Risk Factors with Convex Optimization: Part 1
show video
Lisa Goldberg
05:00 05:30 Identifying Broad and Narrow Financial Risk Factors with Convex Optimization: Part 2 Alex Shkolnik
05:30 06:00 Learning about business cycle conditions from four terabytes of data
show video
Serena Ng
Thu, June 23 Novel Algorithmic Methods *
09:00 10:00 Top 10 Data Analytics Problems in Science
show video
Prabhat
10:00 10:30 Low-rank matrix factorizations at scale: Spark for scientific data analytics Alex Gittens
10:30 11:00 Coffee break
11:00 11:30 Structure & Dynamics from Random Observations
show video
Abbas Ourmazd
11:30 12:00 Stochastic Integration via Error-Correcting Codes Dimitris Achlioptas
12:30 02:00 Lunch *
02:00 02:30 Why Deep Learning Works: Perspectives from Theoretical Chemistry Charles Martin
02:30 03:00 A theory of multineuronal dimensionality, dynamics and measurement
show video
Surya Ganguli
03:00 03:30 Sub-sampled Newton Methods: Uniform and Non-Uniform Sampling
show video
Fred Roosta
03:30 04:00 Coffee break *
04:00 04:30 In-core computation of geometric centralities with HyperBall: A hundred billion nodes and beyond
show video
Sebastiano Vigna
04:30 05:00 Higher-order clustering of networks David Gleich
05:00 05:30 Mining Tools for Large-Scale Networks
show video
Charalampos Tsourakakis
05:30 06:00 Building Scalable Predictive Modeling Platform for Healthcare Applications
show video
Jimeng Sun
06:00 08:00 Dinner reception and poster session
Fri, June 24 Novel Matrix and Graph Methods *
09:00 10:00 Scalable interaction with data: where artificial intelligence meets visualization Christopher White
10:00 10:30 Ameliorating the Annotation Bottleneck Christopher Re
10:30 11:00 Coffee break
11:00 11:30 Homophily and transitivity in dynamic network formation Bryan Graham
11:30 12:00 Systemwide Commonalities in Market Liquidity Mark Flood
12:30 02:00 Lunch *
02:00 02:30 Train faster, generalize better: Stability of stochastic gradient descent Moritz Hardt
02:30 03:00 Extracting governing equations from highly corrupted data Rachel Ward
03:00 03:30 Nonparametric Network Smoothing Cosma Shalizi
03:30 04:00 Coffee break *
04:00 04:30 PCA from noisy linearly reduced measurements
show video
Amit Singer and Joakim Anden
04:30 05:00 PCA with Model Misspecification
show video
Robert Anderson
05:00 05:30 Fast Graphlet Decomposition
show video
Ted Willke and Nesreen Ahmed




 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Wednesday, December 28, 2016

DataSketches : Sketches Library from Yahoo! - implementation -

 
 
While talking to Ravi, I realized I had not mentioned the following library before. Edo was behind the release of the Sketches Library from Yahoo! as open source. This is a Java software library of stochastic streaming algorithms

from the Datasketches page:

The Business Challenge: Analyzing Big Data Quickly.
In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis.
If approximate results are acceptable, there is a class of specialized algorithms, called streaming algorithms, or sketches that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of real-time analysis, sketches are the only known solution.
For any system that needs to extract useful information from big data these sketches are a required toolkit that should be tightly integrated into their analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days to hours or minutes on a number of its internal platforms.
This site is dedicated to providing key sketch algorithms of production quality. Contributions are welcome from those in the big data community interested in further development of this science and art.
 

In particular in the analytics section:

Built-in Theta Sketch set operators (Union, Intersection, Difference) produce sketches as a result (and not just a number) enabling full set expressions of cardinality, such as ((A ∪ B) ∩ (C ∪ D)) \ (E ∪ F). This capability along with predictable and superior accuracy (compared with Include/Exclude approaches) enable unprecedented analysis capabilities for fast queries.
 
This last paragraph echoes the presentation by Sam Bessalah, on Stream Mining via Abstract Algebra (ppt version). at the last meetup of season 1 of the Paris Machine Learning meetup (Europe Wide Machine Learning Meetup) with Andrew Ng  Sam's abstract was ( I recall distinctly having about 200 people listening studiously to monoids after 2 hours and and half of the other presentation) 
 
A quick introduction into some common algorithms and data structures to handle data in streaming fashion, like bloom filters, hyperloglog or min hashes. Then in a second part how abstract algebra with monoids, groups or semi groups help us reason and build scalable analytical systems beyond stream processing.
 
 
 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

SketchMLbox: A MATLAB toolbox for large-scale mixture learning - implementation - ( Compressive K-means , Sketching for Large-Scale Learning of Mixture Models )

 
 
 
Here is a new sketching toolbox: SketchMLbox:

SketchMLbox

A MATLAB toolbox for large-scale mixture learning


Purpose :

The SketchMLbox is a Matlab toolbox for fitting mixture models to large databases using sketching techniques.
The database is first compressed into a vector called sketch, then a mixture model (e.g. a Gaussian Mixture Model) is estimated from this sketch using greedy algorithms typical of sparse recovery.
The size of the sketch does not depend on the number of elements in the database, but rather on the complexity of the problem at hand [2,3]. Its computation can be massively parallelized and distributed over several units. It can also be maintained in an online setting at low cost.

Mixtures of Diracs ("K-means") and Gaussian Mixture Models with diagonal covariance are currently available, the toolbox is structured so that new mixture models can be easily implemented.

Details can be found in the following papers:
[1] Keriven N., Bourrier A., Gribonval R., Pérèz P., "Sketching for Large-Scale Learning of Mixture Models", ICASSP 2016.
[2] Keriven N., Bourrier A., Gribonval R., Pérèz P., "Sketching for Large-Scale Learning of Mixture Models", arXiv:1606.02838 (extended version)
[3] Keriven N., Tremblay N., Traonmilin Y., Gribonval R., "Compressive K-means", arXiv:1610.08738
 
 
The attendant papers are:

Compressive K-means by Nicolas Keriven, Nicolas Tremblay, Yann Traonmilin, Rémi Gribonval
The Lloyd-Max algorithm is a classical approach to perform K-means clustering. Unfortunately, its cost becomes prohibitive as the training dataset grows large. We propose a compressive version of K-means (CKM), that estimates cluster centers from a sketch, i.e. from a drastically compressed representation of the training dataset. We demonstrate empirically that CKM performs similarly to Lloyd-Max, for a sketch size proportional to the number of centroids times the ambient dimension, and independent of the size of the original dataset. Given the sketch, the computational complexity of CKM is also independent of the size of the dataset. Unlike Lloyd-Max which requires several replicates, we further demonstrate that CKM is almost insensitive to initialization. For a large dataset of 10^7 data points, we show that CKM can run two orders of magnitude faster than five replicates of Lloyd-Max, with similar clustering performance on artificial data. Finally, CKM achieves lower classification errors on handwritten digits classification.

Sketching for Large-Scale Learning of Mixture Models by Nicolas Keriven, Anthony Bourrier, Rémi Gribonval, Patrick Pérez

Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. We propose a "compressive learning" framework where we estimate model parameters from a sketch of the training data. This sketch is a collection of generalized moments of the underlying probability distribution of the data. It can be computed in a single pass on the training set, and is easily computable on streams or distributed datasets. The proposed framework shares similarities with compressive sensing, which aims at drastically reducing the dimension of high-dimensional signals while preserving the ability to reconstruct them. To perform the estimation task, we derive an iterative algorithm analogous to sparse reconstruction algorithms in the context of linear inverse problems. We exemplify our framework with the compressive estimation of a Gaussian Mixture Model (GMM), providing heuristics on the choice of the sketching procedure and theoretical guarantees of reconstruction. We experimentally show on synthetic data that the proposed algorithm yields results comparable to the classical Expectation-Maximization (EM) technique while requiring significantly less memory and fewer computations when the number of database elements is large. We further demonstrate the potential of the approach on real large-scale data (over 10 8 training samples) for the task of model-based speaker verification. Finally, we draw some connections between the proposed framework and approximate Hilbert space embedding of probability distributions using random features. We show that the proposed sketching operator can be seen as an innovative method to design translation-invariant kernels adapted to the analysis of GMMs. We also use this theoretical framework to derive information preservation guarantees, in the spirit of infinite-dimensional compressive sensing.
 h/t Ravi
 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Monday, December 26, 2016

Thesis: Sketch and Project: Randomized Iterative Methods for Linear Systems and Inverting Matrices - implementation -

Congratulation Dr. Gower !

On top of his thesis, Robert has a few implementations connected to the thesis, here is an excerpt:

InvRand - download, paper

Inverse Random is suite of randomized methods for inverting positive definite matrices implemented in MATLAB

RandomLinearLab - download, paper

Random Linear Lab is a lab for testing and comparing randomized methods for solving linear systems all implemented in MATLAB

Here is the thesis, with the arxiv abstract, the real abstract and the lay person's abstract !
Sketch and Project: Randomized Iterative Methods for Linear Systems and Inverting Matrices by Robert M. Gower

Probabilistic ideas and tools have recently begun to permeate into several fields where they had traditionally not played a major role, including fields such as numerical linear algebra and optimization. One of the key ways in which these ideas influence these fields is via the development and analysis of randomized algorithms for solving standard and new problems of these fields. Such methods are typically easier to analyze, and often lead to faster and/or more scalable and versatile methods in practice.
This thesis explores the design and analysis of new randomized iterative methods for solving linear systems and inverting matrices. The methods are based on a novel sketch-and-project framework. By sketching we mean, to start with a difficult problem and then randomly generate a simple problem that contains all the solutions of the original problem. After sketching the problem, we calculate the next iterate by projecting our current iterate onto the solution space of the sketched problem.

 Here is the real abstract of the thesis:
Probabilistic ideas and tools have recently begun to permeate into several fields where they had traditionally not played a major role, including fields such as numerical linear algebra and optimization. One of the key ways in which these ideas influence these fields is via the development and analysis of randomized algorithms for solving standard and new problems of these fields. Such methods are typically easier to analyze, and often lead to faster and/or more scalable and versatile methods in practice.
This thesis explores the design and analysis of new randomized iterative methods for solving linear systems and inverting matrices. The methods are based on a novel sketch-and-project framework. By sketching we mean, to start with a difficult problem and then randomly generate a simple problem that contains all the solutions of the original problem. After sketching the problem, we calculate the next iterate by projecting our current iterate onto the solution space of the sketched problem.
The starting point for this thesis is the development of an archetype randomized method for solving linear systems. Our method has six different but equivalent interpretations: sketch-and-project, constrain-and-approximate, random intersect, random linear solve, random update and random fixed point. By varying its two parameters – a positive definite matrix (defining geometry), and a random matrix (sampled in an i.i.d. fashion in each iteration) – we recover a comprehensive array of well known algorithms as special cases, including the randomized Kaczmarz method, randomized Newton method, randomized coordinate descent method and random Gaussian pursuit. We also naturally obtain variants of all these methods using blocks and importance sampling. However, our method allows for a much wider selection of these two parameters, which leads to a number of new specific methods. We prove exponential convergence of the expected norm of the error in a single theorem, from which existing complexity results for known variants can be obtained. However, we also give an exact formula for the evolution of the expected iterates, which allows us to give lower bounds on the convergence rate.
We then extend our problem to that of finding the projection of given vector onto the solution space of a linear system. For this we develop a new randomized iterative algorithm: stochastic dual ascent (SDA). The method is dual in nature, and iteratively solves the dual of the projection problem. The dual problem is a non-strongly concave quadratic maximization problem without constraints. In each iteration of SDA, a dual variable is updated by a carefully chosen point in a subspace spanned by the columns of a random matrix drawn independently from a fixed distribution. The distribution plays the role of a parameter of the method. Our complexity results hold for a wide family of distributions of random matrices, which opens the possibility to fine-tune the stochasticity of the method to particular applications. We prove that primal iterates associated with the dual process converge to the projection exponentially fast in expectation, and give a formula and an insightful lower bound for the convergence rate.
We also prove that the same rate applies to dual function values, primal function values and the duality gap. Unlike traditional iterative methods, SDA converges under virtually no additional assumptions on the system (e.g., rank, diagonal dominance) beyond consistency. In fact, our lower bound improves as the rank of the system matrix drops. By mapping our dual algorithm to a primal process, we uncover that the SDA method is the dual method with respect to the sketch-and-project method from the previous chapter. Thus our new more general convergence results for SDA carry over to the sketch-and-project method and all its specializations (randomized Kaczmarz, randomized coordinate descent...etc). When our method specializes to a known algorithm, we either recover the best known rates, or improve upon them. Finally, we show that the framework can be applied to the distributed average consensus problem to obtain an array of new algorithms. The randomized gossip algorithm arises as a special case.
In the final chapter, we extend our method for solving linear system to inverting matrices, and develop a family of methods with specialized variants that maintain symmetry or positive definiteness of the iterates. All the methods in the family converge globally and exponentially, with explicit rates. In special cases, we obtain stochastic block variants of several quasi-Newton updates, including bad Broyden (BB), good Broyden (GB), Powell-symmetric-Broyden (PSB), Davidon-Fletcher-Powell (DFP) and Broyden-Fletcher-Goldfarb-Shanno (BFGS). Ours are the first stochastic versions of these updates shown to converge to an inverse of a fixed matrix.
Through a dual viewpoint we uncover a fundamental link between quasi-Newton updates and approximate inverse preconditioning. Further, we develop an adaptive variant of the randomized block BFGS (AdaRBFGS), where we modify the distribution underlying the stochasticity of the method throughout the iterative process to achieve faster convergence. By inverting several matrices from varied applications, we demonstrate that AdaRBFGS is highly competitive when compared to the well established Newton-Schulz and approximate preconditioning methods. In particular, on large-scale problems our method outperforms the standard methods by orders of magnitude. The development of efficient methods for estimating the inverse of very large matrices is a much needed tool for preconditioning and variable metric methods in the big data era.
 There is even a Lay Summary
Lay Summary
This thesis explores the design and analysis of methods (algorithms) for solving two common problems: solving linear systems of equations and inverting matrices. Many engineering and quantitative tasks require the solution of one of these two problems. In particular, the need to solve linear systems of equations is ubiquitous in essentially all quantitative areas of human endeavour, including industry and science. Specifically, linear systems are a central problem in numerical linear algebra, and play an important role in computer science, mathematical computing, optimization, signal processing, engineering, numerical analysis, computer vision, machine learning, and many other fields. This thesis proposes new methods for solving large dimensional linear systems and inverting large matrices that use tools and ideas from probability.
The advent of large dimensional linear systems of equations, based on big data sets, poses a challenge. On these large linear systems, the traditional methods for solving linear systems can take an exorbitant amount of time. To address this issue we propose a new class of randomized methods that are capable of quickly obtaining approximate solutions. This thesis lays the foundational work of this new class of randomized methods for solving linear systems and inverting matrices. The main contributions are providing a framework to design and analyze new and existing methods for solving linear systems. In particular, our framework unites many existing methods. For inverting matrices we also provide a framework for designing and analysing methods, but moreover, using this framework we design a highly competitive method for computing an approximate inverse of truly large scale positive definite matrices. Our new method often outperforms previously known methods by several orders of magnitude on large scale matrices 

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Saturday, December 24, 2016

Saturday Morning Videos: Learning, Algorithm Design and Beyond Worst-Case Analysis workshop, Simons Institute, Berkeley

 
All the videos of the Learning, Algorithm Design and Beyond Worst-Case Analysis workshop at the Simons Institute at Berkeley are available by following the links of each talks.

Learning, Algorithm Design and Beyond Worst-Case Analysis

Nov. 14Nov. 18, 2016

  • Click on the titles of individual talks for abstract, slides and archived video (available approximately one week after the conclusion of the workshop).
Add to My Calendar
Monday, November 14th, 2016
9:00 am – 9:20 am
Coffee and Check-In
9:20 am – 9:30 am
Opening Remarks
9:30 am – 10:10 am
10:10 am – 10:50 am
10:50 am – 11:20 am
Break
11:20 am – 12:00 pm
12:00 pm – 2:00 pm
Lunch
2:00 pm – 2:40 pm
2:40 pm – 3:20 pm
3:20 pm – 3:50 pm
Break
3:50 pm – 4:30 pm
4:40 pm – 5:00 pm
Impromptu Talks Session
5:00 pm – 6:00 pm
Reception
Tuesday, November 15th, 2016
9:00 am – 9:30 am
Coffee and Check-In
9:30 am – 10:10 am
10:10 am – 10:50 am
10:50 am – 11:20 am
Break
11:20 am – 12:00 pm
12:00 pm – 2:00 pm
Lunch
2:00 pm – 2:40 pm
2:40 pm – 3:20 pm
3:20 pm – 3:50 pm
Break
3:50 pm – 4:30 pm
4:40 pm – 5:00 pm
Impromptu Talks Session
Wednesday, November 16th, 2016
9:00 am – 9:30 am
Coffee and Check-In
9:30 am – 10:10 am
10:10 am – 10:50 am
10:50 am – 11:20 am
Break
11:20 am – 12:00 pm
12:00 pm – 12:40 pm
12:40 pm – 2:00 pm
Lunch
2:00 pm – 3:20 pm
Breakout Groups
3:20 pm – 4:00 pm
4:00 pm – 5:00 pm
Discussion
Thursday, November 17th, 2016
9:00 am – 9:30 am
Coffee and Check-In
9:30 am – 10:10 am
10:10 am – 10:50 am
10:50 am – 11:20 am
Break
11:20 am – 12:00 pm
12:00 pm – 2:00 pm
Lunch
2:00 pm – 2:40 pm
2:40 pm – 3:20 pm
3:20 pm – 3:50 pm
Break
3:50 pm – 4:30 pm
4:30 pm – 5:10 pm
5:10 pm – 5:30 pm
Impromptu Talks Session
Friday, November 18th, 2016
9:00 am – 9:30 am
Coffee and Check-In
9:30 am – 10:10 am
10:10 am – 10:50 am
10:50 am – 11:20 am
Break
11:20 am – 12:00 pm
12:00 pm – 2:00 pm
Lunch
2:00 pm – 2:40 pm
2:40 pm – 3:20 pm
3:20 pm – 3:50 pm
Break
3:50 pm – 4:30 pm
4:40 pm – 5:00 pm
Impromptu Talks Session
 
 
 
 
 
 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Printfriendly