Showing posts with label MappingMLtoHardware. Show all posts
Showing posts with label MappingMLtoHardware. Show all posts

Friday, June 09, 2017

Training Quantized Nets: A Deeper Understanding

Ah, here is some insight Christoph, Tom et al. !



Currently, deep neural networks are deployed on low-power embedded devices by first training a full-precision model using powerful computing hardware, and then deriving a corresponding low-precision model for efficient inference on such systems. However, training models directly with coarsely quantized weights is a key step towards learning on embedded platforms that have limited computing resources, memory capacity, and power consumption. Numerous recent publications have studied methods for training quantized network, but these studies have mostly been empirical. In this work, we investigate training methods for quantized neural networks from a theoretical viewpoint. We first explore accuracy guarantees for training methods under convexity assumptions. We then look at the behavior of algorithms for non-convex problems, and we show that training algorithms that exploit high-precision representations have an important annealing property that purely quantized training methods lack, which explains many of the observed empirical differences between these types of algorithms.





Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Tuesday, April 18, 2017

MLHardware: Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent / WRPN: Training and Inference using Wide Reduced-Precision Networks

( Personal message: I will at ICLR next week, let's grab some coffee if you are there. )


As ML is becoming more and more important, the hardware architecture on which it runs needs to change as well. These changes in turns are wholly dependent on a number of trade-offs. Today, we have two such studies, one on the quantization issues in neural networks and another one on the influence of low precision on Stochastic Gradient Descent (something we already seen for  gradient descent )

For computer vision applications, prior works have shown the efficacy of reducing the numeric precision of model parameters (network weights) in deep neural networks but also that reducing the precision of activations hurts model accuracy much more than reducing the precision of model parameters. We study schemes to train networks from scratch using reduced-precision activations without hurting the model accuracy. We reduce the precision of activation maps (along with model parameters) using a novel quantization scheme and increase the number of filter maps in a layer, and find that this scheme compensates or surpasses the accuracy of the baseline full-precision network. As a result, one can significantly reduce the dynamic memory footprint, memory bandwidth, computational energy and speed up the training and inference process with appropriate hardware support. We call our scheme WRPN - wide reduced-precision networks. We report results using our proposed schemes and show that our results are better than previously reported accuracies on ILSVRC-12 dataset while being computationally less expensive compared to previously reported reduced-precision networks.


Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called BUCKWILD! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11×. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems.



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Thursday, April 13, 2017

In-Datacenter Performance Analysis of a Tensor Processing Unit​ (TM)

Norm Jouppi mentioned it in a Google blog entry a few days. He and his team have released some information about the TPU. This is very interesting because of how algorithms are changing hardware. Sure enough we know that NVIDIA has shifted its architecture to follow the successful Deep Neural networks development (and continues to do so) but when Google announced the TPU last year one could only wonder what they would be doing better than a chip maker.    wrote about this question in  Does Google’s TPU Investment Make Sense Going Forward? but I believe he misses the point entirely.

That point was made abundantly clear in a Wired article:
It’s not used to train the neural network beforehand. But as Jouppi explains, even that still saves the company quite a bit. It didn’t have to build, say, an extra 15 data centers
The reason Google did not wait for NVIDIA's architecture to change or for Moore's law to kick in (that is using OPM to do the work) is mostly because that hardware technology effort spare them a few bucks. It all comes down to the fact that we are all collectively not fast enough to make sense of data.

Here is Norm et al's paper:  In-Datacenter Performance Analysis of a Tensor Processing Unit​ (TM) by Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC—called a ​Tensor Pro​cessing Unit (TPU)— deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters’ NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU’s GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.







Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Tuesday, April 04, 2017

It's all about convolutions: YodaNN, TrueNorth computing, FPGAs and a comparison between HOG and CNNs on Hardware

After yesterday's Compressive Sensing Hardware, let us look at the recent hardware needed to speed up Deep Learning algorithms where convolutions are a bottelneck. The last paper is also a welcome addition to people wondering about the link between architectures based on DL and those based on computer vision features. Enjoy !




YodaNN: An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration by Renzo Andri, Lukas Cavigelli, Davide Rossi, Luca Benini


Convolutional neural networks (CNNs) have revolutionized the world of computer vision over the last few years, pushing image classification beyond human accuracy. The computational effort of today's CNNs requires power-hungry parallel processors or GP-GPUs. Recent developments in CNN accelerators for system-on-chip integration have reduced energy consumption significantly. Unfortunately, even these highly optimized devices are above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. This prevents the adoption of CNNs in future ultra-low power Internet of Things end-nodes for near-sensor analytics. Recent algorithmic and theoretical advancements enable competitive classification accuracy even when limiting CNNs to binary (+1/-1) weights during training. These new findings bring major optimization opportunities in the arithmetic core by removing the need for expensive multiplications, as well as reducing I/O bandwidth and storage. In this work, we present an accelerator optimized for binary-weight CNNs that achieves 1510 GOp/s at 1.2 V on a core area of only 1.33 MGE (Million Gate Equivalent) or 0.19 mm
2

and with a power dissipation of 895 {\mu}W in UMC 65 nm technology at 0.6 V. Our accelerator significantly outperforms the state-of-the-art in terms of energy and area efficiency achieving 61.2 TOp/s/W@0.6 V and 1135 GOp/s/MGE@1.2 V, respectively.




  Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that (i) approach state-of-the-art classification accuracy across eight standard datasets encompassing vision and speech, (ii) perform inference while preserving the hardware’s underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1,200 and 2,600 frames/s and using between 25 and 275 mW (effectively  superior to 6,000 frames/s per Watt), and (iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. This approach allows the algorithmic power of deep learning to be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.


FPGA-based hardware accelerators for convolutional neural networks (CNNs) have obtained great attentions due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. Experiment results show that the proposed architecture is 8.3x faster and 75x more energy-efficient than a Titan X GPU for processing online individual requests (in small batch size). For processing static data (in large batch size), the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5x higher energy efficiency.

Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision by Amr Suleiman, Yu-Hsin Chen, Joel Emer, Vivienne Sze

Computer vision enables a wide range of applications in robotics/drones, self-driving cars, smart Internet of Things, and portable/wearable electronics. For many of these applications, local embedded processing is preferred due to privacy and/or latency concerns. Accordingly, energy-efficient embedded vision hardware delivering real-time and robust performance is crucial. While deep learning is gaining popularity in several computer vision algorithms, a significant energy consumption difference exists compared to traditional hand-crafted approaches. In this paper, we provide an in-depth analysis of the computation, energy and accuracy trade-offs between learned features such as deep Convolutional Neural Networks (CNN) and hand-crafted features such as Histogram of Oriented Gradients (HOG). This analysis is supported by measurements from two chips that implement these algorithms. Our goal is to understand the source of the energy discrepancy between the two approaches and to provide insight about the potential areas where CNNs can be improved and eventually approach the energy-efficiency of HOG while maintaining its outstanding performance accuracy.

 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Tuesday, January 10, 2017

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference / Scaling Binarized Neural Networks on Reconfigurable Logic

Michaela just provided me with some of the latest results on optimizing Binarized Neural networks on FPGAs. This is quite interesting and impressive.





Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 {\mu}s latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.


Scaling Binarized Neural Networks on Reconfigurable Logic by Nicholas J. Fraser, Yaman Umuroglu, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre and Kees Vissers

Binarized neural networks (BNNs) are gaining interest in the deep learning community due to their significantly lower computational and memory cost. They are particularly well suited to recon gurable logic devices, which contain an abundance of ne-grained compute resources and can result in smaller, lower power implementations, or conversely in higher classification rates. Towards this end, the Finn framework was recently proposed for building fast and exible eld programmable gate array (FPGA) accelerators for BNNs. Finn utilized a novel set of optimizations that enable e fficient mapping of BNNs to hardware and implemented fully connected, non-padded convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. However, FINN was not evaluated on larger topologies due to the size of the chosen FPGA, and exhibited decreased accuracy due to lack of padding. In this paper, we improve upon Finn to show how padding can be employed on BNNs while still maintaining a 1-bit datapath and high accuracy. Based on this technique, we demonstrate numerous experiments to illustrate exibility and scalability of the approach. In particular, we show that a large BNN requiring 1.2 billion operations per frame running on an ADM-PCIE-8K5 platform can classify images at 12 kFPS with 671 mus latency while drawing less than 41W board power and classifying CIFAR-10 images at 88.7% accuracy. Our implementation of this network achieves 14.8 trillion operations per second. We believe this is the fastest classification rate reported to date on this benchmark at this level of accuracy. 





Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Saturday, January 07, 2017

Tuesday, December 20, 2016

Efficient Methods for Deep Neural Networks

Here some of the papers that were presented as posters or orally at the 1st International Workshop on Efficient Methods for Deep Neural Networks at NIPS2016. Most of them consist in how deep learning algorithms can be optimized to fit on silicon architectures.



Efficient Stochastic Inference of Bitwise Deep Neural Networks
Sebastian Vogel, Christoph Schorn, Andre Guntoro, Gerd Ascheid
Recently published methods enable training of bitwise neural networks which allow reduced representation of down to a single bit per weight. We present a method that exploits ensemble decisions based on multiple stochastically sampled network models to increase performance figures of bitwise neural networks in terms of classification accuracy at inference. Our experiments with the CIFAR-10 and GTSRB datasets show that the performance of such network ensembles surpasses the performance of the high-precision base model. With this technique we achieve 5.81% best classification error on CIFAR-10 test set using bitwise networks. Concerning inference on embedded systems we evaluate these bitwise networks using a hardware efficient stochastic rounding procedure. Our work contributes to efficient embedded bitwise neural networks.


PVANet: Lightweight Deep Neural Networks for Real-time Object Detection
Sanghoon Hong, Byungseok Roh, Kye-Hyeon Kim, Yeongjae Cheon, Minje Park
In object detection, reducing computational cost is as important as improving accuracy for most practical usages. This paper proposes a novel network structure, which is an order of magnitude lighter than other state-of-the-art networks while maintaining the accuracy. Based on the basic principle of more layers with less channels, this new deep neural network minimizes its redundancy by adopting recent innovations including C.ReLU and Inception structure. We also show that this network can be trained efficiently to achieve solid results on well-known object detection benchmarks: 84.9% and 84.2% mAP on VOC2007 and VOC2012 while the required compute is less than 10% of the recent ResNet-101.
 Code and models are at: https://github.com/sanghoon/pva-faster-rcnn


ESE: Efficient Speech Recognition Engine with Compressed LSTM on FPGA
Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, William J. Dally
Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built larger and larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to high total cost of ownership (TCO) of a data center. In order to speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose scheduler that encodes and partitions the compressed model to each PE for parallelism, and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the compressed model. Implemented on Xilinx XCKU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the compressed LSTM network, corresponding to 2.52 TOPS on the uncompressed one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.
 
Compacting Neural Network Classifiers via Dropout Training
Yotaro Kubo, George Tucker, Simon Wiesler
We introduce dropout compaction, a novel method for training feed-forward neural networks which realizes the performance gains of training a large model with dropout regularization, yet extracts a compact neural network for run-time efficiency. In the proposed method, we introduce a sparsity-inducing prior on the per unit dropout retention probability so that the optimizer can effectively prune hidden units during training. By changing the prior hyperparameters, we can control the size of the resulting network. We performed a systematic comparison of dropout compaction and competing methods on several real-world speech recognition tasks and found that dropout compaction achieved comparable accuracy with fewer than 50% of the hidden units, translating to a 2.5x speedup in run-time.

Efficient Convolutional Neural Network with Binary Quantization Layer
Mahdyar Ravanbakhsh, Hossein Mousavi, Moin Nabi, Lucio Marcenaro, Carlo Regazzoni

In this paper we introduce a novel method for segmentation that can benefit from general semantics of Convolutional Neural Network (CNN). Our segmentation proposes visually and semantically coherent image segments. We use binary encoding of CNN features to overcome the difficulty of the clustering on the high-dimensional CNN feature space. These binary encoding can be embedded into the CNN as an extra layer at the end of the network. This results in real-time segmentation. To the best of our knowledge our method is the first attempt on general semantic image segmentation using CNN. All the previous papers were limited to few number of category of the images (e.g. PASCAL VOC). Experiments show that our segmentation algorithm outperform the state-of-the-art non-semantic segmentation methods by a large margin.



Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, Jan Kautz
(Submitted on 19 Nov 2016)
We propose a new framework for pruning convolutional kernels in neural networks to enable efficient inference, focusing on transfer learning where large and potentially unwieldy pretrained networks are adapted to specialized tasks. We interleave greedy criteria-based pruning with fine-tuning by backpropagation - a computationally efficient procedure that maintains good generalization in the pruned network. We propose a new criterion based on an efficient first-order Taylor expansion to approximate the absolute change in training cost induced by pruning a network component. After normalization, the proposed criterion scales appropriately across all layers of a deep CNN, eliminating the need for per-layer sensitivity analysis. The proposed criterion demonstrates superior performance compared to other criteria, such as the norm of kernel weights or average feature map activation.

Quantized neural network design under weight capacity constraint
Sungho Shin, Kyuyeon Hwang, Wonyong Sung

The complexity of deep neural network algorithms for hardware implementation can be lowered either by scaling the number of units or reducing the word-length of weights. Both approaches, however, can accompany the performance degradation although many types of research are conducted to relieve this problem. Thus, it is an important question which one, between the network size scaling and the weight quantization, is more effective for hardware optimization. For this study, the performances of fully-connected deep neural networks (FCDNNs) and convolutional neural networks (CNNs) are evaluated while changing the network complexity and the word-length of weights. Based on these experiments, we present the effective compression ratio (ECR) to guide the trade-off between the network size and the precision of weights when the hardware resource is limited.

Efficient Convolutional Auto-Encoding via Random Convexification and Frequency-Domain Minimization
Meshia Cédric Oveneke, Mitchel Aliosha-Perez, Yong Zhao, Dongmei Jiang, Hichem Sahli

The omnipresence of deep learning architectures such as deep convolutional neural networks (CNN)s is fueled by the synergistic combination of ever-increasing labeled datasets and specialized hardware. Despite the indisputable success, the reliance on huge amounts of labeled data and specialized hardware can be a limiting factor when approaching new applications. To help alleviating these limitations, we propose an efficient learning strategy for layer-wise unsupervised training of deep CNNs on conventional hardware in acceptable time. Our proposed strategy consists of randomly convexifying the reconstruction contractive auto-encoding (RCAE) learning objective and solving the resulting large-scale convex minimization problem in the frequency domain via coordinate descent (CD). The main advantages of our proposed learning strategy are: (1) single tunable optimization parameter; (2) fast and guaranteed convergence; (3) possibilities for full parallelization. Numerical experiments show that our proposed learning strategy scales (in the worst case) linearly with image size, number of filters and filter size.

Parallelizing Word2Vec in Multi-Core and Many-Core Architectures
Shihao Ji, Nadathur Satish, Sheng Li, Pradeep Dubey
Word2vec is a widely used algorithm for extracting low-dimensional vector representations of words. State-of-the-art algorithms including those by Mikolov et al. have been parallelized for multi-core CPU architectures, but are based on vector-vector operations with "Hogwild" updates that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we propose "HogBatch" by improving reuse of various data structures in the algorithm through the use of minibatching and negative sample sharing, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to distribute word2vec computation across nodes in a compute cluster, and demonstrate good strong scalability up to 32 nodes. The new algorithm is particularly suitable for modern multi-core/many-core architectures, especially Intel's latest Knights Landing processors, and allows us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge.
 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Saturday, December 17, 2016

Saturday Morning Video: Deep Compression, DSD Training and EIE: Deep Neural Network Model Compression, Regularization and Hardware Acceleration by Song Han

Here are some videos of Song Han on the topic of Mapping Deep Learning to Hardware:

  

Deep Compression, DSD Training and EIE: Deep Neural Network Model Compression, Regularization and Hardware Acceleration 

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on mobile phones and embedded systems with limited hardware resources. To address this limitation, this talk first introduces “Deep Compression” that can compress the deep neural networks by 10x-49x without loss of prediction accuracy[1][2][5]. Then this talk will describe DSD, the "Dense-Sparse-Dense" training method that regularizes CNN/RNN/LSTMs to improve the prediction accuracy of a wide range of neural networks given the same model size[3]. Finally this talk will discuss EIE, the "Efficient Inference Engine" that works directly on the deep-compressed DNN model and accelerates the inference, taking advantage of weight sparsity, activation sparsity and weight sharing, which is 13x faster and 3000x more energy efficient than a TitanX GPU[4]. References: [1] Han et al. Learning both Weights and Connections for Efficient Neural Networks (NIPS'15) [2] Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding (ICLR'16, best paper award) [3] Han et al. DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training (submitted to NIPS'16) [4] Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network (ISCA’16) [5] Iandola, Han et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and less than 0.5MB model size (submitted to EVVC'16)
Here are two earlier presentations on the same topic:

and the attendant slides:
 



EIE: Efficient Inference Engine on Compressed Deep Neural Network


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Monday, November 07, 2016

ICLR 2017: Lighter Networks




So the ICLR 2017 conference has garnered about 500 submissions and they are in the open review process. Here are few within the generic theme of "How do we change the architecture of Deep Learning models so that they can better fit other metrics such as lighter architecture" or more succintly Mapping ML to Hardware. I went through the submissions and looked at the titles and some abstracts so I am surely missing a few (kind feedback is welcome). Anyway, tonight will be a reading Nuit Blanche for sure. 


Credit photo: Image Credit: NASA/JPL-Caltech/Space Science Institute
Image of SATURN

W00101892.jpg was taken on 2016-11-04 17:39 (UTC) and received on Earth 2016-11-06 00:39(UTC). The camera was pointing toward SATURN, and the image was taken using the MT3 and CL2filters. This image has not been validated or calibrated. A validated/calibrated image will be archived with the NASA Planetary Data System.



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Friday, November 04, 2016

Ternary Weight Decomposition and Binary Activation Encoding for Fast and Compact Neural Network / Sparsely Connected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks

All the ICLR 2017 submissions are under open review, here are two papers submitted to the conference and related to the generic theme of mapping ML algorithms to Hardware: 
 
 

Ternary Weight Decomposition and Binary Activation Encoding for Fast and Compact Neural Network by Mitsuru Ambai & Takuya Matsumoto, Takayoshi Yamashita & Hironobu Fujiyoshi

This paper aims to reduce test-time computational load of a deep neural network. Unlike previous methods which factorize a weight matrix into multiple real-valued matrices, our method factorizes both weights and activations into integer and non-integer components. In our method, the real-valued weight matrix is approximated by a multiplication of a ternary matrix and a real-valued co-efficient matrix. Since the ternary matrix consists of three integer values, -1, 0, 1 it only consumes 2 bits per element. At test-time, an activation vector that passed from a previous layer is also transformed into a weighted sum of binary vectors, -1, 1, which enables fast feed-forward propagation based on simple logical operations: AND, XOR, and bit count. This makes it easier to deploy a deep network on low-power CPUs or to design specialized hardware.
In our experiments, we tested our method on three different networks: a CNN for handwritten digits, VGG-16 model for ImageNet classification, and VGG-Face for large-scale face recognition. In particular, when we applied our method to three fully connected layers in the VGG-16, 15x acceleration and memory compression up to 5:2% were achieved with only a 1:43% increase in the top-5 error. Our experiments also revealed that compressing convolutional layers can accelerate inference of the entire network in exchange of slight increase in error.

Sparsely Connected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks by Arash Ardakani, Carlo Condo and Warren J. Gross

Recently deep neural networks have received considerable attention due to their ability to extract and represent high-level abstractions in data sets. Deep neural networks such as fully-connected and convolutional neural networks have shown excellent performance on a wide range of recognition and classification tasks. However, their hardware implementations currently suffer from large silicon area and high power consumption due to the their high degree of complexity. The power/energy consumption of neural networks is dominated by memory accesses, the majority of which occur in fully-connected networks. In fact, they contain most of the deep neural network parameters. In this paper, we propose sparsely-connected networks, by showing that the number of connections in fully-connected networks can be reduced by up to 90% while improving the accuracy performance on three popular datasets (MNIST, CIFAR10 and SVHN). We then propose an efficient hardware architecture based on linear-feedback shift registers to reduce the memory requirements of the proposed sparsely-connected networks. The proposed architecture can save up to 90% of memory compared to the conventional implementations of fully-connected neural networks. Moreover, implementation results show up to 84% reduction in the energy consumption of a single neuron of the proposed sparsely-connected networks compared to a single neuron of fully-connected neural networks.

f
 
 
 
 
 
 
 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Sunday, October 23, 2016

Sunday Morning Insight: We're the Barbarians



In a recent blog entry ( Predicting the Future: The Steamrollers and Machine Learning) I pointed out the current limits on the use of silicon for computing. Even though the predictions show the substantial impact of computing on power generation, there is only a scattered set of initiatives or technology development that are looking into this issue.

This was reinforced when we, at LightOn, recently filled a form to join Optics Valley, a  non-profit group representing the interest of the Optics industry here in France.. Many of our answers fell into the "Other" category. That feeling was very much reinforced last night when I watched the IEEE rebooting computing video that features a set of initiatives that aims at solving this exact problem. But if you watch the short video, you'll probably notice that our technology also falls in the "Others" category.


 
Rome errr....Silicon Valley needs a solution and  we're the Barbarians....

 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Wednesday, October 19, 2016

Random Projections for Scaling Machine Learning in Hardware

Conitinuing in our Mapping ML to Hardware series, here is a way to produce random projections different than the way we do at LightOn.
Random projections have recently emerged as a powerful technique for large scale dimensionality reduction in machine learning applications. Crucially, the randomness can be extracted from sparse probability distributions, enabling hardware implementations with little overhead. In this paper, we describe a Field-Programmable Gate Array (FPGA) implementation alongside a Kernel Adaptive Filter (KAF) that is capable of reducing computational resources by introducing a controlled error term, achieving higher modelling capacity for given hardware resources. Empirical results involving classification, regression and novelty detection show that a 40% net increase in available resources and improvements in prediction accuracy is achievable for projections which halve the input vector length, enabling us to scale-up hardware implementations of KAF learning algorithms by at least a factor of 2. Execution time of our random projection core is shown to be an order of magnitude lower than a single core central processing unit (CPU) and the system-level implementation on a FPGA-based network card achieves a 29x speedup over the CPU. 



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Tuesday, August 30, 2016

Functional Hashing for Compressing Neural Networks

Among the different ways of mapping Machine Learning to hardware -especially in mobile platforms), here is a new approach for compressing redundant deep learning architectures.


Functional Hashing for Compressing Neural Networks by Lei Shi, Shikun Feng, ZhifanZhu

As the complexity of deep neural networks (DNNs) trend to grow to absorb the increasing sizes of data, memory and energy consumption has been receiving more and more attentions for industrial applications, especially on mobile devices. This paper presents a novel structure based on functional hashing to compress DNNs, namely FunHashNN. For each entry in a deep net, FunHashNN uses multiple low-cost hash functions to fetch values in the compression space, and then employs a small reconstruction network to recover that entry. The reconstruction network is plugged into the whole network and trained jointly. FunHashNN includes the recently proposed HashedNets as a degenerated case, and benefits from larger value capacity and less reconstruction loss. We further discuss extensions with dual space hashing and multi-hops. On several benchmark datasets, FunHashNN demonstrates high compression ratios with little loss on prediction accuracy.

and earlier from a different group:


Compressing Convolutional Neural Networks by Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, Yixin Chen

Convolutional neural networks (CNN) are increasingly used in many areas of computer vision. They are particularly attractive because of their ability to "absorb" great quantities of labeled data through millions of parameters. However, as model sizes increase, so do the storage and memory requirements of the classifiers. We present a novel network architecture, Frequency-Sensitive Hashed Nets (FreshNets), which exploits inherent redundancy in both convolutional layers and fully-connected layers of a deep learning model, leading to dramatic savings in memory and storage consumption. Based on the key observation that the weights of learned convolutional filters are typically smooth and low-frequency, we first convert filter weights to the frequency domain with a discrete cosine transform (DCT) and use a low-cost hash function to randomly group frequency parameters into hash buckets. All parameters assigned the same hash bucket share a single value learned with standard back-propagation. To further reduce model size we allocate fewer hash buckets to high-frequency components, which are generally less important. We evaluate FreshNets on eight data sets, and show that it leads to drastically better compressed performance than several relevant baselines.








 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Monday, August 29, 2016

Densely Connected Convolutional Networks

If neural networks are an instance of an iteration of a solver or a dynamical system, they seldom use more than previous iterates at every iteration. There is an on-going area of research that seems to be getting good (CIFAR) results by "remembering" some of these past iterates. From the paper:

Many recent publications address this or related problems. ResNets (He et al., 2015b) and Highway Networks (Srivastava et al., 2015) bypass signal from one layer to the next via identity connections. Stochastic Depth (Huang et al., 2016) shortens ResNets by randomly dropping layers during training to allow better information and gradient flow. Recently, Larsson et al. (2016) introduced FractalNets , which repeatedly combine several parallel layer sequences with different number of convolutional blocks to obtain a large nominal depth, while maintaining many short paths in the network. Although these different approaches vary in network topology and training procedure, we observe a key characteristic shared by all of them: they create short paths from earlier layers near the input to those later layers near the output. In this paper we propose an architecture that distills this insight into a simple and clean connectivity pattern. The idea is straight-forward, yet compelling: to ensure maximum information flow between layers in the network, we connect all layers directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own featuremaps to all subsequent layers





Densely Connected Convolutional Networks by Gao Huang, Zhuang Liu, Kilian Q. Weinberger
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN).


 An implementation is available on GitHub: https://github.com/liuzhuang13/DenseNet

see also the comments on Reddit.
 
 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Printfriendly