SlideShare a Scribd company logo
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not
use or distribute these slides for commercial purposes. You may make copies of these
slides and use or distribute them for educational purposes as long as you
cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see
https://creativecommons.org/licenses/by-sa/2.0/legalcode
Model Resource
Management Techniques
Welcome
Dimensionality Reduction
Dimensionality Effect on
Performance
High-dimensional data
Before. .. when it
was all about data
mining
● Domain experts selected features
● Designed feature transforms
● Small number of more relevant features were enough
Now … data science
is about integrating
everything
● Data generation and storage is less of a problem
● Squeeze out the best from data
● More high-dimensional data having more features
A note about neural networks
● Yes, neural networks will perform a kind of automatic feature selection
● However, that’s not as efficient as a well-designed dataset and model
○ Much of the model can be largely “shut off” to ignore unwanted
features
○ Even unused parts of the consume space and compute resources
○ Unwanted features can still introduce unwanted noise
○ Each feature requires infrastructure to collect, store, and manage
High-dimensional spaces
Word embedding - An example
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
10
?
?
?
?
?
?
?
?
6 7 8 5 11
Auto Embedding Weight Matrix
[“I want to search for blood pressure result history “,
“Show blood pressure result for patient”,...]
Input layer
(Learned vectors)
Embedding Layer
i
want
to
search
for
blood
pressure
result
history
show
patient
...
LAST
1
2
3
4
5
6
7
8
9
10
11
20
Initialization and loading the dataset
import tensorflow as tf
from tensorflow import keras
import numpy as np
from keras.datasets import reuters
from keras.preprocessing import sequence
num_words = 1000
(reuters_train_x, reuters_train_y), (reuters_test_x, reuters_test_y) =
tf.keras.datasets.reuters.load_data(num_words=num_words)
n_labels = np.unique(reuters_train_y).shape[0]
Further preprocessing
from keras.utils import np_utils
reuters_train_y = np_utils.to_categorical(reuters_train_y, 46)
reuters_test_y = np_utils.to_categorical(reuters_test_y, 46)
reuters_train_x =
tf.keras.preprocessing.sequence.pad_sequences(reuters_train_x, maxlen=20)
reuters_test_x = tf.keras.preprocessing.sequence.pad_sequences(reuters_test_x,
maxlen=20)
Using all dimensions
from tensorflow.keras import layers
model2 = tf.keras.Sequential(
[
layers.Embedding(num_words, 1000, input_length= 20),
layers.Flatten(),
layers.Dense(256),
layers.Dropout(0.25),
layers.Activation('relu'),
layers.Dense(46),
layers.Activation('softmax')
])
Model compilation and training
model.compile(loss="categorical_crossentropy", optimizer="rmsprop",
metrics=['accuracy'])
model_1 = model.fit(reuters_train_x, reuters_train_y,
validation_data=(reuters_test_x , reuters_test_y),
batch_size=128, epochs=20, verbose=0)
Example with a higher number of dimensions
Acc
Loss
epoch
epoch
model accuracy model loss
train
validation
train
validation
0.9
0.8
0.6
0.7
0.5
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
0.9
0.8
0.6
0.7
0.5
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
0.0 2.5 5.0
Word embeddings: 6 dimensions
from tensorflow.keras import layers
model = tf.keras.Sequential(
[
layers.Embedding(num_words, 6, input_length= 20),
layers.Flatten(),
layers.Dense(256),
layers.Dropout(0.25),
layers.Activation('relu'),
layers.Dense(46),
layers.Activation('softmax')
])
Word embeddings: fourth root of the size of the vocab
Acc
Loss
epoch
epoch
model accuracy model loss
train
validation
train
validation
0.65
060
0.50
0.55
0.45
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
2.6
2.4
2.0
2.2
1.8
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
0.0 2.5 5.0
0.40
0.35
1.4
1.6
1.2
Dimensionality Reduction
Curse of Dimensionality
ML
methods
k-Nearest
Neighbours
Support Vector
Machines
Distance
measure
e.g., Euclidean
Distance
Recommendation
systems
Many ML methods use the distance measure
● More dimensions → more features
● Risk of overfitting our models
● Distances grow more and more alike
● No clear distinction between clustered objects
● Concentration phenomenon for Euclidean distance
Why is high-dimensional data a problem?
Curse of dimensionality
“As we add more dimensions we also increase the processing power we
need to train the model and make predictions, as well as the amount of
training data required”
Badreesh Shetty
Why are more features bad?
● Redundant / irrelevant features
● More noise added than signal
● Hard to interpret and visualize
● Hard to store and process data
The performance of algorithms ~ the number of dimensions
Optimal Dimensionality (# of features)
Classifier
Performance
1 2 3 4 5
(1, 1) (1, 2) (1, 3) (1, 4) (1, 5)
(2, 1) (2, 2) (2, 3) (2, 4) (2, 5)
(3, 1) (3, 2) (3, 3) (3, 4) (3, 5)
(4, 1) (4, 2) (4, 3) (4, 4) (4, 5)
(5, 1) (5, 2) (5, 3) (5, 4) (5, 5)
1-D
2-D
... ...
Adding dimensions increases feature space volume
Curse of dimensionality in the distance function
● New dimensions add non-negative terms to the sum
● Distance increases with the number of dimensions
● For a given number of examples, the feature space becomes
increasingly sparse
Euclidean distance
Increasing sparsity with higher dimensions
Dimension
3
Dimension 1
Dimension
2
Dimension 1
Dimension 1
Dimension
2
Dimension
3
Dimension 1
Dimension
2
x x x
x
x
x
x x x
x x x
x x x
The more the features, the larger the hypothesis space
The lower the hypothesis space
● the easier it is to find the correct hypothesis
● the less examples you need
The Hughes effect
Dimensionality Reduction
Curse of Dimensionality:
An example
How dimensionality impacts in other ways
● Runtime and system memory
requirements
● Solutions take longer to reach global
optima
● More dimensions raise the likelihood
of correlated features
More features require more training data
● More features aren’t better if they don’t add predictive information
● Number of training instances needed increases exponentially with each
added feature
● Reduces real-world usefulness of models
Model #1 (missing a single feature)
sex:InputLayer
cp:InputLayer
fbs:InputLayer
restecg : InputLayer
exang:InputLayer
slope:InputLayer
category_encoding_6:CategoryEncoding
category_encoding_1:CategoryEncoding
category_encoding_2:CategoryEncoding
category_encoding_3:CategoryEncoding
category_encoding_4:CategoryEncoding
normalization_5:Normalization concatenate: Concatenate dense: Dense dropout: dropout dense_1: Dense
ca:InputLayer
age:InputLayer
trestbps:InputLayer
chol:InputLayer
thalach:InputLayer
category_enconding_5:CategoryEncoding
normalization:Normalization
normalization_1:Normalization
normalization_2:Normalization
normalization_3:Normalization
oldpeak:InputLayer
normalization_4:Normalization
Model #2 (adds a new feature)
sex:InputLayer
cp:InputLayer
fbs:InputLayer
restecg : InputLayer
exang:InputLayer
category_encoding_6:CategoryEncoding
category_encoding_1:CategoryEncoding
category_encoding_2:CategoryEncoding
category_encoding_3:CategoryEncoding
category_encoding_4:CategoryEncoding
thal:InputLayer string_lookup:StringLookup
ca:InputLayer
age:InputLayer
trestbps:InputLayer
chol:InputLayer
thalach:InputLayer
category_enconding_5:CategoryEncoding
normalization:Normalization
normalization_1:Normalization
normalization_2:Normalization
normalization_3:Normalization
oldpeak:InputLayer
normalization_4:Normalization
A new string categorical
feature is added!
slope:InputLayer normalization_5:Normalization concatenate: Concatenate dense: Dense dropout: dropout dense_1: Dense
from tensorflow.python.keras.utils.layer_utils import count_params
# Number of training parameters in Model #1
>>> count_params(model_1.trainable_variables)
833
# Number of training parameters in Model #2 (with an added feature)
>>> count_params(model_1.trainable_variables)
1057
Comparing the two models’ trainable variables
What do ML models need?
● No hard and fast rule on how many features are required
● Number of features to be used vary depending on
● Prefer uncorrelated data containing information to
produce correct results
Manual Dimensionality
Reduction
Dimensionality Reduction
Increasing predictive performance
● Features must have information to produce correct results
● Derive features from inherent features
● Extract and recombine to create new features
Combining features
● Number of features grows very quickly
● Reduce dimensionality
pixels,
contours,
textures, etc.
samples,
spectrograms,
etc.
ticks, trends,
reversals, etc.
dna, marker
sequences,
genes, etc.
words,
grammatical
classes and
relations, etc.
Initial features
Feature explosion
Why reduce dimensionality?
Major techniques for
dimensionality
reduction Engineering Selection
Storage Computational Consistency Visualization
Need for manually crafting features
Certainly provides food for thought
Engineer features
● Tabular - aggregate, combine,
decompose
● Text-extract context
indicators
● Image-prescribe filters for
relevant structures
Come up with ideas to construct
“better” features
Devising features to reduce
dimensionality
Select the right features to maximize
predictiveness
Evaluate models using chosen
features
It’s an iterative
process
Feature Engineering
Manual Dimensionality
Reduction: case study
Dimensionality Reduction
CSV_COLUMNS = [
'fare_amount',
'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
'Dropoff_longitude', 'dropoff_latitude',
'passenger_count', 'key',
]
LABEL_COLUMN = 'fare_amount'
STRING_COLS = ['pickup_datetime']
NUMERIC_COLS = ['pickup_longitude', 'pickup_latitude',
'dropoff_longitude', 'dropoff_latitude',
'passenger_count']
DEFAULTS = [[0.0], ['na'], [0.0], [0.0], [0.0], [0.0], [0.0], ['na']]
DAYS = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
Taxi Fare dataset
dropoff_longitude:InputLayer
passenger_count:InputLayer
pickup_latitude:InputLayer
pickup_longitude:InputLayer
dropoff_latitude:InputLayer
dense_features_4:DenseFeatures h1:Dense h2:Dense fare:Dense
Build the model in Keras
from tensorflow.keras import layers
from tensorflow.keras.metrics import RootMeanSquared as RMSE
dnn_inputs = layers.DenseFeatures(feature_columns.values())(inputs)
h1 = layers.Dense(32, activation='relu', name='h1')(dnn_inputs)
h2 = layers.Dense(8, activation='relu', name='h2')(h1)
output = layers.Dense(1, activation='linear', name='fare')(h2)
model = models.Model(inputs, output)
model.compile(optimizer='adam', loss='mse',
metrics=[RMSE(name='rmse'), 'mse'])
Build a baseline model using raw features
Train the model
model rmse
epoch
125
120
115
110
105
100
0 1 2 3 4
train
validation
Increasing model performance with Feature Engineering
● Carefully craft features for the data types
○ Temporal (pickup date & time)
○ Geographical (latitude and longitude)
def parse_datetime(s):
if type(s) is not str:
s = s.numpy().decode('utf-8')
return datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S %Z")
def get_dayofweek(s):
ts = parse_datetime(s)
return DAYS[ts.weekday()]
@tf.function
def dayofweek(ts_in):
return tf.map_fn(
lambda s: tf.py_function(get_dayofweek, inp=[s],
Tout=tf.string),
ts_in)
Handling temporal features
def euclidean(params):
lon1, lat1, lon2, lat2 = params
londiff = lon2 - lon1
latdiff = lat2 - lat1
return tf.sqrt(londiff * londiff + latdiff * latdiff)
Geolocational features
def scale_longitude(lon_column):
return (lon_column + 78)/8.
Scaling latitude and longitude
def scale_latitude(lat_column):
return (lat_column - 37)/8.
def transform(inputs, numeric_cols, string_cols, nbuckets):
...
feature_columns = {
colname: tf.feature_column.numeric_column(colname)
for colname in numeric_cols
}
Preparing the transformations
for lon_col in ['pickup_longitude', 'dropoff_longitude']:
transformed[lon_col] = layers.Lambda(scale_longitude,
...)(inputs[lon_col])
for lat_col in ['pickup_latitude', 'dropoff_latitude']:
transformed[lat_col] = layers.Lambda(
scale_latitude,
...)(inputs[lat_col])
...
def transform(inputs, numeric_cols, string_cols, nbuckets):
...
transformed['euclidean'] = layers.Lambda(
euclidean,
name='euclidean')([inputs['pickup_longitude'],
inputs['pickup_latitude'],
inputs['dropoff_longitude'],
inputs['dropoff_latitude']])
feature_columns['euclidean'] = fc.numeric_column('euclidean')
...
Computing the Euclidean distance
def transform(inputs, numeric_cols, string_cols, nbuckets):
...
latbuckets = np.linspace(0, 1, nbuckets).tolist()
lonbuckets = ... # Similarly for longitude
b_plat = fc.bucketized_column(
feature_columns['pickup_latitude'], latbuckets)
b_dlat = # Bucketize 'dropoff_latitude'
b_plon = # Bucketize 'pickup_longitude'
b_dlon = # Bucketize 'dropoff_longitude'
Bucketizing and feature crossing
ploc = fc.crossed_column([b_plat, b_plon], nbuckets * nbuckets)
dloc = # Feature cross 'b_dlat' and 'b_dlon'
pd_pair = fc.crossed_column([ploc, dloc], nbuckets ** 4)
feature_columns['pickup_and_dropoff'] = fc.embedding_column(pd_pair,
100)
Bucketizing and feature crossing
dropoff_longitude:InputLayer
passenger_count:InputLayer
pickup_latitude:InputLayer
pickup_longitude:InputLayer
dropoff_latitude:InputLayer
dense_features_4:DenseFeatures h1:Dense h2:Dense fare:Dense
Scale_dropoff_latitude: Lambda
Scale_dropoff_longitude: Lambda
euclidean: Lambda
passenger_count:InputLayer
scale_pickup_latittude:Lambda
scale_pickup_longitude:Lambda
Build a model with the engineered features
Train the new feature engineered model
train
validation
Improved model rmse
epoch
3
mse
100
90
80
70
60
50
0 1 2 4
40
30
6
5 7
baselinel rmse
epoch
125
120
115
110
105
100
0 1 2 3 4
train
validation
Algorithmic Dimensionality
Reduction
Dimensionality Reduction
Linear dimensionality reduction
● Linearly project n-dimensional data onto a k-dimensional subspace
(k < n, often k << n)
● There are infinitely many k-dimensional subspaces we can project the
data onto
● Which one should we choose?
f1
f2
f3
... ... ... fn-1
fn
Dimensionality
Reduction
Output
(dark) 0 1 (bright)
Projecting onto a line
Best k-dimensional subspace for projection
Classification: maximize separation among classes
Example: Linear discriminant analysis (LDA)
Regression: maximize correlation between projected data and response variable
Example: Partial least squares (PLS)
Unsupervised: retain as much data variance as possible
Example: Principal component analysis (PCA)
Principal Component Analysis
Dimensionality Reduction
Principal component analysis (PCA)
● PCA is a minimization of the
orthogonal distance
● Widely used method for unsupervised
& linear dimensionality reduction
● Accounts for variance of data in as
few dimensions as possible using
linear projections
Principal components (PCs)
● PCs maximize the variance of
projections
● PCs are orthogonal
● Gives the best axis to project
● Goal of PCA: Minimize total squared
reconstruction error
1st
principal vector
2nd
principal vector
2-D data
PCA Algorithm - First Principal Component
Step 1
Find a line, such that when the data is projected onto that line, it has the maximum variance
Step 2
Find a second line, orthogonal to the first, that has maximum projected variance
PCA Algorithm - Second Principal Component
Repeat until we have k orthogonal lines
PCA Algorithm
Step 3
pca = PrincipalComponentAnalysis(n_components=2)
pca.fit(X)
X_pca = pca.transform(X)
Applying PCA on Iris
tot = sum(pca.e_vals_)
var_exp = [(i / tot) * 100 for i in sorted(pca.e_vals_, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
Plot the explained variance
loadings = pca.e_vecs_ * np.sqrt(pca.e_vals_)
PCA factor loadings
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import datasets
# Load the data
digits = datasets.load_digits()
# Standardize the feature matrix
X = StandardScaler().fit_transform(digits.data)
PCA in scikit-learn
# Create a PCA that will retain 99% of the variance
pca = PCA(n_components=0.99, whiten=True)
# Conduct PCA
X_pca = pca.fit_transform(X)
PCA in scikit-learn
When to use PCA?
Strengths
● A versatile technique
● Fast and simple
● Offers several variations and extensions (e.g., kernel/sparse PCA)
Weaknesses
● Result is not interpretable
● Requires setting threshold for cumulative explained variance
Other Techniques
Dimensionality Reduction
More dimensionality reduction algorithms
Unsupervised
● Latent Semantic Indexing/Analysis (LSI and LSA) (SVD)
● Independent Component Analysis (ICA)
Matrix
Factorization
● Non-Negative Matrix Factorization (NMF)
Latent
Methods
● Latent Dirichlet Allocation (LDA)
Singular value decomposition (SVD)
● SVD decomposes non-square matrices
● Useful for sparse matrices as produced by TF-IDF
● Removes redundant features from the dataset
Independent Components Analysis (ICA)
● PCA seeks directions in feature space that minimize reconstruction
error
● ICA seeks directions that are most statistically independent
● ICA addresses higher order dependence
How does ICA work?
● Assume there exists independent signals:
𝑆 = [𝑠1
(𝑡) , 𝑠2
(𝑡) , … , 𝑠𝑁
(𝑡) ]
● Linear combinations of signals: 𝑌(𝑡) = 𝐴 𝑆(𝑡)
○ Both A and S are unknown
○ A - mixing matrix
● Goal of ICA: recover original signals, 𝑆(𝑡) from 𝑌(𝑡)
PCA ICA
Removes correlations ✓ ✓
Removes higher order
dependence
✓
All components treated
fairly?
✓
Orthogonality ✓
Comparing PCA and ICA
Non-negative Matrix Factorization (NMF)
● NMF models are interpretable and easier to understand
● NMF requires the sample features to be non-negative
Mobile, IoT, and Similar Use
Cases
Quantization & Pruning
Trends in adoption of smart devices
Factors driving this trend
● Demands move ML capability from cloud to on-device
● Cost-effectiveness
● Compliance with privacy regulations
Online ML inference
● To generate real-time predictions you can:
○ Host the model on a server
○ Embed the model in the device
● Is it faster on a server, or on-device?
● Mobile processing limitations?
Classification request
Prediction results
Inference on the cloud/server
Pros
● Lots of compute capacity
● Scalable hardware
● Model complexity handled by the server
● Easy to add new features and update the model
● Low latency and batch prediction
Mobile inference
Cons
● Timely inference is needed
Pro
● Improved speed
● Performance
● Network connectivity
● No to-and-fro communication needed
On-device Inference
Mobile inference
Cons
● Less capacity
● Tight resource constraints
Options
On-device
inference
On-device
personalization
On-device
training
Cloud-based
web service
Pretrained
models
Custom
models
ML Kit ✓ ✓ ✓ ✓ ✓
Core ML ✓ ✓ ✓ ✓ ✓
✓ ✓ ✓ ✓ ✓
Model deployment
*
* Also supported in TFX
Benefits and Process of
Quantization
Quantization & Pruning
Quantization
Why quantize neural networks?
● Neural networks have many parameters and take up space
● Shrinking model file size
● Reduce computational resources
● Make models run faster and use less power with low-precision
Top
1
Accuracy
70
60
50
40
Runtime (ms) on Pixel 2 big core
10
5 100
50
8bit
Snapdragon 835
MobileNets: Latency vs Accuracy trade-off
Float
Benefits of quantization
● Faster compute
● Low memory bandwidth
● Low power
● Integer operations supported across CPU/DSP/NPUs
-127 127
-3e38 3e38
0
min max
float32
int8
The quantization process
What parts of the model are affected?
● Static values (parameters)
● Dynamic values (activations)
● Computation (transformations)
Trade-offs
● Optimizations impact model accuracy
○ Difficult to predict ahead of time
● In rare cases, models may actually gain some accuracy
● Undefined effects on ML interpretability
Choose the best model for the task
Post Training Quantization
Quantization & Pruning
● Reduced precision representation
● Incur small loss in model accuracy
● Joint optimization for model and
latency
Post-training quantization
Technique Benefits
Dynamic range quantization 4x smaller, 2x-3x speedup
Full integer quantization 4x smaller, 3x+ speedup
float16 quantization 2x smaller, GPU acceleration
Post-training quantization
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
Post training quantization
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()
INT8
TensorFlow
(tf.Keras)
Saved
Model
+
Calibration
data
TF Lite
Converter
TF Lite
Model
Post-training integer quantization
Model accuracy
● Small accuracy loss incurred (mostly for smaller networks)
● Use the benchmarking tools to evaluate model accuracy
● If the loss of accuracy drop is not within acceptable limits, consider
using quantization-aware training
Quantization Aware
Training
Quantization & Pruning
Quantization-aware training (QAT)
● Inserts fake quantization (FQ) nodes in the forward pass
● Rewrites the graph to emulate quantized inference
● Reduces the loss of accuracy due to quantization
● Resulting model contains all data to be quantized according to spec
● INT8
TensorFlow
(tf.Keras)
Apply QAT
+
Train model
Convert &
Quantize
(TF Lite)
TF Lite
Model
Quantization-aware training (QAT)
ReLU6
+
blases
conv
output
input
Adding the quantization emulation operations
weights
ReLU6 Act quant
+
blases
conv
Wt
quant
output
input
Adding the quantization emulation operations
weights
import tensorflow_model_optimization as tfmot
model = tf.keras.Sequential([
...
])
QAT on entire model
# Quantize the entire model.
quantized_model = tfmot.quantization.keras.quantize_model(model)
# Continue with training as usual.
quantized_model.compile(...)
quantized_model.fit(...)
import tensorflow_model_optimization as tfmot
quantize_annotate_layer = tfmot.quantization.keras.quantize_annotate_layer
model = tf.keras.Sequential([
...
# Only annotated layers will be quantized.
quantize_annotate_layer(Conv2D()),
quantize_annotate_layer(ReLU()),
Dense(),
...
])
# Quantize the model.
quantized_model = tfmot.quantization.keras.quantize_apply(model)
Quantize part(s) of a model
quantize_annotate_layer =
tfmot.quantization.keras.quantize_annotate_layer
quantize_annotate_model =
tfmot.quantization.keras.quantize_annotate_model
quantize_scope = tfmot.quantization.keras.quantize_scope
Quantize custom Keras layer
model = quantize_annotate_model(tf.keras.Sequential([
quantize_annotate_layer(CustomLayer(20, input_shape=(20,)),
DefaultDenseQuantizeConfig()),
tf.keras.layers.Flatten()
]))
# `quantize_apply` requires mentioning `DefaultDenseQuantizeConfig` with
`quantize_scope`
with quantize_scope(
{'DefaultDenseQuantizeConfig': DefaultDenseQuantizeConfig,
'CustomLayer': CustomLayer}):
# Use `quantize_apply` to actually make the model quantization aware.
quant_aware_model = tfmot.quantization.keras.quantize_apply(model)
Quantize custom Keras layer
Model Optimization Results - Accuracy
Model
Top-1 Accuracy
(Original)
Top-1 Accuracy
(Post Training
Quantized)
Top-1 Accuracy
(Quantization Aware
Training)
Mobilenet-v1-1-224 0.709 0.657 0.70
Mobilenet-v2-1-224 0.719 0.637 0.709
Inception_v3 0.78 0.772 0.775
Resnet_v2_101 0.770 0.768 N/A
Model
Latency
(Original) (ms)
Latency
(Post Training
Quantized) (ms)
Latency
(Quantization Aware
Training) (ms)
Mobilenet-v1-1-224 124 112 64
Mobilenet-v2-1-224 89 98 54
Inception_v3 1130 845 543
Resnet_v2_101 3973 2868 N/A
Model Optimization Results - Latency
Model Size (Original) (MB) Size (Optimized) (MB)
Mobilenet-v1-1-224 16.9 4.3
Mobilenet-v2-1-224 14 3.6
Inception_v3 95.7 23.9
Resnet_v2_101 178.3 44.9
Model Optimization Results
Pruning
Quantization & Pruning
Before pruning After pruning
Connection pruning
Model sparsity
Larger
models
More
memory
Less
efficient
Sparse
models
Less
memory
More
efficient
Before pruning After pruning
Pruning
synapses
Pruning
neurons
Origins of weight pruning
The Lottery Ticket Hypothesis
Finding Sparse Neural Networks
“A randomly-initialized, dense neural network contains a subnetwork that is
initialized such that — when trained in isolation — it can match the test
accuracy of the original network after training for at most the same number
of iterations”
Jonathan Frankle and Michael Carbin
Pruning research is evolving
● The new method didn’t perform well at large scale
● The new method failed to identify the randomly initialized winners
● It’s an active area of research
Tensors with no sparsity (left), sparsity in blocks of 1x1 (center), and the
sparsity in blocks 1x2 (right)
Eliminate connections based on their magnitude
Example of sparsity ramp-up function with a schedule to start pruning from step 0
until step 100, and a final target sparsity of 90%.
Apply sparsity with a pruning routine
Animation of pruning applied to a tensor
Black cells indicate where the non-zero weights exist
Sparsity increases with training
What’s special about pruning?
● Better storage and/or transmission
● Gain speedups in CPU and some ML accelerators
● Can be used in tandem with quantization to get additional benefits
● Unlock performance improvements
Pruning with TF Model Optimization Toolkit
● INT8
TensorFlow
(tf.Keras)
Sparsify
+
Train model
Convert &
Quantize
(TF Lite)
TF Lite
Model
import tensorflow_model_optimization as tfmot
model = build_your_model()
pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(
initial_sparsity=0.50, final_sparsity=0.80,
begin_step=2000, end_step=4000)
model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude(
model,
pruning_schedule=pruning_schedule)
...
model_for_pruning.fit(...)
Pruning with Keras
Model
Non-sparse
Top-1 acc.
Sparse
acc.
Sparsity
Inception
V3
78.1%
78.0% 50%
76.1% 75%
74.6% 87.5%
Mobilenet
V1 224
71.04% 70.84% 50%
Model
Non-sparse
BLEU
Sparse
BLEU
Sparsity
GNMT
EN-DE
26.77
26.86 80%
26.52 85%
26.19 90%
GNMT
DE-EN
29.47
29.50 80%
29.24 85%
28.81 90%
Results across different models & tasks

More Related Content

PDF
PDF
PDF
Week 4 advanced labeling, augmentation and data preprocessing
PDF
PDF
Course 2 Machine Learning Data LifeCycle in Production - Week 1
PDF
Machine Learning Data Life Cycle in Production (Week 2 feature engineering...
PDF
PDF
Week 3 data journey and data storage
Week 4 advanced labeling, augmentation and data preprocessing
Course 2 Machine Learning Data LifeCycle in Production - Week 1
Machine Learning Data Life Cycle in Production (Week 2 feature engineering...
Week 3 data journey and data storage

What's hot (20)

PDF
Introduction to machine learning
PPTX
NLP Classifier Models & Metrics
PPTX
Intro to ml_2021
PPTX
Developing Recommendation System to provide a Personalized Learning experienc...
PDF
Feature Engineering - Getting most out of data for predictive models
PDF
Using Deep Learning to Find Similar Dresses
PDF
Deep Dive into Hyperparameter Tuning
PDF
Kaggle presentation
PDF
Workshop - Introduction to Machine Learning with R
PDF
Winning Kaggle 101: Introduction to Stacking
PPTX
Hyperparameter Tuning
PDF
Data Wrangling For Kaggle Data Science Competitions
PDF
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
PDF
NLP and Deep Learning for non_experts
PDF
Lecture 5 machine learning updated
PDF
Using Optimal Learning to Tune Deep Learning Pipelines
PPTX
Automated Machine Learning (Auto ML)
PDF
Reward constrained interactive recommendation with natural language feedback ...
PPTX
Machine learning with scikitlearn
PDF
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Introduction to machine learning
NLP Classifier Models & Metrics
Intro to ml_2021
Developing Recommendation System to provide a Personalized Learning experienc...
Feature Engineering - Getting most out of data for predictive models
Using Deep Learning to Find Similar Dresses
Deep Dive into Hyperparameter Tuning
Kaggle presentation
Workshop - Introduction to Machine Learning with R
Winning Kaggle 101: Introduction to Stacking
Hyperparameter Tuning
Data Wrangling For Kaggle Data Science Competitions
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
NLP and Deep Learning for non_experts
Lecture 5 machine learning updated
Using Optimal Learning to Tune Deep Learning Pipelines
Automated Machine Learning (Auto ML)
Reward constrained interactive recommendation with natural language feedback ...
Machine learning with scikitlearn
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Ad

Similar to C3 w2 (20)

PDF
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
PDF
The Ring programming language version 1.9 book - Part 98 of 210
PDF
Feature Engineering - Getting most out of data for predictive models - TDC 2017
PDF
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
PDF
Keras and TensorFlow
PDF
The Ring programming language version 1.5.1 book - Part 174 of 180
PPTX
Tutorial: Image Generation and Image-to-Image Translation using GAN
DOCX
DLT UNIT-3.docx
PPTX
VCE Unit 01 (1).pptx
PDF
AIML4 CNN lab256 1hr (111-1).pdf
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
PPTX
Deep learning requirement and notes for novoice
DOC
Observations
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
PPTX
Smart Data Conference: DL4J and DataVec
PDF
Tensors Are All You Need: Faster Inference with Hummingbird
PDF
Start machine learning in 5 simple steps
PDF
Standardizing on a single N-dimensional array API for Python
PPTX
Neural Learning to Rank
PDF
Scaling Up AI Research to Production with PyTorch and MLFlow
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
The Ring programming language version 1.9 book - Part 98 of 210
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Keras and TensorFlow
The Ring programming language version 1.5.1 book - Part 174 of 180
Tutorial: Image Generation and Image-to-Image Translation using GAN
DLT UNIT-3.docx
VCE Unit 01 (1).pptx
AIML4 CNN lab256 1hr (111-1).pdf
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Deep learning requirement and notes for novoice
Observations
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
Smart Data Conference: DL4J and DataVec
Tensors Are All You Need: Faster Inference with Hummingbird
Start machine learning in 5 simple steps
Standardizing on a single N-dimensional array API for Python
Neural Learning to Rank
Scaling Up AI Research to Production with PyTorch and MLFlow
Ad

Recently uploaded (20)

PPTX
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Geodesy 1.pptx...............................................
DOCX
573137875-Attendance-Management-System-original
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Digital Logic Computer Design lecture notes
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
Arduino robotics embedded978-1-4302-3184-4.pdf
Geodesy 1.pptx...............................................
573137875-Attendance-Management-System-original
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
bas. eng. economics group 4 presentation 1.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Structs to JSON How Go Powers REST APIs.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Digital Logic Computer Design lecture notes
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
Lesson 3_Tessellation.pptx finite Mathematics
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
additive manufacturing of ss316l using mig welding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Sustainable Sites - Green Building Construction
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...

C3 w2

  • 1. Copyright Notice These slides are distributed under the Creative Commons License. DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute these slides for commercial purposes. You may make copies of these slides and use or distribute them for educational purposes as long as you cite DeepLearning.AI as the source of the slides. For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode
  • 4. High-dimensional data Before. .. when it was all about data mining ● Domain experts selected features ● Designed feature transforms ● Small number of more relevant features were enough Now … data science is about integrating everything ● Data generation and storage is less of a problem ● Squeeze out the best from data ● More high-dimensional data having more features
  • 5. A note about neural networks ● Yes, neural networks will perform a kind of automatic feature selection ● However, that’s not as efficient as a well-designed dataset and model ○ Much of the model can be largely “shut off” to ignore unwanted features ○ Even unused parts of the consume space and compute resources ○ Unwanted features can still introduce unwanted noise ○ Each feature requires infrastructure to collect, store, and manage
  • 7. Word embedding - An example ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 10 ? ? ? ? ? ? ? ? 6 7 8 5 11 Auto Embedding Weight Matrix [“I want to search for blood pressure result history “, “Show blood pressure result for patient”,...] Input layer (Learned vectors) Embedding Layer i want to search for blood pressure result history show patient ... LAST 1 2 3 4 5 6 7 8 9 10 11 20
  • 8. Initialization and loading the dataset import tensorflow as tf from tensorflow import keras import numpy as np from keras.datasets import reuters from keras.preprocessing import sequence num_words = 1000 (reuters_train_x, reuters_train_y), (reuters_test_x, reuters_test_y) = tf.keras.datasets.reuters.load_data(num_words=num_words) n_labels = np.unique(reuters_train_y).shape[0]
  • 9. Further preprocessing from keras.utils import np_utils reuters_train_y = np_utils.to_categorical(reuters_train_y, 46) reuters_test_y = np_utils.to_categorical(reuters_test_y, 46) reuters_train_x = tf.keras.preprocessing.sequence.pad_sequences(reuters_train_x, maxlen=20) reuters_test_x = tf.keras.preprocessing.sequence.pad_sequences(reuters_test_x, maxlen=20)
  • 10. Using all dimensions from tensorflow.keras import layers model2 = tf.keras.Sequential( [ layers.Embedding(num_words, 1000, input_length= 20), layers.Flatten(), layers.Dense(256), layers.Dropout(0.25), layers.Activation('relu'), layers.Dense(46), layers.Activation('softmax') ])
  • 11. Model compilation and training model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=['accuracy']) model_1 = model.fit(reuters_train_x, reuters_train_y, validation_data=(reuters_test_x , reuters_test_y), batch_size=128, epochs=20, verbose=0)
  • 12. Example with a higher number of dimensions Acc Loss epoch epoch model accuracy model loss train validation train validation 0.9 0.8 0.6 0.7 0.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0.9 0.8 0.6 0.7 0.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0.0 2.5 5.0
  • 13. Word embeddings: 6 dimensions from tensorflow.keras import layers model = tf.keras.Sequential( [ layers.Embedding(num_words, 6, input_length= 20), layers.Flatten(), layers.Dense(256), layers.Dropout(0.25), layers.Activation('relu'), layers.Dense(46), layers.Activation('softmax') ])
  • 14. Word embeddings: fourth root of the size of the vocab Acc Loss epoch epoch model accuracy model loss train validation train validation 0.65 060 0.50 0.55 0.45 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 2.6 2.4 2.0 2.2 1.8 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0.0 2.5 5.0 0.40 0.35 1.4 1.6 1.2
  • 17. ● More dimensions → more features ● Risk of overfitting our models ● Distances grow more and more alike ● No clear distinction between clustered objects ● Concentration phenomenon for Euclidean distance Why is high-dimensional data a problem?
  • 18. Curse of dimensionality “As we add more dimensions we also increase the processing power we need to train the model and make predictions, as well as the amount of training data required” Badreesh Shetty
  • 19. Why are more features bad? ● Redundant / irrelevant features ● More noise added than signal ● Hard to interpret and visualize ● Hard to store and process data
  • 20. The performance of algorithms ~ the number of dimensions Optimal Dimensionality (# of features) Classifier Performance
  • 21. 1 2 3 4 5 (1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (5, 1) (5, 2) (5, 3) (5, 4) (5, 5) 1-D 2-D ... ... Adding dimensions increases feature space volume
  • 22. Curse of dimensionality in the distance function ● New dimensions add non-negative terms to the sum ● Distance increases with the number of dimensions ● For a given number of examples, the feature space becomes increasingly sparse Euclidean distance
  • 23. Increasing sparsity with higher dimensions Dimension 3 Dimension 1 Dimension 2 Dimension 1 Dimension 1 Dimension 2 Dimension 3 Dimension 1 Dimension 2
  • 24. x x x x x x x x x x x x x x x The more the features, the larger the hypothesis space The lower the hypothesis space ● the easier it is to find the correct hypothesis ● the less examples you need The Hughes effect
  • 25. Dimensionality Reduction Curse of Dimensionality: An example
  • 26. How dimensionality impacts in other ways ● Runtime and system memory requirements ● Solutions take longer to reach global optima ● More dimensions raise the likelihood of correlated features
  • 27. More features require more training data ● More features aren’t better if they don’t add predictive information ● Number of training instances needed increases exponentially with each added feature ● Reduces real-world usefulness of models
  • 28. Model #1 (missing a single feature) sex:InputLayer cp:InputLayer fbs:InputLayer restecg : InputLayer exang:InputLayer slope:InputLayer category_encoding_6:CategoryEncoding category_encoding_1:CategoryEncoding category_encoding_2:CategoryEncoding category_encoding_3:CategoryEncoding category_encoding_4:CategoryEncoding normalization_5:Normalization concatenate: Concatenate dense: Dense dropout: dropout dense_1: Dense ca:InputLayer age:InputLayer trestbps:InputLayer chol:InputLayer thalach:InputLayer category_enconding_5:CategoryEncoding normalization:Normalization normalization_1:Normalization normalization_2:Normalization normalization_3:Normalization oldpeak:InputLayer normalization_4:Normalization
  • 29. Model #2 (adds a new feature) sex:InputLayer cp:InputLayer fbs:InputLayer restecg : InputLayer exang:InputLayer category_encoding_6:CategoryEncoding category_encoding_1:CategoryEncoding category_encoding_2:CategoryEncoding category_encoding_3:CategoryEncoding category_encoding_4:CategoryEncoding thal:InputLayer string_lookup:StringLookup ca:InputLayer age:InputLayer trestbps:InputLayer chol:InputLayer thalach:InputLayer category_enconding_5:CategoryEncoding normalization:Normalization normalization_1:Normalization normalization_2:Normalization normalization_3:Normalization oldpeak:InputLayer normalization_4:Normalization A new string categorical feature is added! slope:InputLayer normalization_5:Normalization concatenate: Concatenate dense: Dense dropout: dropout dense_1: Dense
  • 30. from tensorflow.python.keras.utils.layer_utils import count_params # Number of training parameters in Model #1 >>> count_params(model_1.trainable_variables) 833 # Number of training parameters in Model #2 (with an added feature) >>> count_params(model_1.trainable_variables) 1057 Comparing the two models’ trainable variables
  • 31. What do ML models need? ● No hard and fast rule on how many features are required ● Number of features to be used vary depending on ● Prefer uncorrelated data containing information to produce correct results
  • 33. Increasing predictive performance ● Features must have information to produce correct results ● Derive features from inherent features ● Extract and recombine to create new features
  • 34. Combining features ● Number of features grows very quickly ● Reduce dimensionality pixels, contours, textures, etc. samples, spectrograms, etc. ticks, trends, reversals, etc. dna, marker sequences, genes, etc. words, grammatical classes and relations, etc. Initial features Feature explosion
  • 35. Why reduce dimensionality? Major techniques for dimensionality reduction Engineering Selection Storage Computational Consistency Visualization
  • 36. Need for manually crafting features Certainly provides food for thought Engineer features ● Tabular - aggregate, combine, decompose ● Text-extract context indicators ● Image-prescribe filters for relevant structures Come up with ideas to construct “better” features Devising features to reduce dimensionality Select the right features to maximize predictiveness Evaluate models using chosen features It’s an iterative process Feature Engineering
  • 37. Manual Dimensionality Reduction: case study Dimensionality Reduction
  • 38. CSV_COLUMNS = [ 'fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude', 'Dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'key', ] LABEL_COLUMN = 'fare_amount' STRING_COLS = ['pickup_datetime'] NUMERIC_COLS = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count'] DEFAULTS = [[0.0], ['na'], [0.0], [0.0], [0.0], [0.0], [0.0], ['na']] DAYS = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'] Taxi Fare dataset
  • 40. from tensorflow.keras import layers from tensorflow.keras.metrics import RootMeanSquared as RMSE dnn_inputs = layers.DenseFeatures(feature_columns.values())(inputs) h1 = layers.Dense(32, activation='relu', name='h1')(dnn_inputs) h2 = layers.Dense(8, activation='relu', name='h2')(h1) output = layers.Dense(1, activation='linear', name='fare')(h2) model = models.Model(inputs, output) model.compile(optimizer='adam', loss='mse', metrics=[RMSE(name='rmse'), 'mse']) Build a baseline model using raw features
  • 41. Train the model model rmse epoch 125 120 115 110 105 100 0 1 2 3 4 train validation
  • 42. Increasing model performance with Feature Engineering ● Carefully craft features for the data types ○ Temporal (pickup date & time) ○ Geographical (latitude and longitude)
  • 43. def parse_datetime(s): if type(s) is not str: s = s.numpy().decode('utf-8') return datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S %Z") def get_dayofweek(s): ts = parse_datetime(s) return DAYS[ts.weekday()] @tf.function def dayofweek(ts_in): return tf.map_fn( lambda s: tf.py_function(get_dayofweek, inp=[s], Tout=tf.string), ts_in) Handling temporal features
  • 44. def euclidean(params): lon1, lat1, lon2, lat2 = params londiff = lon2 - lon1 latdiff = lat2 - lat1 return tf.sqrt(londiff * londiff + latdiff * latdiff) Geolocational features
  • 45. def scale_longitude(lon_column): return (lon_column + 78)/8. Scaling latitude and longitude def scale_latitude(lat_column): return (lat_column - 37)/8.
  • 46. def transform(inputs, numeric_cols, string_cols, nbuckets): ... feature_columns = { colname: tf.feature_column.numeric_column(colname) for colname in numeric_cols } Preparing the transformations for lon_col in ['pickup_longitude', 'dropoff_longitude']: transformed[lon_col] = layers.Lambda(scale_longitude, ...)(inputs[lon_col]) for lat_col in ['pickup_latitude', 'dropoff_latitude']: transformed[lat_col] = layers.Lambda( scale_latitude, ...)(inputs[lat_col]) ...
  • 47. def transform(inputs, numeric_cols, string_cols, nbuckets): ... transformed['euclidean'] = layers.Lambda( euclidean, name='euclidean')([inputs['pickup_longitude'], inputs['pickup_latitude'], inputs['dropoff_longitude'], inputs['dropoff_latitude']]) feature_columns['euclidean'] = fc.numeric_column('euclidean') ... Computing the Euclidean distance
  • 48. def transform(inputs, numeric_cols, string_cols, nbuckets): ... latbuckets = np.linspace(0, 1, nbuckets).tolist() lonbuckets = ... # Similarly for longitude b_plat = fc.bucketized_column( feature_columns['pickup_latitude'], latbuckets) b_dlat = # Bucketize 'dropoff_latitude' b_plon = # Bucketize 'pickup_longitude' b_dlon = # Bucketize 'dropoff_longitude' Bucketizing and feature crossing
  • 49. ploc = fc.crossed_column([b_plat, b_plon], nbuckets * nbuckets) dloc = # Feature cross 'b_dlat' and 'b_dlon' pd_pair = fc.crossed_column([ploc, dloc], nbuckets ** 4) feature_columns['pickup_and_dropoff'] = fc.embedding_column(pd_pair, 100) Bucketizing and feature crossing
  • 50. dropoff_longitude:InputLayer passenger_count:InputLayer pickup_latitude:InputLayer pickup_longitude:InputLayer dropoff_latitude:InputLayer dense_features_4:DenseFeatures h1:Dense h2:Dense fare:Dense Scale_dropoff_latitude: Lambda Scale_dropoff_longitude: Lambda euclidean: Lambda passenger_count:InputLayer scale_pickup_latittude:Lambda scale_pickup_longitude:Lambda Build a model with the engineered features
  • 51. Train the new feature engineered model train validation Improved model rmse epoch 3 mse 100 90 80 70 60 50 0 1 2 4 40 30 6 5 7 baselinel rmse epoch 125 120 115 110 105 100 0 1 2 3 4 train validation
  • 53. Linear dimensionality reduction ● Linearly project n-dimensional data onto a k-dimensional subspace (k < n, often k << n) ● There are infinitely many k-dimensional subspaces we can project the data onto ● Which one should we choose?
  • 54. f1 f2 f3 ... ... ... fn-1 fn Dimensionality Reduction Output (dark) 0 1 (bright) Projecting onto a line
  • 55. Best k-dimensional subspace for projection Classification: maximize separation among classes Example: Linear discriminant analysis (LDA) Regression: maximize correlation between projected data and response variable Example: Partial least squares (PLS) Unsupervised: retain as much data variance as possible Example: Principal component analysis (PCA)
  • 57. Principal component analysis (PCA) ● PCA is a minimization of the orthogonal distance ● Widely used method for unsupervised & linear dimensionality reduction ● Accounts for variance of data in as few dimensions as possible using linear projections
  • 58. Principal components (PCs) ● PCs maximize the variance of projections ● PCs are orthogonal ● Gives the best axis to project ● Goal of PCA: Minimize total squared reconstruction error 1st principal vector 2nd principal vector
  • 60. PCA Algorithm - First Principal Component Step 1 Find a line, such that when the data is projected onto that line, it has the maximum variance
  • 61. Step 2 Find a second line, orthogonal to the first, that has maximum projected variance PCA Algorithm - Second Principal Component
  • 62. Repeat until we have k orthogonal lines PCA Algorithm Step 3
  • 63. pca = PrincipalComponentAnalysis(n_components=2) pca.fit(X) X_pca = pca.transform(X) Applying PCA on Iris
  • 64. tot = sum(pca.e_vals_) var_exp = [(i / tot) * 100 for i in sorted(pca.e_vals_, reverse=True)] cum_var_exp = np.cumsum(var_exp) Plot the explained variance
  • 65. loadings = pca.e_vecs_ * np.sqrt(pca.e_vals_) PCA factor loadings
  • 66. from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn import datasets # Load the data digits = datasets.load_digits() # Standardize the feature matrix X = StandardScaler().fit_transform(digits.data) PCA in scikit-learn
  • 67. # Create a PCA that will retain 99% of the variance pca = PCA(n_components=0.99, whiten=True) # Conduct PCA X_pca = pca.fit_transform(X) PCA in scikit-learn
  • 68. When to use PCA? Strengths ● A versatile technique ● Fast and simple ● Offers several variations and extensions (e.g., kernel/sparse PCA) Weaknesses ● Result is not interpretable ● Requires setting threshold for cumulative explained variance
  • 70. More dimensionality reduction algorithms Unsupervised ● Latent Semantic Indexing/Analysis (LSI and LSA) (SVD) ● Independent Component Analysis (ICA) Matrix Factorization ● Non-Negative Matrix Factorization (NMF) Latent Methods ● Latent Dirichlet Allocation (LDA)
  • 71. Singular value decomposition (SVD) ● SVD decomposes non-square matrices ● Useful for sparse matrices as produced by TF-IDF ● Removes redundant features from the dataset
  • 72. Independent Components Analysis (ICA) ● PCA seeks directions in feature space that minimize reconstruction error ● ICA seeks directions that are most statistically independent ● ICA addresses higher order dependence
  • 73. How does ICA work? ● Assume there exists independent signals: 𝑆 = [𝑠1 (𝑡) , 𝑠2 (𝑡) , … , 𝑠𝑁 (𝑡) ] ● Linear combinations of signals: 𝑌(𝑡) = 𝐴 𝑆(𝑡) ○ Both A and S are unknown ○ A - mixing matrix ● Goal of ICA: recover original signals, 𝑆(𝑡) from 𝑌(𝑡)
  • 74. PCA ICA Removes correlations ✓ ✓ Removes higher order dependence ✓ All components treated fairly? ✓ Orthogonality ✓ Comparing PCA and ICA
  • 75. Non-negative Matrix Factorization (NMF) ● NMF models are interpretable and easier to understand ● NMF requires the sample features to be non-negative
  • 76. Mobile, IoT, and Similar Use Cases Quantization & Pruning
  • 77. Trends in adoption of smart devices
  • 78. Factors driving this trend ● Demands move ML capability from cloud to on-device ● Cost-effectiveness ● Compliance with privacy regulations
  • 79. Online ML inference ● To generate real-time predictions you can: ○ Host the model on a server ○ Embed the model in the device ● Is it faster on a server, or on-device? ● Mobile processing limitations?
  • 80. Classification request Prediction results Inference on the cloud/server Pros ● Lots of compute capacity ● Scalable hardware ● Model complexity handled by the server ● Easy to add new features and update the model ● Low latency and batch prediction Mobile inference Cons ● Timely inference is needed
  • 81. Pro ● Improved speed ● Performance ● Network connectivity ● No to-and-fro communication needed On-device Inference Mobile inference Cons ● Less capacity ● Tight resource constraints
  • 82. Options On-device inference On-device personalization On-device training Cloud-based web service Pretrained models Custom models ML Kit ✓ ✓ ✓ ✓ ✓ Core ML ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Model deployment * * Also supported in TFX
  • 83. Benefits and Process of Quantization Quantization & Pruning
  • 85. Why quantize neural networks? ● Neural networks have many parameters and take up space ● Shrinking model file size ● Reduce computational resources ● Make models run faster and use less power with low-precision
  • 86. Top 1 Accuracy 70 60 50 40 Runtime (ms) on Pixel 2 big core 10 5 100 50 8bit Snapdragon 835 MobileNets: Latency vs Accuracy trade-off Float
  • 87. Benefits of quantization ● Faster compute ● Low memory bandwidth ● Low power ● Integer operations supported across CPU/DSP/NPUs
  • 88. -127 127 -3e38 3e38 0 min max float32 int8 The quantization process
  • 89. What parts of the model are affected? ● Static values (parameters) ● Dynamic values (activations) ● Computation (transformations)
  • 90. Trade-offs ● Optimizations impact model accuracy ○ Difficult to predict ahead of time ● In rare cases, models may actually gain some accuracy ● Undefined effects on ML interpretability
  • 91. Choose the best model for the task
  • 93. ● Reduced precision representation ● Incur small loss in model accuracy ● Joint optimization for model and latency Post-training quantization
  • 94. Technique Benefits Dynamic range quantization 4x smaller, 2x-3x speedup Full integer quantization 4x smaller, 3x+ speedup float16 quantization 2x smaller, GPU acceleration Post-training quantization
  • 95. import tensorflow as tf converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) Post training quantization converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE] tflite_quant_model = converter.convert()
  • 97. Model accuracy ● Small accuracy loss incurred (mostly for smaller networks) ● Use the benchmarking tools to evaluate model accuracy ● If the loss of accuracy drop is not within acceptable limits, consider using quantization-aware training
  • 99. Quantization-aware training (QAT) ● Inserts fake quantization (FQ) nodes in the forward pass ● Rewrites the graph to emulate quantized inference ● Reduces the loss of accuracy due to quantization ● Resulting model contains all data to be quantized according to spec
  • 100. ● INT8 TensorFlow (tf.Keras) Apply QAT + Train model Convert & Quantize (TF Lite) TF Lite Model Quantization-aware training (QAT)
  • 102. ReLU6 Act quant + blases conv Wt quant output input Adding the quantization emulation operations weights
  • 103. import tensorflow_model_optimization as tfmot model = tf.keras.Sequential([ ... ]) QAT on entire model # Quantize the entire model. quantized_model = tfmot.quantization.keras.quantize_model(model) # Continue with training as usual. quantized_model.compile(...) quantized_model.fit(...)
  • 104. import tensorflow_model_optimization as tfmot quantize_annotate_layer = tfmot.quantization.keras.quantize_annotate_layer model = tf.keras.Sequential([ ... # Only annotated layers will be quantized. quantize_annotate_layer(Conv2D()), quantize_annotate_layer(ReLU()), Dense(), ... ]) # Quantize the model. quantized_model = tfmot.quantization.keras.quantize_apply(model) Quantize part(s) of a model
  • 105. quantize_annotate_layer = tfmot.quantization.keras.quantize_annotate_layer quantize_annotate_model = tfmot.quantization.keras.quantize_annotate_model quantize_scope = tfmot.quantization.keras.quantize_scope Quantize custom Keras layer model = quantize_annotate_model(tf.keras.Sequential([ quantize_annotate_layer(CustomLayer(20, input_shape=(20,)), DefaultDenseQuantizeConfig()), tf.keras.layers.Flatten() ]))
  • 106. # `quantize_apply` requires mentioning `DefaultDenseQuantizeConfig` with `quantize_scope` with quantize_scope( {'DefaultDenseQuantizeConfig': DefaultDenseQuantizeConfig, 'CustomLayer': CustomLayer}): # Use `quantize_apply` to actually make the model quantization aware. quant_aware_model = tfmot.quantization.keras.quantize_apply(model) Quantize custom Keras layer
  • 107. Model Optimization Results - Accuracy Model Top-1 Accuracy (Original) Top-1 Accuracy (Post Training Quantized) Top-1 Accuracy (Quantization Aware Training) Mobilenet-v1-1-224 0.709 0.657 0.70 Mobilenet-v2-1-224 0.719 0.637 0.709 Inception_v3 0.78 0.772 0.775 Resnet_v2_101 0.770 0.768 N/A
  • 108. Model Latency (Original) (ms) Latency (Post Training Quantized) (ms) Latency (Quantization Aware Training) (ms) Mobilenet-v1-1-224 124 112 64 Mobilenet-v2-1-224 89 98 54 Inception_v3 1130 845 543 Resnet_v2_101 3973 2868 N/A Model Optimization Results - Latency
  • 109. Model Size (Original) (MB) Size (Optimized) (MB) Mobilenet-v1-1-224 16.9 4.3 Mobilenet-v2-1-224 14 3.6 Inception_v3 95.7 23.9 Resnet_v2_101 178.3 44.9 Model Optimization Results
  • 111. Before pruning After pruning Connection pruning
  • 113. Before pruning After pruning Pruning synapses Pruning neurons Origins of weight pruning
  • 114. The Lottery Ticket Hypothesis
  • 115. Finding Sparse Neural Networks “A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations” Jonathan Frankle and Michael Carbin
  • 116. Pruning research is evolving ● The new method didn’t perform well at large scale ● The new method failed to identify the randomly initialized winners ● It’s an active area of research
  • 117. Tensors with no sparsity (left), sparsity in blocks of 1x1 (center), and the sparsity in blocks 1x2 (right) Eliminate connections based on their magnitude
  • 118. Example of sparsity ramp-up function with a schedule to start pruning from step 0 until step 100, and a final target sparsity of 90%. Apply sparsity with a pruning routine
  • 119. Animation of pruning applied to a tensor Black cells indicate where the non-zero weights exist Sparsity increases with training
  • 120. What’s special about pruning? ● Better storage and/or transmission ● Gain speedups in CPU and some ML accelerators ● Can be used in tandem with quantization to get additional benefits ● Unlock performance improvements
  • 121. Pruning with TF Model Optimization Toolkit ● INT8 TensorFlow (tf.Keras) Sparsify + Train model Convert & Quantize (TF Lite) TF Lite Model
  • 122. import tensorflow_model_optimization as tfmot model = build_your_model() pruning_schedule = tfmot.sparsity.keras.PolynomialDecay( initial_sparsity=0.50, final_sparsity=0.80, begin_step=2000, end_step=4000) model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude( model, pruning_schedule=pruning_schedule) ... model_for_pruning.fit(...) Pruning with Keras
  • 123. Model Non-sparse Top-1 acc. Sparse acc. Sparsity Inception V3 78.1% 78.0% 50% 76.1% 75% 74.6% 87.5% Mobilenet V1 224 71.04% 70.84% 50% Model Non-sparse BLEU Sparse BLEU Sparsity GNMT EN-DE 26.77 26.86 80% 26.52 85% 26.19 90% GNMT DE-EN 29.47 29.50 80% 29.24 85% 28.81 90% Results across different models & tasks