2. CNN
• A Convolutional Neural Network (ConvNet or CNN) is a class of
deep neural networks primarily used for analyzing visual data such
as images or videos, where data patterns play a crucial role. .
• advanced version of artificial neural networks (ANNs),
primarily designed to extract features from grid-like matrix
datasets.
• CNNs are widely used in computer vision applications.
• CNNs consist of multiple layers like the input layer, Convolutional
layer, pooling layer, and fully connected layers.
4. Core Components and Layers of a ConvNet:
Input Layer:
• This layer receives the raw input data, typically an image represented as a multi-dimensional array (e.g., height x width x color
channels). (eg.image of dimension 32 x 32 x 3)
• This layer holds the raw input of the image with width 32, height 32, and depth 3.
Convolutional Layer:
• This is the fundamental building block of a ConvNet.
• It applies a set of learnable filters (also called kernels) to the input volume. The filters/kernels are smaller matrices usually 2x2,
3x3, or 5x5 shape.
• Each filter performs a convolution operation, sliding across the input and computing the dot product between the filter's weights
and the corresponding input region.
• This process extracts features like edges, textures, and patterns, generating feature maps that represent the presence of these
features at different locations.
Activation Function (e.g., ReLU):
• Typically applied after the convolutional layer.
• It introduces non-linearity into the model, allowing it to learn more complex relationships in the data.
• Rectified Linear Unit (ReLU) is a common choice, setting all negative values in the feature map to zero and keeping positive values
unchanged.
Pooling Layer:
• Its purpose is to reduce the spatial dimensions (height and width) of the feature maps, thereby decreasing computational
complexity and the number of parameters.
• Common pooling operations include Max Pooling (selecting the maximum value within a defined window) and Average Pooling
(calculating the average value within a window).
5. Fully Connected Layer (FC Layer):
• Located at the end of the network, after several convolutional and pooling layers.
• The flattened output from the preceding layers is fed into this layer.
• Each neuron in a fully connected layer is connected to every neuron in the previous layer, allowing the network
to learn high-level representations and make predictions based on the extracted features.
Output Layer:
• The final layer of the ConvNet.
• It typically uses an activation function like Softmax for classification tasks, producing a probability
distribution over the possible classes.
Additional Components:
Dropout Layer:
A regularization technique often used in fully connected layers to prevent overfitting by randomly deactivating a
fraction of neurons during training.
Batch Normalization:
A technique to normalize the activations of a layer, which can improve training stability and speed.
This hierarchical structure allows ConvNets to automatically learn and extract increasingly complex features from
raw visual data, making them highly effective for tasks such as image classification, object detection, and
segmentation.
6. Detailed description of each layers
1. Input Layer
Purpose: Accepts raw image data (e.g., 224×224×3 for a color
image).
Data shape: (Height × Width × Channels).
Example: For grayscale: 28×28×1, for RGB: 32×32×3.
2. Convolutional Layers
Core operation: Applies a set of learnable filters (kernels) to
extract local features.
Key parameters:
Number of filters (e.g., 32, 64, 128…)
Kernel size (e.g., 3×3, 5×5)
Stride (step size of filter movement eg 1 or 2 or 3 etc)
Padding ("same" to preserve dimensions, "valid" to reduce)
Activation: Usually ReLU for non-linearity.
Outcome: Feature maps (spatial representation of learned
features).
3. Pooling Layers
Purpose: Reduce spatial dimensions to lower computational
cost and control overfitting.
Types:
Max Pooling: Keeps the largest value in each window.
Average Pooling: Takes the mean of values in the window.
Typical size: 2×2 with stride 2.
4. Dropout Layers (optional)
Purpose: Randomly "turns off" a fraction of neurons during
training to prevent overfitting.
Typical rate: 0.25–0.5.
5. Fully Connected (Dense) Layers
Purpose: Flatten the feature maps into a vector and learn high-
level representations. (nD to 1D)
Often uses ReLU activation, ending with a Softmax (for
classification).
6. Output Layer
Purpose: Produces final predictions.
Activation:
Softmax for multi-class classification.
Sigmoid for binary classification.
11. • Convolution- The
term convolution refers to the
mathematical combination of two
functions to produce a third
function.
• Pooling- The objective
of Pooling is to down-sample an
input representation (image,
hidden-layer output matrix,
etc.), reducing its dimensions and
allowing for assumptions to be
made about features contained in
the sub-regions created.
Convolution
Pooling
12. Fully Connected Layers- FCL in a neural
network are those layers where all the
inputs from one layer are connected to
every activation unit of the next layer.
13. 1. What is a Convolutional Layer?
A Convolutional Layer in a CNN applies small, trainable filters (also called kernels) over an input
image (or feature map) to detect features such as edges, textures, shapes, etc.
2. Key Components
a) Filter (Kernel)
• A small matrix of weights, e.g., 3×3 or 5×5.
• Scans across the image, multiplying values element-wise and summing them up.
• Each filter detects a specific pattern (e.g., vertical edges, curves).
• One convolutional layer typically has many filters (e.g., 32, 64…).
b) Stride
• How many pixels the filter moves at each step.
Stride = 1 → Filter moves one pixel at a time (more overlap, bigger output).
Stride = 2 → Filter moves two pixels at a time (less overlap, smaller output).
c) Padding
Decides what happens at the image borders:
Valid padding → No padding, output shrinks.
Same padding → Pads with zeros so output has the same spatial size as input.
14. 3. How the Process Works
Example Input
A 5×5 grayscale image:
1 1 1 0 0
0 1 1 1 0
0 0 1 1 1
0 0 1 1 0
0 1 1 0 0
Filter (Kernel)
A 3×3 filter:
1 0 1
0 1 0
1 0 1
step-by-Step (Stride = 1, No Padding)
Place the filter on the top-left corner of the image.
Multiply each filter value with the overlapping image pixel.
Sum the results → This becomes one pixel in the output.
Slide the filter right by stride steps → Repeat until the end of the row.
Move down by stride steps and repeat for the next row.
15. First Position Calculation
Filter on top-left corner:
Image patch Filter Multiply & Sum
1 1 1 1 0 1 (1×1)+(1×0)+(1×1) +
0 1 1 × 0 1 0 = (0×0)+(1×1)+(1×0) +
0 0 1 1 0 1 (0×1)+(0×0)+(1×1)
= 1 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 1 = 4
So, the first output pixel = 4.
16. Output Size Formula
If:
n = input size
f = filter size
p = padding
s = stride
Then output size
Example:
Input = 5×5, Filter = 3×3, Stride = 1, Padding = 0:
So output is 3×3.
4. Multiple Filters
If the layer has 32 filters, it produces 32 different feature maps (one per filter), stacked
together as the output.
18. What is a Pooling Layer?
Purpose: Reduce spatial dimensions of feature maps while keeping important
features.
Benefit: Fewer parameters → faster computation → less overfitting.
How: Applies a small window (e.g., 2×2) and replaces it with a single value.
use the same 5×5 matrix (after convolution) for pooling demonstration
4 2 1 0 3
3 1 2 3 4
1 0 1 2 3
0 1 3 4 2
2 3 0 1 1
Max Pooling
Operation: Take the maximum value in each window.
Pooling size: 2×2, Stride: 1, Padding: None
19. step-by-step:
First Output Row
(1,1) →
4 2
3 1
Max = 4
(1,2) →
2 1
1 2
Max = 2
(1,3) →
1 0
2 3
Max = 3
Output row → [4, 2, 3]
Second Output Row
(2,1) →
3 1
1 0
Max = 3
(2,2) →
1 2
0 1
Max = 2
(2,3) →
2 3
1 2
Max = 3
Output row → [3, 2, 3]
Third Output Row
(3,1) →
1 0
0 3
Max = 3
(3,2) →
0 1
3 4
Max = 4
(3,3) →
1 2
4 2
Max = 4
Output row → [3, 4, 4]
Final Max Pooling Output (Stride
= 1)
4 2 3
3 2 3
3 4 4
20. Max Pooling with padding
• Max Pooling with padding and stride = 2 for your same 5×5 feature map step-by-step.
• Step 1: Add Padding
• If we use 2×2 pooling with stride = 2 and want to cover all regions including borders, we can add
zero padding around the matrix.
We’ll pad 1 row/column on the bottom and right so pooling windows fit perfectly.
• Padded matrix:
• 4 2 1 0 3 0
• 3 1 2 3 4 0
• 1 0 1 2 3 0
• 0 1 3 4 2 0
• 2 3 0 1 1 0
• 0 0 0 0 0 0
• Size → 6×6
21. • Pool size: 2×2
• Stride: 2
• Padding: None
• for your same 5×5 feature map.
Max Pooling without padding
22. Step 2: Apply Max Pooling (2×2, stride=2)
Row 1 of Output
Window 1 (rows 1–2, cols 1–2):
4 2
3 1
Max = 4
Window 2 (rows 1–2, cols 3–4):
1 0
2 3
Max = 3
(No more columns for another 2×2 window.)
Row 2 of Output
Window 1 (rows 3–4, cols 1–2):
1 0
0 1
Max = 1
Window 2 (rows 3–4, cols 3–4):
1 2
3 4
Max = 4
(No more columns for another window.)
Final Max Pooling Output (stride=2, no padding)
4 3
1 4
So, compared to stride=2 with padding,
The output shrinks from 3×3 → 2×2
The bottom/right edges are ignored since no padding is added.
Assignment:
Try all four cases:
stride=1 no padding
stride=2 no padding
stride=1 with padding
stride=2 with padding
into one comparison chart so can see how output size
and values change. That would make the
differences very clear.
23. Average Pooling
Global Pooling
Average Pooling
Operation: Take the average of all values in
each window.
Example: Pool size = 2×2, Stride = 2.
First window:
4 2
3 1
Average = (4+2+3+1) / 4 = 2.5
Average pooling output:
2.5 1.5 3.5
0.5 2.5 2.5
2.0 1.0 1.5
Global Pooling
Global Max Pooling: Takes the maximum from
the whole feature map (reduces entire map to
1 value per channel).
From our example → max = 4.
Global Average Pooling: Takes the mean of all
values in the feature map.
From our example → sum all values / total count.
This is usually applied just before the output
layer in place of a Flatten → Dense layer.
24. what happens between pooling layers and fully connected (dense)
layers
How does the 2D matrix from pooling become the 1D vector that a
fully connected layer expects?
Suppose after the last pooling layer, your output shape is:
4 × 4 × 64
Height = 4
Width = 4
Channels (feature maps) = 64
This is not yet suitable for a fully connected layer, because a
Dense layer expects a 1D vector.
Flattening the Output
use a Flatten operation to convert the 3D tensor into a 1D vector.
Example:
Before Flatten: 4 × 4 × 64 Number of elements = 4 × 4 × 64 =
1024 After Flatten: [x , x , x , ..., x ] → Shape: (1024,)
₁ ₂ ₃ ₁₀₂₄
In frameworks:
Keras/TensorFlow: Flatten() layer
Feeding to Fully Connected Layer
Now the 1D vector becomes the input to the dense layer.
Each element of the vector is connected to every neuron in the FC
layer.
If the first Dense layer has 128 neurons:
Input size = 1024 (from flattening)
Weight matrix size = (1024 × 128)
Bias size = (128)
• Why Flatten Is Needed
• Convolutional and pooling layers maintain spatial structure
(H, W, Channels).
• Fully connected layers treat inputs as simple feature lists —
no spatial layout.
• Flatten bridges the gap by reshaping without losing the
learned feature values.
• Pooling → Matrix
Flatten → Vector
Vector → FC layer for final decision-making.
25. Some common architectures of CNN
• LeNet-5
• AlexNet
• VGG 16
• Inception (GoogLeNet)
• ResNet
• DenseNet