SlideShare a Scribd company logo
AE-ViT: Token Enhancement for Vision Transformers via
CNN-based Autoencoder Ensembles.
Heriniaina Andry RABOANARY1,2, Roland RABOANARY1,3, and Nirina Maurice
HASINA TAHIRIDIMBISOA1,4
1
Equipe d’accueil PNA-PHE, EDPA - Faculté des Sciences, Université d’Antananarivo, Antananarivo,
Madagascar
2
andry.raboanary@gmail.com
3
r raboanary@yahoo.fr
4
nirina@aims.ac.za
Abstract. While Vision Transformers (ViTs) have revolutionized computer vision with their exceptional
results, they struggle to balance processing speed with visual detail preservation. This tension becomes
particularly evident when implementing larger patch sizes. Although larger patches reduce computational
costs, they lead to significant information loss during the tokenization process. We present AE-ViT, a novel
architecture that leverages an ensemble of autoencoders to address this issue by introducing specialized
latent tokens that integrate seamlessly with standard patch tokens, enabling ViTs to capture both global
and fine-grained features.
Our experiments on CIFAR-100 show that AE-ViT achieves a 23.67% relative accuracy improvement over
the baseline ViT when using 16×16 patches, effectively recovering fine-grained details typically lost with
larger patches. Notably, AE-ViT maintains relevant performance (60.64%) even at 32×32 patches. We
further validate our method on CIFAR-10, confirming consistent benefits and adaptability across different
datasets.
Ablation studies on ensemble size and integration strategy underscore the robustness of AE-ViT, while
computational analysis shows that its efficiency scales favorably with increasing patch size. Overall, these
findings suggest that AE-ViT provides a practical solution to the patch-size dilemma in ViTs by striking
a balance between accuracy and computational cost, all within a simple, end-to-end trainable design.
Keywords: Vision Transformers, Convolutional Neural Networks, Autoencoders, Hybrid
Architecture, Image Classification, latent representation
1 Introduction
Vision Transformers (ViTs) [9] have emerged as a powerful alternative to Convolutional
Neural Networks (CNNs) in computer vision tasks. Adapting the transformer architecture
from natural language processing [26], Vision Transformers (ViTs) process images by de-
composing them into non-overlapping patches and employing self-attention mechanisms
to model relationships across the entire image. This architectural paradigm has achieved
exceptional performance across diverse computer vision tasks [14], presenting a compelling
challenge to the historically dominant Convolutional Neural Networks (CNNs).
However, ViTs face a critical trade-off in patch size selection. The computational com-
plexity of self-attention operations grows quadratically with the number of tokens, mak-
ing smaller patches (e.g., 8×8 pixels) computationally expensive despite their rich feature
representation. Conversely, larger patches (e.g., 16×16 pixels) offer better computational
efficiency but sacrifice fine-grained spatial information [19]. This information loss becomes
particularly evident in tasks requiring detailed feature analysis [4], where the tokenization
process with large patches might lose crucial local patterns.
Several approaches have been proposed to address this trade-off. Hierarchical de-
signs [27] progressively merge tokens to balance computational cost and feature granularity.
Efficient attention mechanisms [20] reduce complexity through local attention windows.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
47
DOI:10.5121/ijaia.2025.16104
tary strengths. However, these solutions often introduce significant architectural complex-
ity [12], requiring careful design choices and sophisticated training strategies that may
limit their practical adoption.
In this paper, we introduce AE-ViT, a novel approach that addresses the patch size
dilemma through an ensemble of autoencoders. Our method complements the standard
patch-based tokenization with learned latent representations from convolutional autoen-
coders [1]. These autoencoders, operating at a finer scale than the patch tokens, capture
and preserve local features that would otherwise be lost with large patches. By integrating
these latent representations into the transformer’s token sequence, we enable the model
to simultaneously leverage both the computational benefits of large patches and the fine-
grained feature detection capabilities of CNNs [18]. This hybrid architecture builds upon
the proven strengths of both convolutional [17] and transformer architectures [24], creating
a synergy that effectively compensates for the limitations of large patch tokenization.
Our key contributions can be summarized as follows: (1) We propose a novel hybrid
architecture that leverages an ensemble of autoencoders to compensate for information
loss in large-patch Vision Transformers; (2) We introduce an efficient method for integrat-
ing autoencoder latent representations with transformer tokens, maintaining architectural
simplicity while significantly improving performance; (3) We demonstrate the effectiveness
of our approach through extensive experiments on CIFAR-100 [15], achieving a 23.67%
relative accuracy improvement over the baseline ViT when using 16×16 patches, without
introducing significant computational overhead [21]. We also tested our system on the
CIFAR-10 [16] to validate the scaling over other datasets. Our results show that AE-ViT
provides a practical solution to the patch size dilemma, particularly valuable in scenarios
where computational efficiency is crucial [9].
2 Background and Related Work
2.1 Vision Transformers
Originally introduced by Dosovitskiy et al. [9], Vision Transformers (ViTs) adapt the
transformer architecture [26] for image processing tasks by treating images as sequences of
patches. This approach employs self-attention mechanisms to capture global dependencies,
yielding remarkable performance across numerous vision applications. The success of the
original ViT has led to multiple improvements, such as DeiT [24], which proposes efficient
training techniques, and CaiT [25], which enhances feature representation by increasing
model depth.
2.2 Efficiency in Vision Transformers
Despite their success, ViTs face computational challenges stemming from self-attention
complexity, which grows quadratically with the number of tokens. Smaller patches yield
more tokens and capture finer details, but with significantly higher computational cost.
To address this, the Swin Transformer [20] introduces local attention windows to reduce
overall complexity, and PVT [27] employs a progressive shrinking pyramid to reduce the
token count in deeper layers. Another approach, TNT [13], processes patch-level and pixel-
level tokens in a nested structure, emphasizing multi-scale feature representation.
Hybrid architectures [10] combine CNNs and transformers to leverage their complemen-
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
48
2.3 Hybrid CNN–Transformer Approaches
Beyond pure transformer architectures, a growing body of research integrates convolu-
tional neural networks (CNNs) into ViTs to capitalize on their complementary strengths.
ConViT [7] introduces soft convolutional inductive biases to refine local feature model-
ing, while LeViT [10] interleaves convolutional blocks for efficient early-stage processing.
CvT [28] demonstrates the effectiveness of convolutional token embeddings, and early
convolution layers [29] have been shown to be crucial for robust vision transformer per-
formance. Meanwhile, CoAtNet [6] unifies depthwise convolutions and attention layers
for scalable performance across different input sizes, and MobileViT [22] incorporates
lightweight CNN blocks into ViTs to facilitate mobile-friendly inference. In parallel, Guo
et al. [11] highlight that CNN-based approaches can significantly reduce the computational
costs of vision transformers. These hybrid efforts underscore the synergy between local re-
ceptive fields (CNN-like) and global self-attention (transformer-like), paving the way for
more efficient and effective models.
2.4 Autoencoders in Vision Tasks
Autoencoder architectures [1] have shown remarkable success in learning compact, mean-
ingful representations, from dimensionality reduction to feature learning [2]. Their ability
to preserve essential information while reducing spatial dimensionality is particularly use-
ful in tasks requiring fine-grained detail [8]. By reconstructing input data, autoencoders
can capture local and global structure, making them relevant for scenarios where large
image patches risk losing crucial spatial information.
2.5 Our Approach
Our work bridges these lines of research by introducing an ensemble of CNN-based au-
toencoders to enhance transformer-based vision models. While previous methods have
incorporated CNN blocks within a transformer (Section 2.3) or tackled patch-level effi-
ciency (Section 2.2), our method uniquely leverages autoencoder-based latent representa-
tions to compensate for information loss in large-patch ViTs. Instead of modifying the
patch embedding or internal attention blocks, we integrate a learnable latent “AE token”
that complements the standard patch tokens. This design aims to reconcile the efficiency
benefits of large patches with the need to preserve fine-grained details, offering a novel
solution to the efficiency–accuracy trade-off in modern vision transformers.
3 Method
3.1 Overview
The Fig. 1 presents an overview of the architecture of our ensemble. For clarity, we chose
to present the case of 4 autoencoders. The latent token is “plugged” as an additional token
for the transformer.
3.2 Autoencoder Ensemble Design
Each autoencoder in our ensemble follows a convolutional architecture optimized for cap-
turing fine-grained features. The encoder pathway consists of three downsampling blocks
that progressively reduce spatial dimensions from 64×64 to 8×8 while increasing the fea-
ture channels. Specifically, we use strided convolutions with a kernel size of 4×4 and stride
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
49
Fig. 1. Overview of the ensemble architecture (for four autoencoders).
2, followed by batch normalization and ReLU activation. The feature dimensions evolve
as follows:
3
conv
−
−
−
→ 32
conv
−
−
−
→ 64
conv
−
−
−
→ 128 (1)
The final feature map is flattened and projected to a latent space of dimension 256
through a fully connected layer. This results in a compression ratio of:
compression ratio =
64 × 64 × 3
256
= 48 : 1 (2)
The decoder mirrors this structure with transposed convolutions to progressively re-
construct the spatial dimensions:
128
convT
−
−
−
→ 64
convT
−
−
−
→ 32
convT
−
−
−
→ 3 (3)
To prevent overfitting and encourage robust feature learning, we employ L1 and L2
regularization on the latent space:
Lreg = λ1∥z∥1 + λ2∥z∥2
2 (4)
where z represents the latent vector and λ1 = λ2 = 0.01 are regularization coefficients.
The total loss for each autoencoder is:
Ltotal = Lrecon + Lreg (5)
where Lrecon is the mean squared error between the input and reconstructed images.
3.3 Latent Token Integration
The main innovation of our approach lies in how we integrate the autoencoder ensemble’s
latent representations with the Vision Transformer’s patch tokens. Given an input image
X ∈ RH×W×3, the process follows three parallel paths that merge in the transformer:
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
50
Patch Tokenization The input image is divided into non-overlapping patches of size
sp × sp, resulting in N = (H/sp) × (W/sp) patches. Where sp defines the patch size in
pixels. Each patch is linearly projected to dimension D through a learnable embedding
matrix E ∈ R(sp×sp×3)×D:
tpatch = [x1
pE; x2
pE; ...; xN
p E] ∈ RN×D
(6)
Latent Encoding Simultaneously, our ensemble of five autoencoders processes the input
image, each producing a latent vector zi ∈ R256. These latents are concatenated:
zconcat = [z1; z2; z3; z4; z5] ∈ R1280
(7)
This concatenated representation is then projected to the transformer’s embedding
dimension D through a learnable projection Wproj ∈ R1280×D:
tlatent = zconcatWproj ∈ R1×D
(8)
Token Sequence Formation The final sequence presented to the transformer concate-
nates the class token tcls, patch tokens, and latent token:
T = [tcls; tpatch; tlatent] ∈ R(N+2)×D
(9)
Position embeddings P are added to this sequence to maintain positional information:
Tfinal = T + P (10)
For the final classification, we utilize both the class token and the latent token, con-
catenating their representations after the transformer processing:
y = MLP([hcls; hlatent]) (11)
where hcls and hlatent are the transformed representations of the respective tokens.
3.4 Training Strategy
Our training process follows a two-phase approach designed to maximize the complemen-
tary strengths of both the autoencoder ensemble and the Vision Transformer.
Phase 1: Autoencoder Ensemble Pretraining We train each autoencoder indepen-
dently on different random subsets of the training data. Each autoencoder sees different
parts of the training data, randomly sampled with a fixed seed to ensure reproducibil-
ity. The objective function for each autoencoder combines reconstruction loss with latent
space regularization:
LAE = Lrecon + λ1∥z∥1 + λ2∥z∥2
2 (12)
where λ1 = λ2 = 0.01. This phase runs for 5 to 205 epochs using the Adam optimizer
with a learning rate of 10−3.
5
Depending on the experience conditions.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
51
Phase 2: End-to-End Training After pretraining, we freeze the autoencoder param-
eters and train the complete AE-ViT architecture end-to-end. The loss function for this
phase is:
Ltotal = LCE + λwd∥θ∥2
2 (13)
where LCE is the cross-entropy loss with label smoothing (α = 0.1), and λwd = 0.05 is
the weight decay coefficient. We employ the AdamW optimizer with the following specifics:
– Learning rate: 10−3
– Batch size: 64
– Training epochs: 100
– No early stopping, use the last epoch for test
– Cosine learning rate scheduling
– Gradient clipping at norm 1.0
The complete training process can be formalized as:
θ∗
= arg min
θ
E(x,y)∼D[Ltotal(AE-ViTθ(x), y)] (14)
where D represents the training dataset and θ are the model parameters.
4 Experiments
4.1 Experimental Setup
We conduct extensive experiments on two standard benchmarks: CIFAR-100 [15] and
CIFAR-10 [16]. Images are resized to 64×64 pixels. We use standard data augmentation
including random horizontal flips and AutoAugment. Training uses AdamW optimizer
with learning rate 1e-3 and weight decay 0.05 for 100 epochs with cosine scheduling.
We conduct our experiments on CIFAR-100, which consists of 50,000 training images
and 10,000 test images in 100 classes. All images are resized to 64×64 pixels to accommo-
date our patch-based architecture. We use standard data augmentation techniques includ-
ing random horizontal flips and AutoAugment [5]. Implementation details are provided in
our publicly available code.
4.2 Main Results
Performance on CIFAR-100. Table 1 presents our main results with various patch
sizes:
Table 1. Classification accuracy (%) on CIFAR-100
Model Patch Size FLOPs (M) Accuracy
ViT 8×8 287.6 61.23
ViT 16×16 78.4 49.94
ViT 32×32 19.6 35.86
AE-ViT (Ours) 16×16 187.0 61.76
AE-ViT (Ours) 32×32 128.4 60.64
With 16×16 patches, our approach not only outperforms the baseline ViT but also
achieves better accuracy than the more computationally intensive 8×8 patch configuration,
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
52
and has 23.67% relative accuracy improvement compared to baseline ViT with 16x16
patches. Most notably, with 32×32 patches, AE-ViT maintains reasonable performance
(60.64%) while the baseline ViT severely degrades (35.86%), demonstrating a remarkable
relative improvement of 69%.
Scaling to CIFAR-10. To validate the generality of our approach, we conduct some ex-
periments on CIFAR-10. Results in Table 2 show that AE-ViT maintains its effectiveness:
Table 2. Classification accuracy (%) on CIFAR-10
Model Patch Size Accuracy
ViT 8×8 82.91
ViT 16×16 76.37
AE-ViT (Ours) 16×16 84.35
We can see that the architecture also works on the CIFAR-10 dataset.
4.3 Ablation Studies
Impact of Ensemble Size. We conduct a systematic study of ensemble size impact:
Table 3. Impact of number of autoencoders (CIFAR-100)
#AEs Test Acc. Train Acc. FLOPs (M)
1 59.11 86.28 106
2 60.73 87.15 133
3 61.00 88.38 160
4 61.76 88.90 187
5 61.75 89.31 214
6 61.27 89.08 241
The results reveal an optimal configuration at 4 autoencoders, beyond which perfor-
mance plateaus or slightly degrades. This suggests that while ensemble diversity is benefi-
cial, there exists a sweet spot balancing performance and computational cost. However, we
should conduct more experiments on various conditions (dataset, patch size, transformers
hyperparameters) before we can generalize this configuration.
Cross-dataset Training. We explore enhancing the ensemble with an autoencoder
trained on CIFAR-10. This configuration achieves 61.94% accuracy on CIFAR-100, sug-
gesting potential benefits from cross-dataset knowledge transfer, albeit with diminishing
returns compared to the computational overhead.
Conclusion. These results indicate that adding more than 4 autoencoders is not beneficial
for the AE-ViT architecture. The use of 4 autoencoders represents an optimal balance
between accuracy, generalization, and computational efficiency.
4.4 Efficiency Analysis
While our primary results demonstrate AE-ViT’s superior accuracy over the baseline ViT
with 16×16 patches, a more insightful comparison emerges when we consider the baseline
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
53
ViT with 8×8 patches, which achieves similar performance levels. This comparison is
particularly interesting as it addresses the trade-off between accuracy and computational
efficiency.
Table 4 presents a detailed comparison:
Table 4. Efficiency comparison between ViT (8×8) and AE-ViT (16×16)
Model FLOPs (M) #Tokens Test Acc. (%)
ViT (8×8) 287.6 65 61.23
AE-ViT (16×16) 187.0 18 61.76
Relative Difference -34.9% -72.3% +0.87%
4.5 Comparison with State-of-the-Art
While our primary focus is on addressing the patch size dilemma in Vision Transformers,
it is insightful to position AE-ViT within the broader context of modern architectures.
Table 5 presents a comparison with Swin Transformer [20], a leading hierarchical vision
transformer:
Table 5. Comparison with Swin Transformer on CIFAR-100
Model Accuracy (%) Params (M) FLOPs (M)
Swin-T [3] 78.41 28.3 4500
AE-ViT (4 AEs) 61.76 12.4 187
While Swin Transformer achieves higher accuracy, AE-ViT offers significantly better
computational efficiency.
This efficiency gap becomes even more significant when considering higher resolution
images. For example, with 1080p images (1920×1080):
Table 6. Theoretical scaling to 1080p images
Model Memory FLOPs (G)
Swin-T O(HW) 284.4
AE-ViT O((HW/P2
)) 11.8
This scaling advantage stems from our efficient use of large patches (16×16) com-
bined with the fixed-cost autoencoder ensemble. While Swin Transformer’s computational
requirements grow quadratically with image size, AE-ViT maintains better efficiency, mak-
ing it particularly suitable for high-resolution applications where computational resources
are constrained.
The results reveal that AE-ViT achieves comparable (slightly better) accuracy while
requiring about 35% fewer FLOPs. This efficiency gain stems primarily from two factors:
1) Token Efficiency: AE-ViT processes only 18 tokens (16 patch tokens + 1 CLS token
+ 1 latent token) compared to 65 tokens in the 8×8 ViT, resulting in a 72.3% reduction
in the self-attention computational load.
2) Computational Distribution: While AE-ViT introduces additional computation through
its autoencoder ensemble (136.3M FLOPs), this is more than offset by the reduced trans-
former complexity (78.4M FLOPs vs 287.6M FLOPs).
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
54
The memory footprint also favors AE-ViT, as the attention mechanism’s quadratic
memory scaling with respect to the number of tokens (O(n²)) makes the reduced token
count particularly significant. This demonstrates that our approach not only bridges the
performance gap of larger patches but does so in a computationally efficient manner.
5 Discussion
Our experimental results demonstrate several key findings about AE-ViT and provide
insights into the trade-offs between patch size, computational efficiency, and model per-
formance.
5.1 Optimal Ensemble Configuration
The systematic study of autoencoder ensemble size reveals a clear optimization pattern.
Starting from a single autoencoder (59.11%), we observe significant improvements with
each additional autoencoder up to four (61.76%), followed by diminishing returns with
five autoencoders (61.75%) and performance degradation with six (61.27%). This pattern
suggests that:
– The ensemble approach is fundamentally sound, with even two autoencoders outper-
forming a single autoencoder by 1.62%
– Four autoencoders represent an optimal balance between performance and complexity
– Additional autoencoders beyond four may introduce unnecessary redundancy or noise
5.2 Scaling with Patch Size
Perhaps the most striking result is AE-ViT’s ability to maintain performance with larger
patch sizes:
– With 16×16 patches, AE-ViT (61.76%) matches the performance of standard ViT with
8×8 patches (61.23%) while using 35% fewer FLOPs
– With 32×32 patches, AE-ViT (60.64%) demonstrates remarkable resilience compared
to the baseline (35.86%), achieving a 69% relative improvement
This scaling behavior suggests that our autoencoder ensemble effectively compensates for
the information loss in larger patches, potentially offering a pathway to processing high-
resolution images efficiently.
5.3 Cross-Dataset Insights
Our experiments with cross-dataset training, where we incorporate an autoencoder trained
on CIFAR-10 into the ensemble, yield several insights:
– The mixed ensemble (61.94%) slightly outperforms the pure CIFAR-100 ensemble
(61.76%)
– The improvement, while modest, suggests potential benefits from diverse training data
– The approach maintains effectiveness across datasets, as demonstrated by strong per-
formance on CIFAR-10 (84.35%)
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
55
5.4 Computational Efficiency
Comparing AE-ViT with state-of-the-art models like Swin Transformer reveals an inter-
esting efficiency-accuracy trade-off:
– While Swin-T achieves higher accuracy (78.41%), it requires 24× more FLOPs
– AE-ViT’s efficiency advantage grows with image resolution due to fixed autoencoder
costs
– The parameter count remains modest (12.4M vs 28.3M for Swin-T)
5.5 Limitations
Our approach, while effective, has several notable limitations:
– Current architecture depends on a fixed number of autoencoders, lacking adaptability
to varying computational constraints
– Performance on high-resolution images remains untested, particularly for resolutions
beyond 64×64
– The approach has only been validated on classification tasks, not on more complex
vision tasks
– Limited to static image processing, without consideration for temporal features
– Generalization capabilities have only been tested on relatively small datasets (CIFAR-
100, CIFAR-10)
5.6 Future Work
Several promising directions for future research emerge:
Architecture Improvements
– Develop dynamic ensemble selection mechanisms that adapt to specific domains and
computational constraints
– Explore alternative autoencoder architectures for enhanced feature extraction
– Investigate integration with state-of-the-art transformer variants
– Explore cross dataset learning seen on 4.3 for further knowledge transfer.
– Using various types of autoencoders to increase the robustness of transformers. To
have better results than those seen in [23]
Scaling and Performance
– Validate performance on high-resolution images (1080p, 4K)
– Test scalability with larger datasets (ImageNet, Places365)
– Optimize implementation for specific hardware accelerators (GPUs, TPUs)
Extended Applications
– Adapt the architecture for dense prediction tasks (segmentation, detection)
– Extend to video processing by incorporating temporal information
– Explore transfer learning for specialized domains (medical imaging, satellite imagery)
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
56
6 Conclusion
In this paper, we introduced AE-ViT, a novel approach that effectively addresses the
patch-size dilemma in Vision Transformers through an ensemble of autoencoders. Our
method achieves a 23.67% relative improvement over the baseline ViT on CIFAR-100
with 16×16 patches, while using significantly fewer computational resources than models
that rely on smaller patches. Through extensive experimentation, we identified an optimal
configuration of four autoencoders that balances performance and efficiency.
The effectiveness of AE-ViT is particularly evident in its ability to maintain strong
performance with large patches, achieving 60.64% accuracy with 32×32 patches compared
to the baseline’s 35.86%. This capability, combined with its strong showing on CIFAR-10
(84.35%) and efficient scaling properties, demonstrates the potential of our approach for
high-resolution image processing applications. Most importantly, AE-ViT offers a practical
solution for scenarios requiring efficient vision processing, providing a favorable trade-off
between accuracy and computational cost. While more computationally intensive models
like Swin Transformer achieve higher absolute accuracy, AE-ViT’s efficiency (24× fewer
FLOPs) makes it particularly attractive for resource-constrained settings or real-time pro-
cessing of high-resolution images.
In line with our findings, promising avenues for further research include exploring dy-
namic ensemble selection mechanisms to adapt AE-ViT to specific domains, developing
more sophisticated autoencoder architectures for enhanced feature extraction, and scal-
ing to larger datasets like ImageNet or higher resolutions (1080p and beyond). Extending
AE-ViT to dense prediction tasks (e.g., segmentation or detection) and time-series data
(e.g., video) would also be valuable, as would investigating transfer learning for special-
ized domains such as medical imaging or satellite imagery. These extensions may reinforce
AE-ViT’s robustness, accelerate its adoption in real-world scenarios, and deepen our un-
derstanding of hybrid autoencoder–transformer models.
7 Acknowledgments
We thank the anonymous reviewers for their valuable feedback. We thank as well ISPM
(https://ispm-edu.com/) for funding our research.
References
1. Baldi, P.: Autoencoders, unsupervised learning, and deep architectures. Proceedings of ICML Work-
shop on Unsupervised and Transfer Learning (2012)
2. Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE
Transactions on Pattern Analysis and Machine Intelligence 35(8), 1798–1828 (2013)
3. Chen, C., Derakhshani, M.M., Liu, Z., Fu, J., Shi, Q., Xu, X., Yuan, L.: On the vision transformer
scaling: Parameter scaling laws and improved training strategies. arXiv preprint arXiv:2204.08476
(2022)
4. Chen, C.F., Fan, Q., Panda, R.: Crossvit: Cross-attention multi-scale vision transformer for image
classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
(2021)
5. Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation
strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). pp. 113–123 (2019). https://doi.org/10.1109/CVPR.2019.00020
6. Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: Marrying Convolution and Attention for All Data Sizes.
In: Advances in Neural Information Processing Systems. vol. 34, pp. 3965–3977 (2021)
7. d’Ascoli, S., Touvron, H., Leavitt, M., Morcos, A., Biroli, G., Sagun, L.: Convit: Improving vision
transformers with soft convolutional inductive biases. In: Proceedings of the International Conference
on Machine Learning (ICML) (2021)
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
57
8. Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.: Attention in atten-
tion: Modeling context correlation for efficient vision transformers. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV) (2021)
9. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M.,
Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image
recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)
10. Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., Douze, M.: Levit: A vision
transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision (ICCV) (2021)
11. Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C.: Cmt: Convolutional neural networks
meet vision transformers (2022), https://arxiv.org/abs/2107.06263
12. Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., et al.: A
survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
13. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: Advances in
Neural Information Processing Systems (NeurIPS) (2021)
14. Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: A survey.
ACM Computing Surveys (2021)
15. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep., University
of Toronto (2009)
16. Krizhevsky, A., Nair, V., Hinton, G.: CIFAR-10 (canadian institute for advanced research). http:
//www.cs.toronto.edu/~kriz/cifar.html (2009), dataset
17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural
networks. Advances in Neural Information Processing Systems 25 (2012)
18. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
19. Liu, K., Zhang, W., Tang, K., Li, Y., Cheng, J., Liu, Q.: A survey of vision transformer. IEEE
Transactions on Pattern Analysis and Machine Intelligence (2022)
20. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical
vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV) (2021)
21. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Efficient transformers: A survey.
ACM Computing Surveys (2021)
22. Mehta, S., Rastegari, M.: MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision
Transformer. In: International Conference on Learning Representations (ICLR) (2022), https://
openreview.net/forum?id=7RkGY0FtwrY
23. Raboanary, H.A., Raboanary, R., Tahiridimbisoa, N.H.M.: Robustness assessment of neural
network architectures to geometric transformations: A comparative study with data augmen-
tation. In: 2023 3rd International Conference on Electrical, Computer, Communications and
Mechatronics Engineering (ICECCME). IEEE, Tenerife, Canary Islands, Spain (Jul 2023).
https://doi.org/10.1109/ICECCME57830.2023.10253075
24. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient
image transformers & distillation through attention. In: Proceedings of the International Conference
on Machine Learning (ICML) (2021)
25. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image trans-
formers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
(2021)
26. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems.
vol. 30 (2017)
27. Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision
transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
28. Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: Introducing convolutions to
vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision
(ICCV) (2021)
29. Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help trans-
formers see better. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
58

More Related Content

PDF
How is a Vision Transformer (ViT) model built and implemented?
PDF
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PPTX
Presentation vision transformersppt.pptx
PPTX
SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
PPTX
Classification of xray images using vision transformers
PDF
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
PDF
unlocking-the-future-an-introduction-to-vision-transformers-202410100758143pD...
How is a Vision Transformer (ViT) model built and implemented?
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Presentation vision transformersppt.pptx
SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
Classification of xray images using vision transformers
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
unlocking-the-future-an-introduction-to-vision-transformers-202410100758143pD...

Similar to AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder Ensembles (20)

PDF
Transformer models for FER
PDF
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
PDF
BriefHistoryTransformerstransformers.pdf
PDF
[DSC Europe 23] Ivan Biliskov - Seeing Through the Lens of Transformers: A Ne...
PDF
CV18_Vision Transformers.pdf
PDF
ViT (Vision Transformer) Review [CDM]
PDF
“Implementing Transformer Neural Networks for Visual Perception on Embedded D...
PDF
A Survey on Vision Transformer.pdf
PPTX
State of transformers in Computer Vision
PPTX
Transformer in Vision
PPTX
ViT.pptx
PPTX
Transformers in vision and its challenges and comparision with CNN
PPTX
Enhancing Vision Models for Fine-Grained Classification
PPTX
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
PDF
Visual Transformers
PPTX
vision transformer siêu cấp vip ro vũ trụ
PDF
Smadav Pro 15.2.2 Crack Plus Serial Key Free Download
PDF
Transformer based approaches for visual representation learning
PPTX
tech_seminar_ppt on vision transformers.pptx
PDF
[Paper] Multiscale Vision Transformers(MVit)
Transformer models for FER
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
BriefHistoryTransformerstransformers.pdf
[DSC Europe 23] Ivan Biliskov - Seeing Through the Lens of Transformers: A Ne...
CV18_Vision Transformers.pdf
ViT (Vision Transformer) Review [CDM]
“Implementing Transformer Neural Networks for Visual Perception on Embedded D...
A Survey on Vision Transformer.pdf
State of transformers in Computer Vision
Transformer in Vision
ViT.pptx
Transformers in vision and its challenges and comparision with CNN
Enhancing Vision Models for Fine-Grained Classification
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
Visual Transformers
vision transformer siêu cấp vip ro vũ trụ
Smadav Pro 15.2.2 Crack Plus Serial Key Free Download
Transformer based approaches for visual representation learning
tech_seminar_ppt on vision transformers.pptx
[Paper] Multiscale Vision Transformers(MVit)
Ad

More from gerogepatton (20)

PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
PDF
Performance Evaluation of Block-Sized Algorithms for Majority Vote in Facial ...
PDF
3rd International Conference on Artificial Intelligence and IoT (AIIoT 2025)
PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
PDF
3rd International Conference on Artificial Intelligence and IoT (AIIoT 2025)
PDF
Augmented and Synthetic Data in Artificial Intelligence
PDF
3rd International Conference on AI, Data Mining and Data Science (AIDD 2025)
PDF
July 2025 - Top 10 Read Articles in Artificial Intelligence and Applications ...
PDF
6th International Conference on Natural Language Processing and Computational...
PDF
From Insight to Impact: The Evolution of Data-Driven Decision Making in the A...
PDF
6th International Conference on Artificial Intelligence and Machine Learning ...
PDF
3rd International Conference on Artificial Intelligence and IoT (AIIoT 2025)
PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
PDF
AI-Driven Vulnerability Analysis in Smart Contracts: Trends, Challenges and F...
PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
PDF
6th International Conference on Artificial Intelligence and Machine Learning ...
PDF
A Thorough Introduction to Multimodal Machine Translation
PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
PDF
6th International Conference on Advanced Machine Learning (AMLA 2025)
PDF
OWE-CVD: An Optimized Weighted Ensemble for Heart Disease Prediction
International Journal of Artificial Intelligence & Applications (IJAIA)
Performance Evaluation of Block-Sized Algorithms for Majority Vote in Facial ...
3rd International Conference on Artificial Intelligence and IoT (AIIoT 2025)
International Journal of Artificial Intelligence & Applications (IJAIA)
3rd International Conference on Artificial Intelligence and IoT (AIIoT 2025)
Augmented and Synthetic Data in Artificial Intelligence
3rd International Conference on AI, Data Mining and Data Science (AIDD 2025)
July 2025 - Top 10 Read Articles in Artificial Intelligence and Applications ...
6th International Conference on Natural Language Processing and Computational...
From Insight to Impact: The Evolution of Data-Driven Decision Making in the A...
6th International Conference on Artificial Intelligence and Machine Learning ...
3rd International Conference on Artificial Intelligence and IoT (AIIoT 2025)
International Journal of Artificial Intelligence & Applications (IJAIA)
AI-Driven Vulnerability Analysis in Smart Contracts: Trends, Challenges and F...
International Journal of Artificial Intelligence & Applications (IJAIA)
6th International Conference on Artificial Intelligence and Machine Learning ...
A Thorough Introduction to Multimodal Machine Translation
International Journal of Artificial Intelligence & Applications (IJAIA)
6th International Conference on Advanced Machine Learning (AMLA 2025)
OWE-CVD: An Optimized Weighted Ensemble for Heart Disease Prediction
Ad

Recently uploaded (20)

PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
PPT on Performance Review to get promotions
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
composite construction of structures.pdf
PDF
Well-logging-methods_new................
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
CH1 Production IntroductoryConcepts.pptx
PPT
introduction to datamining and warehousing
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Geodesy 1.pptx...............................................
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Foundation to blockchain - A guide to Blockchain Tech
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPT on Performance Review to get promotions
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
composite construction of structures.pdf
Well-logging-methods_new................
Embodied AI: Ushering in the Next Era of Intelligent Systems
CH1 Production IntroductoryConcepts.pptx
introduction to datamining and warehousing
CYBER-CRIMES AND SECURITY A guide to understanding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Safety Seminar civil to be ensured for safe working.
Automation-in-Manufacturing-Chapter-Introduction.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Geodesy 1.pptx...............................................
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Foundation to blockchain - A guide to Blockchain Tech

AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder Ensembles

  • 1. AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder Ensembles. Heriniaina Andry RABOANARY1,2, Roland RABOANARY1,3, and Nirina Maurice HASINA TAHIRIDIMBISOA1,4 1 Equipe d’accueil PNA-PHE, EDPA - Faculté des Sciences, Université d’Antananarivo, Antananarivo, Madagascar 2 andry.raboanary@gmail.com 3 r raboanary@yahoo.fr 4 nirina@aims.ac.za Abstract. While Vision Transformers (ViTs) have revolutionized computer vision with their exceptional results, they struggle to balance processing speed with visual detail preservation. This tension becomes particularly evident when implementing larger patch sizes. Although larger patches reduce computational costs, they lead to significant information loss during the tokenization process. We present AE-ViT, a novel architecture that leverages an ensemble of autoencoders to address this issue by introducing specialized latent tokens that integrate seamlessly with standard patch tokens, enabling ViTs to capture both global and fine-grained features. Our experiments on CIFAR-100 show that AE-ViT achieves a 23.67% relative accuracy improvement over the baseline ViT when using 16×16 patches, effectively recovering fine-grained details typically lost with larger patches. Notably, AE-ViT maintains relevant performance (60.64%) even at 32×32 patches. We further validate our method on CIFAR-10, confirming consistent benefits and adaptability across different datasets. Ablation studies on ensemble size and integration strategy underscore the robustness of AE-ViT, while computational analysis shows that its efficiency scales favorably with increasing patch size. Overall, these findings suggest that AE-ViT provides a practical solution to the patch-size dilemma in ViTs by striking a balance between accuracy and computational cost, all within a simple, end-to-end trainable design. Keywords: Vision Transformers, Convolutional Neural Networks, Autoencoders, Hybrid Architecture, Image Classification, latent representation 1 Introduction Vision Transformers (ViTs) [9] have emerged as a powerful alternative to Convolutional Neural Networks (CNNs) in computer vision tasks. Adapting the transformer architecture from natural language processing [26], Vision Transformers (ViTs) process images by de- composing them into non-overlapping patches and employing self-attention mechanisms to model relationships across the entire image. This architectural paradigm has achieved exceptional performance across diverse computer vision tasks [14], presenting a compelling challenge to the historically dominant Convolutional Neural Networks (CNNs). However, ViTs face a critical trade-off in patch size selection. The computational com- plexity of self-attention operations grows quadratically with the number of tokens, mak- ing smaller patches (e.g., 8×8 pixels) computationally expensive despite their rich feature representation. Conversely, larger patches (e.g., 16×16 pixels) offer better computational efficiency but sacrifice fine-grained spatial information [19]. This information loss becomes particularly evident in tasks requiring detailed feature analysis [4], where the tokenization process with large patches might lose crucial local patterns. Several approaches have been proposed to address this trade-off. Hierarchical de- signs [27] progressively merge tokens to balance computational cost and feature granularity. Efficient attention mechanisms [20] reduce complexity through local attention windows. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025 47 DOI:10.5121/ijaia.2025.16104
  • 2. tary strengths. However, these solutions often introduce significant architectural complex- ity [12], requiring careful design choices and sophisticated training strategies that may limit their practical adoption. In this paper, we introduce AE-ViT, a novel approach that addresses the patch size dilemma through an ensemble of autoencoders. Our method complements the standard patch-based tokenization with learned latent representations from convolutional autoen- coders [1]. These autoencoders, operating at a finer scale than the patch tokens, capture and preserve local features that would otherwise be lost with large patches. By integrating these latent representations into the transformer’s token sequence, we enable the model to simultaneously leverage both the computational benefits of large patches and the fine- grained feature detection capabilities of CNNs [18]. This hybrid architecture builds upon the proven strengths of both convolutional [17] and transformer architectures [24], creating a synergy that effectively compensates for the limitations of large patch tokenization. Our key contributions can be summarized as follows: (1) We propose a novel hybrid architecture that leverages an ensemble of autoencoders to compensate for information loss in large-patch Vision Transformers; (2) We introduce an efficient method for integrat- ing autoencoder latent representations with transformer tokens, maintaining architectural simplicity while significantly improving performance; (3) We demonstrate the effectiveness of our approach through extensive experiments on CIFAR-100 [15], achieving a 23.67% relative accuracy improvement over the baseline ViT when using 16×16 patches, without introducing significant computational overhead [21]. We also tested our system on the CIFAR-10 [16] to validate the scaling over other datasets. Our results show that AE-ViT provides a practical solution to the patch size dilemma, particularly valuable in scenarios where computational efficiency is crucial [9]. 2 Background and Related Work 2.1 Vision Transformers Originally introduced by Dosovitskiy et al. [9], Vision Transformers (ViTs) adapt the transformer architecture [26] for image processing tasks by treating images as sequences of patches. This approach employs self-attention mechanisms to capture global dependencies, yielding remarkable performance across numerous vision applications. The success of the original ViT has led to multiple improvements, such as DeiT [24], which proposes efficient training techniques, and CaiT [25], which enhances feature representation by increasing model depth. 2.2 Efficiency in Vision Transformers Despite their success, ViTs face computational challenges stemming from self-attention complexity, which grows quadratically with the number of tokens. Smaller patches yield more tokens and capture finer details, but with significantly higher computational cost. To address this, the Swin Transformer [20] introduces local attention windows to reduce overall complexity, and PVT [27] employs a progressive shrinking pyramid to reduce the token count in deeper layers. Another approach, TNT [13], processes patch-level and pixel- level tokens in a nested structure, emphasizing multi-scale feature representation. Hybrid architectures [10] combine CNNs and transformers to leverage their complemen- International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025 48
  • 3. 2.3 Hybrid CNN–Transformer Approaches Beyond pure transformer architectures, a growing body of research integrates convolu- tional neural networks (CNNs) into ViTs to capitalize on their complementary strengths. ConViT [7] introduces soft convolutional inductive biases to refine local feature model- ing, while LeViT [10] interleaves convolutional blocks for efficient early-stage processing. CvT [28] demonstrates the effectiveness of convolutional token embeddings, and early convolution layers [29] have been shown to be crucial for robust vision transformer per- formance. Meanwhile, CoAtNet [6] unifies depthwise convolutions and attention layers for scalable performance across different input sizes, and MobileViT [22] incorporates lightweight CNN blocks into ViTs to facilitate mobile-friendly inference. In parallel, Guo et al. [11] highlight that CNN-based approaches can significantly reduce the computational costs of vision transformers. These hybrid efforts underscore the synergy between local re- ceptive fields (CNN-like) and global self-attention (transformer-like), paving the way for more efficient and effective models. 2.4 Autoencoders in Vision Tasks Autoencoder architectures [1] have shown remarkable success in learning compact, mean- ingful representations, from dimensionality reduction to feature learning [2]. Their ability to preserve essential information while reducing spatial dimensionality is particularly use- ful in tasks requiring fine-grained detail [8]. By reconstructing input data, autoencoders can capture local and global structure, making them relevant for scenarios where large image patches risk losing crucial spatial information. 2.5 Our Approach Our work bridges these lines of research by introducing an ensemble of CNN-based au- toencoders to enhance transformer-based vision models. While previous methods have incorporated CNN blocks within a transformer (Section 2.3) or tackled patch-level effi- ciency (Section 2.2), our method uniquely leverages autoencoder-based latent representa- tions to compensate for information loss in large-patch ViTs. Instead of modifying the patch embedding or internal attention blocks, we integrate a learnable latent “AE token” that complements the standard patch tokens. This design aims to reconcile the efficiency benefits of large patches with the need to preserve fine-grained details, offering a novel solution to the efficiency–accuracy trade-off in modern vision transformers. 3 Method 3.1 Overview The Fig. 1 presents an overview of the architecture of our ensemble. For clarity, we chose to present the case of 4 autoencoders. The latent token is “plugged” as an additional token for the transformer. 3.2 Autoencoder Ensemble Design Each autoencoder in our ensemble follows a convolutional architecture optimized for cap- turing fine-grained features. The encoder pathway consists of three downsampling blocks that progressively reduce spatial dimensions from 64×64 to 8×8 while increasing the fea- ture channels. Specifically, we use strided convolutions with a kernel size of 4×4 and stride International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025 49
  • 4. Fig. 1. Overview of the ensemble architecture (for four autoencoders). 2, followed by batch normalization and ReLU activation. The feature dimensions evolve as follows: 3 conv − − − → 32 conv − − − → 64 conv − − − → 128 (1) The final feature map is flattened and projected to a latent space of dimension 256 through a fully connected layer. This results in a compression ratio of: compression ratio = 64 × 64 × 3 256 = 48 : 1 (2) The decoder mirrors this structure with transposed convolutions to progressively re- construct the spatial dimensions: 128 convT − − − → 64 convT − − − → 32 convT − − − → 3 (3) To prevent overfitting and encourage robust feature learning, we employ L1 and L2 regularization on the latent space: Lreg = λ1∥z∥1 + λ2∥z∥2 2 (4) where z represents the latent vector and λ1 = λ2 = 0.01 are regularization coefficients. The total loss for each autoencoder is: Ltotal = Lrecon + Lreg (5) where Lrecon is the mean squared error between the input and reconstructed images. 3.3 Latent Token Integration The main innovation of our approach lies in how we integrate the autoencoder ensemble’s latent representations with the Vision Transformer’s patch tokens. Given an input image X ∈ RH×W×3, the process follows three parallel paths that merge in the transformer: International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025 50
  • 5. Patch Tokenization The input image is divided into non-overlapping patches of size sp × sp, resulting in N = (H/sp) × (W/sp) patches. Where sp defines the patch size in pixels. Each patch is linearly projected to dimension D through a learnable embedding matrix E ∈ R(sp×sp×3)×D: tpatch = [x1 pE; x2 pE; ...; xN p E] ∈ RN×D (6) Latent Encoding Simultaneously, our ensemble of five autoencoders processes the input image, each producing a latent vector zi ∈ R256. These latents are concatenated: zconcat = [z1; z2; z3; z4; z5] ∈ R1280 (7) This concatenated representation is then projected to the transformer’s embedding dimension D through a learnable projection Wproj ∈ R1280×D: tlatent = zconcatWproj ∈ R1×D (8) Token Sequence Formation The final sequence presented to the transformer concate- nates the class token tcls, patch tokens, and latent token: T = [tcls; tpatch; tlatent] ∈ R(N+2)×D (9) Position embeddings P are added to this sequence to maintain positional information: Tfinal = T + P (10) For the final classification, we utilize both the class token and the latent token, con- catenating their representations after the transformer processing: y = MLP([hcls; hlatent]) (11) where hcls and hlatent are the transformed representations of the respective tokens. 3.4 Training Strategy Our training process follows a two-phase approach designed to maximize the complemen- tary strengths of both the autoencoder ensemble and the Vision Transformer. Phase 1: Autoencoder Ensemble Pretraining We train each autoencoder indepen- dently on different random subsets of the training data. Each autoencoder sees different parts of the training data, randomly sampled with a fixed seed to ensure reproducibil- ity. The objective function for each autoencoder combines reconstruction loss with latent space regularization: LAE = Lrecon + λ1∥z∥1 + λ2∥z∥2 2 (12) where λ1 = λ2 = 0.01. This phase runs for 5 to 205 epochs using the Adam optimizer with a learning rate of 10−3. 5 Depending on the experience conditions. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025 51
  • 6. Phase 2: End-to-End Training After pretraining, we freeze the autoencoder param- eters and train the complete AE-ViT architecture end-to-end. The loss function for this phase is: Ltotal = LCE + λwd∥θ∥2 2 (13) where LCE is the cross-entropy loss with label smoothing (α = 0.1), and λwd = 0.05 is the weight decay coefficient. We employ the AdamW optimizer with the following specifics: – Learning rate: 10−3 – Batch size: 64 – Training epochs: 100 – No early stopping, use the last epoch for test – Cosine learning rate scheduling – Gradient clipping at norm 1.0 The complete training process can be formalized as: θ∗ = arg min θ E(x,y)∼D[Ltotal(AE-ViTθ(x), y)] (14) where D represents the training dataset and θ are the model parameters. 4 Experiments 4.1 Experimental Setup We conduct extensive experiments on two standard benchmarks: CIFAR-100 [15] and CIFAR-10 [16]. Images are resized to 64×64 pixels. We use standard data augmentation including random horizontal flips and AutoAugment. Training uses AdamW optimizer with learning rate 1e-3 and weight decay 0.05 for 100 epochs with cosine scheduling. We conduct our experiments on CIFAR-100, which consists of 50,000 training images and 10,000 test images in 100 classes. All images are resized to 64×64 pixels to accommo- date our patch-based architecture. We use standard data augmentation techniques includ- ing random horizontal flips and AutoAugment [5]. Implementation details are provided in our publicly available code. 4.2 Main Results Performance on CIFAR-100. Table 1 presents our main results with various patch sizes: Table 1. Classification accuracy (%) on CIFAR-100 Model Patch Size FLOPs (M) Accuracy ViT 8×8 287.6 61.23 ViT 16×16 78.4 49.94 ViT 32×32 19.6 35.86 AE-ViT (Ours) 16×16 187.0 61.76 AE-ViT (Ours) 32×32 128.4 60.64 With 16×16 patches, our approach not only outperforms the baseline ViT but also achieves better accuracy than the more computationally intensive 8×8 patch configuration, International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025 52
  • 7. and has 23.67% relative accuracy improvement compared to baseline ViT with 16x16 patches. Most notably, with 32×32 patches, AE-ViT maintains reasonable performance (60.64%) while the baseline ViT severely degrades (35.86%), demonstrating a remarkable relative improvement of 69%. Scaling to CIFAR-10. To validate the generality of our approach, we conduct some ex- periments on CIFAR-10. Results in Table 2 show that AE-ViT maintains its effectiveness: Table 2. Classification accuracy (%) on CIFAR-10 Model Patch Size Accuracy ViT 8×8 82.91 ViT 16×16 76.37 AE-ViT (Ours) 16×16 84.35 We can see that the architecture also works on the CIFAR-10 dataset. 4.3 Ablation Studies Impact of Ensemble Size. We conduct a systematic study of ensemble size impact: Table 3. Impact of number of autoencoders (CIFAR-100) #AEs Test Acc. Train Acc. FLOPs (M) 1 59.11 86.28 106 2 60.73 87.15 133 3 61.00 88.38 160 4 61.76 88.90 187 5 61.75 89.31 214 6 61.27 89.08 241 The results reveal an optimal configuration at 4 autoencoders, beyond which perfor- mance plateaus or slightly degrades. This suggests that while ensemble diversity is benefi- cial, there exists a sweet spot balancing performance and computational cost. However, we should conduct more experiments on various conditions (dataset, patch size, transformers hyperparameters) before we can generalize this configuration. Cross-dataset Training. We explore enhancing the ensemble with an autoencoder trained on CIFAR-10. This configuration achieves 61.94% accuracy on CIFAR-100, sug- gesting potential benefits from cross-dataset knowledge transfer, albeit with diminishing returns compared to the computational overhead. Conclusion. These results indicate that adding more than 4 autoencoders is not beneficial for the AE-ViT architecture. The use of 4 autoencoders represents an optimal balance between accuracy, generalization, and computational efficiency. 4.4 Efficiency Analysis While our primary results demonstrate AE-ViT’s superior accuracy over the baseline ViT with 16×16 patches, a more insightful comparison emerges when we consider the baseline International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025 53
  • 8. ViT with 8×8 patches, which achieves similar performance levels. This comparison is particularly interesting as it addresses the trade-off between accuracy and computational efficiency. Table 4 presents a detailed comparison: Table 4. Efficiency comparison between ViT (8×8) and AE-ViT (16×16) Model FLOPs (M) #Tokens Test Acc. (%) ViT (8×8) 287.6 65 61.23 AE-ViT (16×16) 187.0 18 61.76 Relative Difference -34.9% -72.3% +0.87% 4.5 Comparison with State-of-the-Art While our primary focus is on addressing the patch size dilemma in Vision Transformers, it is insightful to position AE-ViT within the broader context of modern architectures. Table 5 presents a comparison with Swin Transformer [20], a leading hierarchical vision transformer: Table 5. Comparison with Swin Transformer on CIFAR-100 Model Accuracy (%) Params (M) FLOPs (M) Swin-T [3] 78.41 28.3 4500 AE-ViT (4 AEs) 61.76 12.4 187 While Swin Transformer achieves higher accuracy, AE-ViT offers significantly better computational efficiency. This efficiency gap becomes even more significant when considering higher resolution images. For example, with 1080p images (1920×1080): Table 6. Theoretical scaling to 1080p images Model Memory FLOPs (G) Swin-T O(HW) 284.4 AE-ViT O((HW/P2 )) 11.8 This scaling advantage stems from our efficient use of large patches (16×16) com- bined with the fixed-cost autoencoder ensemble. While Swin Transformer’s computational requirements grow quadratically with image size, AE-ViT maintains better efficiency, mak- ing it particularly suitable for high-resolution applications where computational resources are constrained. The results reveal that AE-ViT achieves comparable (slightly better) accuracy while requiring about 35% fewer FLOPs. This efficiency gain stems primarily from two factors: 1) Token Efficiency: AE-ViT processes only 18 tokens (16 patch tokens + 1 CLS token + 1 latent token) compared to 65 tokens in the 8×8 ViT, resulting in a 72.3% reduction in the self-attention computational load. 2) Computational Distribution: While AE-ViT introduces additional computation through its autoencoder ensemble (136.3M FLOPs), this is more than offset by the reduced trans- former complexity (78.4M FLOPs vs 287.6M FLOPs). International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025 54
  • 9. The memory footprint also favors AE-ViT, as the attention mechanism’s quadratic memory scaling with respect to the number of tokens (O(n²)) makes the reduced token count particularly significant. This demonstrates that our approach not only bridges the performance gap of larger patches but does so in a computationally efficient manner. 5 Discussion Our experimental results demonstrate several key findings about AE-ViT and provide insights into the trade-offs between patch size, computational efficiency, and model per- formance. 5.1 Optimal Ensemble Configuration The systematic study of autoencoder ensemble size reveals a clear optimization pattern. Starting from a single autoencoder (59.11%), we observe significant improvements with each additional autoencoder up to four (61.76%), followed by diminishing returns with five autoencoders (61.75%) and performance degradation with six (61.27%). This pattern suggests that: – The ensemble approach is fundamentally sound, with even two autoencoders outper- forming a single autoencoder by 1.62% – Four autoencoders represent an optimal balance between performance and complexity – Additional autoencoders beyond four may introduce unnecessary redundancy or noise 5.2 Scaling with Patch Size Perhaps the most striking result is AE-ViT’s ability to maintain performance with larger patch sizes: – With 16×16 patches, AE-ViT (61.76%) matches the performance of standard ViT with 8×8 patches (61.23%) while using 35% fewer FLOPs – With 32×32 patches, AE-ViT (60.64%) demonstrates remarkable resilience compared to the baseline (35.86%), achieving a 69% relative improvement This scaling behavior suggests that our autoencoder ensemble effectively compensates for the information loss in larger patches, potentially offering a pathway to processing high- resolution images efficiently. 5.3 Cross-Dataset Insights Our experiments with cross-dataset training, where we incorporate an autoencoder trained on CIFAR-10 into the ensemble, yield several insights: – The mixed ensemble (61.94%) slightly outperforms the pure CIFAR-100 ensemble (61.76%) – The improvement, while modest, suggests potential benefits from diverse training data – The approach maintains effectiveness across datasets, as demonstrated by strong per- formance on CIFAR-10 (84.35%) International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025 55
  • 10. 5.4 Computational Efficiency Comparing AE-ViT with state-of-the-art models like Swin Transformer reveals an inter- esting efficiency-accuracy trade-off: – While Swin-T achieves higher accuracy (78.41%), it requires 24× more FLOPs – AE-ViT’s efficiency advantage grows with image resolution due to fixed autoencoder costs – The parameter count remains modest (12.4M vs 28.3M for Swin-T) 5.5 Limitations Our approach, while effective, has several notable limitations: – Current architecture depends on a fixed number of autoencoders, lacking adaptability to varying computational constraints – Performance on high-resolution images remains untested, particularly for resolutions beyond 64×64 – The approach has only been validated on classification tasks, not on more complex vision tasks – Limited to static image processing, without consideration for temporal features – Generalization capabilities have only been tested on relatively small datasets (CIFAR- 100, CIFAR-10) 5.6 Future Work Several promising directions for future research emerge: Architecture Improvements – Develop dynamic ensemble selection mechanisms that adapt to specific domains and computational constraints – Explore alternative autoencoder architectures for enhanced feature extraction – Investigate integration with state-of-the-art transformer variants – Explore cross dataset learning seen on 4.3 for further knowledge transfer. – Using various types of autoencoders to increase the robustness of transformers. To have better results than those seen in [23] Scaling and Performance – Validate performance on high-resolution images (1080p, 4K) – Test scalability with larger datasets (ImageNet, Places365) – Optimize implementation for specific hardware accelerators (GPUs, TPUs) Extended Applications – Adapt the architecture for dense prediction tasks (segmentation, detection) – Extend to video processing by incorporating temporal information – Explore transfer learning for specialized domains (medical imaging, satellite imagery) International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025 56
  • 11. 6 Conclusion In this paper, we introduced AE-ViT, a novel approach that effectively addresses the patch-size dilemma in Vision Transformers through an ensemble of autoencoders. Our method achieves a 23.67% relative improvement over the baseline ViT on CIFAR-100 with 16×16 patches, while using significantly fewer computational resources than models that rely on smaller patches. Through extensive experimentation, we identified an optimal configuration of four autoencoders that balances performance and efficiency. The effectiveness of AE-ViT is particularly evident in its ability to maintain strong performance with large patches, achieving 60.64% accuracy with 32×32 patches compared to the baseline’s 35.86%. This capability, combined with its strong showing on CIFAR-10 (84.35%) and efficient scaling properties, demonstrates the potential of our approach for high-resolution image processing applications. Most importantly, AE-ViT offers a practical solution for scenarios requiring efficient vision processing, providing a favorable trade-off between accuracy and computational cost. While more computationally intensive models like Swin Transformer achieve higher absolute accuracy, AE-ViT’s efficiency (24× fewer FLOPs) makes it particularly attractive for resource-constrained settings or real-time pro- cessing of high-resolution images. In line with our findings, promising avenues for further research include exploring dy- namic ensemble selection mechanisms to adapt AE-ViT to specific domains, developing more sophisticated autoencoder architectures for enhanced feature extraction, and scal- ing to larger datasets like ImageNet or higher resolutions (1080p and beyond). Extending AE-ViT to dense prediction tasks (e.g., segmentation or detection) and time-series data (e.g., video) would also be valuable, as would investigating transfer learning for special- ized domains such as medical imaging or satellite imagery. These extensions may reinforce AE-ViT’s robustness, accelerate its adoption in real-world scenarios, and deepen our un- derstanding of hybrid autoencoder–transformer models. 7 Acknowledgments We thank the anonymous reviewers for their valuable feedback. We thank as well ISPM (https://ispm-edu.com/) for funding our research. References 1. Baldi, P.: Autoencoders, unsupervised learning, and deep architectures. Proceedings of ICML Work- shop on Unsupervised and Transfer Learning (2012) 2. Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8), 1798–1828 (2013) 3. Chen, C., Derakhshani, M.M., Liu, Z., Fu, J., Shi, Q., Xu, X., Yuan, L.: On the vision transformer scaling: Parameter scaling laws and improved training strategies. arXiv preprint arXiv:2204.08476 (2022) 4. Chen, C.F., Fan, Q., Panda, R.: Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021) 5. Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 113–123 (2019). https://doi.org/10.1109/CVPR.2019.00020 6. Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: Marrying Convolution and Attention for All Data Sizes. In: Advances in Neural Information Processing Systems. vol. 34, pp. 3965–3977 (2021) 7. d’Ascoli, S., Touvron, H., Leavitt, M., Morcos, A., Biroli, G., Sagun, L.: Convit: Improving vision transformers with soft convolutional inductive biases. In: Proceedings of the International Conference on Machine Learning (ICML) (2021) International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025 57
  • 12. 8. Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.: Attention in atten- tion: Modeling context correlation for efficient vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021) 9. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021) 10. Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., Douze, M.: Levit: A vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021) 11. Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C.: Cmt: Convolutional neural networks meet vision transformers (2022), https://arxiv.org/abs/2107.06263 12. Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., et al.: A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022) 13. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: Advances in Neural Information Processing Systems (NeurIPS) (2021) 14. Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: A survey. ACM Computing Surveys (2021) 15. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep., University of Toronto (2009) 16. Krizhevsky, A., Nair, V., Hinton, G.: CIFAR-10 (canadian institute for advanced research). http: //www.cs.toronto.edu/~kriz/cifar.html (2009), dataset 17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012) 18. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 19. Liu, K., Zhang, W., Tang, K., Li, Y., Cheng, J., Liu, Q.: A survey of vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022) 20. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021) 21. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Efficient transformers: A survey. ACM Computing Surveys (2021) 22. Mehta, S., Rastegari, M.: MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In: International Conference on Learning Representations (ICLR) (2022), https:// openreview.net/forum?id=7RkGY0FtwrY 23. Raboanary, H.A., Raboanary, R., Tahiridimbisoa, N.H.M.: Robustness assessment of neural network architectures to geometric transformations: A comparative study with data augmen- tation. In: 2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME). IEEE, Tenerife, Canary Islands, Spain (Jul 2023). https://doi.org/10.1109/ICECCME57830.2023.10253075 24. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of the International Conference on Machine Learning (ICML) (2021) 25. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image trans- formers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021) 26. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems. vol. 30 (2017) 27. Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021) 28. Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021) 29. Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help trans- formers see better. In: Advances in Neural Information Processing Systems (NeurIPS) (2021) International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025 58