AWS EC2 M6i instances featuring 3rd Gen Intel Xeon Scalable processors offered better BERT machine learning performance

AWS EC2 M6i instances featuring 3rd
Gen Intel Xeon
Scalable processors offered better BERT machine
learning performance
vs. M5n instances with 2nd
Gen Intel Xeon Scalable processors and
M6a instances with 3rd
Gen AMD EPYC processors
Many machine learning workloads involve sorting, analyzing, and making relationships
between images, but how can organizations quickly make sense of large amounts of text?
Bidirectional Encoder Representations from Transformers (BERT) is a machine learning
framework for natural language processing (NLP). To analyze text, BERT looks at all the words
around a given word to put it in the correct context. This allows applications such as search
engines to predict sentences, answer questions, or generate conversational responses.
Using Intel optimization for TensorFlow and ZenDNN integrated with TensorFlow, we
compared the BERT machine learning performance of three types of Amazon Web Services
(AWS) EC2 series instances: M6i instances with 3rd
Gen Intel®
Xeon®
Scalable processors
featuring Intel DL Boost with Vector Neural Network Instructions, M5n instances with 2nd
Gen
Intel Xeon Scalable processors, and M6a instances with 3rd
Gen AMD EPYC™
processors.
In tests at multiple instance sizes, AWS M6i instances offered up to 45 percent better BERT
performance on a benchmark from the Intel Model Zoo than the M5n instances with previous-
gen processors and up to 6.4 times the BERT performance compared to M6a instances with
3rd
Gen AMD EPYC processors. This means that organizations running similar BERT workloads
in the cloud could get better performance per instance by choosing M6i instances featuring
3rd
Gen Intel Xeon Scalable processors.
Up to 5.2x the queries
per second
vs. M6a instances
per second
vs. M6a instances
per second
vs. M6a instances
4
vCPUs
8
vCPUs
16
vCPUs
Gen Intel Xeon Scalable processors offered better BERT machine learning performance June 2022
A Principled Technologies report: Hands-on testing. Real-world results.

Figure 1: Key specifications for each instance size we tested. Source: Principled Technologies.
How we tested
We purchased three sets of instances from three general-purpose AWS EC2 series:
• M6i instances featuring 3rd
Gen Intel Xeon Platinum 8375C processors (Ice Lake)
• M5n instances featuring 2nd
Gen Intel Xeon Platinum 8259CL processors (Cascade Lake)
• M6a instances featuring 3rd
Gen AMD EPYC 7R13 processors (Milan)
We ran each instance in the US East 1 region.
Figure 1 shows the specifications for the instances that we chose. To show how businesses of various sizes with
different machine learning demands can benefit from choosing M6i instances, we tested instances with 4 vCPUs,
8 vCPUs, and 16 vCPUs. To account for different types of datasets, we ran tests using a small batch size of 1
and a large batch size of 32—where batch size is the number of samples that go through the neural network at
a time. In this report, we present the comparisons between M6i and M5n instances first, and then present the
comparisons between M6i and M6a instances. (Note: For additional test results on even larger instances, see the
science behind the report.)
4
vCPUs
8
vCPUs
16
vCPUs
Testing BERT performance in the cloud
The BERT framework, which was trained on text from the English language Wikipedia with over 2.5 million
words, works by turning text into numbers to sort, analyze, and make predictions about that text.1
Depending
on the dataset on which an organization needs to run BERT machine leaning, the size of the AWS instances they
choose will vary. To account for these different needs, we tested using two batch sizes across three different
instance sizes. We used a BERT benchmark from Intel Model Zoo, which offers a range of machine learning
models and tools. At the time of our testing, AMD EPYC processors did not support INT8 precision for BERT,
so we present FP32 precision results for M6i instances as well for comparison. In all three, the M6i instances
enabled by 3rd
Gen Intel Xeon Scalable processors outperformed both the previous-gen M5n instances and the
current-gen M6a instances.
June 2022 | 2
Gen Intel Xeon Scalable processors offered better BERT machine learning performance

Why choose M6i instances with 3rd
Gen Intel Xeon Scalable processors?
New M6i instances with 3rd
Gen Intel Xeon Scalable processors offer the following:4
• All-core turbo frequency of up to 3.5 GHz
• Always-on memory encryption with Intel Total Memory Encryption (TME)
• Intel DL Boost with Vector Neural Network Instructions (VNNI) that accelerate INT8 performance
• Intel Advanced Vector Extensions 512 (Intel AVX-512) instructions for demanding machine
learning workloads
• Support for up to 128 vCPUs and 512 GB of memory per instance
• Up to 50Gbps networking
About 3rd
Generation Intel Xeon Scalable processors
According to Intel, 3rd
Generation Intel Xeon Scalable processors are “[o]ptimized for cloud, enterprise, HPC,
network, security, and IoT workloads with 8 to 40 powerful cores and a wide range of frequency, feature, and
power levels.”2
Intel continues to offer many models from the Platinum, Gold, Silver, and Bronze processor lines
that they “designed through decades of innovation for the most common workload requirements.3
For more information, visit http://intel.com/xeonscalable.
June 2022 | 3

Instances with 4 vCPUs: M6i vs. M5n
First, we compared BERT performance on smaller instances, looking at the relative amount of text the instance
types analyzed on 4vCPU configurations. As Figure 2 shows, M6i instances enabled by 3rd
Gen Intel Xeon
Scalable processors analyzed up to 18 percent more examples per second than the M5n instances with 2nd
Gen
Intel Xeon Scalable processors.
Figure 2: Relative BERT performance for M6i and M5n instances using 4 vCPUs. Higher numbers are better.
Source: Principled Technologies.
When we doubled the instance size to 8 vCPUs, M6i instances delivered a similar performance increase over
previous-gen M5n instances. Figure 3 compares the relative amount of text the instance types analyzed on
8vCPU configurations. The M6i instances enabled by 3rd
Gen Intel Xeon Scalable processors analyzed up to 11
percent more examples per second than the M5n instances with 2nd
Gen Intel Xeon Scalable processors.
Relative BERT performance of m6i.xlarge vs. m5n.xlarge
Larger is better
0
0.20
0.40
0.60
0.80
1.00
1.40
1.20
1.00
Relative
throughput
(examples/sec)
M6i (INT8) M5n (INT8) M6i (INT8)
Batch size: 1 Batch size: 32
1.13
1.00
1.18
M5n (INT8)
1.60
Relative BERT performance of m6i.2xlarge vs. m5n.2xlarge
Larger is better
0
0.20
0.40
0.60
0.80
1.00
1.40
1.20
1.00
Relative
throughput
(examples/sec)
1.11
1.00
1.11
M5n (INT8)
1.60
up to 11%
better throughput
up to 18%
better throughput
4
vCPUs
8
vCPUs
June 2022 | 4

As Figure 4 shows, M6i instances offered the greatest relative BERT performance increase over previous-gen
M5n instances using larger 16vCPU configurations. The M6i instances enabled by 3rd
Gen Intel Xeon Scalable
processors analyzed up to 45 percent more examples per second than the M5n instances with 2nd
Gen Intel Xeon
Scalable processors. By improving textual data analysis throughput by 45 percent, organizations could reduce
the number of instances they need to purchase and manage when they select the M6i instance type.
Relative BERT performance of m6i.4xlarge vs. m5n.4xlarge
Larger is better
0
0.20
0.40
0.60
0.80
1.00
1.40
1.20
1.00
Relative
throughput
(examples/sec)
1.21
1.00
1.45
M5n (INT8)
1.60
up to 45%
better throughput
16
vCPUs
June 2022 | 5

Instances with 4 vCPUs: M6i vs. M6a
After comparing BERT performance of M6i instances against that of instances based on previous-gen processors,
we compared those three sizes of M6i instances against M6a instances with AMD EPYC processors. Figure 5
compares the relative amount of text these instance types analyzed on 4vCPU configurations. The M6i instances
enabled by 3rd
Gen Intel Xeon Scalable processors with INT8 precision analyzed data 5.29 times as fast as the
M6a instances with 3rd
Gen AMD EPYC processors using FP32 precision. Note: At the time of testing, INT8
precision—which can improve performance for these types of machine learning—was not available for
BERT workloads on AMD EPYC processors. Using FP32 precision, M6i instances improved performance over
M6a instances by as much as 68 percent.
Figure 5: Relative BERT performance for M6i and M6a instances using 4 vCPUs. Higher numbers are better.
When we increased the instance sizes to 8 vCPUs, performance increases were similar to the 4vCPU
configurations. Figure 6 compares the relative amount of text the instance types analyzed on 8vCPU
configurations. The M6i instances enabled by 3rd
Gen Intel Xeon Scalable processors analyzed data up to 5.10
times as fast as the M6a instances with 3rd
Gen AMD EPYC processors.
Relative BERT performance of m6i.xlarge vs. m6a.xlarge
Larger is better
0
1.00
2.00
3.00
4.00
5.00
7.00
6.00
1.68
1.00
Relative
throughput
(examples/sec)
M6i (INT8) M6i (FP32) M6a (FP32) M6i (INT8) M6i (FP32) M6a (FP32)
4.24
1.00
1.57
5.29
Relative BERT performance of m6i.2xlarge vs. m6a.2xlarge
Larger is better
0
1.00
2.00
3.00
4.00
5.00
7.00
6.00
1.64
1.00
Relative
throughput
(examples/sec)
4.44
1.00
1.68
5.10
up to 5.10x
the throughput
up to 5.29x
the throughput
4
vCPUs
8
vCPUs
June 2022 | 6

The biggest relative difference in BERT performance occurred in our 16vCPU comparison of M6i and M6a
configurations. Figure 7 compares the relative examples per second the instance types analyzed on 16vCPU
configurations. The M6i instances enabled by 3rd
Gen Intel Xeon Scalable processors analyzed data up to 6.40
times as fast as the M6a instances with 3rd
Gen AMD EPYC processors. These results show that for these types
of BERT workloads, selecting M6i instances that offer INT8 precision over M6a instances that don’t could allow
organizations to complete textual analysis workloads using fewer cloud instances.
Relative BERT performance of m6i.4xlarge vs. m6a.4xlarge
Larger is better
0
1.00
2.00
3.00
4.00
5.00
7.00
6.00
1.81
1.00
Relative
throughput
(examples/sec)
6.37
1.00
2.24
6.40
up to 6.40x
the throughput
16
vCPUs
June 2022 | 7

Scaling BERT workloads
Another consideration for assessing BERT performance is to see how the throughput scales as you increase
the size of the instance. Theoretically, performance could double as you double the vCPU count, which would
be perfect linear scaling. While resource allocation makes this unlikely in the real world, the closer an instance
approaches this ideal, the better.
As Figure 8 shows, using results from our batch size: 1 tests, the M6i instance with 3rd
Gen Intel Xeon Scalable
processors had better BERT performance scaling from 8 vCPUs to 16 vCPUs compared to the M6a instance with
AMD EPYC processors, though slightly worse scaling from 4 vCPUs to 8 vCPUs.
Figure 8: How BERT performance scaled across instance sizes, compared to results from the 4vCPU tests with batch size 1.
Higher numbers are better. Source: Principled Technologies.
Figure 9 makes the same comparison, but uses results from our batch size: 32 testing. Again, the M6i
instance with 3rd
Gen Intel Xeon Scalable processors scaled more linearly from 4 to 16 vCPUs compared to
the M6a instance.
Figure 9: How BERT performance scaled across instance sizes, compared to results from the 4vCPU tests with batch size
32. Higher numbers are better. Source: Principled Technologies.
Relative BERT performance scaling compared to 4vCPUs with batch size: 1
Larger is better
0
0.50
1.00
1.50
2.00
2.50
3.50
3.00
Relative
throughput
(examples/sec)
4 vCPUs 8 vCPUs
1.85
1.00
3.40
1.86 1.91
3.16
4.00
4.50
M6i (INT8) M6i (INT32) M6a (FP32)
16vCPUs
3.83
1.00 1.00
Relative BERT performance scaling compared to 4vCPUs with batch size: 32
Larger is better
Relative
throughput
(examples/sec)
0
0.50
1.00
1.50
2.00
2.50
3.50
3.00
4 vCPUs 8 vCPUs
1.86
1.00
1.90
3.52
2.52
1.78
4.00
4.50
M6i (INT8) M6i (INT32) M6a (FP32)
16vCPUs
3.79
1.00 1.00
By selecting M6i instances that offer more linear, predictable performance scaling, organizations could more
reliably fix their cloud operating budgets as textual analysis workloads continue to grow.
June 2022 | 8

Conclusion
Organizations analyzing textual data using NLP through the BERT framework must decide which type of instance
can deliver the BERT performance they need. In our tests, we found that across instance sizes, AWS M6i
instances with 3rd
Gen Intel Xeon Scalable processors outperformed both M5n instances with 2nd
Gen Intel Xeon
Scalable processors and M6a instances with 3rd
Gen AMD EPYC processors for BERT machine learning. Plus, the
M6i instances offered more predictable scaling at 16vCPUs. These performance increases could help you get
quicker insight from textual data to better satisfy consumers and increase revenues.
1. TechTarget, “BERT language model,” accessed December 16, 2021,
https://www.techtarget.com/searchenterpriseai/definition/BERT-language-model.
2. Intel, “3rd Gen Intel®
Xeon®
Scalable Processors,” accessed December 14, 2021,
https://www.intel.com/content/www/us/en/products/docs/processors/xeon/3rd-gen-xeon-scalable-processors-brief.html.
3. Intel, “3rd Gen Intel®
Xeon®
Scalable Processors.”
4. Amazon, Amazon EC2 M6i Instances, accessed December 14, 2021,
https://aws.amazon.com/ec2/instance-types/m6i/.
Principled Technologies is a registered trademark of Principled Technologies, Inc.
All other product names are the trademarks of their respective owners.
For additional information, review the science behind this report.
Principled
Technologies®
Facts matter.®
Principled
Technologies®
Facts matter.®
This project was commissioned by Intel.
Read the science behind this report at https://facts.pt/ZymIIA3
June 2022 | 9

AWS EC2 M6i instances featuring 3rd Gen Intel Xeon Scalable processors offered better BERT machine learning performance

More Related Content

Similar to AWS EC2 M6i instances featuring 3rd Gen Intel Xeon Scalable processors offered better BERT machine learning performance (20)

More from Principled Technologies (20)

Recently uploaded (20)

AWS EC2 M6i instances featuring 3rd Gen Intel Xeon Scalable processors offered better BERT machine learning performance