AWS MLS-C01 Exam Study Notes

AWS Database Migration Service (AWS DMS) is a cloud service that makes it
easy to migrate relational databases, data warehouses, NoSQL databases, and
other types of data stores. You can use AWS DMS to migrate your data into the
AWS Cloud, between on-premises instances (through an AWS Cloud setup), or
between combinations of cloud and on-premises setups. With AWS DMS, you can
perform one-time migrations, and you can replicate ongoing changes to keep
sources and targets in sync.
You can migrate data to Amazon S3 using AWS DMS from any of the supported
database sources. When using Amazon S3 as a target in an AWS DMS task, both
full load and change data capture (CDC) data is written to comma-separated value
(.csv) format by default.
The comma-separated value (.csv) format is the default storage format for
Amazon S3 target objects. For more compact storage and faster queries, you can
instead use Apache Parquet (.parquet) as the storage format. Apache Parquet is
an open-source file storage format originally designed for Hadoop.
Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or
Textfiles and also supports open-source columnar formats such as Apache ORC
and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO,
and GZIP formats. By compressing, partitioning, and using columnar formats, you
can improve performance and reduce your costs.
Parquet and ORC file formats both support predicate pushdown (also called
predicate filtering). Parquet and ORC both have blocks of data that represent
column values. Each block holds statistics for the block, such as max/min values.
When a query is being executed, these statistics determine whether the block
should be read or skipped.
Athena charges you by the amount of data scanned per query. You can save on

●
●
costs and get better performance if you partition the data, compress data, or
convert it to columnar formats such as Apache Parquet.
Amazon Augmented AI (Amazon A2I) enables you to build the workflows that are
required for human review of machine learning predictions. Amazon Textract is
directly integrated with Amazon A2I so that you can easily get low-confidence
results from Amazon Textract’s AnalyzeDocument API operation reviewed by
humans.
Root-mean-square error (RMSE) is incorrect because this metric is not suitable
for evaluating classification models. This performance metric is mostly used for
regression models.
Amazon Kinesis Data Firehose is a fully managed service for delivering real-
time streaming data to destinations such as Amazon Simple Storage Service
(Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES),
Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported
third-party service providers, including Datadog, MongoDB, and New Relic.
Kinesis Data Firehose does not support converting CSV files directly into Apache
Parquet, unlike JSON.
Amazon KCL already abstracts common tasks like creating streams, resharding,
putting, and getting records, so you can only focus on writing your record-
processing logic.
Amazon Kinesis Data Analytics
Allow you to process and analyze streaming data using standard SQL.
Makes it easy to collect, process, and analyze real-time, streaming data.
Kinesis can ingest real-time data such as video, audio, application logs,
website clickstreams, and IoT telemetry data for machine learning,
analytics, and other applications.

Amazon Kinesis Analytics applications to transform data before it is processed by
your SQL code. This new feature allows you to use AWS Lambda to convert
formats, enrich data, filter data, and more. Once the data is transformed by your
function, Kinesis Analytics sends the data to your application’s SQL code for real-
time analytics.
Cheat Sheet: https://tutorialsdojo.com/amazon-kinesis/
The Amazon Kinesis Data Analytics RANDOM_CUT_FOREST function detects
anomalies in your data stream.
HOTSPOTS function just detects relatively dense regions in your data.
Amazon SageMaker requires more operational management than Amazon Kinesis
Data Analytics.
Kinesis Data Streams can’t be used to transform data on the fly and store the
output data to Amazon S3.
AWS Glue is a serverless data integration service that makes it easy to discover,
prepare, and combine data for analytics, machine learning, and application
development. AWS Glue provides all of the capabilities needed for data integration
so that you can start analyzing your data and putting it to use in minutes instead of
months.
A job is the business logic that performs the extract, transform, and load (ETL)
work in AWS Glue. When you start a job, AWS Glue runs a script that extracts data
from sources, transforms the data, and loads it into targets. You can create jobs in
the ETL section of the AWS Glue console.
CloudWatch
CloudWatch Alarms just allows you to watch CloudWatch metrics and to receive
notifications when the metrics fall outside of the levels that you configure. Since
we need an event-based solution whenever a crawler run completes, we must use
CloudWatch Events.
Amazon Rekognition Content Moderation enables you to streamline or automate
your image and video moderation workflows using machine learning. Using fully
managed image and video moderation APIs, you can proactively detect
inappropriate, unwanted, or offensive content containing nudity, suggestiveness,
violence, and other such categories.

–
–
–
–
–
–
–
–
Polly
text to speech
With Amazon Polly’s custom lexicons or vocabularies, you can modify the
pronunciation of particular words, such as company names, acronyms,
foreign words, and neologisms (e.g., “ROTFL”, “C’est la vie” when spoken
in a non-French voice). To customize these pronunciations, you upload
an XML file with lexical entries.
a viseme Speech Mark is a feature used to synchronize speech with
facial animation (lip-syncing) or to highlight written words as they’re
spoken.
The option that says: Convert the scripts into Speech Synthesis
Markup Language (SSML) and use the pronunciation tag is incorrect
because Amazon Polly does not support this SSML tag.
The option that says: Convert the documents into Speech Synthesis
Markup Language (SSML) and use the emphasis tag to guide the
pronunciation is incorrect as this type of tag is simply used to
emphasize words by changing the speaking rate and volume of the
speech.
Amazon Lex is incorrect. This is just a service that you can use for building
conversational interfaces or chatbots.
Comprehend:
Custom entity recognition extends the capability of Amazon
Comprehend by enabling you to identify new entity types not supported
as one of the preset generic entity types. This means that in addition to
identifying entity types such as LOCATION, DATE, PERSON, and so on,
you can analyze documents and extract entities like product codes or
business-specific entities that fit your particular needs.
Use regular expressions to determine the entities is incorrect.
Although this is possible, it isn’t as effective as creating a Custom Entity
Recognition model.
The option that says: Use Topic Modelling to determine entities is
incorrect because this is specifically used for determining themes/topics
from a collection of documents. Take note that we only need to identify
entities from a list of words.
The option that says: Create a list for each product and use string
matching to determine their entities is incorrect. Like regular
expressions, it would be difficult to match all possible patterns with string
matching. This would produce less accurate results than when using a
Custom Entity Recognition model.

–
–
–
–
–
–
–
–
–
–
–
–
–
Algorithms
k-NN algorithm is just used for classification or regression problems, has
the issue of putting more importance of larger numerical values
small k -> low bias high variance
large k -> high bias low variance
If the line chart looks like an “arm”m, then the “elbow” (the point of
inflection on the curve) is the best value of k. The “arm” can be either
up or down, but if there is a strong inflection point, it is a good
indication that the underlying model fits best at that point.
Normalization of numeric variables can help the learning process if there
are very large range differences between numeric variables because
variables with the highest magnitude could dominate the ML model, no
matter if the feature is informative with respect to the target or not.
The Amazon SageMaker BlazingText algorithm provides highly optimized
implementations of the Word2vec and text classification algorithms
ARIMA and ETS are both valid forecasting methods, however, Amazon
SageMaker DeepAR will most likely outperform them in terms of
providing better forecast accuracy.
Label encoding is incorrect because this type of encoding will only
convert categorical data into integer labels (e.g. 0,1,2,3,4) and not into a
vector of binary values (e.g. [1,0,0], [0,1,0]).
Target encoding is incorrect as this type of encoding is achieved by
replacing categorical variables with just one new numerical variable and
replacing each category of the categorical variable with its corresponding
probability of the target. This won’t convert categorical variables into
binary values.
Tokenization is incorrect because this method is commonly used in
Natural Language Processing (NLP) where you split a string into a list of
words that have a semantic meaning.
Perform t- distributed Stochastic Neighbor Embedding (t-SNE) on
image data to reduce highly correlated features is incorrect because
this algorithm is just used to preprocess a dataset that contains highly
correlated variables.
matrix multiplication is incorrect. It may reduce the overall size of the
dataset, but it won’t help at all in reducing highly correlated features. In
Linear Algebra, matrix multiplication is a binary operation that produces a
matrix from two matrices or in other words, it multiplies two matrices that
are usually in array form. In Machine Learning, matrix multiplication is a

–
–
–
–
–
–
–
–
–
–
compute-intensive operation used to process sparse or scattered data
produced by the training model.
Factorization Machines with a binary_classifier predictor type is
incorrect as this will not output numerical values needed to accomplish
the task.
Principal component analysis (PCA) is incorrect because PCA is just
used to reduce the dimensionality (number of features) within a dataset.
Logistic Regression is incorrect as this type of regression only predicts
a binary output such as “0” or “1”.
Feature Engineering:
The Multiple Imputations by Chained Equations (MICE) algorithm is a
robust, informative method of dealing with missing data in your datasets.
This procedure imputes or ‘fills in’ the missing data in a dataset through
an iterative series of predictive models. Each specified variable in the
dataset is imputed in each iteration using the other variables in the
dataset. These iterations will be run continuously until convergence has
been met. In General, MICE is a better imputation method than naive
approaches (filling missing values with 0, dropping columns).
Principal Component Analysis (PCA) is a popular technique used by
data scientists primarily for dimensionality reduction in numerous
applications ranging from stock market prediction to medical image
classification. Other uses of PCA include de-noising and feature
extraction. PCA is also used as an exploratory data analysis tool.
The t-Distributed Stochastic Neighbor Embedding (TSNE) is a non-
linear dimensionality reduction algorithm used for exploring high-
dimensional data. PCA and t-SNE are both valid dimensionality reduction
techniques that you can use.
Metrics
A correlation coefficient tells you how strong, or how weak, the
relationship is between two sets of data. Also called the cross-correlation
coefficient, Pearson correlation coefficient (PCC), or the Pearson
product-moment correlation coefficient (PPMCC).
residuals for regression problems. A residual for an observation in the
evaluation data is the difference between the true target and the
predicted target. Residuals represent the portion of the target that the
model is unable to predict.
In Amazon ML, the macro average F1-measure is used to evaluate the

–
–
–
–
predictive success of a multiclass classifier.
Confusion matrix is mainly used for evaluating the model’s
performance. It won’t help you identify overestimation/underestimation.
Correlation matrix shows the correlation coefficient between variables
so you can have an idea about how close the predicted values are from
true values. This won’t help you gain insight into the underestimation/
overestimation of the target value.
Root Mean Square Error (RMSE) is incorrect because this is specifically
used for measuring the accuracy of the ML model. RMSE is a distance
measure between the predicted numeric target and the actual numeric
answer (ground truth). The smaller the value of the RMSE, the better is
the predictive accuracy of the mode

Amazon Forecast provides several filling methods to help handle missing values
in your datasets. During backtesting, Forecast assumes the filled values (barring
NaNs) to be true values and uses them in evaluating metrics. Forecast supports
the following filling methods:
Middle filling – Fills any missing values between the item start and item end
date of a data set.
Back filling – Fills any missing values between the last recorded data point and
the global end date of a dataset.
Future filling (related time series only) – Fills any missing values between the
global end date and the end of the forecast horizon.
The Amazon SageMaker
Pipe mode is necessary to stream the dataset directly from Amazon S3.
Most Amazon SageMaker algorithms work best when you use the optimized
protobuf recordIO data format for training.
Using this format allows you to take advantage of Pipe mode. In Pipe mode, your
training job streams data directly from Amazon Simple Storage Service (Amazon
S3). Streaming can provide faster start times for training jobs and better
throughput.
DeepAR forecasting algorithm is a supervised learning algorithm for forecasting
scalar (one-dimensional) time series using recurrent neural networks (RNN).
Training machine learning models requires providing the training datasets to the
training job. When using Amazon S3 as the training data source in File input mode,
all training data have to be downloaded from Amazon S3 to the EBS volumes
attached to the training instances at the start of the training job. A distributed file
system such as Amazon FSx for Lustre or EFS can speed up machine learning
training by eliminating the need for this download step.

If your training data is already in Amazon S3, and your needs do not dictate faster
training times for your training jobs, you can get started with Amazon SageMaker
with no need for data movement. However, if you need faster startup and training
times, we recommend that you take advantage of the Amazon FSx for Lustre file
system, which is natively integrated with Amazon S3.
Amazon FSx for Lustre speeds up your training jobs by serving your Amazon S3
data to Amazon SageMaker at high speeds. The first time you run a training job,
Amazon FSx for Lustre automatically copies data from Amazon S3 and makes it
available to Amazon SageMaker. Additionally, the same Amazon FSx file system
can be used for subsequent iterations of training jobs on Amazon SageMaker,
preventing repeated downloads of common Amazon S3 objects. Because of this,
Amazon FSx has the most benefit to training jobs that have training sets in
Amazon S3 and in workflows where training jobs must be run several times using
different training algorithms or parameters to see which gives the best result.
In Amazon SageMaker, if you plan to use GPU devices for model training, make
sure that your containers are nvidia-docker compatible. Only the CUDA toolkit
should be included on containers; don’t bundle NVIDIA drivers with the image.
Amazon SageMaker supports local training using the pre-built TensorFlow and
MXNet containers. Amazon SageMaker allows you to pull the containers from
SageMaker-specific environments into your working environment.
Amazon SageMaker Autopilot is a feature of Amazon SageMaker that allows you to
automatically train and tune machine learning models with minimal setup and no
machine learning expertise required. Autopilot automatically generates pipelines,
trains, and tunes the best ML models for classification or regression tasks on
tabular data while allowing you to maintain full control and visibility.
Amazon Redshift ML makes it easy for data analysts and database developers to
create, train, and apply machine learning models using familiar SQL commands in
Amazon Redshift data warehouses. With Redshift ML, you can take advantage
of Amazon SageMaker without learning new tools or languages. Simply use SQL
statements to create and train Amazon SageMaker machine learning models using
your Redshift data and then use these models to make predictions.
Redshift Streaming ingestion is a feature for consuming data directly from
streaming sources such as Amazon Kinesis Data Streams and Amazon Managed
Streaming for Apache Kafka. It cannot be used to send inference requests to a
SageMaker instance.

AWS MLS-C01 Exam Study Notes

More Related Content

What's hot (13)

Similar to AWS MLS-C01 Exam Study Notes (20)

More from Tiffany Jachja (20)

Recently uploaded (20)

AWS MLS-C01 Exam Study Notes