SlideShare a Scribd company logo
2
Most read
3
Most read
AWS Database Migration Service (AWS DMS) is a cloud service that makes it
easy to migrate relational databases, data warehouses, NoSQL databases, and
other types of data stores. You can use AWS DMS to migrate your data into the
AWS Cloud, between on-premises instances (through an AWS Cloud setup), or
between combinations of cloud and on-premises setups. With AWS DMS, you can
perform one-time migrations, and you can replicate ongoing changes to keep
sources and targets in sync.
You can migrate data to Amazon S3 using AWS DMS from any of the supported
database sources. When using Amazon S3 as a target in an AWS DMS task, both
full load and change data capture (CDC) data is written to comma-separated value
(.csv) format by default.
The comma-separated value (.csv) format is the default storage format for
Amazon S3 target objects. For more compact storage and faster queries, you can
instead use Apache Parquet (.parquet) as the storage format. Apache Parquet is
an open-source file storage format originally designed for Hadoop.
Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or
Textfiles and also supports open-source columnar formats such as Apache ORC
and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO,
and GZIP formats. By compressing, partitioning, and using columnar formats, you
can improve performance and reduce your costs.
Parquet and ORC file formats both support predicate pushdown (also called
predicate filtering). Parquet and ORC both have blocks of data that represent
column values. Each block holds statistics for the block, such as max/min values.
When a query is being executed, these statistics determine whether the block
should be read or skipped.
Athena charges you by the amount of data scanned per query. You can save on
●
●
costs and get better performance if you partition the data, compress data, or
convert it to columnar formats such as Apache Parquet.
Amazon Augmented AI (Amazon A2I) enables you to build the workflows that are
required for human review of machine learning predictions. Amazon Textract is
directly integrated with Amazon A2I so that you can easily get low-confidence
results from Amazon Textract’s AnalyzeDocument API operation reviewed by
humans.
Root-mean-square error (RMSE) is incorrect because this metric is not suitable
for evaluating classification models. This performance metric is mostly used for
regression models.
Amazon Kinesis Data Firehose is a fully managed service for delivering real-
time streaming data to destinations such as Amazon Simple Storage Service
(Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES),
Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported
third-party service providers, including Datadog, MongoDB, and New Relic.
Kinesis Data Firehose does not support converting CSV files directly into Apache
Parquet, unlike JSON.
Amazon KCL already abstracts common tasks like creating streams, resharding,
putting, and getting records, so you can only focus on writing your record-
processing logic.
Amazon Kinesis Data Analytics
Allow you to process and analyze streaming data using standard SQL.
Makes it easy to collect, process, and analyze real-time, streaming data.
Kinesis can ingest real-time data such as video, audio, application logs,
website clickstreams, and IoT telemetry data for machine learning,
analytics, and other applications.
Amazon Kinesis Analytics applications to transform data before it is processed by
your SQL code. This new feature allows you to use AWS Lambda to convert
formats, enrich data, filter data, and more. Once the data is transformed by your
function, Kinesis Analytics sends the data to your application’s SQL code for real-
time analytics.
Cheat Sheet: https://tutorialsdojo.com/amazon-kinesis/
The Amazon Kinesis Data Analytics RANDOM_CUT_FOREST function detects
anomalies in your data stream.
HOTSPOTS function just detects relatively dense regions in your data.
Amazon SageMaker requires more operational management than Amazon Kinesis
Data Analytics.
Kinesis Data Streams can’t be used to transform data on the fly and store the
output data to Amazon S3.
AWS Glue is a serverless data integration service that makes it easy to discover,
prepare, and combine data for analytics, machine learning, and application
development. AWS Glue provides all of the capabilities needed for data integration
so that you can start analyzing your data and putting it to use in minutes instead of
months.
A job is the business logic that performs the extract, transform, and load (ETL)
work in AWS Glue. When you start a job, AWS Glue runs a script that extracts data
from sources, transforms the data, and loads it into targets. You can create jobs in
the ETL section of the AWS Glue console.
CloudWatch
CloudWatch Alarms just allows you to watch CloudWatch metrics and to receive
notifications when the metrics fall outside of the levels that you configure. Since
we need an event-based solution whenever a crawler run completes, we must use
CloudWatch Events.
Amazon Rekognition Content Moderation enables you to streamline or automate
your image and video moderation workflows using machine learning. Using fully
managed image and video moderation APIs, you can proactively detect
inappropriate, unwanted, or offensive content containing nudity, suggestiveness,
violence, and other such categories.
–
–
–
–
–
–
–
–
Polly
text to speech
With Amazon Polly’s custom lexicons or vocabularies, you can modify the
pronunciation of particular words, such as company names, acronyms,
foreign words, and neologisms (e.g., “ROTFL”, “C’est la vie” when spoken
in a non-French voice). To customize these pronunciations, you upload
an XML file with lexical entries.
a viseme Speech Mark is a feature used to synchronize speech with
facial animation (lip-syncing) or to highlight written words as they’re
spoken.
The option that says: Convert the scripts into Speech Synthesis
Markup Language (SSML) and use the pronunciation tag is incorrect
because Amazon Polly does not support this SSML tag.
The option that says: Convert the documents into Speech Synthesis
Markup Language (SSML) and use the emphasis tag to guide the
pronunciation is incorrect as this type of tag is simply used to
emphasize words by changing the speaking rate and volume of the
speech.
Amazon Lex is incorrect. This is just a service that you can use for building
conversational interfaces or chatbots.
Comprehend:
Custom entity recognition extends the capability of Amazon
Comprehend by enabling you to identify new entity types not supported
as one of the preset generic entity types. This means that in addition to
identifying entity types such as LOCATION, DATE, PERSON, and so on,
you can analyze documents and extract entities like product codes or
business-specific entities that fit your particular needs.
Use regular expressions to determine the entities is incorrect.
Although this is possible, it isn’t as effective as creating a Custom Entity
Recognition model.
The option that says: Use Topic Modelling to determine entities is
incorrect because this is specifically used for determining themes/topics
from a collection of documents. Take note that we only need to identify
entities from a list of words.
The option that says: Create a list for each product and use string
matching to determine their entities is incorrect. Like regular
expressions, it would be difficult to match all possible patterns with string
matching. This would produce less accurate results than when using a
Custom Entity Recognition model.
–
–
–
–
–
–
–
–
–
–
–
–
–
Algorithms
k-NN algorithm is just used for classification or regression problems, has
the issue of putting more importance of larger numerical values
small k -> low bias high variance
large k -> high bias low variance
If the line chart looks like an “arm”m, then the “elbow” (the point of
inflection on the curve) is the best value of k. The “arm” can be either
up or down, but if there is a strong inflection point, it is a good
indication that the underlying model fits best at that point.
Normalization of numeric variables can help the learning process if there
are very large range differences between numeric variables because
variables with the highest magnitude could dominate the ML model, no
matter if the feature is informative with respect to the target or not.
The Amazon SageMaker BlazingText algorithm provides highly optimized
implementations of the Word2vec and text classification algorithms
ARIMA and ETS are both valid forecasting methods, however, Amazon
SageMaker DeepAR will most likely outperform them in terms of
providing better forecast accuracy.
Label encoding is incorrect because this type of encoding will only
convert categorical data into integer labels (e.g. 0,1,2,3,4) and not into a
vector of binary values (e.g. [1,0,0], [0,1,0]).
Target encoding is incorrect as this type of encoding is achieved by
replacing categorical variables with just one new numerical variable and
replacing each category of the categorical variable with its corresponding
probability of the target. This won’t convert categorical variables into
binary values.
Tokenization is incorrect because this method is commonly used in
Natural Language Processing (NLP) where you split a string into a list of
words that have a semantic meaning.
Perform t- distributed Stochastic Neighbor Embedding (t-SNE) on
image data to reduce highly correlated features is incorrect because
this algorithm is just used to preprocess a dataset that contains highly
correlated variables.
matrix multiplication is incorrect. It may reduce the overall size of the
dataset, but it won’t help at all in reducing highly correlated features. In
Linear Algebra, matrix multiplication is a binary operation that produces a
matrix from two matrices or in other words, it multiplies two matrices that
are usually in array form. In Machine Learning, matrix multiplication is a
–
–
–
–
–
–
–
–
–
–
compute-intensive operation used to process sparse or scattered data
produced by the training model.
Factorization Machines with a binary_classifier predictor type is
incorrect as this will not output numerical values needed to accomplish
the task.
Principal component analysis (PCA) is incorrect because PCA is just
used to reduce the dimensionality (number of features) within a dataset.
Logistic Regression is incorrect as this type of regression only predicts
a binary output such as “0” or “1”.
Feature Engineering:
The Multiple Imputations by Chained Equations (MICE) algorithm is a
robust, informative method of dealing with missing data in your datasets.
This procedure imputes or ‘fills in’ the missing data in a dataset through
an iterative series of predictive models. Each specified variable in the
dataset is imputed in each iteration using the other variables in the
dataset. These iterations will be run continuously until convergence has
been met. In General, MICE is a better imputation method than naive
approaches (filling missing values with 0, dropping columns).
Principal Component Analysis (PCA) is a popular technique used by
data scientists primarily for dimensionality reduction in numerous
applications ranging from stock market prediction to medical image
classification. Other uses of PCA include de-noising and feature
extraction. PCA is also used as an exploratory data analysis tool.
The t-Distributed Stochastic Neighbor Embedding (TSNE) is a non-
linear dimensionality reduction algorithm used for exploring high-
dimensional data. PCA and t-SNE are both valid dimensionality reduction
techniques that you can use.
Metrics
A correlation coefficient tells you how strong, or how weak, the
relationship is between two sets of data. Also called the cross-correlation
coefficient, Pearson correlation coefficient (PCC), or the Pearson
product-moment correlation coefficient (PPMCC).
residuals for regression problems. A residual for an observation in the
evaluation data is the difference between the true target and the
predicted target. Residuals represent the portion of the target that the
model is unable to predict.
In Amazon ML, the macro average F1-measure is used to evaluate the
–
–
–
–
predictive success of a multiclass classifier.
Confusion matrix is mainly used for evaluating the model’s
performance. It won’t help you identify overestimation/underestimation.
Correlation matrix shows the correlation coefficient between variables
so you can have an idea about how close the predicted values are from
true values. This won’t help you gain insight into the underestimation/
overestimation of the target value.
Root Mean Square Error (RMSE) is incorrect because this is specifically
used for measuring the accuracy of the ML model. RMSE is a distance
measure between the predicted numeric target and the actual numeric
answer (ground truth). The smaller the value of the RMSE, the better is
the predictive accuracy of the mode
Amazon Forecast provides several filling methods to help handle missing values
in your datasets. During backtesting, Forecast assumes the filled values (barring
NaNs) to be true values and uses them in evaluating metrics. Forecast supports
the following filling methods:
Middle filling – Fills any missing values between the item start and item end
date of a data set.
Back filling – Fills any missing values between the last recorded data point and
the global end date of a dataset.
Future filling (related time series only) – Fills any missing values between the
global end date and the end of the forecast horizon.
The Amazon SageMaker
Pipe mode is necessary to stream the dataset directly from Amazon S3.
Most Amazon SageMaker algorithms work best when you use the optimized
protobuf recordIO data format for training.
Using this format allows you to take advantage of Pipe mode. In Pipe mode, your
training job streams data directly from Amazon Simple Storage Service (Amazon
S3). Streaming can provide faster start times for training jobs and better
throughput.
DeepAR forecasting algorithm is a supervised learning algorithm for forecasting
scalar (one-dimensional) time series using recurrent neural networks (RNN).
Training machine learning models requires providing the training datasets to the
training job. When using Amazon S3 as the training data source in File input mode,
all training data have to be downloaded from Amazon S3 to the EBS volumes
attached to the training instances at the start of the training job. A distributed file
system such as Amazon FSx for Lustre or EFS can speed up machine learning
training by eliminating the need for this download step.
If your training data is already in Amazon S3, and your needs do not dictate faster
training times for your training jobs, you can get started with Amazon SageMaker
with no need for data movement. However, if you need faster startup and training
times, we recommend that you take advantage of the Amazon FSx for Lustre file
system, which is natively integrated with Amazon S3.
Amazon FSx for Lustre speeds up your training jobs by serving your Amazon S3
data to Amazon SageMaker at high speeds. The first time you run a training job,
Amazon FSx for Lustre automatically copies data from Amazon S3 and makes it
available to Amazon SageMaker. Additionally, the same Amazon FSx file system
can be used for subsequent iterations of training jobs on Amazon SageMaker,
preventing repeated downloads of common Amazon S3 objects. Because of this,
Amazon FSx has the most benefit to training jobs that have training sets in
Amazon S3 and in workflows where training jobs must be run several times using
different training algorithms or parameters to see which gives the best result.
In Amazon SageMaker, if you plan to use GPU devices for model training, make
sure that your containers are nvidia-docker compatible. Only the CUDA toolkit
should be included on containers; don’t bundle NVIDIA drivers with the image.
Amazon SageMaker supports local training using the pre-built TensorFlow and
MXNet containers. Amazon SageMaker allows you to pull the containers from
SageMaker-specific environments into your working environment.
Amazon SageMaker Autopilot is a feature of Amazon SageMaker that allows you to
automatically train and tune machine learning models with minimal setup and no
machine learning expertise required. Autopilot automatically generates pipelines,
trains, and tunes the best ML models for classification or regression tasks on
tabular data while allowing you to maintain full control and visibility.
Amazon Redshift ML makes it easy for data analysts and database developers to
create, train, and apply machine learning models using familiar SQL commands in
Amazon Redshift data warehouses. With Redshift ML, you can take advantage
of Amazon SageMaker without learning new tools or languages. Simply use SQL
statements to create and train Amazon SageMaker machine learning models using
your Redshift data and then use these models to make predictions.
Redshift Streaming ingestion is a feature for consuming data directly from
streaming sources such as Amazon Kinesis Data Streams and Amazon Managed
Streaming for Apache Kafka. It cannot be used to send inference requests to a
SageMaker instance.

More Related Content

PPTX
5 investigating system requirements
PDF
The IT Chargeback Journey
PPTX
A Business Intelligence requirement gathering checklist
PDF
Business Analysis basics - Based on BABOK V3.0
PPTX
Building a modern data warehouse
PPS
ITIL Service Desk Tools
PDF
080827 abramson inmon vs kimball
5 investigating system requirements
The IT Chargeback Journey
A Business Intelligence requirement gathering checklist
Business Analysis basics - Based on BABOK V3.0
Building a modern data warehouse
ITIL Service Desk Tools
080827 abramson inmon vs kimball

What's hot (13)

PPTX
Software Testing
PPTX
What is legal process outsourcing
PPTX
Itsm capability model v1
PDF
[한국IBM] AI활용을 위한 머신러닝 모델 구현 및 운영 세션
PDF
Sample Business Requirement Document
PPSX
Workflow Diagram
PDF
The Key Role of Business Analysis in Project Success and Achieving Business V...
PPTX
Data Lake Overview
PDF
Introduction to Azure Data Lake
PDF
Selecting Data Management Tools - A practical approach
PPTX
ITSM-ISMS
PDF
Enterprise architecture
PPTX
Jira Basic Concepts
Software Testing
What is legal process outsourcing
Itsm capability model v1
[한국IBM] AI활용을 위한 머신러닝 모델 구현 및 운영 세션
Sample Business Requirement Document
Workflow Diagram
The Key Role of Business Analysis in Project Success and Achieving Business V...
Data Lake Overview
Introduction to Azure Data Lake
Selecting Data Management Tools - A practical approach
ITSM-ISMS
Enterprise architecture
Jira Basic Concepts
Ad

Similar to AWS MLS-C01 Exam Study Notes (20)

PDF
Modern Data Platforms - Thinking Data Flywheel on the Cloud
PPTX
From raw data to business insights. A modern data lake
PPTX
Construindo data lakes e analytics com AWS
PPTX
Building Data Lakes & Analytics on AWS
PDF
AWS reInvent 2022 reCap AI/ML and Data
PDF
Artificial Intelligence on the AWS Platform
PDF
From Data To Insights
PDF
Amazon reInvent 2020 Recap: AI and Machine Learning
PPTX
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
PPTX
Innovations and trends in Cloud. Connectfest Porto 2019
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PDF
[AWS Techshift] Innovation and AI/ML Sagemaker Build-in 머신러닝 모델 활용 및 Marketpl...
PDF
2018512 AWS上での機械学習システムの構築とSageMaker
PDF
Value of Data Beyond Analytics by Darin Briskman
PDF
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
PDF
Amazon SageMaker を中心とした持続的な ML システム
PPTX
WhereML a Serverless ML Powered Location Guessing Twitter Bot
PDF
Speed up data preparation for ML pipelines on AWS
PPTX
Make your data fly - Building data platform in AWS
PDF
Big Data & Analytics - Innovating at the Speed of Light
Modern Data Platforms - Thinking Data Flywheel on the Cloud
From raw data to business insights. A modern data lake
Construindo data lakes e analytics com AWS
Building Data Lakes & Analytics on AWS
AWS reInvent 2022 reCap AI/ML and Data
Artificial Intelligence on the AWS Platform
From Data To Insights
Amazon reInvent 2020 Recap: AI and Machine Learning
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Innovations and trends in Cloud. Connectfest Porto 2019
Big Data, Ingeniería de datos, y Data Lakes en AWS
[AWS Techshift] Innovation and AI/ML Sagemaker Build-in 머신러닝 모델 활용 및 Marketpl...
2018512 AWS上での機械学習システムの構築とSageMaker
Value of Data Beyond Analytics by Darin Briskman
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
Amazon SageMaker を中心とした持続的な ML システム
WhereML a Serverless ML Powered Location Guessing Twitter Bot
Speed up data preparation for ML pipelines on AWS
Make your data fly - Building data platform in AWS
Big Data & Analytics - Innovating at the Speed of Light
Ad

More from Tiffany Jachja (20)

PDF
CD_Con_Japan_2023.pdf
PDF
Scaling Software Delivery.pdf
PPTX
Linux Foundation Live Webinar: Applying Governance to CI/CD
PDF
Observability for CI/CD Pipelines | Infographic
PDF
Continuous Delivery | Infographic
PDF
Lean Thinking | Infographic
PDF
Enterprise Kubernetes | Infographic
PDF
Agile foundations for developers
PDF
Devops JS A Guide to CI/CD
PDF
Succeeding With Microservices | Harness Webinar
PDF
{unscripted} 2020 : A Conference for Simplifying and Scaling Software Delivery
PDF
DevOps World 2020: Optimizing Kubernetes Cloud Costs
PDF
CdCon 2020 Lightning Talk: CI/CD Patterns
PPTX
Connect Ahead 2020: Continuous Delivery Today
PDF
A Developer's Guide to Cloud Costs
PDF
Skilup Day Value Stream Management: Fundamentals in Lean Thinking
PDF
DevOps Institute SkilUp Day Enterprise Kubernetes - Navigating Your Kubernete...
PDF
A DevOps Practitioner’s Guide to AI and ML
PDF
A Snapshot of DevOps
PDF
Building Microservices with Distributed Tracing and Eclipse Vert.x
CD_Con_Japan_2023.pdf
Scaling Software Delivery.pdf
Linux Foundation Live Webinar: Applying Governance to CI/CD
Observability for CI/CD Pipelines | Infographic
Continuous Delivery | Infographic
Lean Thinking | Infographic
Enterprise Kubernetes | Infographic
Agile foundations for developers
Devops JS A Guide to CI/CD
Succeeding With Microservices | Harness Webinar
{unscripted} 2020 : A Conference for Simplifying and Scaling Software Delivery
DevOps World 2020: Optimizing Kubernetes Cloud Costs
CdCon 2020 Lightning Talk: CI/CD Patterns
Connect Ahead 2020: Continuous Delivery Today
A Developer's Guide to Cloud Costs
Skilup Day Value Stream Management: Fundamentals in Lean Thinking
DevOps Institute SkilUp Day Enterprise Kubernetes - Navigating Your Kubernete...
A DevOps Practitioner’s Guide to AI and ML
A Snapshot of DevOps
Building Microservices with Distributed Tracing and Eclipse Vert.x

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
cuic standard and advanced reporting.pdf
Programs and apps: productivity, graphics, security and other tools
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
NewMind AI Weekly Chronicles - August'25 Week I
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
sap open course for s4hana steps from ECC to s4
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
cuic standard and advanced reporting.pdf

AWS MLS-C01 Exam Study Notes

  • 1. AWS Database Migration Service (AWS DMS) is a cloud service that makes it easy to migrate relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the AWS Cloud, between on-premises instances (through an AWS Cloud setup), or between combinations of cloud and on-premises setups. With AWS DMS, you can perform one-time migrations, and you can replicate ongoing changes to keep sources and targets in sync. You can migrate data to Amazon S3 using AWS DMS from any of the supported database sources. When using Amazon S3 as a target in an AWS DMS task, both full load and change data capture (CDC) data is written to comma-separated value (.csv) format by default. The comma-separated value (.csv) format is the default storage format for Amazon S3 target objects. For more compact storage and faster queries, you can instead use Apache Parquet (.parquet) as the storage format. Apache Parquet is an open-source file storage format originally designed for Hadoop. Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open-source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats. By compressing, partitioning, and using columnar formats, you can improve performance and reduce your costs. Parquet and ORC file formats both support predicate pushdown (also called predicate filtering). Parquet and ORC both have blocks of data that represent column values. Each block holds statistics for the block, such as max/min values. When a query is being executed, these statistics determine whether the block should be read or skipped. Athena charges you by the amount of data scanned per query. You can save on
  • 2. ● ● costs and get better performance if you partition the data, compress data, or convert it to columnar formats such as Apache Parquet. Amazon Augmented AI (Amazon A2I) enables you to build the workflows that are required for human review of machine learning predictions. Amazon Textract is directly integrated with Amazon A2I so that you can easily get low-confidence results from Amazon Textract’s AnalyzeDocument API operation reviewed by humans. Root-mean-square error (RMSE) is incorrect because this metric is not suitable for evaluating classification models. This performance metric is mostly used for regression models. Amazon Kinesis Data Firehose is a fully managed service for delivering real- time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers, including Datadog, MongoDB, and New Relic. Kinesis Data Firehose does not support converting CSV files directly into Apache Parquet, unlike JSON. Amazon KCL already abstracts common tasks like creating streams, resharding, putting, and getting records, so you can only focus on writing your record- processing logic. Amazon Kinesis Data Analytics Allow you to process and analyze streaming data using standard SQL. Makes it easy to collect, process, and analyze real-time, streaming data. Kinesis can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for machine learning, analytics, and other applications.
  • 3. Amazon Kinesis Analytics applications to transform data before it is processed by your SQL code. This new feature allows you to use AWS Lambda to convert formats, enrich data, filter data, and more. Once the data is transformed by your function, Kinesis Analytics sends the data to your application’s SQL code for real- time analytics. Cheat Sheet: https://tutorialsdojo.com/amazon-kinesis/ The Amazon Kinesis Data Analytics RANDOM_CUT_FOREST function detects anomalies in your data stream. HOTSPOTS function just detects relatively dense regions in your data. Amazon SageMaker requires more operational management than Amazon Kinesis Data Analytics. Kinesis Data Streams can’t be used to transform data on the fly and store the output data to Amazon S3. AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months. A job is the business logic that performs the extract, transform, and load (ETL) work in AWS Glue. When you start a job, AWS Glue runs a script that extracts data from sources, transforms the data, and loads it into targets. You can create jobs in the ETL section of the AWS Glue console. CloudWatch CloudWatch Alarms just allows you to watch CloudWatch metrics and to receive notifications when the metrics fall outside of the levels that you configure. Since we need an event-based solution whenever a crawler run completes, we must use CloudWatch Events. Amazon Rekognition Content Moderation enables you to streamline or automate your image and video moderation workflows using machine learning. Using fully managed image and video moderation APIs, you can proactively detect inappropriate, unwanted, or offensive content containing nudity, suggestiveness, violence, and other such categories.
  • 4. – – – – – – – – Polly text to speech With Amazon Polly’s custom lexicons or vocabularies, you can modify the pronunciation of particular words, such as company names, acronyms, foreign words, and neologisms (e.g., “ROTFL”, “C’est la vie” when spoken in a non-French voice). To customize these pronunciations, you upload an XML file with lexical entries. a viseme Speech Mark is a feature used to synchronize speech with facial animation (lip-syncing) or to highlight written words as they’re spoken. The option that says: Convert the scripts into Speech Synthesis Markup Language (SSML) and use the pronunciation tag is incorrect because Amazon Polly does not support this SSML tag. The option that says: Convert the documents into Speech Synthesis Markup Language (SSML) and use the emphasis tag to guide the pronunciation is incorrect as this type of tag is simply used to emphasize words by changing the speaking rate and volume of the speech. Amazon Lex is incorrect. This is just a service that you can use for building conversational interfaces or chatbots. Comprehend: Custom entity recognition extends the capability of Amazon Comprehend by enabling you to identify new entity types not supported as one of the preset generic entity types. This means that in addition to identifying entity types such as LOCATION, DATE, PERSON, and so on, you can analyze documents and extract entities like product codes or business-specific entities that fit your particular needs. Use regular expressions to determine the entities is incorrect. Although this is possible, it isn’t as effective as creating a Custom Entity Recognition model. The option that says: Use Topic Modelling to determine entities is incorrect because this is specifically used for determining themes/topics from a collection of documents. Take note that we only need to identify entities from a list of words. The option that says: Create a list for each product and use string matching to determine their entities is incorrect. Like regular expressions, it would be difficult to match all possible patterns with string matching. This would produce less accurate results than when using a Custom Entity Recognition model.
  • 5. – – – – – – – – – – – – – Algorithms k-NN algorithm is just used for classification or regression problems, has the issue of putting more importance of larger numerical values small k -> low bias high variance large k -> high bias low variance If the line chart looks like an “arm”m, then the “elbow” (the point of inflection on the curve) is the best value of k. The “arm” can be either up or down, but if there is a strong inflection point, it is a good indication that the underlying model fits best at that point. Normalization of numeric variables can help the learning process if there are very large range differences between numeric variables because variables with the highest magnitude could dominate the ML model, no matter if the feature is informative with respect to the target or not. The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms ARIMA and ETS are both valid forecasting methods, however, Amazon SageMaker DeepAR will most likely outperform them in terms of providing better forecast accuracy. Label encoding is incorrect because this type of encoding will only convert categorical data into integer labels (e.g. 0,1,2,3,4) and not into a vector of binary values (e.g. [1,0,0], [0,1,0]). Target encoding is incorrect as this type of encoding is achieved by replacing categorical variables with just one new numerical variable and replacing each category of the categorical variable with its corresponding probability of the target. This won’t convert categorical variables into binary values. Tokenization is incorrect because this method is commonly used in Natural Language Processing (NLP) where you split a string into a list of words that have a semantic meaning. Perform t- distributed Stochastic Neighbor Embedding (t-SNE) on image data to reduce highly correlated features is incorrect because this algorithm is just used to preprocess a dataset that contains highly correlated variables. matrix multiplication is incorrect. It may reduce the overall size of the dataset, but it won’t help at all in reducing highly correlated features. In Linear Algebra, matrix multiplication is a binary operation that produces a matrix from two matrices or in other words, it multiplies two matrices that are usually in array form. In Machine Learning, matrix multiplication is a
  • 6. – – – – – – – – – – compute-intensive operation used to process sparse or scattered data produced by the training model. Factorization Machines with a binary_classifier predictor type is incorrect as this will not output numerical values needed to accomplish the task. Principal component analysis (PCA) is incorrect because PCA is just used to reduce the dimensionality (number of features) within a dataset. Logistic Regression is incorrect as this type of regression only predicts a binary output such as “0” or “1”. Feature Engineering: The Multiple Imputations by Chained Equations (MICE) algorithm is a robust, informative method of dealing with missing data in your datasets. This procedure imputes or ‘fills in’ the missing data in a dataset through an iterative series of predictive models. Each specified variable in the dataset is imputed in each iteration using the other variables in the dataset. These iterations will be run continuously until convergence has been met. In General, MICE is a better imputation method than naive approaches (filling missing values with 0, dropping columns). Principal Component Analysis (PCA) is a popular technique used by data scientists primarily for dimensionality reduction in numerous applications ranging from stock market prediction to medical image classification. Other uses of PCA include de-noising and feature extraction. PCA is also used as an exploratory data analysis tool. The t-Distributed Stochastic Neighbor Embedding (TSNE) is a non- linear dimensionality reduction algorithm used for exploring high- dimensional data. PCA and t-SNE are both valid dimensionality reduction techniques that you can use. Metrics A correlation coefficient tells you how strong, or how weak, the relationship is between two sets of data. Also called the cross-correlation coefficient, Pearson correlation coefficient (PCC), or the Pearson product-moment correlation coefficient (PPMCC). residuals for regression problems. A residual for an observation in the evaluation data is the difference between the true target and the predicted target. Residuals represent the portion of the target that the model is unable to predict. In Amazon ML, the macro average F1-measure is used to evaluate the
  • 7. – – – – predictive success of a multiclass classifier. Confusion matrix is mainly used for evaluating the model’s performance. It won’t help you identify overestimation/underestimation. Correlation matrix shows the correlation coefficient between variables so you can have an idea about how close the predicted values are from true values. This won’t help you gain insight into the underestimation/ overestimation of the target value. Root Mean Square Error (RMSE) is incorrect because this is specifically used for measuring the accuracy of the ML model. RMSE is a distance measure between the predicted numeric target and the actual numeric answer (ground truth). The smaller the value of the RMSE, the better is the predictive accuracy of the mode
  • 8. Amazon Forecast provides several filling methods to help handle missing values in your datasets. During backtesting, Forecast assumes the filled values (barring NaNs) to be true values and uses them in evaluating metrics. Forecast supports the following filling methods: Middle filling – Fills any missing values between the item start and item end date of a data set. Back filling – Fills any missing values between the last recorded data point and the global end date of a dataset. Future filling (related time series only) – Fills any missing values between the global end date and the end of the forecast horizon. The Amazon SageMaker Pipe mode is necessary to stream the dataset directly from Amazon S3. Most Amazon SageMaker algorithms work best when you use the optimized protobuf recordIO data format for training. Using this format allows you to take advantage of Pipe mode. In Pipe mode, your training job streams data directly from Amazon Simple Storage Service (Amazon S3). Streaming can provide faster start times for training jobs and better throughput. DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). Training machine learning models requires providing the training datasets to the training job. When using Amazon S3 as the training data source in File input mode, all training data have to be downloaded from Amazon S3 to the EBS volumes attached to the training instances at the start of the training job. A distributed file system such as Amazon FSx for Lustre or EFS can speed up machine learning training by eliminating the need for this download step.
  • 9. If your training data is already in Amazon S3, and your needs do not dictate faster training times for your training jobs, you can get started with Amazon SageMaker with no need for data movement. However, if you need faster startup and training times, we recommend that you take advantage of the Amazon FSx for Lustre file system, which is natively integrated with Amazon S3. Amazon FSx for Lustre speeds up your training jobs by serving your Amazon S3 data to Amazon SageMaker at high speeds. The first time you run a training job, Amazon FSx for Lustre automatically copies data from Amazon S3 and makes it available to Amazon SageMaker. Additionally, the same Amazon FSx file system can be used for subsequent iterations of training jobs on Amazon SageMaker, preventing repeated downloads of common Amazon S3 objects. Because of this, Amazon FSx has the most benefit to training jobs that have training sets in Amazon S3 and in workflows where training jobs must be run several times using different training algorithms or parameters to see which gives the best result. In Amazon SageMaker, if you plan to use GPU devices for model training, make sure that your containers are nvidia-docker compatible. Only the CUDA toolkit should be included on containers; don’t bundle NVIDIA drivers with the image. Amazon SageMaker supports local training using the pre-built TensorFlow and MXNet containers. Amazon SageMaker allows you to pull the containers from SageMaker-specific environments into your working environment. Amazon SageMaker Autopilot is a feature of Amazon SageMaker that allows you to automatically train and tune machine learning models with minimal setup and no machine learning expertise required. Autopilot automatically generates pipelines, trains, and tunes the best ML models for classification or regression tasks on tabular data while allowing you to maintain full control and visibility. Amazon Redshift ML makes it easy for data analysts and database developers to create, train, and apply machine learning models using familiar SQL commands in Amazon Redshift data warehouses. With Redshift ML, you can take advantage of Amazon SageMaker without learning new tools or languages. Simply use SQL statements to create and train Amazon SageMaker machine learning models using your Redshift data and then use these models to make predictions. Redshift Streaming ingestion is a feature for consuming data directly from streaming sources such as Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka. It cannot be used to send inference requests to a SageMaker instance.