AWS ML Exam Notes - Important
AWS ML Exam Notes - Important
data stored in Amazon S3 (or any data lake) using standard SQL. Athena is
serverless. Athena also can be used for extract, transform and load (ETL) jobs for
data processing.
The best solution to support both ad-hoc querying of data via SQL and also to
allow that same data to be sent to an ML pipeline would be AWS Athena and AWS
Glue. AWS Athena can do ad-hoc queries and AWS Glue can do the ETL.
RDS, S3, and DynamoDB all have the ability to take snapshots.
Both DeepLense and Step Functions have AWS Lambda embedded as part of
their service.
A data lake can store structured and unstructured data, can be used for analytics
and ML, and also work on data without data movement. Additionally, it is low-
cost storage.
Descriptive statistics are a tool for identifying the central tendency and also the
measures of variability.
Box plots, histograms and density plots are all used to show shape and
distribution of data sets.
AWS Sagemaker is designed to work with Amazon S3 data and allows for easy
data visualization because it includes common Python libraries.
Validation set is a third split that can reduce overfitting. It is used after the model
is trained, and allows you to select which model performs best on validation set,
then it can be double-checked on the test set.
Amazon SageMaker XGBoost can train data in either CSV or LibSVM format. Label
should be in the 1st column. It should have not a header row.
First, we will convert our categorical features into numeric features, then split the
data into training, validation and test sets.
Early stopping is a simple technique for preventing neural networks from training
too far, and learning patterns in the training data that can't be generalized.
Dropout regularization forces the learning to be spread out amongst the artificial
neurons, further preventing overfitting. Removing layers, rather than adding
them, might also help prevent an overly complex model from being created - as
would using fewer features, not more.
SageMaker Neo is designed for compiling models using TensorFlow and other
frameworks to edge devices such as Nvidia Jetson. The low latency requirement
requires an edge solution, where the classification is being done within the
vehicle itself and not over the air. Rekognition (which doesn't have an "edge
mode," but does integrate with DeepLens) can't handle the very specific
classification task of identifying different street signs and what they mean.
With Pipe input mode in Amazon SageMaker, your dataset is streamed directly to
your training instances instead of being downloaded first. This means that your
training jobs start sooner, finish quicker, and need less disk space. Amazon
SageMaker algorithms have been engineered to be fast and highly scalable.
With Pipe input mode, your data is fed on-the-fly into the algorithm container
without involving any disk I/O. This approach shortens the lengthy download
process and dramatically reduces startup time. It also offers generally better read
throughput than File input mode. This is because your data is fetched from
Amazon S3 by a highly optimized multi-threaded background process. It also
allows you to train on datasets that are much larger than the 16 TB Amazon
Elastic Block Store (EBS) volume size limit.
SMOTE is an oversampling technique that generates synthetic samples from the
minority class. It is used to obtain a synthetically class-balanced or nearly class-
balanced training set, which is then used to train the classifier.
Many developers want to implement the famous Amazon model that was used to
power the "People who bought this also bought these items" feature on
Amazon.com. This model is based on a method called Collaborative Filtering. It
takes items such as movies, books, and products that were rated highly by a set of
users and recommending them to other users who also gave them high ratings.
This method works well in domains where explicit ratings or implicit user actions
can be gathered and analyzed.
You can use Amazon S3 bucket policies to control access to buckets from specific
virtual private cloud (VPC) (VPC) endpoints, or specific VPCs.
A VPC endpoint for Amazon S3 is a logical entity within a VPC that allows
connectivity only to Amazon S3. The VPC endpoint routes requests to Amazon S3
and routes responses back to the VPC.
If you plan to use GPU devices for model training, make sure that your containers
are nvidia-docker compatible. Only the CUDA toolkit should be included on
containers; don't bundle NVIDIA drivers with the image.
The residual plot will be give whether the target value is overestimated or
underestimated.
A positive residual indicates that the model is underestimating the target (the
actual target is larger than the predicted target). A negative residual indicates an
overestimation (the actual target is smaller than the predicted target).
https://docs.aws.amazon.com/machine-learning/latest/dg/regression-model-
insights.html
MinMaxScaler preserves the shape of the original distribution. It doesn’t
meaningfully change the information embedded in the original data.
Note that MinMaxScaler doesn’t reduce the importance of outliers.
The default range for the feature returned by MinMaxScaler is 0 to 1.
RobustScaler transforms the feature vector by subtracting the median and then
dividing by the interquartile range (75% value — 25% value).
Use RobustScaler if you want to reduce the effects of outliers, relative to
MinMaxScaler.
AUC is scale-invariant. It measures how well predictions are ranked, rather than
their absolute values. AUC is classification-threshold-invariant. It measures the
quality of the model's predictions irrespective of what classification threshold is
chosen.
Athena performs much more efficiently and at lower cost when using columnar
format such as Parquet or ORC, and Kinesis Firehose has the ability to convert
JSON data to Parquet or ORC format on the fly.
An Amazon Kinesis Data Streams producer is an application that puts user data
records into a Kinesis data stream (also called data ingestion). The Kinesis
Producer Library (KPL) simplifies producer application development, allowing
developers to achieve high write throughput to a Kinesis data stream.
How can you most effectively load data from Hadoop cluster into your SageMaker
model for training?
The SageMaker Spark Library that makes it so you can easily train models using
data frames in your Spark clusters.
Using k-fold cross validation will randomly split your data. By sequentially
splitting the data you preserve the time element of your observations.
In order to get proper generalization from your data, you need to randomize it.
The OneHotEncoder transformer has the following methodologies you can use to
drop one of the categories per feature: None, first, array.
Kinesis Data Analytics works really well for near-real time and RFC for anomaly
detection.
Random Forest algorithm is well known to increase the prediction accuracy and
prevent overfitting that occurs with a single decision tree.
The main difference between ROC curves and precision-recall (PR) curves is that
the number of true-negative results is not used for making a PR curve.
The Time Series Cross Validation technique is the correct choice for cross
validating a time series dataset. Time series cross validation uses forward
chaining where the origin of the forecast moves forward in time. Day n is training
data and day n+1 is test data.
Low learning rate in image classification algorithm will make the model learn
more slowly and be less sensitive to outliers.
When using k-fold for cross-validation the variance of the estimate is reduced as
you increase k.
If you have relatively equal error rates for all k-fold rounds it is an indication that
you have properly randomized your test data, therefore reducing the chance of
bias.
In Linear Learner Algorithm, for binary classification; the model produces a score
denoting the strength of the prediction AND a predicted_label denoting complete
or not complete.
For binary classification, predicted_label is 0 or 1, and score is a single
floating point number that indicates how strongly the algorithm believes
that the label should be 1.
For multiclass classification, the predicted_class will be an integer
from 0 to num_classes-1, and score will be a list of one floating point
number per class.
To interpret the score in classification problems, you have to consider the loss
function used. If the loss hyperparameter value is logistic for binary classification
or softmax_loss for multiclass classification, then the score can be interpreted as
the probability of the corresponding class. These are the loss values used by the
linear learner when the loss value is auto default value. But if the loss is set
to hinge_loss, then the score cannot be interpreted as a probability. This is
because hinge loss corresponds to a Support Vector Classifier, which does not
produce probability estimates.
How would you best use AWS Glue to build the data schema needed to classify
the data?
Use Glue crawlers to crawl your data. (the best way to build the schema for your
data is to use a Glue crawler that leverages a classifier or multiple classifiers).
Amazon Kinesis Data Analytics is very efficient service for taking streams from
Kinesis Data Streams and transforming them with SQL or Flink.
A scatter chart shows a multiple distribution, i.e., two or three measures for a
dimension.
A histogram is an accurate representation of the distribution of numerical data. It
is an estimate of the probability distribution of a continuous variable.
Use line charts to compare changes in measured values over a period of time.
The lambda timeout value is 3 seconds. For many Kinesis Data Firehose
implementations, 3 seconds is not enough time to execute the transformation
function.
Kinesis Data Firehose supports Amazon S3 server-side encryption with AWS Key
Management Service (AWS KMS) for encrypting delivered data in Amazon S3.
In Kinesis Data Firehose, you are required to create IAM role when creating
delivery stream.
Use AWS Glue for data preprocessing, Save the data in Amazon S3 in Parquet
format.
Heatmaps show relationships between two variables, but is not enough to check
for overall distribution or skewness in the data.
Scatterplot can help check for outliers, but it won’t show the skewness of the
data.
Box Plot and Histogram are good for outliers and overall distribution and
skewness of the feature.
De-register the endpoint as a scalable target, then, update the endpoint using a
new endpoint configuration with the latest model Amazon S3 path, then, finally
register the endpoint as a scalable target again.
Using a new endpoint configuration ONLY will not have Auto Scaling enabled.
VolumeKmsKeyId in Amazon SageMaker training job, helps in encrypting data on
the training job instance storage not on Amazon S3.
Amazon SageMaker Ground Truth manages sending your data objects to workers
to be labeled. Labeling each data object is a task. Workers complete each task
until the entire labeling job is complete. Ground Truth divides the total number of
tasks into smaller batches that are sent to workers. A new batch is sent to
workers when the previous one is finished.
Ground Truth provides two features that help improve the accuracy of your data
labels and reduce the total cost of labeling your data:
Annotation consolidation helps to improve the accuracy of your data
object labels. It combines the results of multiple workers' annotation tasks
into one high-fidelity label.
Automated data labeling uses machine learning to label portions of your
data automatically without having to send them to human workers.
IoT Core collects data from each shared bike, IoT Analytics retrieves messages
from the shared bikes as they stream data, IoT Analytics also enriches the
streaming data with your external data sources and sends the streaming data to
your K-Means ML inference endpoint, QuickSight is then used to create your
visualization.
IoT Greengrass is a service that you use to run local ML inference capabilities on
connected devices.
The main advantage of random search is that all jobs can be run in parallel. In
contrast, Bayesian optimization, the default tuning method, is a sequential
algorithm that learns from past trainings as the tuning job progresses. This highly
limits the level of parallelism. The disadvantage of random search is that it
typically requires running considerably more training jobs to reach a
comparable model quality.
So Bayesian Optimization approach to hyperparameter tuning results in less
tuning job runs than the random search method.
Data scientists and developers can now quickly and easily access, monitor, and
visualize metrics that are computed while training machine learning models on
Amazon SageMaker. You can now specify the metrics you want to track by using
the AWS Management Console for Amazon SageMaker or by using the Amazon
SageMaker Python SDK APIs. After the model training starts, Amazon SageMaker
will automatically monitor and stream the specified metrics in real time to the
Amazon CloudWatch console for visualizing time-series curves, such as loss
curves and accuracy curves. You can also access the metrics programmatically
using Amazon SageMaker Python SDK APIs.
You can use the regex patterns that you see next to each metric to quickly parse
and filter the metric values from your Amazon CloudWatch Log files created by
Amazon SageMaker.
When you configure a Kinesis data stream as the data source of a Kinesis Data
Firehose delivery stream, Kinesis Data Firehose no longer stores the data at rest.
Instead, the data is stored in the data stream.
When you send data from your data producers to your data stream, Kinesis Data
Streams encrypts your data using an AWS Key Management Service (AWS KMS)
key before storing the data at rest. When your Kinesis Data Firehose delivery
stream reads the data from your data stream, Kinesis Data Streams first decrypts
the data and then sends it to Kinesis Data Firehose. Kinesis Data Firehose buffers
the data in memory based on the buffering hints that you specify. It then delivers
it to your destinations without storing the unencrypted data at rest.
You can use the Amazon SageMaker model tracking capability to search key
model attributes such as hyperparameter values, the algorithm used, and tags
associated with your team’s models. This SageMaker capability allows you to
manage your team’s experiments at the scale of up to thousands of model
experiments.
Use customer owned KMS key, in case your project requires encryption for
regulatory compliance reasons.
Kinesis Firehose can invoke Lambda functions to transform incoming source data
and deliver it to Amazon S3. Common transformation functions include
transforming Apache Log and Syslog formats to standardized JSON and/or CSV
formats. The JSON and CSV formats can then be directly queried using Amazon
Athena.
Lake Formation then helps you collect and catalog data from databases and
object storage, move the data into your new Amazon S3 data lake, clean and
classify your data using machine learning algorithms, and secure access to your
sensitive data.
When using AWS Glue FindMatches ML Transform, the labeling file must be
encoded as UTF-8 without BOM (Byte Order Mark)
For a relationship between two variables, you could use the scatter chart. For a
relationship between 3 variables a bubble chart is the best choice.
Factorization Machines solve a discrete recommendation.
A pairs plot is used to show the relationship between pairs of features as well as
distribution of one of the variables in relation to other.
A covariance matrix shows the degree of correlation between two features.
Entropy represents the measure of randomness in your feature.
MAE (Mean Absolute Error) is a good metric for regression in case of outliers
existing.
Your XGBoost model has high accuracy on its training set, but poor accuracy on its
validation set, suggesting overfitting. the "subsample" parameter directly
addresses overfitting, but other parameters such as eta, gamma, lambda, and
alpha may also have an effect. Refer to
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
XGBoost is a CPU-only algorithm, and won't benefit from the GPU's of a P3 or P2.
It is also memory-bound, making M4 a better choice than C4. It can be parallelize.
XGBoost Hyperparameters;
- num_class; number of classes (Required if objective is set to multi:softmax
or multi:softprob)
- num_sample; The number of rounds to run the training (Required)
- alpha; L1 Regularization, increasing this value makes models more
conservative
- eta; step size, prevent overfitting
- eval_metric; rmse for regression & error for classification & map for ranking
- gamma; min loss reduction
- lambda; L2 regularization
- subsample; prevents overfitting
Ordinal Encoder transform is better choice to fill missing value for feature, that
has ordinal value, like rating H(High)>M(Medium)>L(Low)>N(No) or size data
L(Large)>M(Medium)>S(Small)
You can use various AWS services to transform or preprocess records prior to
running inference. At a minimum, you need to convert the data for the following:
Inference request serialization (handled by you)
Inference request deserialization (handled by the algorithm)
Inference response serialization (handled by the algorithm)
Inference response deserialization (handled by you)
When using a custom algorithm, you need to ensure that the desired metrics are
emitted to stdout output. You also need to include the metric definition and
regex expression for the metric in the stdout output when defining the training
job.