
Welcome to our Model Evaluation video series, led by Guillermo Germade, AI/ML & Backend Developer at Azumo.
In this 7-part series, we take a deep dive into one of the most critical phases of the machine learning workflow: evaluating models. From train/test splits to precision, recall, F1 scores, and ROC AUC curves, Guillermo explains how to measure whether models are performing effectively. Not just from a technical perspective, but also in terms of business impact.
These videos equip you with practical tools to understand ML model evaluation metrics and apply them to real-world challenges like fraud detection, medical diagnostics, and spam filtering. By the end of the series, you’ll know how to determine if your model is “good enough” for your specific use case.
Introduction to Model Evaluation
Guillermo begins this part of the series by emphasizing the importance of machine learning model evaluation for both practitioners and stakeholders. Building on the previous session’s overview of AI, machine learning, and the variety of model types, this section focuses specifically on evaluating models.
The focus here is on supervised learning models, which include both classification (categorical outcomes) and regression (quantitative predictions) models. Guillermo explains that these types of models have well-established ML model evaluation metrics and methods that make it easier to judge their effectiveness.
We’re beginning the session with a high-level overview of the evaluation process, which is followed by a deep dive into classification models. This includes common use cases and key evaluation metrics.Â
Next, the series transitions to regression models, discussing the tools and metrics available to analyze continuous outcomes. By the end of this session, viewers will gain a clear understanding of how to apply machine learning model evaluation principles across both classification and regression tasks. This will help them make data-driven decisions and ensure that their models deliver real-world value.
What Is Model Evaluation?
Here, Guillermo starts by breaking down machine learning model evaluation into its core purpose: assessing how well a model can predict outcomes on new, unseen data. He explains that all machine learning models learn patterns from a training dataset, which contains historical data.Â
Different models trained on the same dataset can perform very differently. Some factors, such as model type, tuning parameters, and the mathematical structure of the model, influence how effectively it captures patterns in the data. Guillermo highlights that in practice, data scientists often treat models as a “black box” for evaluation purposes. Instead of trying to predict which model will work best theoretically, they train multiple models on the same data and then use ML model evaluation metrics to compare their performance.
Model evaluation is typically one of the final steps in the data science workflow. After training several models, practitioners use machine learning model evaluation metrics to measure performance and decide which model is best for the project’s technical and business goals.Â
By focusing on evaluation rather than theory, project teams can make data-driven decisions. They ensure that selected models deliver both accuracy and practical value across business use cases. This approach is important for robust AI systems in domains ranging from fraud detection to medical diagnostics.
The Data Science Process & Why Evaluation Matters
Guillermo explains that understanding where machine learning model evaluation fits in the broader data science process is essential for both technical and non-technical stakeholders. He uses the example of fraud detection in banking to illustrate the workflow.Â
Next comes data understanding: determining whether historical transaction data is available and sufficiently labeled (e.g., fraudulent vs. legitimate). If so, a supervised learning approach can be applied. Then, data preparation ensures that the dataset is clean, structured, and free from noise. Guillermo emphasizes that a significant portion of the data science process is often dedicated to cleaning and organizing data.
After preparing the data, modeling involves training multiple models on the dataset. The focus of this section, however, is evaluation. Once models are trained, evaluation is used to assess their performance using ML model evaluation metrics. This is the stage where project managers and stakeholders can meaningfully engage. Evaluation helps connect the gap between technical model performance and practical business requirements.
By understanding this process, stakeholders can see how ML model evaluation connects each stage of the data science workflow, turning raw data into actionable insights and trustworthy AI solutions.
Understanding Train/Test Split
This lesson introduces the train/test split, the foundation of model evaluation. Guillermo explains why separating data into training and testing sets will prevent inflated results and mistakes.
Using a credit card transaction dataset as an example, Guillermo explains that each row represents a transaction, while columns contain explanatory features. The label indicates whether the transaction was fraudulent. Typically, about 70% of the data is used for training and 20–30% for testing.
Guillermo emphasizes that skipping this step and training on the entire dataset is a common mistake. If the model is evaluated on the same data it was trained on, performance metrics will appear unrealistically high. He called it a form of “cheating” that hides the model’s true predictive power. The train/test split makes sure that evaluation results are reliable and meaningful.
Decoding the Confusion Matrix: Simple Guide to AI Model Accuracy
Here, Guillermo explains how machine learning model evaluation applies to classification models, which predict categorical labels rather than continuous values.Â
Once a model is trained, the first step in evaluating its performance is comparing its predictions against the actual labels. One of the most visual and intuitive tools for this is the confusion matrix, which summarizes the outcomes of classification predictions into four categories: true positives, true negatives, false positives, and false negatives.
Guillermo emphasizes that understanding these outcomes is key to applying ML model evaluation metrics effectively. Then he shows that two core metrics have come from the confusion matrix: precision and recall.
He also talks about accuracy: another common metric. While simple, accuracy can be misleading in imbalanced datasets. For this reason, machine learning model evaluation typically emphasizes precision, recall, and their balance through metrics like the F1 score.
By understanding the confusion matrix and these derived metrics, project teams can evaluate classification models more effectively and make informed trade-offs. They will also ensure their AI systems align with business priorities.
F1 Score and ROC AUC Curve Explained
Guillermo explains two more machine learning model evaluation metrics: F1 score and ROC AUC curve.
The F1 score combines precision and recall into one number. It shows a balance between correctly predicting positives and avoiding false positives. For example, if a model has 70% precision and 70% recall, the F1 score combines these into a single measure. F1 score is useful when you need to balance false positives and false negatives.
The ROC AUC curve shows how well a model can separate classes. It plots the false positive rate against the true positive rate. A random model gives a diagonal line, while a better model curves toward the top-left. The bigger the area under the curve (AUC), the better the model is at predicting both positives and negatives.
By using F1 score, ROC AUC, and other metrics like precision, recall, and accuracy, you can understand ML model evaluation fully. These metrics help choose models that work well and match business goals.
Hands-On Model Evaluation Demo & Closing Remarks
In the final part, Guillermo shows a practical example of machine learning model evaluation using Python and Scikit-learn. He creates a synthetic dataset of 1,000 credit card transactions with 20 features, labeled as fraud or not fraud.
The data is split 70/30 for training and testing. He trains a logistic regression model and makes predictions on the test set. Then he calculates ML model evaluation metrics: accuracy, precision, recall, and the confusion matrix. In this example, the model correctly identified 128 fraudulent transactions and 127 normal ones. It missed 27 frauds and flagged 18 normal transactions incorrectly. Accuracy was 85%, precision 88%, and recall was 83%.
Guillermo explains that lower recall means some frauds were missed, highlighting why choosing the right metric is important. Data scientists can tune models or try others to improve performance based on business priorities.
These evaluation steps apply to many domains, such as fraud detection, medical diagnostics, spam filtering, and more. Regression models use similar principles but different metrics, like mean squared error, for continuous predictions.
This demo shows how ML model evaluation helps connect model performance to real-world decisions. The next session will cover regression evaluation and explore different model types.
To Sum Up
This series shows how to evaluate machine learning models. First, you learn to test models on unseen data and use ML model evaluation metrics like accuracy, precision, recall, F1 score, and ROC AUC.
Next, evaluation helps you pick the best model, tune it, and understand trade-offs. These principles apply to classification tasks like fraud detection, spam filtering, or medical diagnosis, as well as regression tasks like predicting sales or weather.
Demos then show how to calculate metrics, train models, split data, and make predictions. These steps provide a clear link between model performance and business goals and real-world decisions.
Overall, by following these principles, you can ensure your AI models are accurate, reliable, and useful across different applications.