Welcome back, fellow data adventurers! If you've been journeying with us, you've already navigated the treacherous terrains of preprocessing tasks in AutoML, and unlocked the enigmatic secrets of SHAP values and feature importance. There articles are part of my series where I explain the theory behind Qlik AutoML. I hope you are a ready for the third one because you just got it.
When I was writing it, I kept thinking of a trilogy. I always like to use a theme or analogy in most of my articles. To mark this trilogy, let's use analogies straight from the movies. So, grab your popcorn, sit back, turn the cell phones down, relax and enjoy. You will embark on a cinematic exploration of machine learning metrics, featuring guest appearances from Pulp Fiction, Forrest Gump, and Back to the Future, and more.
Setting the Scene: The Importance of Model Evaluation
Imagine you're Marty McFly, and you've just traveled back to 1955. You need precise calculations to ensure you can return to 1985. Similarly, in machine learning, selecting the right model is crucial for accurate predictions. Qlik AutoML evaluates multiple models using various metrics to find the one that best fits your data's narrative.
Qlik AutoML does not know in advance what algorithm will produce the best model for a given problem, just like you don't know if a movie will be a blockbuster until you actually do it. What you can do is to make sure you have the right script, director, and cast. Qlik AutoML is the same. You have to make sure you provide the data validated and prepared for your use case but when it comes to algorithm, you need actually to try different options.
Alright, alright, alright (read it in a Matthew McConaughey Texan accent), you have trained 8 models, now what? How do you know which one to pick, which one to deploy and use? You measure the performance and select the best? But how? This is what this article is about.
Binary Classification: The Godfather of Metrics
In the world of binary classification, we're dealing with problems that have two outcomes: like deciding whether to accept or not the protection of Don Corleone and call him "godfather". Let's understand the metrics Qlik AutoML uses to evaluate these models.
1. F1 Score (Default)
What is it?
The F1 Score is the harmonic mean of precision and recall (no worries, I will explain these two inglorious bastards later). It balances the two metrics, providing a single score that reflects both the false positives and false negatives in your model.
How is it calculated?
The formula for the F1 Score is:
Why is it important?
Think of the F1 Score as the consigliere to The Godfather. It provides balanced advice by considering both precision (the accuracy of positive predictions) and recall (the ability to find all positive instances). In situations with imbalanced classes, where one class significantly outnumbers the other, the F1 Score becomes invaluable.
By the way, do you know they call F1 Score in France? Royale with Cheese. No, I am joking, that is how the call the Quarter Ponder sandwich. The F1 Score they just call the same way, but they say Le F1 Score.
2. Precision
What is it?
Precision measures the proportion of true positive predictions among all positive predictions made. It is like "if we say something is positive, how sure are we"? If an app recommends me a movie because it thinks I would like, how much of those I actually like?
How is it calculated?
Why is it important?
Precision is crucial when the cost of a false positive is high. For instance, in email spam detection, marking a legitimate email as spam (a false positive) could mean missing an important message from your boss—much like Forrest Gump missing a letter that could change his life. Sometimes worse: if the jury finds a defended guilty, how sure are they that the defendant is actually guilty? We don't want to send innocent people to jail.
3. Recall
What is it?
Recall measures the proportion of true positive predictions among all actual positive cases. Are you having a deja vu? Haven't I said the same for Precision? No worries, it is not a glitch on the Matrix. That is one word different that makes the whole point. In precision I said "positive predictors" and here I said "actual positive cases". How different? Precision tells us how well a model predicts a positive case. A Recall, however, tells me how many of the actual positive cases are in fact predicted as such.
Imagine a patient is diagnose with a disease. Is it better to have a higher precision or a higher recall? It depends on the disease and the effects on the diagnosis and treatment. If it is a disease that is expensive to treat or comes with a lot of stigma, I better be sure. In that case, a higher precision is better. On the other hand, if is a rapid evolving condition that I need to start treatment quickly and it is ok to treat some healthy people, at least until we have a confirmation, I would suggest to go with the Recall.
How is it calculated?
Why is it important?
Recall is vital when missing a positive case is costly. In medical diagnoses, failing to detect a disease (a false negative) could have severe consequences. It's like Doc Brown not realizing the importance of the lightning strike in Back to the Future—missing it would prevent Marty from returning home.
4. Accuracy
What is it?
Accuracy measures the proportion of true results (both true positives and true negatives) among the total number of cases examined. It is the basic: "how well did the model do"?
How is it calculated?
Why is it important?
While accuracy seems straightforward, it can be misleading in imbalanced datasets. If 90% of your data belongs to one class, predicting that class every time yields 90% accuracy. It's like Forrest Gump's box of chocolates—you never know what you're gonna get, so a single metric might not tell the whole story.
5. AUC (Area Under the ROC Curve)
What is it?
AUC, or Area Under the Receiver Operating Characteristic (ROC) Curve, measures the ability of a model to distinguish between classes (Positive vs. Negative). It evaluates how well a binary classification model can separate positive and negative cases across different decision thresholds.
The ROC curve itself is a graphical plot that illustrates the trade-off between two key metrics:
True Positive Rate (Recall):Â The proportion of actual positives correctly identified.
False Positive Rate:Â The proportion of actual negatives that were incorrectly classified as positives.
The AUC value ranges from 0 to 1:
1.0 (Perfect):Â The model makes no errors in separating the classes.
0.5 (Random Guess):Â The model is no better than random guessing.
< 0.5 (Worse than Random): The model’s predictions are systematically wrong.
What is the ROC Curve?
The Receiver Operating Characteristic (ROC) Curve is created by varying the decision threshold of a classifier and plotting:
True Positive Rate (y-axis):Â Proportion of correctly identified positives out of all actual positives.
False Positive Rate (x-axis):Â Proportion of incorrectly identified positives out of all actual negatives.
At one extreme of the threshold, all instances are classified as positive, leading to a high false positive rate and high true positive rate. At the other extreme, all instances are classified as negative, leading to a low false positive rate but also a low true positive rate. The curve connects these points.
A perfect model will have a ROC curve that hugs the top-left corner, meaning it achieves high recall with minimal false positives. By the way, a perfect model does not exist, it only a theoretical utopia.
Why is it Important?
AUC is a robust metric for evaluating binary classification models, especially in cases of imbalanced datasets. It focuses on the model's ranking of predictions rather than specific thresholds.
Threshold Independence: AUC evaluates performance across all possible thresholds, offering a comprehensive view.
Interpretability: A higher AUC means the model is better at distinguishing between classes (e.g., predicting 0s as 0s and 1s as 1s).
Think of the AUC like the Voight-Kampff test in Blade Runner: it’s designed to differentiate humans from replicants. A model with a high AUC is like a flawless version of the test, consistently distinguishing humans from artificial beings without fail. If the AUC is low, it’s like a faulty test that can’t tell the difference—leading to chaos in the dystopian world of replicants and humans!
Choosing a Different Model
Even though in a binary classification Qlik AutoML recommends a model based on the F1 Score, you might have reasons to prefer another model. Perhaps precision is more critical for your application. In that case:
Review Metrics: Examine all the metrics provided for each model.
Consider Your Priorities: Align the metrics with what's most important for your project.
Select and Deploy: Choose the model that best fits your needs and deploy it.
It's like Tarantino casting or writing a script based on what's best for the movie, not just what's popular or easier. In Qlik AutoML, check the details chart under the Compare tab for all of the metrics, but select with wisdom, Frodo.
Multi-Class Classification: The Ensemble Cast
When your problem involves more than two classes, it’s like navigating the infinite parallel realities of The Butterfly Effect or Everything Everywhere All at Once. Each class represents a unique timeline or outcome, and the challenge lies in understanding how they all coexist and interact. Qlik AutoML, much like a skilled time traveler or multiverse explorer, uses specialized metrics to evaluate these multi-class models, ensuring the best possible alignment between your data's various realities.
1. F1 Macro (Default)
What is it?
F1 Macro calculates the F1 Score independently for each class and then takes the average (unweighted mean). It treats all classes equally, regardless of their support (the number of actual occurrences).
How is it calculated?
Why is it important?
F1 Macro is beneficial when all classes are equally important. It’s like the ensemble cast in Inglourious Basterds, where each character plays a pivotal role in the mission. From Lt. Aldo Raine’s leadership to the explosive finale orchestrated by Shosanna, every character's contributions are treated with equal importance, just as F1 Macro ensures that all classes are equally represented when evaluating a model.
2. F1 Micro
What is it?
F1 Micro takes a different approach compared to F1 Macro by aggregating the contributions of all classes to compute a single, weighted average F1 Score. Instead of treating each class equally, it calculates global counts of true positives, false negatives, and false positives across all classes, combining them into one comprehensive metric. This means larger classes with more samples have a bigger impact on the final score, as F1 Micro emphasizes overall performance across the entire dataset rather than individual class performance. This makes it particularly useful when class distributions are imbalanced, as it reflects how well the model performs on the dataset as a whole, rather than giving each class the same weight.
How is it calculated?
It's the same as calculating precision, recall, and F1 Score over the entire dataset without considering class labels.
Why is it important?
F1 Micro is useful when you have imbalanced class distributions and care about the overall performance of the model as a whole. It’s like the Avengers in The Avengers—not every hero gets the same screen time, but their combined efforts determine the outcome of the battle. Similarly, F1 Micro focuses on the collective contributions of all classes, giving more weight to larger ones, ensuring the overall performance reflects the true impact of the dataset's structure.
3. F1 Weighted
What is it?
F1 Weighted provides a more nuanced evaluation by calculating the F1 Score for each class individually and then weighting those scores according to the number of instances in each class. This approach ensures that classes with more samples have a proportionally larger influence on the overall F1 score, while smaller classes still contribute to the final result without being disproportionately emphasized. It's particularly effective in scenarios where class distributions are imbalanced, as it accounts for the significance of each class based on its representation in the dataset.
How is it calculated?
Why is it important?
F1 Weighted is essential when dealing with class imbalance because it ensures the performance of each class is represented proportionally to its size in the dataset. It reminds me when in Amadeus, Mozart takes center stage due to his brilliance, but Antonio Salieri’s role, though less prominent, is still crucial to the story’s depth and complexity. Just as the movie gives more focus to Mozart because of his outsized talent and impact, F1 Weighted gives larger classes more influence on the overall score, while still acknowledging the contributions of smaller classes. This metric provides a fair yet realistic assessment, capturing the dynamics of the dataset much like Amadeus captures the interplay of genius and rivalry.
4. Accuracy
What is it?
Same as in binary classification, accuracy measures the proportion of correct predictions over total predictions.
Why is it important?
In multi-class problems with balanced datasets, accuracy can be a good indicator of model performance. However, it may not capture the nuances in class-wise performance, like when executives judge a movie solely by its box office earnings without considering critical acclaim or cultural impact.
Regression: Back to the Future of Continuous Predictions
Regression problems involve predicting continuous outcomes, like forecasting stock prices or estimating the time until the next Back to the Future reboot. In order words, regression is when you predict a number, not how you can remember past lives or talk to ghosts while you do pottery,
1. R² Score (Default)
What is it?
The R² Score, also known as the coefficient of determination, is a statistical measure that indicates how well a regression model's predicted values align with the actual observed values. It represents the proportion of the variance in the dependent variable (the target) that can be explained by the independent variables (the predictors).
Wait, what? No worries, let me dumb and dumber for you. In essence, R² gives you a score from -1 to +1 indicating how much of the change in the output Y (hipster data scientists call this a Dependent Variable) can be explained by a change in the input X (the same folks who say 'the book was better than the movie' call this an independent variable).
If X moves up 50%, how much Y also moves? Example: how much of the increase of people drowning in swimming pools (Y) can be explained by the number of movies released by Nicolas Cage (X) in the same year?
You can interpret the scores like this:
R² = 1: This indicates a perfect fit, meaning the regression model explains 100% of the variance in the target variable. The predicted values align exactly with the actual values. An increase in X means the same proportional increase in Y.
R² = 0: This means the model does no better than simply predicting the mean of the target variable for all instances. In other words, the independent variables provide no useful predictive information.
R² < 0: A negative R² suggests that the model performs worse than a simple horizontal line at the mean of the target variable. This typically indicates a poorly fitted model.
How is it calculated?
Too crazy formula? Let's simplify it like the do in current movies:
Why is it important?
An R² Score of 1 indicates perfect prediction, while 0 indicates the model does no better than the mean of the data. Negative values suggest the model is performing worse than a horizontal line. It's similar to Doc Brown's calculations: precise predictions ensure successful time travel. If your dataset contains high variability or noise, R² might not provide meaningful insights.
2. RMSE (Root Mean Squared Error)
What is it?
RMSE measures the square root of the average squared differences between predicted and actual values. It gives higher weight to larger errors. Worse to understand than the ending of Inception? Let me rephrase it. Imagine you are predicting the total points the Boston Celtics will score in every game in the next season. For game 1 you might say 120, but they scored 110, for game 2 you might say 115 but they scored 125, and so on. The residual is the different between the actual value and the one you predicted.
How do you know how well you predicted the entire season? How about the average of residuals? Not quite yet. The problem of a simple average is that positive residuals (like the game 2 prediction) will cancel the negative residuals (like game 1). So, let's eliminate the signal. How? RMSE square each difference. Since the square of a number is always positive, you eliminate the canceling each other out problem. However, you create another one. Imagine your predictions were off on average of 10 points per game. Because you square it, the RSME will show 100. How to fic this last problem? Just take the square root! Easy!
How is it calculated?
Why is it important?
RMSE is useful for understanding the model's prediction errors in the same units as the target variable. Large errors are penalized more severely, which is critical when significant deviations are unacceptable—like miscalculating the gigawatts needed for the DeLorean's flux capacitor.
3. MAE (Mean Absolute Error)
What is it?
MAE measures the average magnitude of the errors without considering their direction. It provides a linear score which means all the individual differences are weighted equally. MAE and RMSE are very similar. They both eliminate the canceling effect positive and negative residuals have, but instead of squaring and square rooting the differences, it simply takes the absolute value of the differences. In math terms, that is represented by the pipe characters. No, not Mario, but this pipe: |.
How is it calculated?
Why is it important?
MAE is straightforward to interpret and useful when all errors are equally important. It's like Forrest Gump's simple yet profound wisdom—"Life is like a box of chocolates; you never know what you're gonna get," but you can at least measure how far you are from the expected sweetness.
4. MSE (Mean Squared Error)
What is it?
MSE measures the average of the squares of the errors. It's the squared difference between the estimated values and the actual value. It is similar to RMSE but without the R. In other words, it squares the differences, and it keeps it like that. The results are squared, just accept it and live with it.
How is it calculated?
Why is it important?
MSE is valuable for mathematical reasons, especially in calculus, but it's harder to interpret because the errors are squared, and the units are squared as well.
Choosing Your Destiny: Selecting a Different Model
Even though Qlik AutoML recommends a model based on default metrics (F1 Score for classification and R² for regression), you hold the power to choose differently. Perhaps you're more interested in minimizing large errors (RMSE) or prioritizing precision over recall.
How to Select a Different Model
Analyze All Metrics: Study each model's performance metrics provided by Qlik AutoML.
Align with Business Goals: Consider which metric aligns best with your project's objectives.
Make an Informed Decision: Choose the model that excels in the metric most relevant to your needs.
Deploy with Confidence: Implement the model, knowing it best serves your purpose.
The Final Cut: Wrapping Up
As we reach the end of this feature presentation, remember that the metrics you choose are as critical as casting the right actors in a blockbuster film. Each metric provides a different lens through which to evaluate your models, and understanding them allows you to make informed decisions.
Just as Forrest Gump lived through pivotal moments in history, your journey through Qlik AutoML's evaluation metrics equips you to navigate the complex landscape of machine learning with confidence. So, let me make you an offer you cannot refuse: whether you're predicting the next big stock surge or classifying emails, you're now armed with the knowledge to select the best model for your needs.
Comments