Bias Detection in Machine Learning and Qlik Predict
- Igor Alcantara
- Mar 31
- 15 min read
When your model learns more than it should
There is a quiet assumption in most machine learning projects. If the model is accurate, then it must be good. That assumption works well, until the model starts making decisions about people, money, risk, or opportunity. Then accuracy alone is not enough. A model can be very accurate and still systematically treat groups differently.
This is the ninth article in our series "The Theory behind Qlik Predict".
So far, we have talked about models, features, evaluation metrics, and how to build systems that can predict the future with a reasonable degree of confidence. Most of that work focused on one central idea, how well the model performs. Use the following list to read the previous articles:
Now it is time to talk about something just as important, and often ignored: Fairness.
Bias detection is a relatively new capability in Qlik Predict, and it changes the conversation in a meaningful way. Instead of only asking whether a model is accurate, we can now ask whether it behaves consistently across different groups.
Because a model can be technically correct and still be wrong in ways that matter.
This article shifts the focus from performance to responsibility. Not in an abstract sense, but in a measurable, practical way. We are going to unpack what bias really means in machine learning, how it appears in both data and models, and how Qlik Predict allows you to detect, analyze, and act on it before it becomes a problem in production.

What Bias Actually Is, Beyond the Buzzword
Bias in machine learning is often misunderstood because the word carries a lot of baggage.
In statistics, bias simply means a systematic deviation from the truth. In machine learning, it takes a more practical meaning:
A model is biased when its predictions systematically favor or disadvantage certain groups.
That “systematically” part matters.
One wrong prediction is not bias. A pattern of wrong predictions that consistently affects the same group is.
Bias can show up in different ways:
A loan model approves applications more often for one group than another
A hiring model ranks certain profiles lower regardless of qualifications
A healthcare model underestimates risk for a specific population
In all cases, the model is not random. It is consistent. And consistency is what makes it dangerous.
Bias can originate in two places:
The data
The model itself
Both need to be understood separately.
Data Bias, When Reality Is Already Skewed
Data bias is the most fundamental form of bias. It exists before any algorithm is involved.
It happens when the training data does not represent groups equally or fairly.
This can happen for several reasons:
Some groups are underrepresented
Historical data reflects past inequalities
Certain outcomes are more frequently recorded for specific groups
A classic example is hiring data. If historically a company hired mostly one type of candidate, the dataset will reflect that pattern. A model trained on this data will learn that pattern and reproduce it.
Not because it is correct. Because it is common.
Why Data Bias Is So Difficult
Data bias is tricky because it often looks legitimate.
The data is real. The distributions are correct. The patterns are statistically valid.
But they may not be fair. A model trained on biased data is like a student learning from a biased textbook. The learning is accurate relative to the source, but flawed relative to reality.
Measuring Data Bias in Qlik Predict
Qlik Predict approaches data bias with a simple idea:
Measure how balanced the data is across groups, and how outcomes behave within those groups.
Two metrics do most of the work here.
Representation Rate Parity Ratio
This metric answers a basic question:
How evenly are different groups represented in the dataset?
Imagine a feature like “Marital Status” with groups such as Married, Single, Divorced. If 90 percent of the dataset belongs to one group, the model has very little exposure to the others.
The representation rate parity ratio compares the size of each group relative to the largest group.
A value close to 1 means groups are similarly represented
A value below 0.8 signals imbalance
Think of it as a fairness check on exposure. If the model rarely sees a group during training, it will not learn reliable patterns for it. That group becomes an afterthought in predictions. A rare exception the model does not have enough knowledge to know how to handle.

Another way to look at this metric is through the lens of learning opportunity. Machine learning models do not generalize equally across all groups, they generalize based on what they see frequently. When one group dominates the dataset, the model becomes highly specialized in recognizing patterns for that group, while treating others as noise or edge cases. This is not a flaw in the algorithm, it is a direct consequence of exposure.
This imbalance often leads to unstable behavior for underrepresented groups. Predictions for those groups tend to have higher variance, lower confidence, and more errors. In practice, this means the model might perform very well overall, yet fail quietly for smaller segments of the population. Those failures are easy to miss if you only look at aggregate metrics.
It also highlights an important limitation of accuracy as a performance measure. A model trained on heavily imbalanced data can achieve high accuracy simply by optimizing for the majority group. Think of a condition that only happens in 1% of the population. If the model simply predict everyone as not having the condition, it will be right most of the time. However, that does not make this a good model, since people with the condition will never be predicted.
The representation rate parity ratio helps reveal this imbalance early, before it turns into a model that is technically impressive but practically unreliable for anyone outside the dominant group.
Conditional Distribution Parity Ratio
This metric goes one level deeper. It asks:
Within each group, how does the target variable behave?
In other words, even if groups are equally represented, do they have similar outcome distributions?
For example:
Do different groups have similar approval rates?
Do different groups have similar default rates?
The conditional distribution parity ratio compares these relationships across groups.
Values near 1 indicate similar behavior
Values below 0.8 indicate imbalance
This metric is important because representation alone can be misleading. A dataset can look balanced on the surface but still encode very different relationships between features and outcomes.
What makes this metric particularly valuable is that it captures hidden structure in the data. Two groups can appear equally represented, yet the way the target variable interacts with those groups can be completely different. One group might consistently have higher positive outcomes, while another group shows a wider spread or systematically lower results. From a modeling perspective, this creates very different learning signals, even though the dataset looks balanced at first glance.
Another way to think about it is that this metric evaluates fairness in relationships, not just counts. Representation tells you how often a group appears. Conditional distribution tells you what happens when it appears. If the outcomes differ significantly, the model will learn those differences and may reinforce them in its predictions. This is often where subtle bias begins to take shape, not from lack of data, but from uneven patterns embedded within it.
It also helps explain why simply balancing a dataset is not always enough. You can oversample or rebalance groups to achieve equal representation, but if the underlying outcome distributions remain skewed, the problem persists. The model will still pick up on those differences and act on them. That is why looking at conditional distributions provides a more realistic view of fairness, closer to how the model actually experiences the data during training.
Measuring Model Bias in Qlik Predict
Here is where things become more nuanced. Different types of models require different fairness checks.
This distinction exists because fairness is not a single, universal concept. What it means for a classification model to be fair is fundamentally different from what it means for a regression or time series model. In classification, fairness is tied to decisions, who gets approved, who gets selected, who gets flagged. In regression, fairness is tied to precision, how accurate those predictions are across different groups.
That is why a one size fits all metric would not work. A model could appear fair under one definition while being clearly biased under another. For example, equal approval rates across groups might look fair on the surface, but if the model is systematically less accurate for one group, the problem is still there, just hidden in a different dimension.
Qlik Predict approaches this by aligning bias metrics with the nature of the problem being solved. Instead of forcing a generic fairness score, it evaluates bias in the same space where the model operates, outcomes for classification, errors for regression, and patterns over time for time series. This makes the analysis more meaningful, because it reflects how the model is actually used in practice, not just how it behaves in theory.
Classification Models, Who Gets the Positive Outcome
In classification problems, bias is about how outcomes are distributed across groups.
For example:
Who gets approved
Who gets hired
Who is flagged as high risk
Qlik Predict evaluates this using three key metrics.
Disparate Impact (DI)
Disparate Impact looks at selection rates. It asks:
How often does each group receive a favorable outcome compared to the most favored group?
The metric is calculated as a ratio:
Selection rate of group A divided by selection rate of the most favored group.
Ideal value is 1, meaning equal selection rates
Values below 0.8 indicate potential unfairness
Interpretation is straightforward. If one group is selected far less often than another, the model is not treating them equally. This metric is widely used in regulatory contexts because it directly reflects outcome disparity.
Statistical Parity Difference (SPD)
Statistical Parity Difference measures the gap in positive outcome rates between groups.
Instead of a ratio, it looks at absolute difference.
For example:
If one group has a 60 percent approval rate and another has 40 percent, the difference is 0.2.
Ideal value is 0
Values above 0.2 signal unfairness
SPD is useful because it shows how large the disparity is in absolute terms. While DI tells you relative fairness, SPD tells you how big the gap actually is.
One of the strengths of SPD is its simplicity. It translates fairness into something very concrete, a direct gap between groups. There is no normalization, no scaling against a reference group, just a clear difference that can be interpreted immediately. This makes it particularly useful when communicating results to stakeholders who are not deep into the technical details but still need to understand the impact.
Equal Opportunity Difference (EOD)
Equal Opportunity Difference focuses on true positives. It asks:
Among the cases where the outcome should be positive, does the model treat groups equally?
In other words:
If two groups both qualify for a positive outcome, does the model correctly identify them at the same rate?
Ideal value is 0
Values above 0.1 indicate unfairness
This metric is particularly important in sensitive applications. In healthcare, missing a true positive for one group more than another is not just unfair, it can be harmful.
This metric is especially relevant when the cost of missing a correct positive outcome is high. In many real-world scenarios, false negatives carry more serious consequences than false positives. If a model consistently fails to identify qualified individuals from a specific group, it is not just a performance issue, it is a systemic disadvantage. EOD brings that issue into focus by isolating how well the model performs where it matters most, on cases that truly deserve a positive outcome.
Regression and Time Series Models, Fairness in Errors
Not all problems are classification. In regression and time series, there is no “positive outcome”. There is prediction error. So bias is measured differently.
The key question becomes:
Does the model make larger errors for some groups than others?
Qlik Predict evaluates this using parity ratios on error metrics. This shift in perspective is important because it reframes fairness as consistency of performance. In regression and time series problems, every prediction carries some level of error, that is expected. What is not acceptable is when that error is unevenly distributed across groups. If one group consistently receives less accurate predictions, the model is effectively less useful, or even misleading, for that segment.
It also highlights a common blind spot in model evaluation. Aggregate error metrics can look excellent while masking disparities at the group level. A model might achieve low overall error by performing extremely well on the majority group, while underperforming on smaller or less represented groups. Parity ratios force you to look beyond the average and examine whether the model is delivering comparable reliability for everyone it is meant to serve.
Error Parity Ratios (MAE, MSE, RMSE, etc.)
These metrics compare error levels across groups. Take MAE as an example:
Calculate the average error for each group
Compare them using a ratio
If one group consistently has higher errors, the model is less reliable for that group.
Ideal range is between 0.8 and 1
Values above 1.25 signal unfairness
This is subtle but critical. A model can have excellent overall accuracy while performing poorly for specific groups. Without these metrics, you would never know. The metrics for Error Parity Rations in Qlik Predict are:
MAE
MSE
RMSE
R²
MASE
MAPE
SMAPE
Some of these were discussed in a previous article here. However, let me explain them one more time.
MAE (Mean Absolute Error)
MAE measures the average magnitude of errors, without considering their direction. In simple terms, it tells you how far off the predictions are from the actual values, on average.
For bias detection, MAE parity compares this average error across groups. If one group has a significantly higher MAE, it means the model is consistently less precise for that group. This is often one of the most intuitive metrics because it is expressed in the same units as the target variable, making it easy to interpret.
MSE (Mean Squared Error)
MSE takes a different approach by squaring the errors before averaging them. This gives more weight to larger errors. From a bias perspective, this is useful because it highlights whether certain groups are experiencing more extreme mistakes. A group might have a similar MAE to others but a much higher MSE, indicating that when the model is wrong, it is very wrong. This can be particularly important in financial or risk-related applications where large errors carry higher consequences.
MSE (Mean Squared Error)
MSE takes a different approach by squaring the errors before averaging them. This gives more weight to larger errors. From a bias perspective, this is useful because it highlights whether certain groups are experiencing more extreme mistakes. A group might have a similar MAE to others but a much higher MSE, indicating that when the model is wrong, it is very wrong. This can be particularly important in financial or risk-related applications where large errors carry higher consequences.
R² (Coefficient of Determination)
R² measures how well the model explains the variance in the data. Instead of focusing on error magnitude, it evaluates how much of the outcome variability is captured by the model. In bias analysis, the focus is on the gap in R² between groups. If one group has a significantly lower R², it means the model does not capture the underlying patterns for that group as well as it does for others. The model is effectively less “understanding” of that segment.
MASE (Mean Absolute Scaled Error)
MASE is commonly used in time series forecasting. It compares the model’s error against a simple baseline, typically a naive forecast. This makes it scale-independent and useful when comparing across different datasets or groups. In terms of bias, MASE parity reveals whether the model performs better than a simple baseline consistently across groups. If one group has a higher MASE, it means the model struggles to outperform even basic assumptions for that segment.
MAPE (Mean Absolute Percentage Error)
MAPE expresses error as a percentage of the actual value. This makes it particularly useful when dealing with variables that vary in scale, since it normalizes the error relative to the magnitude of the target. For bias detection, MAPE parity shows whether the model is proportionally less accurate for certain groups. A higher MAPE for one group means that, relative to their actual values, predictions are consistently further off.
SMAPE (Symmetric Mean Absolute Percentage Error)
SMAPE is a variation of MAPE designed to address some of its limitations, especially when dealing with very small or zero values. It balances the percentage error by considering both actual and predicted values in the denominator. This makes it more stable in edge cases.
In bias analysis, SMAPE parity helps ensure that percentage-based error comparisons remain fair and consistent across groups, even when the data includes extreme or near-zero values.
Favored and Harmed Groups, Making Bias Visible
One of the most practical aspects of Qlik Predict is how it translates metrics into insights.
It identifies:
Favored groups, those receiving better outcomes
Harmed groups, those receiving worse outcomes
This is important because metrics alone can feel abstract.
Seeing which groups are affected turns bias detection into something tangible. It moves the conversation from numbers to impact.
And that shift is critical. A metric like Statistical Parity Difference or Disparate Impact can tell you that something is off, but it does not immediately tell you who is affected. Once you label favored and harmed groups, the problem becomes concrete. You can point to it. You can explain it. You can challenge it.
It also changes how stakeholders engage with the model. Instead of discussing thresholds and ratios, the conversation becomes about real segments, real customers, real people. That tends to raise better questions and, in many cases, more accountability.

Another important aspect is that these labels are not permanent truths. A group can be favored in one model and harmed in another. Even within the same model, different features can produce different bias patterns. This reinforces the idea that bias is contextual, and understanding that context is what allows you to address it properly.
From Detection to Action
Bias detection is not the goal. Action is.
Once bias is identified, several paths are available:
Remove or adjust problematic features
Improve data collection
Rebalance the dataset
Redefine the problem itself
That last option deserves attention.
Sometimes the issue is not the data or the model. It is the question being asked.
A biased question leads to a biased model, no matter how well it is implemented.
Taking action requires judgment. Not every detected bias requires the same response. Some features may need to be removed entirely, especially if they act as proxies for sensitive attributes. Others might require better representation in the data, which is often a longer and more complex effort.
Improving data collection is one of the most effective, and most difficult, solutions. It involves going back to the source and ensuring that the dataset reflects a broader and more balanced reality. This is not something you fix with a parameter, it requires process changes.
Rebalancing techniques, such as sampling or weighting, can help, but they should be applied carefully. They adjust the data the model sees, but they do not change the underlying reality. Used correctly, they can improve fairness. Used blindly, they can introduce new distortions.
Ask Better Questions
And then there is the question itself. Sometimes a model is biased because the objective is inherently biased. If the target reflects a historical decision that was unfair, the model will learn that unfairness perfectly. In those cases, the right solution is not to tweak the model, but to rethink the objective.
A simple example makes this very clear. Imagine the original question is:
“Which customers are most likely to default on a loan based on their ZIP code?”
At first glance, it sounds reasonable. It uses available data, it targets a clear business outcome, and it can produce a predictive model.
The problem is that ZIP code often acts as a proxy for socioeconomic status, race, or other sensitive attributes. The model may learn patterns that reflect historical inequalities tied to geography, not actual individual risk. In practice, this question pushes the model toward reinforcing those patterns.
A better way to frame the same problem would be:
“Which customers are most likely to default on a loan based on their financial behavior and credit history, excluding location-based proxies?”
This reframed question shifts the focus from indirect, potentially sensitive indicators to variables that are more directly related to the outcome. It does not guarantee a perfectly fair model, but it removes a major source of embedded bias and forces the model to rely on more relevant signals.
The difference between the two is subtle in wording, but significant in impact. One invites bias into the model. The other actively reduces the chances of it.

Bias Is Not a Bug, It Is a Property
One final thought.
Bias is not something you fix once and forget. It evolves.
As new data arrives, as behaviors change, as models are retrained, bias can reappear in new forms. That is why bias detection should be treated like performance monitoring.
Continuous, systematic, and part of the lifecycle.
Models live in dynamic environments. Customer behavior shifts, economic conditions change, new segments emerge. As those changes happen, the balance between groups can shift as well. A model that was fair at training time can become biased months later without any change in code.
This makes bias a property of the system, not a defect. It is something that needs to be observed and managed over time, just like accuracy, latency, or data drift.
It also means that governance matters. Monitoring bias, setting thresholds, and defining acceptable ranges should be part of the operational process. Not as an afterthought, but as a standard practice.

The Question That Matters
Machine learning has become very good at optimizing metrics.
Accuracy, precision, recall, RMSE.
But none of those answer the most important question:
Who does the model fail, and how often?
Bias detection gives you the tools to answer that question.
And once you start asking it, you will never look at a “good model” the same way again.
Because a model that performs well on average can still fail consistently for specific groups. And those failures are often invisible unless you look for them explicitly.
In the end, this is what separates a technically sound model from a responsible one. Not just how well it performs, but how evenly that performance is distributed.
And that is a much harder question to answer, but a far more important one.




Comments