top of page
Search

Curse of Dimensionality: Raiders of the Lost Accuracy

  • Writer: Igor Alcantara
    Igor Alcantara
  • Jul 27
  • 12 min read

Updated: 20 hours ago

Real photo of Igor Alcantara in a regular ML project
Real photo of Igor Alcantara in a regular ML project

There is a common trend in the world of data that is a bit dangerous. A person watches a couple of YouTube videos, learns how to use some no-code tools, Auto ML type or RAG, and claims "I am a data scientist" or learn a few prompts and say they are an "AI Expert". What bothers me is not the title itself. As I explore in a previous article, titles only matter if you do not see beyond the words. What concerns me is that when your lack of knowledge blinds you from understanding what you don't know, and this can derail your whole AI strategy.


Research indicates that only 20% of AI projects are actually implemented and used. This number is alarming. I list three main reasons for that:


  1. Lack of maturity: The company is just not ready for AI yet. They have some homework to do before implementing their first models or assistants. Check your maturity here.

  2. Bad Data: It is the classic Garbage In, Garbage Out. If your data quality is bad or not even measured (go back to the previous item), your AI will be bad.

  3. Professional Gap: You assign the wrong professional to the task or assume "anyone can do it, just click this button the tool does it for you".


There are many traps and things can go wrong. Sometimes you even get a great result in training, and you think: "I don't need a data scientist, I trained this model and got a 95% accuracy"! Oh boy (or girl), you could not be more wrong. You're like the villain in the classic Indiana Jones and the Last Crusade, that aims to the wrong chalice that he believes to be the Holy Grail. You "chose poorly". One of these traps is the topic of this article: the Curse of Dimensionality.


It begins, like any great tale of danger and discovery, with a data scientist, a fresh dataset, and the gleam of predictive glory just beyond reach. You crack your knuckles, open Qlik Predict, and begin your (gradient) descent. The data glows with promise. So many features! So much potential! Like a mysterious idol on a pedestal, the dataset beckons. One more variable couldn’t hurt, right?


Except… it does.


If you think adding more columns to a Machine Learning model cannot cause any harm, this article is for you. Picture yourself as Indiana Jones, poised at the mouth of a forgotten crypt, scanning for traps before making your move. Every feature you add is another step deeper into the temple. The floor shifts. The air thickens. With each dimension, the walls close in. In data science, this perilous journey mirrors the struggle against the Curse of Dimensionality. The deeper you venture into higher dimensions, the greater the risk to your model’s accuracy, stability, and interpretability.


To survive this expedition and escape with the treasure of robust, actionable insight, you’ll need more than courage. You’ll need to understand the mathematical booby traps, craft better feature maps, and guard your sample size like it’s the Holy Grail. Qlik Predict just might be your whip and torch. Embrace yourself, we're about to go in an adventure and I hope you're not afraid of snakes... or math.


What Is the Curse of Dimensionality?


The term "curse of dimensionality" was coined by Richard Bellman in the 1960s while studying dynamic programming. It describes how data becomes increasingly sparse and problematic as the number of dimensions (features) increases.


In simple terms, imagine you’re building a model to predict where a student will sit in a classroom. Not just any classroom, this is Dr. Indiana Jones’ Intro to Artifact Recovery, so seating choice could mean the difference between a safe lecture and dodging a giant rolling rock.


You start with a few useful features:

  • Age

  • Gender

  • Height

  • Sociability score (from “lone archaeologist” to “group project enthusiast”)


You collect data, run your model in Qlik Predict, and it performs well. Maybe students who are more extroverted tend to sit toward the middle, while the quieter ones drift to the edges. Patterns emerge, and the model gives you reliable predictions.


So far, so good.


Then curiosity kicks in. You start adding more features:

  • Hair color

  • Shirt brand

  • Whether they wear glasses

  • Favorite snack

  • Social media usage

  • What car they drive

  • Favorite Data Voyagers podcast episode

  • Number of archaeology documentaries watched last month

  • Whether they’ve ever been chased by a boulder (you never know)


Your feature list grows and grows. Every variable seems like it might help. But instead of improving your model, you notice its performance begins to decline. It’s more fragile. Less consistent. You’ve just activated the curse of dimensionality.


Here’s why.


As you add more features, you’re slicing your data into smaller and smaller subgroups. Let’s say you want to predict where a student will sit based on:


  • Gender = male

  • Glasses = green

  • Hair color = black

  • Shirt brand = Target

  • Height = 5'6"

  • Car = GWM Haval


How many people in your dataset match all those exact features? Probably just one. And now your model is trying to learn a pattern from that one student.

He sat in the front-left corner that day. Is that where everyone like him will sit? Is that representative? You have no idea. You added so many filters that your data became too thin, too fragmented to be trustworthy.


This is the curse: the more dimensions (features) you add, the less data you have per unique combination. You may start with a classroom of 300 students, but once you break them down by 25 features, it’s like you have hundreds of micro-groups with only one or two people in each.


Your model is no longer finding patterns. It’s memorizing exceptions. It’s learning from noise. It’s overfitting. And in real-world deployment, it falls apart.


The curse of dimensionality is when your dataset becomes too wide and not deep enough. A diet might be in order. You know more about each student, but you can no longer group them meaningfully. You’ve lost the forest for the very precisely measured trees.


High-Dimensional Spaces


We now understand that as the number of features grows, the volume of the data space increases so rapidly that the available data becomes sparse. This sparsity makes it difficult for any model to generalize reliably. It is time to explore this in more details and for that we need to understand high-dimensional spaces.


High-dimensional spaces refer to data environments where each observation (or data point) is described by a large number of features or variables. If you think of each feature as a separate dimension, then a dataset with just 3 features exists in 3-dimensional space, something we can still mentally picture. In that world, each data point (or row) can be described by a 3D coordinate. But once you move beyond 3, 10, or 100 dimensions, things quickly spiral into abstract mathematical terrain that no human brain evolved to intuitively grasp.


In a high-dimensional space, everything becomes strange:


  • Each additional feature adds a new axis to the data space. A dataset with 100 features lives in a 100-dimensional space. Visualizing or reasoning about this space in geometric terms becomes nearly impossible.


  • Volume grows exponentially. The number of possible positions a point can occupy increases so fast that data becomes incredibly sparse. Even if you have thousands of rows, they occupy only a tiny fraction of the full space.


  • Distances become less meaningful. As dimensionality increases, the distance between data points tends to converge, meaning that your nearest neighbor is almost as far away as your farthest. This undermines many machine learning techniques that rely on distance, such as k-nearest neighbors or clustering.


  • All points are far from the center. In high-dimensional spaces, most data points lie close to the edges, not near the center. It’s as if gravity flips and pushes everything toward the walls.


One of the strangest and most counterintuitive effects of high-dimensional space is what happens to distance, a concept so fundamental to geometry and modeling that we almost take it for granted.


In low dimensions, distance helps us group similar things together and separate dissimilar ones. That’s why many machine learning algorithms (like k-nearest neighbors, clustering, SVM, or tree splits) rely on distances to make decisions.


But as dimensionality increases, something very weird happens:


All distances between points start to look the same.


This isn’t just philosophical. It’s mathematical.


Imagine you randomly sample n points inside a d-dimensional unit hypercube (each side is from 0 to 1). For each point, you compute:

ree

You’d expect this ratio to be small. Your closest neighbor should be much closer than your farthest one, right? But here’s the catch:


  • In low dimensions (e.g., 2D or 3D), this ratio is significantly less than 1, as expected.

  • As d→, the ratio approaches 1.


That means:


Your closest neighbor becomes almost as far away as your most distant one.


In other words, there’s no longer a meaningful distinction between “close” and “far.” Everything is just... far.


Sample Size: How Much Data Is Enough?


If you want to survive the Temple of Dimensionality, you’ll need more than clever algorithms, a cool hat, and a whip. You’ll need enough data, and much more than you might think. Well, you might also need more than this, but one solution is to increase your sample size, considering you do not want to reduce the number of dimensions, which I will discuss soon.


In low-dimensional spaces, a few hundred rows may be enough to train a good model. But as your feature count increases, your data needs grow exponentially. Why? Because you're not just learning from raw features, you're learning from the space formed by their combinations.


A classic rule of thumb in machine learning is:

ree

Where:


  • 𝑛: number of observations (rows of data)

  • 𝑝: number of features (columns or dimensions)


So, for 100 features, you'd want at least 1,000 rows. But this guideline assumes:


  • The features are all informative

  • There’s no noise

  • You’re using a linear or relatively simple model

  • You’re not modeling complex interactions


That’s a generous assumption. In the real world, especially with non-linear models like decision trees, ensembles, and neural networks, you may need:

ree

This is because these models are flexible enough to model subtle patterns or hallucinate them if the data is insufficient to ground those patterns in reality.


Feature Interactions and the Exponential Explosion


Features don’t exist in isolation, they interact. In most real-world scenarios, the relationship between two or more features can be just as important as the individual features themselves. This is known as a feature interaction.


For example:

  • A student’s sociability might not predict their seating choice on its own.

  • Their sociability combined with height, or sociability combined with age and gender, might hold the real predictive power.


This is where the curse of dimensionality gets more complex: when you start considering all possible combinations of features, your feature space doesn’t just grow, it explodes.


Pairwise Interactions: The First Wave of Chaos


Let’s say you have p original features. If you want to include all two-way (pairwise) interactions, you’ll need to consider every possible pairing of two features. Mathematically, the number of possible 2-feature combinations is given by the binomial coefficient:

ree

So:

  • For p=10, that’s 45 new interaction terms.

  • For p=100, you now have 4,950 new features on top of the original 100.


This means a model that originally had 100 features now has 5,050 if you include all pairwise interactions. That is only considering how a pair of features interact. If you consider the interaction in triplets, the number of combinations is over 160,000! Imagine a machine learning in patient care where each medication can interact with one another and with the blood pressure, glucose, heart rate, which affects many other things. You will get a combination in the millions with just a few dozen columns.


From a mathematical standpoint, the model’s complexity grows combinatorially, not linearly. That causes several issues:


  1. Computation Overload: Training time increases dramatically. Memory usage spikes. Even with efficient hardware, you’re now running up against processing limits.

  2. Sparse Coverage: With that many combinations, your dataset won’t have enough examples to represent each interaction meaningfully. You might have a million rows, but each interaction is now supported by only a handful of examples, or sometimes just one.

  3. Overfitting: The model begins to memorize spurious patterns in rare combinations, especially if you're using tree-based models or neural nets. The model may perform well on training data but crash when faced with new inputs.

  4. Interpretability Collapse: Try explaining to a stakeholder that the prediction depends on the interaction between "customer tenure," "browser type," and "average cart size in the past 3.2 weeks." It’s not just confusing, it’s non-actionable.


Escaping the Curse: How to Beat the Dimensional Trap


Even Indiana Jones wouldn’t dare walk into a booby-trapped data temple without a plan, and neither should you. If your model is starting to look like it belongs in a museum (and not in production), it’s time to talk strategy. The Curse of Dimensionality can feel like a mathematical minefield, but there are clear, practical ways to fight back.


The key idea is this: don’t let your model get overwhelmed by too many features. The more variables you have, the more data you need. But if you can reduce the number of dimensions, or at least make them more meaningful, you can simplify the journey, avoid traps, and unlock better predictions. Here are the main tools to help you escape:


1. Feature Selection: Keep Only What Matters


Sometimes, the best move is to get rid of what you don’t need. Feature selection is the process of choosing only the most important variables and dropping the rest. Think of it like cleaning out your bag before a long trek. Do you really need to carry that broken compass?


There are several ways to do this:


  • Look at which features are highly correlated (too similar) and keep only one.

  • Use basic statistics to measure which features actually relate to your target.

  • Let your modeling tool (like Qlik Predict) show you which features have the biggest impact.


This doesn’t just make your model faster, it makes it more stable and more accurate.


2. Feature Extraction: Combine Signals into Simpler Forms


Instead of picking features, what if you could create new ones that are smarter, cleaner versions of the originals? That’s what feature extraction does. It builds compressed versions of your data, often combining multiple variables into one that captures the most useful patterns. That involves doing the work of a scientist and research the field of study, read papers about that subject to understand what the most important predictors to a target are. For example, instead of using Total Cholesterol value, you might want to use Cholesterol Range (Low, Normal, High, etc.). It is more important to predict a cardiovascular disease if your cholesterol is controlled than the actual number.


Another method is to combine multiple features into one. Mathematically speaking, the most common technique for this task is PCA (Principal Component Analysis). PCA transforms your data into a new set of uncorrelated variables that capture the most variance. You lose some interpretability, but you gain clarity and reduce noise. I had a project recently where there were dozens of features.


In a research project I work last year, we needed to understand what impacted of several features in the alarming number of suicides by firearms in Ohio. There are so many factors that could contribute to this that this task was too complex to be handled by simple charts and tables. As part of the process, I applied PCA, reduced to a few grouped ones and got a great model out of it.


3. Regularization: Keep Your Model in Check


When a model tries too hard to fit every little detail in your data, it ends up memorizing noise. Regularization is like setting ground rules for your model: “Don’t get carried away.”

By adding a small penalty to complex behavior, you encourage your model to keep things simple. This reduces the risk of overfitting, which is especially dangerous in high-dimensional spaces. The basic message here is to add a cost for using a feature in a model. That makes sure the benefit of a new dimension suppress the noise it creates.


Common types of regularization are:


  • L1 (Lasso): Removes unnecessary features by setting their weights to zero.

  • L2 (Ridge): Shrinks feature weights without removing them entirely.


Both methods help your model focus on the core signals, not the clutter. The good news is that Qlik Predict comes with this functionality out of the box.


4. Clean Data, Smart Start


Before you do anything fancy, make sure your dataset is clean. That means:


  • Handling missing values properly

  • Normalizing numeric features (so they’re on the same scale)

  • Encoding categories the right way

  • Removing obvious outliers


A model trained on messy data will struggle, no matter how advanced your algorithms are. Qlik Predict helps automate these steps, but understanding why they matter helps you make better decisions later.


So What’s the Goal?


The ultimate mission is this: reduce the number of dimensions to just what you need. Fewer features = fewer blind alleys = more generalizable, trustworthy predictions.

Like any good explorer, your job isn’t to collect every shiny object. It’s to find the few that matter, understand their value, and build a model that can tell that story clearly and consistently.


So, as you emerge from the data jungle, hat a little scratched, whip well-used, maybe a little wiser, you realize something important: building a predictive model isn't about collecting every artifact you can find. It's about knowing which ones actually matter. The Curse of Dimensionality doesn’t announce itself with dramatic music or collapsing temples. It sneaks in through good intentions and unchecked assumptions. But with the right tools, the right professional, a bit of experience, and the humility to question your own understanding, you can navigate this peril. Qlik Predict can guide you, but the judgment call is yours. Choose features wisely. Respect your sample size. And remember: in the temple of machine learning, it’s not the golden chalice that brings success. It’s the humble, well-chosen one that leads to lasting insight.



 
 
 

Comments


© 2024 Data Voyagers

  • Youtube
  • LinkedIn
bottom of page