Qlik Predict Preprocessing Explained

Igor Alcantara
Nov 4, 2024
13 min read

Updated: Oct 13

We take a lot of things for granted. For those new to Machine Learning who are lucky to start with tools like Qlik Predict, there are a lot of things happening behind the scenes that veterans (aka classical practitioners) like me had to program in R and Python in the past, just like the old Aztecs did (well, not really, but I like the joke). Don't get me wrong - this is amazing. We can now focus more on the analytical and scientific parts of the work rather than typing long sequences of code.

Nevertheless, I believe it is important to understand what's happening behind the scenes. It helps you not only appreciate what you have "for free" but, more importantly, it makes you a better analyst. Understanding what Qlik Predict does even before training the Machine Learning models will help you know which tasks to ask your Data Engineer to perform when preparing data for Machine Learning and which tasks will be handled automatically.

That is the reason that I am starting this series of article "Qlik Predict Behind the Scenes Explained". My first article (this one) is about the Proprocessing Tasks" of Predict. Those who are performed before the training of the models. The ones you see in the image below. Qlik Predict executes everything using Python and Scikit-Learn, but do not worry, I am not going to explain code. I will discuss each concept in a way that every human can understand.

Oh, and don't be scared by the size of this article. I placed some shortcuts for those who just want to know the basics. For them, it is only a 3-minute reading.

Trust me, it will be fun!

I wrote this article 3 times. I was never happy. It was too technical, too boring. So, I decided to wear my Marketing hat and rewrite this in a more click bait, I mean, interesting style. There is also a color code involved: each one of the 5 preprocessing tasks will be explained twice. First, in green (because we love this color) very simple explanation, then, in gray, a more detailed, but still fun approach.

Well, since is a whole new article, how about we also come up with a new title?

Spoiler alert: I also considered writing this in classic BuzzFeed Qlik Bait style: "10 Things You Didn't Know About Qlik Predict (Number 8 Will Make Your Neural Networks Tingle!)" but decided to keep it real instead. So, here we go:

5 Mind-Blowing Things About Qlik Predict That Will Make You Say "Wait, What?!"

Warning: This article contains dad jokes about data science. Proceed with caution! 🤓

Hey data voyagers, Qlik community, data enthusiasts and ML curious cats! Ever looked at machine learning and thought, "This is harder to understand than my teenager's text messages"? Well, you're in for a treat! We're about to dive into the world of Qlik Predict, and I promise you won't need a PhD in Computer Science or Statistics to follow along. (Though if you have one, that's cool too – we don't judge!)

What's Qlik Predict, Anyway?

Imagine if you could build a complex machine learning model as easily as making a sandwich. Okay, maybe not that easy (I've seen some pretty complicated sandwich builds), but you get the idea! Qlik Predict is like having a super-smart data science assistant who does all the heavy lifting while you sit back and enjoy your coffee. No coding required – just point, click, and let the magic happen!

Now, let's explore the five things you always wanted to know about Qlik Predict but were afraid to ask (because let's face it, sometimes tech stuff can be scarier than a database without backups or developing in Production! 😱).

NOTE FOR BEGINNERS: From this point, if you just want a simple explanation, just read the text in green and ignore the rest.

For a shortcut to each section, use this index. Click on each title to skip to that section.

Missing Data Handling
Categorical Encoding:
Feature Scaling
Automatic Holdout
Cross-Validation

1. The Great Missing Data Adventure: Imputation of Nulls

AKA: "Help! My Data Has More Holes Than Swiss Cheese!" 🧀

Think of your dataset like a puzzle with some missing pieces. Instead of throwing away the whole puzzle (which would be a waste!), Qlik Predict cleverly fills in those gaps. For numbers, it looks at all the other pieces and uses the average value - like if you're missing someone's age, and everyone else is between 30-40, it'll make an educated guess around 35. For categories (like favorite colors or cities), it uses the category "other" for those missing rows. It's like having a really smart friend who's great at filling in the blanks in a way that makes sense!

The Juicy Details (Don't Worry, It's Fun!):

Imagine you're planning a huge party (stay with me here, this analogy is going somewhere). Your guest list has some gaps because some people forgot to RSVP. You could:

A) Cancel the party (Delete rows with missing data - booooring!)
B) Make up random names (Bad idea, unless you want chaos)
C) Make educated guesses based on who usually comes to your parties (Now we're talking!)

Qlik Predict goes with option C, but with way more math and less party anxiety. Here's how:

For numbers (like age or salary):

It calculates the average of what we know (mean imputation)
Example: If most of your colleagues are around 35, that missing age is probably not 95
Fun fact: This is like guessing someone's age at a party, but with actual math instead of awkward compliments!

For categories (like favorite ice cream flavor), it is a much simpler solution. Qlik Predict simply fills any missing value as "other" and moves forward.

However, not all columns with missing data can be saved. If there are more than 50% of missing data, the amount of unknown is so large that it would be irresponsible of Qlik Predict to make any inference. In those situations, the column is excluded from the model. In situations like that, you need a better column, or if I use the party analogy, if more than 50% of your friends did not RSVP, maybe you need new friends.

Pro Tip: This is way better than the old-school method of just ignoring missing data, which is like pretending that friend who never RSVPs doesn't exist. We all have that friend, don't we? 😉

2. The Category Whisperer: Encoding Categorical Features

AKA: "Converting Cat-egories Because Your Model Can't Read" 😺

Computers are like extremely literal-minded friends who only understand numbers, not words. When you tell them about categories like "red," "blue," and "green," they get confused. So Qlik Predict acts as a translator, converting these words into numbers in a way that makes sense. It's similar to how we might assign players numbers in sports - but instead of just giving each color a simple number (which might accidentally suggest that "red" is somehow "better" than "blue"), it creates a special numerical code that treats each category fairly and distinctly. This way, the computer can work with the data while preserving the true meaning of each category.

The Deep Dive (Hold onto Your Hats!):

Imagine you're trying to explain colors to someone who can only understand numbers. Tricky, right? That's exactly what we need to do with machine learning models. Qlik Predict uses two clever methods to handle this translation: one-hot encoding and impact encoding. It's like having two different translation services, each perfect for different situations!

One-Hot Encoding: The VIP Treatment

One-hot encoding is like giving each category its own personal spotlight. Let's break it down with a real example:

Suppose you're analyzing ice cream flavors at your shop:

Qlik Predict uses this method when:

Your dataset has 100 or fewer columns (let's call these "boutique datasets" 😉)
The categorical feature has 13 or fewer unique values

Why 13? It's like having a dinner party – once you get more than 13 guests, you need a different strategy for seating arrangements! With more categories, one-hot encoding would create too many new columns, making your model as confused as a chameleon in a bag of Skittles or like a smart data analyst left with nothing more than just Power BI.

Impact Encoding: The Smooth Operator

Now, what happens when you have way more categories? Like trying to encode all possible last names in your customer database? This is where impact encoding swoops in to save the day!

Impact encoding is like having a really smart summarizer. Instead of creating a new column for each category, it looks at how each category value relates to your target variable and assigns a numerical value based on that relationship. It's basically giving each category a "reputation score" based on its past performance!

Qlik Predict automatically switches to impact encoding when:

Your categorical feature has more than 13 unique values (even in smaller datasets)
Your dataset has more than 100 columns (regardless of unique values)

Here's a real-world example:

Each city gets a number that represents its "impact" on purchases, rather than getting its own column. Neat, right? Why purchases? Because that is the Target variable selected. It is what you are predicting. In other words, Impact Encoding replaces the category text to a number showing how much that value impacts the prediction column.

The Decision-Making Process

Qlik Predict is like a wise chef who knows exactly which cooking method to use for different ingredients. Here's its decision tree:

First Check: How big is your dataset?
- More than 100 columns? → Impact encoding for everything!
- 100 or fewer columns? → Proceed to step 2
For Each Categorical Column:
- 13 or fewer unique values? → One-hot encoding
- More than 13 unique values? → Impact encoding

This automatic selection ensures your data is encoded efficiently without creating unnecessary complexity. It's like having a smart GPS that automatically picks the best route based on traffic conditions!

Why This Matters

Understanding these encoding methods helps you:

Predict how your data will be transformed
Know when to potentially combine categories to stay under the 13-value threshold
Understand why your model might perform differently with different categorical features

Pro Tip: When preparing your data for Qlik Predict , consider whether you really need all those unique categories. Sometimes, grouping similar categories together (like grouping rarely-used values into an "Other" category) can help you stay under the 13-value threshold and take advantage of one-hot encoding's benefits!

3. The Great Equalizer: Feature Scaling

AKA: "Making Sure Your Data Doesn't Skip Leg Day" 💪

Imagine trying to compare the height of a house measured in feet with someone's weight measured in pounds - it's like comparing apples and oranges! Feature scaling is like converting everything to the same unit of measurement. Qlik Predict automatically adjusts all your numbers to be on a similar scale, so when the model looks at your data, it's comparing apples to apples.

This ensures that just because one number is bigger (like a salary in thousands) it doesn't overshadow smaller but equally important numbers (like years of experience). You don't want the Machine Model to consider Salary as more important than Years of Experience just because it is a bigger number. It makes sure everyone in a conversation gets an equal chance to speak, regardless of how loud their voice naturally is!

The Nerdy (But Nice!) Details:

Picture this: You're comparing the weight of a feather to the weight of your laptop. In their natural state, it's similar of having a sumo wrestler compete against a ballet dancer – not exactly a fair comparison! Feature scaling is like putting everyone in the same weight class.

Here's how Qlik Predict does it:

First, it takes all your numbers and puts them on a scale from roughly -1 to 1
The formula looks scary (𝑋scaled = (𝑋 − 𝜇) /𝜎 ), but it's really just:
- Take your number
- Subtract the average
- Divide by how spread out the numbers are
- Voilà! Scaled feature!

For reference: X is the original number, 𝜇 is the mean (average) of the number, and 𝜎 is the Standard Deviation. So we could rewrite it as:

Scaled Value = (Original Value - Mean) / Standard Deviation

Real-World Example Time! Let's say you're analyzing both salary and age:

Salary: $30,000 to $200,000
Age: 18 to 65

Without scaling, your model might think salary is waaaaay more important just because the numbers are bigger. That's like saying Godzilla is a better movie than The Godfather just because Godzilla is bigger!

Let's Scale Age.

Notice something cool? The scaled values typically fall between -3 and 3, with:

0 meaning "average"
Positive numbers meaning "above average"
Negative numbers meaning "below average"

After scaling, your data has some nice properties:

The mean becomes 0 (like setting the "center point")
About 68% of values fall between -1 and 1
About 95% of values fall between -2 and 2
About 99.7% of values fall between -3 and 3

It's like putting all your features on the same playing field, regardless of their original size!

Why This is A-mazing

Fair Competition:
- Before scaling: Income ($50,000) dominated Age (35)
- After scaling: Both features might be something like 0.7 and 0.8
Better Model Performance:
- Many algorithms work better with scaled features
- Gradient descent (a common optimization technique) converges faster
- Prevents numerical instability (no more computer crying over big numbers!)
Easier Interpretation:
- You can quickly see if a value is above or below average
- The magnitude tells you how unusual a value is
- Works consistently across all features

4. Automatic Holdout of Training Data

AKA: "The Art of Keeping Secrets from Your Model" 🕵️‍♂️

This is kind of like having a practice test and a final exam for your machine learning model. Qlik Predict automatically sets aside a portion of your data (i.e. keeping some exam questions secret) to test how well the model has learned. Think of it as teaching someone to cook - you show them how to make a dish multiple times (training data), but the real test comes when they have to cook something using the same techniques but with slightly different ingredients (test data). This helps ensure your model isn't just memorizing the data but actually learning patterns it can use on new information.

The 'Secrets and Stats' Detailed Explanation:

Imagine you're teaching someone to cook. If you only test their skills using the exact recipes they practiced with, how would you know if they can actually cook, or if they've just memorized those specific recipes? That's exactly why we need holdout data!

How Qlik Predict Plays Hide and Seek with Data

Qlik Predict offers two flavors of holdout selection, each perfect for different situations:

1. The Default Method: Random Selection

Think of this as randomly picking students from a class to answer pop quiz questions. Here's how it works:

Training Dataset (1000 records)

↓

Random Selection Process

↓

800 records for Training (80%)

200 records for Holdout (20%)

Image taken from Qlik Predict documentation

Why Random Selection?

Ensures a representative sample
Maintains the same distribution of patterns
Works great for most use cases

2. The Time-Based Method: Chronological Selection

This is like testing a weather prediction model with tomorrow's weather, not yesterday's. Perfect for when timing matters!

Training Dataset (Jan-Dec 2023)

↓

Sort by Date

↓

Training: Jan-Oct 2023

Holdout: Nov-Dec 2023

When to Use Time-Based Holdout:

Forecasting future sales
Predicting stock prices
Anticipating seasonal trends
Any situation where time patterns matter!

The Magic Behind the Scenes

Let's peek under the hood at how Qlik Predict handles this:

For Random Holdout:

Step 1: Shuffle the entire dataset
Step 2: Calculate holdout size (typically 20%)
Step 3: Split data into training and holdout sets
Step 4: Keep holdout data locked away
For Time-Based Holdout:

Step 1: Identify the date column
Step 2: Sort all data chronologically
Step 3: Reserve most recent chunk for holdout
Step 4: Use older data for training

Why This is Cool

Prevents Overfitting:
- Like preventing a student from memorizing test answers
- Ensures model learns patterns, not specific cases
- Real-world performance validation
Honest Performance Metrics:
- No "peeking at the answers"
- True indication of model capability
- Reliable performance estimates
Future-Proofing:
- Especially with time-based holdout
- Tests model on truly "unseen" scenarios
- Better prediction of real-world performance

5. Five-Fold Cross-Validation

AKA: "Playing Musical Chairs with Your Data" 🎵

Instead of just testing your model once, imagine testing it five different ways to make sure it really knows its stuff. It's something like teaching someone a new language and testing them with different native speakers to ensure they can understand various accents and speaking styles.

Qlik Predict automatically divides your data into five parts and plays this game of "musical chairs" where each part gets a turn being the test data while the other four parts are used for training. This gives you a much better idea of how well your model will perform in the real world, where data might look a bit different than what it was trained on. It's like getting five different second opinions before making an important decision!

Your Original Dataset (100%)

↓

Shuffle Like a Vegas Dealer 🎲

↓

Holdout (20%) │ Training Data (80%)

↓

Five-Fold Magic Happens!

The "Rinse and Repeat Until Perfect" Details:

Imagine you're teaching a robot to play cards. You wouldn't want it to memorize just one deck order, right? That's exactly what default cross-validation does!

Wait! Did I say Default? Does it mean there is another method? Yes! Similar to the 4th item in this article (Automatic Holdout of Training Data), there is the default way and the Time-Based. Let's start with the default and break it down:

The Initial Split:

Your Dataset (100%)

↓ Shuffle like a Vegas pro

├── Holdout (20%) → "The Final Exam"

└── Training (80%) → "The Practice Sessions"

The Five-Fold Dance:

Training Data

↓ Split into 5 equal pieces

Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5

The Training Rounds:

Round 1: Test on Fold 1, Train on Folds 2-5

Round 2: Test on Fold 2, Train on Folds 1,3-5

Round 3: Test on Fold 3, Train on Folds 1-2,4-5

Round 4: Test on Fold 4, Train on Folds 1-3,5

Round 5: Test on Fold 5, Train on Folds 1-4

This is perfect For:

General prediction tasks
When time doesn't matter
When patterns are time-independent
Customer segmentation
Image classification
Product recommendations

Time-Based Cross-Validation: The Chronological Chronicle

Now, imagine you're teaching someone to predict weather patterns. You wouldn't use next week's weather to predict yesterday's, would you? Welcome to time-based cross-validation!

How It Works:

The Time-Sorted Split:

Your Dataset (100%)

↓ Sort by date/time

├── Training (80%) → "The Historical Chronicles"

└── Holdout (20%) → "The Future Test"

(Most recent data)

The Growing Window Approach:

Timeline: [Oldest] → → → → [Newest]

Round 1: Train[Fold 1] → Test[Fold 2]

Round 2: Train[Fold 1,2] → Test[Fold 3]

Round 3: Train[Fold 1,2,3] → Test[Fold 4]

Round 4: Train[Fold 1,2,3,4] → Test[Fold 5]

This is perfect for:

Sales forecasting
Stock price prediction
Seasonal trend analysis
Demand forecasting
Website traffic prediction
Inventory management

Conclusion: You Made It! 🎉

(And Your Brain Cells Are Still Intact!)

Congratulations! You've just taken a deep (or shallow) dive into the preprocessing magic of Qlik Predict, and you're still here to tell the tale! Let's quickly recap what we've learned without the fancy tech jargon, because we've had enough of that for one day. This is the same summary I placed at the beginning of this article, with links to each section, but now with a short dad-joke style summary.

Missing Data Handling: We learned how Qlik Predict fills in the blanks better than your friend trying to explain why they missed your birthday party.
Categorical Encoding: Turning words into numbers, because apparently computers never learned to read like the rest of us.
Feature Scaling: Making sure all your numbers play nice together, like a kindergarten teacher managing recess.
Automatic Holdout: Keeping some data secret, like that emergency chocolate stash you swear you don't have.
Cross-Validation: Testing your model five different ways, because trust issues are real in the machine learning world!

The Bottom Line

Qlik Predict handles all these preprocessing steps automatically, saving you from writing enough code to make your keyboard cry. It's like having a really smart friend who does your homework, except it's totally legal and encouraged!

And remember: If anyone asks you what you learned today, just say "I now understand the intricate preprocessing paradigms of automated machine learning systems." If they look impressed, nod wisely. If they ask for details, just show them this article.

Final Dad Joke (Because Why Not?)

Q: Why did the data scientist break up with their preprocessing pipeline?

A: Because it had too many missing values in their relationship!

Ba dum tss 🥁

P.S. If you made it this far and actually understood everything, congratulations! You've officially made it to the small group of people that I really like. If you didn't understand everything, congratulations! You're a normal human being. I also like you because you read this long text. Either way, you're awesome!