We take a lot of things for granted. For those new to Machine Learning who are lucky to start with tools like Qlik AutoML, there are a lot of things happening behind the scenes that veterans (aka classical practitioners) like me had to program in R and Python in the past, just like the old Aztecs did (well, not really, but I like the joke). Don't get me wrong - this is amazing. We can now focus more on the analytical and scientific parts of the work rather than typing long sequences of code.
Nevertheless, I believe it is important to understand what's happening behind the scenes. It helps you not only appreciate what you have "for free" but, more importantly, it makes you a better analyst. Understanding what Qlik AutoML does even before training the Machine Learning models will help you know which tasks to ask your Data Engineer to perform when preparing data for Machine Learning and which tasks will be handled automatically.
That is the reason that I am starting this series of article "Qlik AutoML Behind the Scenes Explained". My first article (this one) is about the Proprocessing Tasks" of AutoML. Those who are performed before the training of the models. The ones you see in the image below. Qlik AutoML executes everything using Python and Scikit-Learn, but do not worry, I am not going to explain code. I will discuss each concept in a way that every human can understand.
Oh, and don't be scared by the size of this article. I placed some shortcuts for those who just want to know the basics. For them, it is only a 3-minute reading.
Trust me, it will be fun!
I wrote this article 3 times. I was never happy. It was too technical, too boring. So, I decided to wear my Marketing hat and rewrite this in a more click bait, I mean, interesting style. There is also a color code involved: each one of the 5 preprocessing tasks will be explained twice. First, in green (because we love this color) very simple explanation, then, in gray, a more detailed, but still fun approach.
Well, since is a whole new article, how about we also come up with a new title?
Spoiler alert: I also considered writing this in classic BuzzFeed Qlik Bait style: "10 Things You Didn't Know About Qlik AutoML (Number 8 Will Make Your Neural Networks Tingle!)" but decided to keep it real instead. So, here we go:
5 Mind-Blowing Things About Qlik AutoML That Will Make You Say "Wait, What?!"
Warning: This article contains dad jokes about data science. Proceed with caution! 🤓
Hey data voyagers, Qlik community, data enthusiasts and ML curious cats! Ever looked at machine learning and thought, "This is harder to understand than my teenager's text messages"? Well, you're in for a treat! We're about to dive into the world of Qlik AutoML, and I promise you won't need a PhD in Computer Science or Statistics to follow along. (Though if you have one, that's cool too – we don't judge!)
What's Qlik AutoML, Anyway?
Imagine if you could build a complex machine learning model as easily as making a sandwich. Okay, maybe not that easy (I've seen some pretty complicated sandwich builds), but you get the idea! Qlik AutoML is like having a super-smart data science assistant who does all the heavy lifting while you sit back and enjoy your coffee. No coding required – just point, click, and let the magic happen!
Now, let's explore the five things you always wanted to know about Qlik AutoML but were afraid to ask (because let's face it, sometimes tech stuff can be scarier than a database without backups or developing in Production! 😱).
NOTE FOR BEGINNERS: From this point, if you just want a simple explanation, just read the text in green and ignore the rest.
For a shortcut to each section, use this index. Click on each title to skip to that section.
1. The Great Missing Data Adventure: Imputation of Nulls
AKA: "Help! My Data Has More Holes Than Swiss Cheese!" 🧀
Think of your dataset like a puzzle with some missing pieces. Instead of throwing away the whole puzzle (which would be a waste!), Qlik AutoML cleverly fills in those gaps. For numbers, it looks at all the other pieces and uses the average value - like if you're missing someone's age, and everyone else is between 30-40, it'll make an educated guess around 35. For categories (like favorite colors or cities), it uses the category "other" for those missing rows. It's like having a really smart friend who's great at filling in the blanks in a way that makes sense!
The Juicy Details (Don't Worry, It's Fun!):
Imagine you're planning a huge party (stay with me here, this analogy is going somewhere). Your guest list has some gaps because some people forgot to RSVP. You could:
A) Cancel the party (Delete rows with missing data - booooring!)
B) Make up random names (Bad idea, unless you want chaos)
C) Make educated guesses based on who usually comes to your parties (Now we're talking!)
Qlik AutoML goes with option C, but with way more math and less party anxiety. Here's how:
For numbers (like age or salary):
It calculates the average of what we know (mean imputation)
Example: If most of your colleagues are around 35, that missing age is probably not 95
Fun fact: This is like guessing someone's age at a party, but with actual math instead of awkward compliments!
For categories (like favorite ice cream flavor), it is a much simpler solution. Qlik AutoML simply fills any missing value as "other" and moves forward.
However, not all columns with missing data can be saved. If there are more than 50% of missing data, the amount of unknown is so large that it would be irresponsible of Qlik AutoML to make any inference. In those situations, the column is excluded from the model. In situations like that, you need a better column, or if I use the party analogy, if more than 50% of your friends did not RSVP, maybe you need new friends.
Pro Tip: This is way better than the old-school method of just ignoring missing data, which is like pretending that friend who never RSVPs doesn't exist. We all have that friend, don't we? 😉
2. The Category Whisperer: Encoding Categorical Features
AKA: "Converting Cat-egories Because Your Model Can't Read" 😺
Computers are like extremely literal-minded friends who only understand numbers, not words. When you tell them about categories like "red," "blue," and "green," they get confused. So Qlik AutoML acts as a translator, converting these words into numbers in a way that makes sense. It's similar to how we might assign players numbers in sports - but instead of just giving each color a simple number (which might accidentally suggest that "red" is somehow "better" than "blue"), it creates a special numerical code that treats each category fairly and distinctly. This way, the computer can work with the data while preserving the true meaning of each category.
The Deep Dive (Hold onto Your Hats!):
Imagine you're trying to explain colors to someone who can only understand numbers. Tricky, right? That's exactly what we need to do with machine learning models. Qlik AutoML uses two clever methods to handle this translation: one-hot encoding and impact encoding. It's like having two different translation services, each perfect for different situations!
One-Hot Encoding: The VIP Treatment
One-hot encoding is like giving each category its own personal spotlight. Let's break it down with a real example:
Suppose you're analyzing ice cream flavors at your shop:
Qlik AutoML uses this method when:
Your dataset has 100 or fewer columns (let's call these "boutique datasets" 😉)
The categorical feature has 13 or fewer unique values
Why 13? It's like having a dinner party – once you get more than 13 guests, you need a different strategy for seating arrangements! With more categories, one-hot encoding would create too many new columns, making your model as confused as a chameleon in a bag of Skittles or like a smart data analyst left with nothing more than just Power BI.
Impact Encoding: The Smooth Operator
Now, what happens when you have way more categories? Like trying to encode all possible last names in your customer database? This is where impact encoding swoops in to save the day!
Impact encoding is like having a really smart summarizer. Instead of creating a new column for each category, it looks at how each category value relates to your target variable and assigns a numerical value based on that relationship. It's basically giving each category a "reputation score" based on its past performance!
Qlik AutoML automatically switches to impact encoding when:
Your categorical feature has more than 13 unique values (even in smaller datasets)
Your dataset has more than 100 columns (regardless of unique values)
Here's a real-world example:
Each city gets a number that represents its "impact" on purchases, rather than getting its own column. Neat, right? Why purchases? Because that is the Target variable selected. It is what you are predicting. In other words, Impact Encoding replaces the category text to a number showing how much that value impacts the prediction column.
The Decision-Making Process
Qlik AutoML is like a wise chef who knows exactly which cooking method to use for different ingredients. Here's its decision tree:
First Check: How big is your dataset?
More than 100 columns? → Impact encoding for everything!
100 or fewer columns? → Proceed to step 2
For Each Categorical Column:
13 or fewer unique values? → One-hot encoding
More than 13 unique values? → Impact encoding
This automatic selection ensures your data is encoded efficiently without creating unnecessary complexity. It's like having a smart GPS that automatically picks the best route based on traffic conditions!
Why This Matters
Understanding these encoding methods helps you:
Predict how your data will be transformed
Know when to potentially combine categories to stay under the 13-value threshold
Understand why your model might perform differently with different categorical features
Pro Tip: When preparing your data for Qlik AutoML, consider whether you really need all those unique categories. Sometimes, grouping similar categories together (like grouping rarely-used values into an "Other" category) can help you stay under the 13-value threshold and take advantage of one-hot encoding's benefits!
3. The Great Equalizer: Feature Scaling
AKA: "Making Sure Your Data Doesn't Skip Leg Day" 💪
Imagine trying to compare the height of a house measured in feet with someone's weight measured in pounds - it's like comparing apples and oranges! Feature scaling is like converting everything to the same unit of measurement. Qlik AutoML automatically adjusts all your numbers to be on a similar scale, so when the model looks at your data, it's comparing apples to apples.
This ensures that just because one number is bigger (like a salary in thousands) it doesn't overshadow smaller but equally important numbers (like years of experience). You don't want the Machine Model to consider Salary as more important than Years of Experience just because it is a bigger number. It makes sure everyone in a conversation gets an equal chance to speak, regardless of how loud their voice naturally is!
The Nerdy (But Nice!) Details:
Picture this: You're comparing the weight of a feather to the weight of your laptop. In their natural state, it's similar of having a sumo wrestler compete against a ballet dancer – not exactly a fair comparison! Feature scaling is like putting everyone in the same weight class.
Here's how Qlik AutoML does it:
First, it takes all your numbers and puts them on a scale from roughly -1 to 1
The formula looks scary (𝑋scaled = (𝑋 − 𝜇) /𝜎 ), but it's really just:
Take your number
Subtract the average
Divide by how spread out the numbers are
Voilà! Scaled feature!
For reference: X is the original number, 𝜇 is the mean (average) of the number, and 𝜎 is the Standard Deviation. So we could rewrite it as:
Scaled Value = (Original Value - Mean) / Standard Deviation
Real-World Example Time! Let's say you're analyzing both salary and age:
Salary: $30,000 to $200,000
Age: 18 to 65
Without scaling, your model might think salary is waaaaay more important just because the numbers are bigger. That's like saying Godzilla is a better movie than The Godfather just because Godzilla is bigger!
Let's Scale Age.
Notice something cool? The scaled values typically fall between -3 and 3, with:
0 meaning "average"
Positive numbers meaning "above average"
Negative numbers meaning "below average"
After scaling, your data has some nice properties:
The mean becomes 0 (like setting the "center point")
About 68% of values fall between -1 and 1
About 95% of values fall between -2 and 2
About 99.7% of values fall between -3 and 3
It's like putting all your features on the same playing field, regardless of their original size!
Why This is A-mazing
Fair Competition:
Before scaling: Income ($50,000) dominated Age (35)
After scaling: Both features might be something like 0.7 and 0.8
Better Model Performance:
Many algorithms work better with scaled features
Gradient descent (a common optimization technique) converges faster
Prevents numerical instability (no more computer crying over big numbers!)
Easier Interpretation:
You can quickly see if a value is above or below average
The magnitude tells you how unusual a value is
Works consistently across all features
4. Automatic Holdout of Training Data
AKA: "The Art of Keeping Secrets from Your Model" 🕵️♂️
This is kind of like having a practice test and a final exam for your machine learning model. Qlik AutoML automatically sets aside a portion of your data (i.e. keeping some exam questions secret) to test how well the model has learned. Think of it as teaching someone to cook - you show them how to make a dish multiple times (training data), but the real test comes when they have to cook something using the same techniques but with slightly different ingredients (test data). This helps ensure your model isn't just memorizing the data but actually learning patterns it can use on new information.
The 'Secrets and Stats' Detailed Explanation:
Imagine you're teaching someone to cook. If you only test their skills using the exact recipes they practiced with, how would you know if they can actually cook, or if they've just memorized those specific recipes? That's exactly why we need holdout data!
How Qlik AutoML Plays Hide and Seek with Data
Qlik AutoML offers two flavors of holdout selection, each perfect for different situations:
1. The Default Method: Random Selection
Think of this as randomly picking students from a class to answer pop quiz questions. Here's how it works:
Training Dataset (1000 records)
↓
Random Selection Process
↓
800 records for Training (80%)
200 records for Holdout (20%)
Why Random Selection?
Ensures a representative sample
Maintains the same distribution of patterns
Works great for most use cases
2. The Time-Based Method: Chronological Selection
This is like testing a weather prediction model with tomorrow's weather, not yesterday's. Perfect for when timing matters!
Training Dataset (Jan-Dec 2023)
↓
Sort by Date
↓
Training: Jan-Oct 2023
Holdout: Nov-Dec 2023
When to Use Time-Based Holdout:
Forecasting future sales
Predicting stock prices
Anticipating seasonal trends
Any situation where time patterns matter!
The Magic Behind the Scenes
Let's peek under the hood at how Qlik AutoML handles this:
For Random Holdout:
Step 1: Shuffle the entire dataset
Step 2: Calculate holdout size (typically 20%)
Step 3: Split data into training and holdout sets
Step 4: Keep holdout data locked away
For Time-Based Holdout:
Step 1: Identify the date column
Step 2: Sort all data chronologically
Step 3: Reserve most recent chunk for holdout
Step 4: Use older data for training
Why This is Cool
Prevents Overfitting:
Like preventing a student from memorizing test answers
Ensures model learns patterns, not specific cases
Real-world performance validation
Honest Performance Metrics:
No "peeking at the answers"
True indication of model capability
Reliable performance estimates
Future-Proofing:
Especially with time-based holdout
Tests model on truly "unseen" scenarios
Better prediction of real-world performance
5. Five-Fold Cross-Validation
AKA: "Playing Musical Chairs with Your Data" 🎵
Instead of just testing your model once, imagine testing it five different ways to make sure it really knows its stuff. It's something like teaching someone a new language and testing them with different native speakers to ensure they can understand various accents and speaking styles.
Qlik AutoML automatically divides your data into five parts and plays this game of "musical chairs" where each part gets a turn being the test data while the other four parts are used for training. This gives you a much better idea of how well your model will perform in the real world, where data might look a bit different than what it was trained on. It's like getting five different second opinions before making an important decision!
Your Original Dataset (100%)
↓
Shuffle Like a Vegas Dealer 🎲
↓
Holdout (20%) │ Training Data (80%)
↓
Five-Fold Magic Happens!
The "Rinse and Repeat Until Perfect" Details:
Imagine you're teaching a robot to play cards. You wouldn't want it to memorize just one deck order, right? That's exactly what default cross-validation does!
Wait! Did I say Default? Does it mean there is another method? Yes! Similar to the 4th item in this article (Automatic Holdout of Training Data), there is the default way and the Time-Based. Let's start with the default and break it down:
The Initial Split:
Your Dataset (100%)
↓ Shuffle like a Vegas pro
├── Holdout (20%) → "The Final Exam"
└── Training (80%) → "The Practice Sessions"
The Five-Fold Dance:
Training Data
↓ Split into 5 equal pieces
Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5
The Training Rounds:
Round 1: Test on Fold 1, Train on Folds 2-5
Round 2: Test on Fold 2, Train on Folds 1,3-5
Round 3: Test on Fold 3, Train on Folds 1-2,4-5
Round 4: Test on Fold 4, Train on Folds 1-3,5
Round 5: Test on Fold 5, Train on Folds 1-4
This is perfect For:
General prediction tasks
When time doesn't matter
When patterns are time-independent
Customer segmentation
Image classification
Product recommendations
Time-Based Cross-Validation: The Chronological Chronicle
Now, imagine you're teaching someone to predict weather patterns. You wouldn't use next week's weather to predict yesterday's, would you? Welcome to time-based cross-validation!
How It Works:
The Time-Sorted Split:
Your Dataset (100%)
↓ Sort by date/time
├── Training (80%) → "The Historical Chronicles"
└── Holdout (20%) → "The Future Test"
(Most recent data)
The Growing Window Approach:
Timeline: [Oldest] → → → → [Newest]
Round 1: Train[Fold 1] → Test[Fold 2]
Round 2: Train[Fold 1,2] → Test[Fold 3]
Round 3: Train[Fold 1,2,3] → Test[Fold 4]
Round 4: Train[Fold 1,2,3,4] → Test[Fold 5]
This is perfect for:
Sales forecasting
Stock price prediction
Seasonal trend analysis
Demand forecasting
Website traffic prediction
Inventory management
Conclusion: You Made It! 🎉
(And Your Brain Cells Are Still Intact!)
Congratulations! You've just taken a deep (or shallow) dive into the preprocessing magic of Qlik AutoML, and you're still here to tell the tale! Let's quickly recap what we've learned without the fancy tech jargon, because we've had enough of that for one day. This is the same summary I placed at the beginning of this article, with links to each section, but now with a short dad-joke style summary.
Missing Data Handling: We learned how Qlik AutoML fills in the blanks better than your friend trying to explain why they missed your birthday party.
Categorical Encoding: Turning words into numbers, because apparently computers never learned to read like the rest of us.
Feature Scaling: Making sure all your numbers play nice together, like a kindergarten teacher managing recess.
Automatic Holdout: Keeping some data secret, like that emergency chocolate stash you swear you don't have.
Cross-Validation: Testing your model five different ways, because trust issues are real in the machine learning world!
The Bottom Line
Qlik AutoML handles all these preprocessing steps automatically, saving you from writing enough code to make your keyboard cry. It's like having a really smart friend who does your homework, except it's totally legal and encouraged!
And remember: If anyone asks you what you learned today, just say "I now understand the intricate preprocessing paradigms of automated machine learning systems." If they look impressed, nod wisely. If they ask for details, just show them this article.
Final Dad Joke (Because Why Not?)
Q: Why did the data scientist break up with their preprocessing pipeline?
A: Because it had too many missing values in their relationship!
Ba dum tss 🥁
P.S. If you made it this far and actually understood everything, congratulations! You've officially made it to the small group of people that I really like. If you didn't understand everything, congratulations! You're a normal human being. I also like you because you read this long text. Either way, you're awesome!
Comments