Qlik Answers: A Detective’s Guide to Extracting Text from Images

Igor Alcantara
Oct 21, 2024
5 min read

It was late. The rain tapped against my window like a typewriter stuck on one letter, and I was deep in the trenches of another case. This one? A real doozy. I had to teach Qlik Answers how to crack open images and pull out their secrets—text buried beneath poor resolution, bad angles, you name it.

I'd cracked cases before. Excel files? Easy. CSV? Child’s play. Audio, video? Done. Even got Qlik Answers talking in code, switching between Python and R like a seasoned double agent. But this? This was different. Text hidden inside images, some fresh off the FBI Vault. Specifically, their files on Undercover Operations. I had a hunch there was treasure buried in those pages, but first I had to get past the static.

For this research and article, I used Python to extract the text from image and make all the forensics and transformation work that I needed to solve this case. My input were images with scanned documents I got from the FBI Vault and the output are text files with the text information extracted. These are then automatically loaded to a Qlik Cloud space where my Knowledge Base reindex daily based on a schedule task.

There are already a lot of content in the Qlik Community and YouTube about the use of Qlik Answers. What I want here is to show the steps for extracting and preparing the data. But worry not: will not be too technical. You won't feel bored or overwhelmed. So, let me put my detective (or data detective) hat on, and let's crack the case.

Find a summary flowchart explaining the whole process at the end of this article.

First Glance: The Raw Evidence

You know how it goes in this line of work. Not every lead’s clean, but every now and then, you catch a break. The first image I fed into the system? Clean as a whistle. No skew, no noise. I ran it through the usual process, and sure enough, the text came out crisp, clear—just like it was meant to.

ree — Exhibit B: The Transcription of the previous image

It was a smooth start, most of the text was perfectly transcribed. There were a few mistakes (like Guidelin instead of Guidelines), but nothing to worry. Occasionally I can automatically submit to an advanced LLM for grammar correction. It was a great start but I wasn’t getting too comfortable. The real challenges were just around the corner.

The Skew Angle: Adjusting the Lens

Not everything was smooth as the first pages. Initially my code returned no error, but a handful of pages return either no text or just a few sparse words. That was making my case much harder than I anticipated. Check the next evidence. Have you noticed anything?

ree — Exhibit C: Example of a page that was not transcribed correctly

When I looked closer, I knew what the problem was. The documents were skewed, leaning like they'd had one too many drinks (maybe a few Qlik Fusion). If I was going to get any text out of this, I needed to sober them up.

I turned to Python for some help—my trusty sidekick. First step? Figuring out the skew angle. It’s like when you’re following a suspect; you’ve got to keep a straight line. My code scanned the image for lines, found the ones it could trust, and measured how off-kilter the whole thing was.

In case you don't want to know any technical details, skip this paragraph. If you're still here, let's go! When the text in an image is skewed, most free OCR libraries cannot read them well. In that case, I had to first detect the angle of skewness on the image text. First, I converted the image to grayscale, used Gaussian Blur to remove noise, Canny to detect edges to then finally use Hough Line Transform to detect lines in the image. For each line I extracted the angle. With all the angles in hand, I calculated the median and that gave me the overall angle of skewness. Easy, right?

Some images were just a few degrees off, others looked like they’d been flipped by a tornado. I wasn’t about to let that stop me. With the angle in hand, I rotated the image back into place. Luckily, Python's Computer Vision library has the right tool for it. After applying this, I was again successful.

ree — Exhibit D: Text extracted from the De-skewed Image

Not bad, huh? At least now it looked like a document again.

When Colors get in the way

Then came the tricky ones—documents where the background wasn’t exactly white. These weren’t your run-of-the-mill papers; no, they had a murky tint, like they'd been sitting under a flickering light for decades. The text extraction? Forget it. The usual method couldn’t make heads or tails of it. Every attempt came back empty-handed, leaving the words lost in the shadows of the discolored page. Check the next evidence for reference.

ree — Exhibit E: Example of a page where background was not white-ish

ree — Exhibit F: Extracting text from non-white pages

Oh no! I definitely did not expect that, but could I solve it? Elementary, my dear Dalton. So, I rolled up my sleeves and threw in a background color detection algorithm. It wasn’t pretty, but it got the job done. I processed those images a little differently, stripping out the noise, and focusing on separating the text from the muddled background. And wouldn’t you know it? The text surfaced like a confession under pressure—clear and readable. Turns out, sometimes you’ve got to get your hands a little dirty to get results.

The Cleanup: Sharpening the Focus

Skew fixed, colors figure it out, I hit another wall: some images were just plain bad. Low quality, grainy, the kind of thing you expect from a camera stuck in someone’s shoe. But I had a trick up my sleeve. I enhanced the images, brought out the contrast, sharpened them until they looked presentable.

ree — Exhibit G: Example of low quality image

Like I said, sometimes the usual OCR wasn’t enough. For those bad apples, I brought in the heavy hitter: EasyOCR. It's got a knack for handling the tough cases, reading through the noise and pulling out the text. I ran it through, and what do you know—it worked.

ree — Exhibit H: The low quality image extracted text after quality treatment

Sure, it wasn’t perfect, but it got the job done. Every letter, every word—finally, Qlik Answers had its next clue.

Cleaning Up the Clues

Sometimes, the algorithms don’t get it all—some letters slip through the cracks, or worse, get twisted into something they shouldn’t be. The text we pull from those images is good, real good, but not perfect. So before filing it away, I ran it through a spell check model. Little things like “othervise” and “guidelin”? Cleaned up in a flash. Just a couple lines of code and the rough edges were smoothed out. Now, take a look at the next piece of evidence—a corrected version of Exhibit B, polished and ready for action.

ree — Exhibit I: The spell checked version of Exhibit B

The End of the Line: Uploading to Qlik Answers

Once I had the text in hand, the next step was to feed it into Qlik Answers’ Knowledge Base. The system took it, no questions asked. Just like that, a new case was solved. Now, any time a question came up related to those FBI documents, Qlik Answers would know where to look—pulling the answers right out of the images. It wasn’t perfect yet, but progress was made.

If all this sounds like a labyrinth of processes, don’t sweat it—the hard work’s already in the rearview. Now, it’s just a matter of setting things up and reaping the rewards. The possibilities for this approach are endless, but let’s start with the most obvious: those hundreds of thousands of old, dusty scanned documents sitting forgotten on some neglected drive. They’re packed with knowledge, just waiting to be unlocked. With this in place, you can finally bring them back to life—letting Qlik Answers dig through the archives so you can have some sharp, informed conversations with your data.

Next up? I’ve got bigger fish to fry. The road ahead involves reading text out of charts, reading heavily redacted documents like the UFO Files, even pulling handwritten scrawls off paper. But that’s a story for another day. For now, I’ll keep my head down and my code sharp—after all, there’s always another case waiting in the shadows.