Making Qlik Answers Smarter: The Art of Text Parsing and Data Prep

Igor Alcantara
Nov 19, 2024
7 min read

Updated: Oct 13

If you are following my work here at Data Voyagers and in my LinkedIn page, you probably already know how excited I am with Qlik Answers. Experimenting with this tool has been a mix of curiosity, creativity, and a little bit of chaos—but all with purpose. I can say, a controlled chaos, like the one you have at home on Thanksgiving day or any day if you're a parent. I’ve been testing the limits of what the tool can handle, finding clever ways to make the “impossible” possible, and figuring out how to squeeze the most performance and accuracy out of it.

While breaking boundaries and pushing limits is exciting, this article focuses on something just as important: making Qlik Answers sharper, faster, and smarter. Because let’s be real—if your assistant can consistently deliver accurate, lightning-fast answers, you’ve got something you can brag about at a party (well, maybe at a data nerd party or perhaps at a Qlik Meetup).

To test this scenario, I worked with 37 PDF files of varying sizes, topics, and complexities. Not a massive dataset, but enough to provide meaningful insights. For this article, I’m focusing on one file of medium complexity: a medical paper filled with tables, columns, titles, subtitles, images, and more. You can download the paper on the following link.

Each file was tested in three different setups: first in its original form with no modifications, then as a semantically parsed JSON file, and finally as a Markdown file. The Markdown approach is similar to the one I used in a previous article to introduce structured data into Qlik Answers. Let’s break down how each scenario performed.

Option 1: The Lazy Genius—Stick to the Original PDF

Using the original PDF is the simplest, most straightforward way to load documents into Qlik Answers. There’s no need for advanced tools, data conversions, or extra effort—just upload the file and let Qlik Answers do the heavy lifting. This approach is efficient in terms of page count, as it produces the fewest indexed pages. It’s ideal for straightforward documents where the structure, such as headers, tables, and images, is well-maintained and doesn’t require additional semantic enhancements.

However, simplicity comes with limitations. Without further parsing or tagging, the assistant may struggle to interpret context or relationships between sections, especially if the document includes complex formatting. Nested tables, unconventional layouts, or non-standard fonts might result in errors or misinterpretations. Despite these limitations, sticking to the original PDF can be a reliable choice when working with simple, cleanly formatted documents or when keeping the indexed pages low is a top priority.

Option 2: Nerd Out—Advanced Parsing

When dealing with complex, multi-layered documents, using a parsing tool like Aryn.ai offers significant advantages. This method goes beyond basic text extraction by breaking the document into semantically tagged components, such as headers, paragraphs, tables, and images. Aryn.ai generates a structured JSON file, giving Qlik Answers a much clearer view of the document’s organization and relationships between sections. For this task, I used Python to access Aryn.ai APIs and perform all necessary job in a more consistent and automatic way.

While JSON parsing doesn’t directly improve the accuracy of responses, it provides a deeper semantic understanding that can make the assistant’s output more contextually relevant. This makes it especially effective for handling complex documents with mixed content, such as research papers or technical guides that combine dense text with visual data like charts and tables. However, the downside is significant: the JSON parsing process results in a dramatic increase in indexed pages—over 850 in some cases—which can overwhelm Qlik Answers and degrade performance. Yes, you read that right: 6 pages in PDF created 853 pages in JSON.

Knowledge Base with Txt JSON file indexed

Filtering out unnecessary data from the JSON file requires extra effort and technical expertise. While this method isn’t practical for all scenarios, it’s a powerful option when document structure and semantic clarity are critical to achieving your goals. Oh, and by the way, how could I add JSON files to a Knowledge Base if JSON is not accepted? Simple: I just saved it as a TXT. In Brazil we call that "Gambiarra", and I think this is beautiful.

Option 3: The JEDI Approach—Convert to Markdown

Converting PDFs to Markdown offers the best of both worlds, striking a balance between structure, accuracy, and efficiency. Markdown simplifies the document by retaining key elements such as headings, tables, and lists in a clean, lightweight format that’s easy for Qlik Answers to process. It gets the semantics of the document, which LLM platforms love.

This approach produced more indexed pages than the original PDF but far fewer than the JSON option. The result is a manageable file size that maintains enough structure to improve the assistant’s ability to interpret the content accurately. In summary, we get almost as much semantics as a parsed JSON for a fraction of the complexity. It's all about balance, my padawan.

With Markdown I even went one step ahead and extracted all tables inside the PDF and saved separably in two formats: another markdown, one per table (see next image) and as CSV. Why? well, let's say now I can extract tabular data from documents and add to my Qlik apps data models. Super cool, but let's save this for a future article.

Knowledge Base with Markdown file indexed

One of Markdown’s biggest advantages is its flexibility; the format is easy to edit, allowing you to remove unnecessary sections, clean up the data, and highlight critical information before uploading. There are multiple ways to programmatically convert a PDF document to Markdown. I used Python for this task and my program has basically 2 steps: first the parsing, just like the JSON solution, then I focus on the content and semantics of the parsing and extract just what is meaningful, formatting and saving as a Markdown fie. Simple and efficient.

While Markdown may not handle intricate layouts like layered tables or detailed graphics as effectively as JSON, it’s a practical choice for most scenarios. It combines enough semantic structure to improve performance without bogging down the system, making it the ideal option for users looking for a balance between precision and simplicity.

Why Markdown is Better?

When preparing documents for tools like Qlik Answers, Markdown emerges as a standout format for ensuring both structure and context are preserved. Unlike plain text, which flattens information and strips away valuable context, Markdown offers a richer, more organized representation of the original document. This is especially valuable in Retrieval-Augmented Generation (RAG) scenarios like Qlik Answers, where the language model relies on well-structured context to generate accurate and relevant outputs. Here’s why Markdown outshines other formats.

Preserving Structure

Markdown preserves the inherent structure of a document, such as headings, lists, and tables, which plain text often loses. When Qlik Answers—or any LLM—processes Markdown, it can leverage this structure to understand relationships between different sections of the text. For example, a table in Markdown stays neatly formatted and distinguishable from surrounding paragraphs, unlike plain text, where the boundaries between table rows and subsequent text can blur. By maintaining this structural integrity, Markdown ensures that Qlik Answers interprets data more effectively.

Enhanced Semantics

One of Markdown's greatest strengths is its ability to include semantic cues directly in the text. Headers (#, ##), bold (**), italics (*), and even code blocks (```) provide signals about the text's importance and type. These markers allow language models to identify and prioritize key sections of the text, such as distinguishing a subsection from a main header or recognizing a table as a discrete entity. Most modern LLMs, including those used in advanced RAG systems, are trained to understand Markdown formatting. This added layer of meaning helps Qlik Answers produce more accurate responses.

Cleaner Representation

Markdown offers a cleaner, more readable format for documents, especially those containing tabular or mixed data. Tables, for example, remain well-organized in Markdown, avoiding the blending of table data with surrounding text—a common issue in plain text formats. This clarity ensures that Qlik Answers processes tabular data correctly, reducing errors and convoluted outputs. Additionally, Markdown’s simplicity makes it easy to include links, alt-text, or placeholders for images, providing contextual information that might otherwise be lost.

Context Preservation

Preserving context is crucial for systems like Qlik Answers, where relevance and accuracy depend on how well the input data retains its meaning. Markdown ensures that the relationship between text elements is maintained, allowing the language model to draw meaningful connections. For example, a table followed by a descriptive paragraph in Markdown retains its sequence and context, which helps the assistant understand how the two relate. Including alt-text for images or descriptions for figures further enriches the context, something that plain text can’t achieve.

Optimized for Chunking in RAG Systems

Markdown is especially effective in RAG-based systems, where large documents are broken into smaller, manageable chunks for processing. Because Markdown retains structure and context, it ensures that each chunk remains self-contained and meaningful, even when processed independently. This is critical for systems with limited context windows, as the model can only handle a certain amount of data at a time. By feeding well-structured Markdown chunks into Qlik Answers, you increase the likelihood of generating better, more accurate outputs.

Final Thoughts

Making Qlik Answers work at its best isn’t just about loading documents—it’s about loading them right. Whether you’re keeping it simple with PDFs, getting fancy with JSON, or finding a happy medium with Markdown, the right approach can turn your assistant into an accuracy machine. And when you nail it, you might just have something worth bragging about at your next (data nerd) party. In summary, this is what my tests found:

Factor	PDF	JSON Parsing	Markdown
Minimum Page Count	✅ Best choice	🚫 High	⚖️ Moderate
Parsing Accuracy	⚖️ Moderate	✅ High	✅ High
Structural Complexity	🚫 Limited	✅ Handles complex layouts	⚖️ Handles moderate
Ease of Use	✅ Easiest	🚫 Requires preprocessing	⚖️ Requires some setup
Ideal Use Case	Simple documents	Complex, detailed documents	Balanced needs

I hope you enjoyed this article and found value in the experiments I’m conducting with Qlik Answers. Exploring new ways to optimize tools like this is not only exciting but also a chance to push boundaries and share knowledge with the community. If you’d like to see more content like this, please show your support by sharing this article with your network, leaving a comment, or even suggesting topics for future experiments. Your engagement keeps the ideas flowing and the experiments going—May the Force be with you!