Document-informed large language models: Blog posts vs real applications

[data-wf-bgvideo-fallback-img] { display: none; } @media (prefers-reduced-motion: reduce) { [data-wf-bgvideo-fallback-img] { position: absolute; z-index: -100; display: inline-block; height: 100%; width: 100%; object-fit: cover; } }

I always enjoy reading about advancements in AI in various blogs, and lately they've been dominated by the application of large language models such as ChatGPT. Recent times have seen incredible advances in large language models, running your own large language model (LLM), and grounding large language models in external knowledge bases. Titles like 'How to fine-tune ChatGPT with your own data' and 'How to set up document question answering with LLMs in a couple of hours' make it seem like anyone can use some premade ChatGPT plugin, or host some open-source LLM, and set up their own knowledge-grounded chat system quite easily.

‍

And I've learned that is both true, and quite far from the truth. It's been quite easy to set up a system to pass information to ChatGPT (or another LLM), and answer my questions based on the context I've provided. Open source libraries provide a lot of functionality, and when you combine it with a vector store, you can easily set up the code structure.

‍

But what those blogs never seem to mention, is that while this is quite easy to do for small well-structured collections of information (couple page documents), it becomes increasingly harder with larger knowledge-bases (hundreds of pages of information to sift through). Applying this to our product, where we have to answer questions based on hundreds of pages of tender documents, often not structured as fondly as we'd like, this poses quite the challenge.

‍

There seem to be a couple of approaches:

- Throwing all information to your LLM

- Using dense retrieval models in a vector database

‍

Letting your LLM decide which information to use

Grounding large language models in external knowledge, requires including that knowledge inside the prompt. If we don't exactly know which information will answer our question, can't we just throw all information to the LLM, and let it figure it out? Well... This works when grounding in a relatively small knowledge base, such as a 1 page document, but this is fundamentally constrained by the context size of the model. While these context sizes are increasing (32k tokens with GPT-4), they're nowhere near the scale of a knowledge base relevant to real-world problems.

One approach is to do multiple calls to the large language model, to let it summarize or create more dense representations of the knowledge base. This works, but increases the number of calls, number of processed tokens, the computation cost and time needed by quite a factor, and does not scale well with knowledge base (document) size.

‍

Using dense retrieval models in a vector database

Dense retrieval models, and more specifically bi-encoders, which encode a text into an n-dimensional vector, and allow for easy computation of the distance between vectors (embeddings). This distance is an approximation of the similarity between texts, and can be used for 'semantic search'. This has proven to work quite well as a first pass to get relevant texts for a given query. Many blogs suggest encoding each page, and calculating the similarity between the pages and the query to get a result. While this does in theory give good results on specific (in domain) benchmarks, it is often not the full story for a real world application.

‍

However, a few challenges are easily overseen.

- Pretrained sentence embedding models have often been trained on short passages of texts, and are unable to encode entire pages. They work better on a sentence level. This is especially true for multilingual models.

- A lot of pretrained models are trained on a symmetric similarity task, to calculate the similarity between two sentences. This does not work that well for shorter query sentences, and longer reference texts (document pages or paragraphs).

- Pretrained sentence embedding models have often been trained on general-domain texts (like tweets or internet fora), performing sub-par in domain-specific texts often encountered in a professional setting.

- Data in documents is often not as well structured as the training texts. There can be a lot of noise in automatically extracted documents (headers, footers, bad OCR, and many more problems).

‍

Takeaways

Of course there are solutions to all these problems, and it is definitely possible to build a good information-grounded chat assistant. But there is a large gap between the catchy blog posts demonstrating a toy problem, and applying this technology to a real world use-case.

At Brainial we’re currently working on solving these challenges, to integrate the power of Large Language Models within our Tender Assist. We’re aiming at providing free-form question answering, and informed proposal (re-)writing features. Contact us for more information.

Development

Document-informed large language models: Blog posts vs real applications

Document-informed large language models: Blog posts vs real applications

Similar posts

Bob Maks

Development

The Importance of Data Security in AI-Powered Tender Assistance: Ensuring Confidentiality with Brainial

Learn how to create better proposals faster

Development

Document-informed large language models: Blog posts vs real applications

Document-informed large language models: Blog posts vs real applications

Edwin Bosscha

Similar posts

Bob Maks

Development

The Importance of Data Security in AI-Powered Tender Assistance: Ensuring Confidentiality with Brainial

Learn how to create better proposals faster