Note: Although this post is published at the end of 2024, the original prototype was created in May 2023. During the writing process, I updated the code to fix a few bugs and changed the LLM from GPT-3.5-turbo to GPT-4o. A lot of tools have been created between the original implementation and now, so the updated API is essentially a copy-paste from a template, in contrast to the original May 2023 version, which was significantly more complex to set up.
An Idea
Could LLMs be used to find and summarize relevant parts of complex documentation?
This idea came to me while dealing with Microsoft security tools and their documentation in a Security Operations Center role. Finding relevant documentation from Microsoft Learn has consistently been challenging, partly because of the lack of standardization across documentation for different sub-products managed by different teams within Microsoft. As a result, my experience with Microsoft Defender product documentation varies in structure, and it often takes searching through scattered articles to find the information needed.
Digging deeper
There are four options for solving this problem:
- Train a custom model.
- Fine-tune existing model.
- Input documentation into generic LLM
- Use Retrieval-Augmented Generation to find relevant documentation for a question and use the results to generate a summary.
Options number 1 and 2 are the most costly and difficult to execute efficiently with limited resources. Microsoft Defender documentation currently includes about 1600 articles. Fine-tuning a model would require generating thousands and thousands of training data items. Generating the data would likely be the most resource-intensive part of this project and is not feasible for this prototype and likely not feasible in any case as the documentation is often updated and the training data generation and fine-tuning would have to be done constantly to remain up to date.
With option 3, we would input all the documentation along with our question into the LLM. This requires consideration of the amount of data versus the model’s input token limits. Currently, Claude 3.5 has the largest input token limit, with 200 000 tokens. Given that each article averages around 1500, which is about 2048 tokens words, covering all 1600 Defender articles would result in over 3 million tokens. If we can limit a question to a specific product, this could significantly reduce the required tokens. Additional preprocessing could further minimize token usage. However, a single response using the full 200 000-token input window would still cost tens of cents per query, which might not be feasible.
The solution
Retrieval-Augmented Generation (option 4) works by creating embeddings of the documentation, storing that data in a database, and then querying this database to retrieve content relevant to a search—similar to how a traditional search engine functions. The retrieved content is then fed into the LLM, which interprets and summarizes the information. The main advantage of this approach is that it doesn’t require training a model on this data specifically, while still allowing us to trace the source of the retrieved content.
Preparing the data
We could start develping scrapers and scrape the documentation from learn.microsoft.com. However, all the Defender documentation can be found from its own Github repository.
As we can see, all the documentation is stored in Markdown format along with metadata in Front Matter format.
Example file
---
title: Overview of the permanent site license and how to choose one for Microsoft Defender for IoT in the Defender portal
description: Learn about a permanent site license, how to upgrade and the different options available for Microsoft Defender for IoT in the Defender portal.
ms.service: defender-for-iot
author: limwainstein
ms.author: lwainstein
ms.localizationpriority: medium
ms.date: 08/01/2024
ms.topic: overview
---
# The site-based license model
Our site-based license model streamlines your licensing needs by covering entire sites instead of individual devices. With this model, you can purchase annual licenses for your operational sites where Operational Technology (OT) devices are deployed. This ensures comprehensive security coverage for all devices within each site.
[!INCLUDE [defender-iot-preview](../includes//defender-for-iot-defender-public-preview.md)]
...
To make this usable in our RAG system, we need to:
- Parse the metadata. Metadata can provide context when retrieving data, allowing us to display information like the title, date, and original source of the results.
- Parse all the Markdown files. To prevent disorganized results, we parse sub-titles within each Markdown file to retrieve relevant subsections of large articles, and we remove large code blocks.
Defining the architecture
- For the vector store, we use Supabase. There is no specific reason to use Supabase’s vector store other than wanting to try a Postgres-based vector database; Supabase is very developer-friendly and easy to set up.
- For embeddings, we use all-MiniLM-L6-v2, a general-purpose embeddings model. This model is very lightweight (~22MB) and can easily run locally. It’s important to note that while we use GPT-4o as the LLM, there is no requirement to use OpenAI embeddings specifically.
- To simplify the full-stack application used to retrieve and parse results and return context to the user, we use Langchain.js. The application itself is based on Next.js.
Final results
Note that the GIF is sped up for demonstration purposes.
The final result is a simple web application where users can input a question and receive a summary of the most relevant documentation from Microsoft Defender product documentation that can be found on the vector store. The application retrieves the most relevant documents from the database, feeds them into the LLM, and returns the summarized results to the user while providing context about the source of the information in the form of a link.
All the source code for this prototype can be found in:
https://github.com/markusleh/ask-docs
Note: although the repository links to a publicly hosted version of the application, the application will certainly not be live for long. The Vercel-hosted app also has some trouble with the context being provided in header which results in some queries failing. I was not inclined to troubleshoot this further.
Beyond the prototype
This prototype is an extremely simple implementation of a RAG (Retrieval-Augmented Generation) system. RAG can be thought of as an alternative to building or fine-tuning a custom model. Essentially, any generic LLM can interpret user-provided questions and summarize retrieved documents; as long as the vector search is sufficiently accurate and the queries are straightforward, this approach produces adequate results for general use cases.
In this example, the prototype was built with the understanding that the source material may be complex for humans to understand and might require domain expertise to interpret. As such, the RAG method may not produce enough understanding for the model to understand more abstract topics such as licensing.