Using RAG architecture for generative tasks // Bohdan Stupak's blog

Large language models are used widely across the industry these days. Yet still, many people are skeptical about their capabilities as they are quite prone to hallucination. For that reason, in this article, I decided to use LLM in a case where there are no incorrect answers: generating artistic text. However, even in the case of art which is highly subjective, there are still some quality gates (or shall I say personal preferences?).

So how do I trick a model into generating text that suits my artistic taste? While prompt engineering can get you this far, I decided to provide LLM with data that would represent my view of fine art. For this, purpose I decided to try out Retrieval Augmented Generation (RAG) architecture.

You can take a look at the entire code in this repository

Enter RAG

RAG is a technique that combines a large language model with a vector database. The idea is that the model will be able to retrieve relevant information from a database and use it to generate a response. One of the most prominent use cases is letting LLM operate across the knowledge base thus creating an expert system.

So how does it work?

It consists of the following stages:

Indexing. The data we want to use as our knowledge base is converted into LLM embeddings: numerical representations in the form of vectors.
Retrieval. Given a user query, a document retriever is first called to select the most relevant documents that will be used to augment the query.
Augmentation. The model feeds this relevant retrieved information into the LLM via prompt engineering of the user’s original query.
Generation. Finally, the LLM generates output based on both the query and the retrieved documents.

Indexing data

To be able to use LLM locally, I used Ollama. I won’t dive deep into its setup as you can get quite detailed info on their site.

As I don’t expect big dataset for my task, I’ll embed data of interest in memory.

open Microsoft.KernelMemory.AI.Ollama
open Microsoft.KernelMemory
open Microsoft.SemanticKernel

let ollamaUri = "http://localhost:11434/"
let chatModelName = "llama3.1"
let embedderModelName = "all-minilm"

let kernelMemoryConfig =
    let config = new OllamaConfig()
    config.Endpoint <- ollamaUri
    config.TextModel <- new OllamaModelConfig(chatModelName)
    config.EmbeddingModel <- new OllamaModelConfig(embedderModelName, 2048)
    config

let memory =
    (new KernelMemoryBuilder())
        .WithOllamaTextGeneration(kernelMemoryConfig)
        .WithOllamaTextEmbeddingGeneration(kernelMemoryConfig)
        .Build()

Now, once I set up memory kernel, I can embed data

let importPreparedData =
    async {
        let files = Directory.GetFiles(poemsPath, "*", SearchOption.AllDirectories)
        for j in 1..files.Length do
            let text = File.ReadAllText(files.[j-1])
            let! importResult = memory.ImportTextAsync(text) |> Async.AwaitTask
            Console.WriteLine(importResult)
    }

Generating actual text looks like this

let generateText =
    async {
        do! importData
        let! finalPoem = (memory.AskStreamingAsync(propmpt)
        |> TaskSeq.fold
            (fun acc answer ->
                Console.WriteLine(answer.RelevantSources.Count)
                Console.WriteLine(answer.Result)
                acc + answer.Result
            ) "" |> Async.AwaitTask)
        return Utils.truncateText finalPoem 100
    }

Enriching dataset

As often happens, results in the domain of machine learning depend on the quality and quantity of data provided. This case is not an exception. With the small dataset result looks chunky and often repetitive, as it generates data based solely on what’s embedded in memory.

Honestly, as this domain is quite new to me this was a big revelation to me, as I expected LLM to use not only the data provided but also its own generative capabilities.

So how do I add more variativity? With generating synthetic data and embedding it alongside the original dataset.

let rawLlama = Kernel.CreateBuilder().AddOllamaTextGeneration(chatModelName, new Uri(ollamaUri)).Build()

let importSyntheticData =
    async {
        for i in 1..syntheticDataCount do
            let! builtInPoem = rawLlama.InvokePromptAsync(syntheticPropmpt) |> Async.AwaitTask
            let! importResult = memory.ImportTextAsync(builtInPoem.ToString()) |> Async.AwaitTask
            Console.WriteLine(importResult)
    }

let importData =
    async {
        do! importSyntheticData
        do! importPreparedData
    }

Prompt engineering

Even when one has a good enough dataset at your disposal, experience with LLM still depends greatly on how well the question is formulated.

I found a good metric to control it is relevant document count that comes together with LLMs response. Maximizing this value allowed me to find the prompt that utilized most of the data I provided.

While initially, I used one prompt for both embedding synthetic data and generating the final output, eventually, I found that it’s better to use different prompts for each task. While generating synthetic data I use the most restrictive prompt I have, in order to guide LLM to generate something that will be on par with seeded data.

On the other hand, the prompt for generating the final output is quite relaxed, since I applied all the restrictions I wanted during the generation phase, so now I want to squeeze the most from the data I have.

Conclusion

While subject to fierce debate from the philosophical standpoint, from the technical employing LLMs for generative tasks is the perfect use case. In this article, we used a couple of techniques to guide the model into generating the output we strive for, without spending huge effort on training our own model.

As a conclusion, I’ll leave the fragment generated with my approach. I think it is vague and enigmatic enough to be passed as a modern poem.

On black earth, Crimean wormwood blooms,
As tears flow from the icon of the Mother of God.
The vizier of Solomon appears on city shop windows,
Like a guardian spirit abandoned.