llama: Embeddings and RAG #221

pminev · 2025-01-03T13:25:17Z

pminev
Jan 3, 2025
Maintainer

What is RAG

To support RAG (Retrieval Augmented Generation) application with AC we have to need the following components:

LLM
Embedding Model
Vector Database
Reranking Model - optional

flowchart TD
 subgraph Generation["Generation"]
        A["Query"]
        B("Query Augmentation/Context extraction - LLM")
        C["Query Vectorization - Embeddings Model"]
        D["Retrieve top-K relevant documents - Vector Database"]
        E["Pass retrieved documents to the LLM alongside with the query - LLM"]
        F["Output"]
  end
 subgraph subGraph1["Filling Database"]
        A1["Collect Data Documents"]
        B1("Clean up documents - prepare them for the Vector Database")
        C1["Vectorization of document chunks - Embeddings Model"]
        D1["Place query embedding vector in the Database - Vector Database"]
        E1["Are all documents processed?"]
        F1["Finish"]

        E1@{ shape: diam}
  end
    A -- What are the latest models of BMW? --> B
    B --> C
    C --> D
    D --> E
    E --> F
    A1 --> B1
    B1 --> C1
    C1 --> D1
    D1 --> E1
    E1 -- "No" --> C1
    E1 -- "Yes" --> F1

Now we'll check each one of them and see what exactly we have to support.

LLM

The LLM is the generative component of a RAG application. It's responsable for text prediction (token generation). Might be used also to extract context or also rephrase or enhance the query to improve the retrieval.

Embedding models

The emebedding models are used for vectorization of the text. They are used to convert the text into a vector representation. The vector representation is used to compare the similarity between the query and the documents (Semantic Similarity Search).

Vector Database

The vector database is the essential component for efficient storing and searching information based on semantic similarity. The vector database is used to store the vector representation of the documents. The vector representation is used to compare the similarity between the query and the documents. Some of them are capable of supporting multi-modal data handling - support different types of data. Along with the vector emebedding, the vector database can store additional information about the document - a metadata which is used for filtering.

Reranking

The reranking models used to enhance the quality and relevance of the retrieved resulkts before they are passed to the language model for response generation. They can reduce noises and prioritizes results that directly answer the query

pminev · 2025-01-06T14:45:28Z

pminev
Jan 6, 2025
Maintainer Author

RAG in AC SDK

Currently, the Instance class is solely responsible for creating Session objects, which are used exclusively for text prediction.

After evaluating potential approaches, only one viable solution emerged:

Introduce separate Instance and Session classes tailored for Embedding and Reranking models. This approach would include EmbeddingSession and RerankerSession, each with dedicated APIs to support the RAG application.

An alternative idea involved enhancing the existing Session class with additional APIs for Embedding and Reranking models. However, this approach was set aside as it would overly complicate the class and make it confusing to use.

Now we need to consider what we should do with Vector Databases. It's obvious that we should wrap them in our classes. We solve that problem similar to how we do it with the LLMs. Create a base abstract class which will be inherited and implemented by specific database providers.

std::string model = "text-embedding-model.gguf";
EmbeddingInstance einst(model, EmbeddingInstanceOptions{});
EmbeddingSession es = einst.createSession(...);

ChromaVectorDatabase vd(ChromaOptions{}, model);

{
    // documentURLs is a list of URLs to the documents which contain the text we need for the vector database
    for (const auto& documentURL : documentURLs) {
        std::vector<std::string> chunks = getDocumentChunks(documentURL);
        std::vector<std::vector<float>> vectors = es.vectorize(chunks);
        vd.insert(documentURL, chunks.getMetadata(), vectors);
    }
}

{
    std::string query = ...;
    // in options we can set the number top-K results we want to retrieve
    std::vector<std::string> results = vd.query(query, QueryOptions{}); 
}

LLMInstance linst("text-generation-model.gguf", LLMInstanceOptions{});
LLMSession ls = linst.createSession(...);

{
    ls.setInitialPrompt(results);
    std::string result;
    for (int i = 0; i < 100; i++) {
        result += ls.getToken();
    }

    std::cout << result << std::endl;
}

0 replies

pminev · 2025-01-09T09:57:23Z

pminev
Jan 9, 2025
Maintainer Author

Vector stores/database integration

Note: VS = Vector Store

Some of the most known VSs are:

Vector Store	Written Language(s)
FAISS	C++, Python
Milvus	C++, Go, Python
Qdrant	Rust, Python
Chroma	Python
Weaviate	Go
Pinecone	Python- Proprietary (Access via API)
Lance	Rust
Lucene	Java
Redis	C

VSs are usually designed to be used through server for scalability and ease of integration.

Note: Server-based VS can handle large-scale datasets by data distribution int multiple server nodes, also they are optimized to handle concurrent querioes from multiple clients.

FAISS - The only one which seems easy to integrate without a server is FAISS due to it's C++ compatibility. For the rest of the VSs are used through Clients.

Milvus - Has a C++ Client SDK, so we can use it directly. However, I checked that in Langchain (example in Python) they can create a local file and use it without a server. We should check if that's possible.

Qdrant - Since it's written in Rust it'll be easy to make a C library for the Client and integrate it.

Lance - Same as Qdrant library written in Rust, but the main difference is that LanceDB is designed as an embedded dabase (part of an application) while Qdrant is more service-like solutions.

Redis - It seems like to a more enterprise oriented VS, but since it's open source and written in C we might have to check it too.

1 reply

pminev Jan 9, 2025
Maintainer Author

FAISS is CPU and CUDA only VS, so I see here a point where we can make a Metal implementation, so Apple developers can run it locally.

Also by implementing it for Metal it'll be possible to run it on apple mobile devices. However, I'm not sure what will be the performance of them... but hey we're seeing that they are improving with each iteration, right?

pminev · 2025-01-09T15:11:26Z

pminev
Jan 9, 2025
Maintainer Author

VS API

Since we're about to have different models with big variety of data types (text, audio, video, etc), we need to consider making the VSs more generic to the developers.

The folllowing API is basic and still in WIP and about to be enhanced with more funtionlity. For now we'll think it as a sync API as we do for AI models.

using EmbeddingVector  = std::vector<float>;
using Score = float;

// The embedding instance should be implemented for all models that are meant for embedding
class EmbeddingInstance {
   virtual EmbeddingVector getEmbedding(std::string query) = 0;
}

// ========
// The following can be user defined structs which are wrapped In Dicts

// Example structure of text document chunk
struct DocumentRecord {
    std::vector<string> metadata;
    std::string text;
    EmbeddingVector embedding;
}

// Example struct of image
struct ImageRecord {
    std::vector<string> metadata;
    uint8_t bytes;
    EmbeddingVector embedding;
}

// ==========

struct VectorStoreOptions {
    uint16_t top; // sets the maximum of the returned results
    std::optional<std::function> filter; // search filter before vector search
}


class VectorStore {
    // Add/remove records to the store
    // Requires a vector of records and their ids
    virtual void addRecords(ac::Dict records) = 0;
    virtual void removeRecords(ac::Dict ids) = 0;

    // Returns a record by id
    virtual ac::Dict get(ac::Dict param) = 0;

    // Returns a list of ids and scores
    // Requires a vector and a top K
    virtual ac::Dict search(ac::Dict params) = 0;
};
}

0 replies

pminev · 2025-01-13T17:20:35Z

pminev
Jan 13, 2025
Maintainer Author

Design: Vector Stores vs Inference libraries

The decision to make inference libraries as plugins was made upon the restrain we had - they share common libraries which results in ODR violations. We don't have this problem in Vector Stores, that's why we can keep them as libraries and include them to the applications that need them.

So to summarize what we'll need for the VS:

we'll need to add the abstract VS class to AC local
for each of the vector stores that we want to implement we should:
- Create repository called vs-<name> for example vslib-faiss, vslib-chroma
- Implement the abstract AC local class for Vector Stores using corresponding library
- Each implementation should include the library, examples, tests
- Make examples in the most common use cases - such as LLM - ilib-llama

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: Embeddings and RAG #221

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

llama: Embeddings and RAG #221

pminev Jan 3, 2025 Maintainer

What is RAG

LLM

Embedding models

Vector Database

Reranking

Replies: 4 comments · 1 reply

pminev Jan 6, 2025 Maintainer Author

RAG in AC SDK

pminev Jan 9, 2025 Maintainer Author

Vector stores/database integration

pminev Jan 9, 2025 Maintainer Author

pminev Jan 9, 2025 Maintainer Author

VS API

pminev Jan 13, 2025 Maintainer Author

Design: Vector Stores vs Inference libraries

pminev
Jan 3, 2025
Maintainer

Replies: 4 comments 1 reply

pminev
Jan 6, 2025
Maintainer Author

pminev
Jan 9, 2025
Maintainer Author

pminev Jan 9, 2025
Maintainer Author

pminev
Jan 9, 2025
Maintainer Author

pminev
Jan 13, 2025
Maintainer Author