ChromaDB Vector Database: A guide on extending the knowledge of LLM models with RAG

10 min readNov 29, 2023

Introduction

In recent times, Large Language Models (LLMs) have gained significant prominence. Numerous companies have developed their own robust LLM models characterized by extensive parameters, showcasing their impressive capabilities. However, despite their power, these models are limited by their training data; they only possess knowledge within their training scope. When confronted with information beyond their training, they may generate inaccurate or hallucinated results.

In situations where information evolves continuously, constantly retraining these models with new data becomes impractical. This challenge leads us to leverage a concept known as RAG (Retrieval Augmented Generation). RAG acts as a bridge, enabling us to link our LLM models with external sources of information. To employ RAG effectively, we rely on a vector database. This database serves as a repository for storing diverse information, which can be accessed and integrated into the model whenever a query is initiated.

Our aim with this article is to shed light on the pivotal role vector databases play and the reasons behind their increasing popularity in expanding the capabilities of LLM systems. Additionally, through a practical session, we will construct our own vector database and engage with it. This hands-on experience will deepen our understanding of their critical importance in the current landscape.

Theory

Before proceeding to the practical session, some questions need to be answered to further clarify what many of the terms we have been using really mean.

What are databases and why do we need them?

Databases play a crucial role in data storage. Traditionally, there were two primary types: SQL and NoSQL databases. SQL databases manage structured data organized in various tables with established relationships through primary and secondary keys, such as MySQL and PostgreSQL. On the other hand, NoSQL databases handle unstructured data like documents, seen in examples like MongoDB and Cassandra.

Now, let’s delve into Vector databases. These specialized databases excel in storing vast quantities of vector data, enabling swift and effortless retrieval. In a Vector database, each text inputted corresponds to a stored vector embedding, a feature utilized for rapid information retrieval. This attribute is particularly advantageous for Large Language Models (LLMs), where minimizing latency is crucial. Moreover, Vector databases offer the advantage of performing similarity searches. This capability involves seeking documents akin to a specific document, enhancing their suitability for various applications.

What do LLMs entail?

Understanding Large Language Models (LLMs) involves dissecting each term they encompass:

Large: Signifies substantial size or quantity.
Language: Refers to the means of communication, often conveyed in textual form.
Models: Denotes machine learning algorithms or mathematical functions designed to execute one or multiple tasks.

In simpler terms, LLMs embody extensive machine learning algorithms specialized in language-related tasks, such as text comprehension, generation, and sentiment analysis. These tasks involve interpreting, processing, and producing textual content resembling human-like language patterns.

In a more technical context, LLMs represent advanced artificial intelligence systems meticulously crafted to comprehend, process, and generate human-like textual content based on input data. Their training involves extensive textual data, enabling them to decode intricate patterns, context, and linguistic structures. Consequently, LLMs can execute diverse tasks, including language translation, text completion, question answering, and more.

What does RAG entail?

Image showing the relationship between vector databases and LLMs

According to IBM, RAG stands for an AI framework utilized to retrieve factual information from an external knowledge base. This process aims to anchor large language models (LLMs) with the latest and most precise information, offering users a glimpse into the generative process of LLMs.

For Example:

For instance, imagine an LLM used by a news organization to generate news articles. As news is continually evolving, the LLM may encounter rapidly changing information, such as breaking news stories or updated statistics. Employing RAG, the AI model seamlessly accesses an external database, swiftly retrieving the latest data to ensure the articles it generates remain accurate and up-to-date. This integration of external knowledge through RAG enables the LLM to adapt to new information in real-time, ensuring the content it produces remains reliable despite the constantly changing landscape of news.

Essentially, this entails integrating an external repository of knowledge into our AI model. This integration ensures that our model furnishes accurate information, especially considering the dynamic nature of much of this knowledge that tends to evolve over time.

Practical

In this practical session, we will be building our own database and working with the text and their corresponding embeddings. But before we proceed, we need to know what embeddings are.

What are embeddings?

According to Turing, word embedding in NLP is an important term that is used for representing words for text analysis in the form of real-valued vectors. It is an advancement in NLP that has improved the ability of computers to understand text-based content in a better way. It is considered one of the most significant breakthroughs of deep learning for solving challenging natural language processing problems.
In this approach, words and documents are represented in the form of numeric vectors allowing similar words to have similar vector representations. The extracted features are fed into a machine learning model so as to work with text data and preserve the semantic and syntactic information. This information once received in its converted form is used by NLP algorithms that easily digest these learned representations and process textual information. [2]

Building your own vector database

ChromaDB DATABASE

Introduction to ChromaDB

Chroma is the open-source embedding database. Chroma makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs [3]. Chroma gives you the tools to store embeddings and their metadata, embed documents and queries and search embeddings

2. Why use ChromaDB?

Unlike many vector databases, Chroma can run in-memory or in client/server (in alpha) mode i.e we can decide to have a ChromaDB database running locally on our device or we can decide to run an instance of this database in the cloud and connect to if from our local machine.

3. Building A ChromaDB From Scratch

This tutorial is going to contain 3 parts:

Creating the database instance, collection and getting our embedding model
Loading data into our vector database
Fetching information from our vector database

Before proceeding, we need to install the required packages for our code to run. run the following commands to install the required packages.

pip install chromadb transformers sentence_transformers InstructorEmbedding 
pip install torch

Now we go ahead and import our requires packages and load our chroma client and embedding model.

# Importing relevant packages for this project
import chromadb as db  #This helps us work with the vectors database
from chromadb.utils import embedding_functions # This helps us fetch our embedding model


# Loading our embedding model to memory (all-MiniLM-L6-v2)
""" 
Note we have other embedding models available to us in the chromadb 
ecosystem. We could even decide to use the OpenAIEmbedding model. Note that 
using this OpenAIEmdedding model will require an API key issues to you. In this 
Tutorial, we have decide to use a completely free embedding model.

"""
embedding_model = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")


# Creating our ChromaDB client.
chroma_client = db.PersistentClient(path="/Path/to/save/database")
# Remember to specify the path you will like to save your database

""" 
... henceforth, everything regarding loading and fetching information 
will be done through through this client we just created 

"""

Now we go straight to loading data into our database. But first we need to create something called a collection.

What are collections?

Collections are where you’ll store your embeddings, documents, and any additional metadata.

You can create as many collections as possible. Whenever we want to lead data into a collection, we use a create client to pick the collection we want to use, the we can write data into this collection. In our case, we don’t have any collection yet, so we need to create our collection before we can load data into it.

# Creating a database collection 
"""
The create_collection method helps us create a new collection that we can 
store data. When creating this collection, there are some attributes we need
to pass tho this method to customize its behavious
1. We need to specify and embedding function that embeds any text entering 
the model. ChromaDB will use the default if not passed.
2. We can also a metadata that states how this database should compute 
similarities between object in the vectors space computed. l2 is the default.
"""

chroma_client.create_collection(
    name='Collection1',
    metadata={"hnsw:space": "cosine"}, # l2 is the default
    embedding_function=embedding_model
)


# getting the collection
collection_one = chroma_client.get_collection(name='Collection1')


# Here is the data we are loading into our database 

"""
About the data:
1. The data is about the article gotten from medium website
2. The content of these articles has been loaded into the datas
3. The title of these article in present in the metadatas
4. We have created an id for each document to serve as the unique identifier for
   each article.
"""

datas = [
    'Data analysis is the process of inspecting and exploring data generated by a particular population to find the information needed to make decisions and draw conclusions. With the use of data in decision making, most businesses today need data analysts. So, if you want to know about the best books to learn data analysis, this article is for you. In this article, I will introduce you to some of the best books to learn data analysis.',
    'The performance of a machine learning algorithm on a particular dataset often depends on whether the features of the dataset satisfies the assumptions of that machine learning algorithm. Not all machine learning algorithms have assumptions that differentiate them from each other. So, in this article, I will take you through the assumptions of machine learning algorithms.',
    'The K-Means Clustering is a clustering algorithm capable of clustering an unlabeled dataset quickly and efficiently in just a very few iterations. In this article, I will take you through the K-Means clustering in machine learning using Python.',
    'Many machine learning algorithms can be used to solve complex problems that require a large amount of data with a large number of features, but deep learning can outperform all algorithms. So to understand where we can use deep learning techniques, in this article, I will introduce you to the applications of deep learning.',
    'A scatter plot is one of the most useful ways to analyze the relationship between two features. You must have used a scatter plot before if you are learning data science but have you ever tried to create an animated scatter plot using Python? If you want to learn how to visualize an animated scatter plot, this article is for you. In this article, I will take you through a tutorial on visualizing animated scatter plot using Python.'
    ]

metadatas = [{'title': 'Best Books to Learn Data Analysis'},
             {'title': 'Assumptions of Machine Learning Algorithms'},
             {'title': 'K-Means Clustering in Machine Learning'},
             {'title': 'Applications of Deep Learning'},
             {'title': 'Animated Scatter Plot using Python'}]

ids = ['1',
       '2',
       '3',
       '4',
       '5']


# Loading data into this database
collection_one.add(
    documents=datas,
    metadatas=metadatas,
    ids=ids
)

Now that our data has been loaded into the ‘Collection1’ collection in our database. Now we can fetch our information from the database.

NB: Here is where a major advantage of vector database over other types of database is apparent. When fetching information from the database, we do not need to have an exact match of what we are looking for in the database, we can perform something called similarity search which fetches information that are nearest to a document we are querying for. This is possible because every document present in the database has their corresponding embedding stored also. So this embedding is first used to fetch the similarities, then the text of the embedding is then returned.

"""
In the below query, we are looking for articles that have something releated to 
the phrase 'For data visualization, what is the role of matplotlib'.
We want to just fetch just 2 results.

"""

collection_one.query(
    query_texts='For data visualization, what is the role of matplotlib',
    n_results=2,
)

Result:

"""
In this result, lets focus especially on the [distances]. We see that we have
[0.48327672481536865, 0.7423484921455383] for ids [5, 1] respactively. 
The higher the score, the closer the text is to our query text.
In this result we can see that text 1 is closer to the query text than text 5
"""

'-------- RESULT -------'

{'ids': [['5', '1']],
 'distances': [[0.48327672481536865, 0.7423484921455383]],
 'metadatas': [[{'title': 'Animated Scatter Plot using Python'},
   {'title': 'Best Books to Learn Data Analysis'}]],
 'embeddings': None,
 'documents': [['A scatter plot is one of the most useful ways to analyze the relationship between two features. You must have used a scatter plot before if you are learning data science but have you ever tried to create an animated scatter plot using Python? If you want to learn how to visualize an animated scatter plot, this article is for you. In this article, I will take you through a tutorial on visualizing animated scatter plot using Python.',
   'Data analysis is the process of inspecting and exploring data generated by a particular population to find the information needed to make decisions and draw conclusions. With the use of data in decision making, most businesses today need data analysts. So, if you want to know about the best books to learn data analysis, this article is for you. In this article, I will introduce you to some of the best books to learn data analysis.']],
 'uris': None,
 'data': None}

Bonus For Yours Truly :)

To better know if you truly understand the concepts demonstrated in this article. Create a movie recommendation system with any dataset of your choice. Create a new collection in your database and include its properties. Load your data into the collection with their metadatas. Then fetch information from this database. Voila! You have yourself a movie recommender system

References

Socials

https://amusatomisin65.medium.com/subscribe