Vector Search with Large Language Models

A tutorial on Chromadb and Azure Cognitive Search

8 min readAug 22, 2023

As the quantity of documents are increasing day by day, our need to analyze such large volumes of data is also getting overwhelming. Fortunately, the explosion of large language models in the recent months has provided a tangible benefits to us in handling such data. One of the frequently used case scenario is finding the relevant documents. For example, in academic fields, we have hundreds of thousands of research papers, conference papers, abstracts. It is beyond our reach to read every documents out there. This means, many valuable information slips out of hands of researcher across the globe. We can leverage large language models to refine our search for relevant documents. One of such tool is to use vector database that utilizes embeddings to find closely related documents.

A simple illustration of vector space [source: deeplearning.ai]

So, why the use of embeddings are so good to find closely related documents? When we supply our input query or prompt in form of a ‘text’, the subsequent embeddings of the input text get’s mapped into the vector space of closely related text as well. Below is an excellent example from deeplearning.ai that shows the intuition behind it.

Mapping of input query to the vector space [source: deeplearning.ai]

With this intuition on mind, I am going to conduct a simple exercise to find similar articles from a corpus of scientific articles. In this case, I used ~1000 abstracts to generate vector database and used one of the article as a prompt to cluster relevant articles.

Key findings:

The similarity search experiment showed a number of duplicated articles as the most similar (obviously).
A closer look at the text revealed some key words and context present in both input and output results.

Below, I will also walk you through the steps I took to conduct this experiment. Especially, if you are working in Azure environment, this might be beneficial to you.

import os 
import json
import pandas as pd

import openai
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.azuresearch import AzureSearch
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
load_dotenv()

# openai keys
openai.api_type = os.getenv("OPENAI_API_TYPE")
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.api_base = os.getenv("OPENAI_API_BASE")
openai.api_version = os.getenv("OPENAI_API_VERSION")

openai_deployment_name='<your deployment name>'
openai_model_name='gpt-35-turbo'

openai_embedding_deployment_name='<your deployment name>'
openai_embedding_model_name='text-embedding-ada-002'
embedding_encoding = "cl100k_base"  
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

# Create an OpenAI embedding instance
embeddings = OpenAIEmbeddings(
                                openai_api_key=os.getenv('OPENAI_API_KEY'),
                                deployment=openai_embedding_deployment_name, 
                                chunk_size=1 # here it means batch size
                            )
test_embeddings = embeddings.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)
print(len(test_embeddings), len(test_embeddings[0]))

Ouput: 5, 1536

Steps to create Chromadb:

#We may need to split documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(separators=["."], chunk_size=10000, chunk_overlap=200)
docs = text_splitter.create_documents(input_text)

#Define location to store chromadb
dir = './chroma_db/'

vectordb = Chroma.from_documents(documents=docs,
                               embedding=embeddings,
                              persist_directory=dir)
vectordb.persist()
vectordb = None

# Load the persisted db from disk
dir = './chroma_db/vectordb'
vectordb = Chroma(persist_directory=dir,
                   embedding_function=embeddings)

Inspecting Chromadb:

temp = vectordb.get(include=[‘embeddings’, ‘documents’, ‘metadatas’])
temp[‘embeddings’][0][:10]

Output:
[-0.014461057762887768, -0.01953867518217372, 0.0031938221017889477, -0.023492448847831967, 
-0.018103402736713643, 0.02425070563738725, 0.0033732311574714573, 0.0033867714240234225, 
-0.021041651055869488, -0.03165725728551473]

temp['documents'][0]

Output: 
"Rice (Oryza sativa) germination and seedling establishment, particularly in increasingly saline soils, are critical to ensure successful crop yields. Seed vigor, which determines germination and seedling growth, is a complex trait affected by exogenous (environmental) and endogenous (hormonal) factors."

Run some similarity search:

query = '<your query here>'

vectordb.similarity_search(query, k=5)

**************
Search Result e.g:
Document(page_content='Understanding the molecular mechanisms of environmental salinity stress tolerance and acclimation strategies by photosynthetic organisms facilitates accelerating the genetic improvement of tolerant economically important crops....'
**************

We can also generate similarity score for each of the documents searched.
vectordb.similarity_search_with_score(query, k=5) 

**************
Search Result e.g:
Score: 0.24827224016189575
Content: 'While the importance of cell type specificity in plant adaptive responses is widely accepted, only a limited number of studies have addressed this issue at the functional level...' 
**************

Getting a vector similarity search with Chromadb was fairly easy. I also wanted to scale it up in Azure environment. But Azure recently also announced vector search capability through it’s Cognitive Search Service. So, I spend some times to figure out quirks of Azure to check how this works. Currently, it is in the beta phase, so it was a bit of a learning curve for me. But, to make your learning smooth, I decided to put some of the procedure to make it work. The folks at Azure has GitHub tutorials with LangChain integration. So, I started with that, and immediately ran into library issues as they have recently upgraded the SDK for the vector search. So, I had to go back to native Azure to make it work. Thanks to one of the folks at Azure to gave me a direction.

The general steps for Azure based similarity search procedure involves:

Setting up following service in your Azure environment: Azure OpenAI, Azure Cognitive Search Service, Azure Storage, Azure ML Studio
Creating a embeddings -> This has to be done outside of Azure Cognitive Service
Creating Index
Pushing documents/embeddings into the Azure Search Index
Performing Search

Since, I had already created a vector embeddings using Chromadb, I decided to use that in the Azure Cognitive Search.

from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.models import Vector
from azure.search.documents.indexes.models import (
    SearchIndex, 
    SearchField,
    SimpleField,
    SearchableField,
    VectorSearch,
    SearchFieldDataType,
    HnswVectorSearchAlgorithmConfiguration, 
    )

# Cognitive search keys
os.environ["AZURE_COGNITIVE_SEARCH_ENDPOINT"] = '<***>'
os.environ["AZURE_COGNITIVE_SEARCH_API_KEY"] = '<***>'
os.environ["AZURE_COGNITIVE_SEARCH_INDEX_NAME"] = 'my-index'
credential = AzureKeyCredential(os.environ["AZURE_COGNITIVE_SEARCH_API_KEY"])

From Azure documentation, the dataset must be composed of JSON documents that map to the index schema.

All I have to do is to convert the embeddings into JSON format that than be uploaded into the Azure Index. Before you convert to JSON, you will probably need to the have a json as list of dictionaries. Something like this:

[
    {
        "id": "1",
        "title": "<***>",
        "content": "<****>
        .....
    },
    {
        "id": "2",
        "title": "<***>",
        "content": "<****>"},
        ...
    }
.
.
.
]

# We need to convert to dictionary first before the json dump
df = pd.DataFrame(
            vectordb.get(include=['embeddings', 'documents', 'metadatas',...])
          )
dict = df.to_dict(orient='records')
with open('./json/vectordb.json', 'w') as file:  
    json.dump(dict, file)

Once you have your documents/embeddings in the JSON format, just load the file:

# Upload some documents to the index
with open('./json/vectordb.json', 'r') as file:  
    documents = json.load(file)

Creating the Index Schema was the trickiest part to me. We need to pay attention to the ‘type’ for each field. For example, if you have your ‘id_field’ set up as ‘SearchFieldDataType.String’ type but in your dataframe/json it was numeric type, then the search will fail. Another parameter is to pay attention is the vector dimensions, which if wrong, is easy to configure as the error will point out to it clearly. But anyway, put the dimension equal to the length of the embeddings you created earlier.

# Steps for creating index for Azure Cognitive Search
index_name: str = "my-index"

# Invoke SearchIndexClient
index_client = SearchIndexClient(os.environ["AZURE_COGNITIVE_SEARCH_ENDPOINT"], 
                            credential)

id_field = SimpleField(name="id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True)
content_field = SearchableField(name="content", type=SearchFieldDataType.String, filterable=True)
metadata_field = SimpleField(name="metadata", type=SearchFieldDataType.String)
content_vector = SearchField(name="content_vector", 
                             type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                             searchable=True,  
                             vector_search_dimensions=1536,  #need to match with embeddings length
                             vector_search_configuration="my-vector-config",
                             )

vector_search = VectorSearch(
    algorithm_configurations=[
        HnswVectorSearchAlgorithmConfiguration(
            name="my-vector-config",
            kind="hnsw",
            parameters={
                "m": 4,
                "efConstruction": 400,
                "efSearch": 1000,
                "metric": "cosine"
            }
        )
    ]
)

index = SearchIndex(name=index_name, 
                    fields=[
                            id_field, 
                            content_field, 
                            metadata_field, 
                            content_vector],
                            vector_search=vector_search
                            )

#client.create_index(index)
index_client.create_or_update_index(index)
result = index_client.create_or_update_index(index)
print(f' {result.name} created')

Now we can start uploading the documents and embeddings into the vector store.

# Invoke a search client 
index_name: str = "my-index"
search_client = SearchClient(
    endpoint=os.environ["AZURE_COGNITIVE_SEARCH_ENDPOINT"], 
    index_name=index_name, 
    credential=credential)

Pushing data to an index:

The dataset must be composed of JSON documents that map to your index schema
Upload documents individually or in batches up to 1000 per batch, or 16 MB per batch, whichever limit comes first

# We will upload in chunk of 1000
chunk_size = 1000
for i in range(0, len(documents), chunk_size):
    start = i
    end = i + chunk_size
    docs_chunk = documents[start:end]

    print(f"Uploading docs from {start} to {end}")

    #result = search_client.upload_documents(docs_chunk) 
    result = search_client.merge_or_upload_documents(docs_chunk)  

    print(f"Uploaded {len(docs_chunk)} documents")

Now, let’s perform a Vector Similarity Search. In order to do that, our input query prompt also needs to be vectorized. Let’s create a simple function to do that:

def generate_embeddings(text):
    response = openai.Embedding.create(
        input=text, engine=openai_embedding_deployment_name)
    embeddings = response['data'][0]['embedding']
    return embeddings

The Azure Cognitive Search finds similar documents using cosine similarity. There is also a way to change/define the similarity measure, but it is little complicated and I might try that later. The simplest form of the search looks like this:

# Perform a vector similarity search
query = "<your prompt here>"  
  
vector = Vector(value=generate_embeddings(query), 
                k=10, # number of documents to output
                fields="content_vector")

results = search_client.search(  
    search_text=None,  
    vectors= [vector],
    select=["id", "content", "content_vector"],
)

for result in results:  
    print(f"ID: {result['id']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}\n")

Result:

Input Query: 
"Salinity is major abiotic stress limiting plant growth worldwide. 
Plant adaptation to salinity stress involves diverse physiological and 
metabolic pathways. In this study, we assessed the effects of foliar 
application of zinc oxide nanoparticles (ZnONPs) and Moringa leaf extract 
(MLE) on salt tolerance in faba beans (cultivar, Sakha 4). 
Morphological, chemical, and biochemical parameters of plants grown under 
saline condition (50 and 100 mM NaCl) were assessed 60 days after sowing."

Score: 0.8992457
Content: Understanding the molecular mechanisms of environmental 
salinity stress tolerance and acclimation strategies by photosynthetic 
organisms facilitates accelerating the genetic improvement of tolerant 
economically important crops. In this study, we have chosen the marine algae 
Dunaliella (D.) salina , a high-potential and unique organism that 
shows superior tolerance against abiotic stresses, especially hypersaline 
conditions.....

The search method is quite robust. There were few duplicates of the inpurt text in the dataset and query listed them first, which I have omitted here. As we can see, the model looked for terms in context to ‘stress’, ‘salinity’ to identify similar documents. We can also compare this with the similarity search result from Chromadb as well.

Similarity search with Chromadb resulted:

Score: 0.22408774495124817 
Content: Understanding the molecular mechanisms of environmental salinity 
stress tolerance and acclimation strategies by photosynthetic organisms 
facilitates accelerating the genetic improvement of tolerant economically 
important crops. In this study, we have chosen the marine algae 
Dunaliella (D.) salina , a high-potential and unique organism that shows 
superior tolerance against abiotic stresses, especially hypersaline conditions.
....

As expected, Chromadb also produced similar results, since both case uses cosine similarity. The main difference is the way scoring works. In the case of Chromadb, smaller number means more similar whereas Azure Cognitive Search works in the opposite manner.

Vector similarity search is just one of the many potential use case of large language models to quickly identify relevant documents. In this short documents, we also compared an open source Chromadb with enterprise Azure Cognitive Search. As the technology advances, this same method could be further used downstream for more complicated task as per the requirements.

Thank you for reading!

Vector Search with Large Language Models

A tutorial on Chromadb and Azure Cognitive Search

Written by Suman Gautam