Tuesday, September 17, 2024

Utilizing PostgreSQL as a vector database in RAG

  1. We set up the Postgres extension referred to as pgvector, which permits tables to have columns of sort vector the place vector is a set of floats. On this instance we use a 768-dimensional vector, i.e. a vector of size 768.
  2. We create a desk that can save the articles for our information base — the textual content of every article, the title of the article, and a vector embedding of the article textual content. We title the desk articles and the columns title, textual content, and embedding.
  3. We extract the content material at 4 Wikipedia URLs and separate the title and content material for every.
  4. We clear every article physique, divide the textual content into chunks of 500 characters, and use an embedding mannequin to create a 768-dimensional vector from every chunk. The vector is a numerical illustration (a float) of the that means of the chunk of textual content.
  5. We save the title, a piece from the physique, and the embedding vector for the chunk in a row of the database. For every article, there are as many vectors as there are chunks.
  6. We index the vector column for similarity search in Half 2.

import psycopg2
from sentence_transformers import SentenceTransformer
import requests
from bs4 import BeautifulSoup
import re
import ollama

# Your connection params right here

MY_DB_HOST = 'localhost'
MY_DB_PORT = 5432
MY_DB_NAME = 'nitin'
MY_DB_USER = 'nitin'
MY_DB_PASSWORD = ''

# Arrange the database connection

conn = psycopg2.join(
    host=MY_DB_HOST,
    port=MY_DB_PORT,
    dbname=MY_DB_NAME,
    consumer=MY_DB_USER,
    password=MY_DB_PASSWORD
)
cur = conn.cursor()

# Create the articles desk with the pgvector extension

# If the pgvector extension just isn't put in in your machine it should have to be put in.
# See https://github.com/pgvector/pgvector or cloud situations with pgvector put in.
# First create the pgvector extension, then a desk with a 768 dim vector column for embeddings.
# Notice that the title and full textual content of the article can also be saved with the embedding.
# This permits vector similarity search on the embedding column, returning matched textual content 
# together with matched embeddings relying on what is required.
# After this SQL command is executed we may have 
#   a) a pgvector extension put in if it didn't exist already
#   b) an empty desk with a column of sort vector together with two columns,
# one to save lots of the title of the article and one to save lots of a piece of textual content.

# Postgres doesn't put a restrict on the variety of dimensions in pgvector embeddings. 
# It's price experimenting with bigger lengths however word they should match the size of embeddings
# created by the mannequin you employ. Embeddings of ~1k, 2k, or extra dimensions are widespread amongst embeddings APIs. 

cur.execute('''
    CREATE EXTENSION IF NOT EXISTS vector;   
    DROP TABLE IF EXISTS articles;
	CREATE TABLE articles (
    	id SERIAL PRIMARY KEY,
    	title TEXT,
    	textual content TEXT,
    	embedding VECTOR(768)
	);
''')
conn.commit()

# Beneath are the sources of content material for creating embeddings to be inserted in our demo vector db.
# Be happy so as to add your individual hyperlinks however word that totally different sources apart from Wikipedia might
# have totally different junk characters and should require totally different pre-processing.
# As a begin attempt different Wikipedia pages, then increase to different sources.

urls= [
'https://en.wikipedia.org/wiki/Pentax_K-x',
'https://en.wikipedia.org/wiki/2008_Tour_de_France',
'https://en.wikipedia.org/wiki/Onalaska,_Texas',
'https://en.wikipedia.org/wiki/List_of_extinct_dog_breeds'
]

# Fetch the HTML at a given hyperlink and extract solely the textual content, separating title and content material. 
# We'll use this textual content to extract content material from Wikipedia articles to reply queries. 

def extract_title_and_content(url):
    attempt:
        response = requests.get(url)
        if response.status_code == 200: # success
            # Create a BeautifulSoup object to parse the HTML content material
            soup = BeautifulSoup(response.content material, 'html.parser')
            # Extract the title of the web page
            title = soup.title.string.strip() if soup.title else ""
            # Extract the textual content content material from the web page
            content material = soup.get_text(separator=" ")
            return {"title": title, "textual content": content material}
        else:
            print(f"Didn't retrieve content material from {url}. Standing code: {response.status_code}")
            return None
    besides requests.exceptions.RequestException as e:
        print(f"Error occurred whereas retrieving content material from {url}: {str(e)}")
        return None

# Create the embedding mannequin

# That is the mannequin we use to generate embeddings, i.e. to encode textual content chunks into numeric vectors of floats.
# Sentence Transformers (sbert.web) is a set of transformer fashions designed for creating embeddings 
# from sentences. These are skilled on information units used for various purposes. We use one tuned for Q&A,
# therefore the 'qa' within the title. There are different embedding fashions, some tuned for pace, some for breadth, and many others.
# The location sbert.web is price finding out for choosing the right mannequin for different makes use of. It is also price trying 
# on the embedding fashions of suppliers like OpenAI, Cohere, and many others. to study the variations, however word that
# the usage of a web based mannequin entails a possible lack of privateness.

embedding_model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')

articles = []
embeddings = []

# Extract title,content material from every URL and retailer it within the checklist. 
for url in urls:
    article = extract_title_and_content(url)
    if article:
        articles.append(article)

for article in articles:
    raw_text = article["text"]
    # Pre-processing: Substitute giant chunks of white area with an area, remove junk characters.
    # It will fluctuate with every supply and can want customized cleanup. 
    textual content = re.sub(r's+', ' ', raw_text)
    textual content = textual content.exchange("]", "").exchange("[", "")

    # chunk into 500 character chunks, this is a typical size, could be lower if total size of article is small.
    chunks = [text[i:i + 500] for i in vary(0, len(textual content), 500)]
    for chunk in chunks:
        # That is the place we invoke our mannequin to generate an inventory of floats.
        # The embedding mannequin returns a numpy ndarray of floats.
        # Psycopg coerces the checklist right into a vector for insertion.
        embedding = embedding_model.encode([chunk])[0].tolist()
        cur.execute('''
            INSERT INTO articles (title, textual content, embedding)
            VALUES (%s, %s, %s); ''', (article["title"], chunk, embedding)
        )
        embeddings.append(embedding)

conn.commit()

# Create an index

# pgvector permits totally different indexes for similarity search.
# See the docs within the README at https://github.com/pgvector/pgvector for detailed explanations.
# Right here we use 'hnsw' which is an index that assumes a Hierarchical Community Small Worlds mannequin.
# HNSW is a sample seen in community fashions of language. Therefore this is without doubt one of the indexes
# that's anticipated to work nicely for language embeddings. For this small demo it should most likely not 
# make a lot of a distinction which index you employ, and the others are additionally price making an attempt.
# The parameters offered within the 'USING' clause are 'embedding vector_cosine_ops'
# The primary, 'embedding' on this case, must match the title of the column which holds embeddings.
# The second, 'vector_cosine_ops', is the operation used for similarity search i.e. cosine similarity.
# The identical README doc on GitHub offers different selections however for most typical makes use of it makes little distinction
# therefore cosine similarity is used as our default.

cur.execute('''
    CREATE INDEX ON articles USING hnsw (embedding vector_cosine_ops);
''')

conn.commit()
cur.shut()
conn.shut()

# Finish of file

Half 2. Retrieve context from the vector database and question the LLM

Partly 2 we ask a pure language query of our information base, utilizing similarity search to discover a context and utilizing an LLM (on this case Meta’s Llama 3) to generate a solution to the query within the offered context. The steps:

  1. We encode our pure language question as a vector utilizing the identical embedding mannequin we used to encode the chunks of textual content we extracted from the Wikipedia pages.
  2. We carry out a similarity search on this vector utilizing a SQL question. Similarity, or particularly cosine similarity, is a method to discover the vectors in our database which might be nearest to the vector question. As soon as we discover the closest vectors, we will use them to retrieve the corresponding textual content that’s saved with every vector. That’s the context for our question to the LLM.
  3. We append this context to our pure language question textual content, explicitly telling the LLM that the offered textual content is to be taken because the context for answering the question.
  4. We use a programmatic wrapper round Ollama to cross the pure language question and contextual textual content to the LLM’s request API and fetch the response. We submit three queries, and we obtain the reply in context for every question. An instance screenshot for the primary question is proven under.
Postgres RAG query example

IDG

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles