Ingest content and generate embeddings using OpenAI and python

Ingest content and generate embeddings using OpenAI and python

What are embeddings ?

  • OpenAI embeddings are vector representations of language units generated by NLP models like GPT-3 and GPT-4. These high-dimensional vectors capture semantic and syntactic information, enabling various NLP tasks. The Transformer architecture used in these models allows for context-dependent embeddings, providing more powerful and flexible representations compared to traditional static word embeddings.

Steps

  1. Read file
  2. Split files into smaller chunks (there is a limit for every LLM)
  3. Use Faiss to create an index
  4. Use Pickle so serialize and store the data to the disk

1. Read File

from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
import faiss
import pickle

loader = PyPDFLoader("file.pdf")

2. Split file into smaller chunks

pages = loader.load_and_split()

3. Use Faiss to create an index

You need to export you OPENAPI key, otherwise you’ll get an error at this step.

store = FAISS.from_documents(pages, OpenAIEmbeddings())
faiss.write_index(store.index, "docs.index")
store.index = None

4. Use Pickle so serialize and store the data to the disk

with open("faiss_store_pdf.pkl", "wb") as f:
    pickle.dump(store, f)

You can also use other places to store your embeddings.

After the data is ingested we can move the part where we search and find similarities using GPT and embeddings.

Related Posts