Ingest content and generate embeddings using OpenAI and python

What are embeddings ?

  • OpenAI embeddings are vector representations of language units generated by NLP models like GPT-3 and GPT-4. These high-dimensional vectors capture semantic and syntactic information, enabling various NLP tasks. The Transformer architecture used in these models allows for context-dependent embeddings, providing more powerful and flexible representations compared to traditional static word embeddings.


  1. Read file
  2. Split files into smaller chunks (there is a limit for every LLM)
  3. Use Faiss to create an index
  4. Use Pickle so serialize and store the data to the disk

1. Read File

from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
import faiss
import pickle

loader = PyPDFLoader("file.pdf")

2. Split file into smaller chunks

pages = loader.load_and_split()

3. Use Faiss to create an index

You need to export you OPENAPI key, otherwise you’ll get an error at this step.

store = FAISS.from_documents(pages, OpenAIEmbeddings())
faiss.write_index(store.index, "docs.index")
store.index = None

4. Use Pickle so serialize and store the data to the disk

with open("faiss_store_pdf.pkl", "wb") as f:
    pickle.dump(store, f)

You can also use other places to store your embeddings.

After the data is ingested we can move the part where we search and find similarities using GPT and embeddings.

