What are embeddings ?
- OpenAI embeddings are vector representations of language units generated by NLP models like GPT-3 and GPT-4. These high-dimensional vectors capture semantic and syntactic information, enabling various NLP tasks. The Transformer architecture used in these models allows for context-dependent embeddings, providing more powerful and flexible representations compared to traditional static word embeddings.
Steps
- Read file
- Split files into smaller chunks (there is a limit for every LLM)
- Use Faiss to create an index
- Use Pickle so serialize and store the data to the disk
1. Read File
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
import faiss
import pickle
loader = PyPDFLoader("file.pdf")
2. Split file into smaller chunks
pages = loader.load_and_split()
3. Use Faiss to create an index
You need to export you OPENAPI key, otherwise you’ll get an error at this step.
store = FAISS.from_documents(pages, OpenAIEmbeddings())
faiss.write_index(store.index, "docs.index")
store.index = None
4. Use Pickle so serialize and store the data to the disk
with open("faiss_store_pdf.pkl", "wb") as f:
pickle.dump(store, f)
You can also use other places to store your embeddings.
After the data is ingested we can move the part where we search and find similarities using GPT and embeddings.