Ingest content and generate embeddings using OpenAI and python

04 Jan, 2023

What are embeddings ?

OpenAI embeddings are vector representations of language units generated by NLP models like GPT-3 and GPT-4. These high-dimensional vectors capture semantic and syntactic information, enabling various NLP tasks. The Transformer architecture used in these models allows for context-dependent embeddings, providing more powerful and flexible representations compared to traditional static word embeddings.

Steps

Read file
Split files into smaller chunks (there is a limit for every LLM)
Use Faiss to create an index
Use Pickle so serialize and store the data to the disk

1. Read File

from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
import faiss
import pickle

loader = PyPDFLoader("file.pdf")

2. Split file into smaller chunks

pages = loader.load_and_split()

3. Use Faiss to create an index

You need to export you OPENAPI key, otherwise you’ll get an error at this step.

store = FAISS.from_documents(pages, OpenAIEmbeddings())
faiss.write_index(store.index, "docs.index")
store.index = None

4. Use Pickle so serialize and store the data to the disk

with open("faiss_store_pdf.pkl", "wb") as f:
    pickle.dump(store, f)

You can also use other places to store your embeddings.

After the data is ingested we can move the part where we search and find similarities using GPT and embeddings.

#markdown #python #langchain #openai

Extreme Car Simulator 2016 game on Android using Unity3D. 15 mil downloads, 97k reviews

#markdown #unity3d #mobile

Car Crashing Engine 2021 game on Android using Unity3D. 100k+ downloads