That is the second of two articles.
Within the earlier article, we mentioned three issues for builders when constructing GPT functions with an open supply stack, resembling LangChain. Let’s now use LangChain for a sensible instance the place we need to retailer and analyze PDF paperwork.
We’ll acquire a PDF doc, divide it into smaller components, save the doc textual content and its vector representations (embeddings*) in a database system after which question it. We’ll additionally use a GPT to assist reply a query.
*In a GPT, an embedding is solely a numerical illustration of a phrase or phrase. Vectors symbolize the semantic which means of phrases and phrases in a method {that a} machine-learning mannequin can perceive.
Create a SingleStoreDB Cloud Account
First, join a free SingleStoreDB Cloud account. As soon as logged in, choose CLOUD > Create new workspace group from the left-hand navigation pane. Subsequent, select Create Workspace and simply work by the wizard. Listed below are the beneficial settings for this instance:
Create Workspace Group
Workspace Group Title: LangChain Demo Group
Cloud Supplier: AWS
Area: US East 1 (N. Virginia)
Click on Subsequent.
Create Workspace
Workspace Title: langchain-demo
Measurement: S-00
Click on Create Workspace.
As soon as the workspace is created and accessible, from the left-hand navigation pane, choose DEVELOP > SQL Editor to create a brand new database, as follows:
CREATE DATABASE IF NOT EXISTS pdf_db;
Create a Pocket book
From the left-hand navigation pane, choose DEVELOP > Notebooks. Within the prime proper of the online web page, choose New Pocket book > New Pocket book, as proven in Determine 1 under.
We’ll name the pocket book langchain_demo. Choose a Clean pocket book template from the accessible choices.
We’ll additionally choose the Connection and Database utilizing the drop-down menus above the pocket book, as proven in Determine 2.
Fill out the Pocket book
First, we’ll import some libraries:
!pip set up langchain —quiet !pip set up openai —quiet !pip set up pdf2image —quiet !pip set up tabulate —quiet !pip set up tiktoken —quiet !pip set up unstructured —quiet |
Subsequent, we’ll learn in a PDF doc. That is an article by Neal Leavitt titled “No matter Occurred to Object-Oriented Databases?” OODBs had been an rising expertise through the late Eighties and early Nineteen Nineties. We’ll add leavcom.com
to the firewall by choosing the Edit Firewall possibility within the prime proper. As soon as the handle has been added to the firewall, we’ll learn the PDF file:
from langchain.document_loaders import OnlinePDFLoader loader = OnlinePDFLoader(“http://leavcom.com/pdf/DBpdf.pdf”) information = loader.load() |
We will use LangChain’s OnlinePDFLoader, which makes studying a PDF file simpler.
Subsequent, we’ll get some information on the doc:
from langchain.text_splitter import RecursiveCharacterTextSplitter
print (f“You might have len(information) doc(s) in your information”) print (f“There are len(information[0].page_content) characters in your doc”) |
The output needs to be:
You have 1 doc(s) in your information There are 13040 characters in your doc |
We’ll now break up the doc into pages containing 2,000 characters every, giving us seven pages:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 2000, chunk_overlap = 0) texts = text_splitter.split_documents(information)
print (f“You might have len(texts) pages”) |
Subsequent, we’ll create a desk to retailer the textual content and embeddings. We will do that immediately utilizing the %%sql
magic command:
%%sql
USE pdf_db; DROP TABLE IF EXISTS pdf_docs; CREATE TABLE IF NOT EXISTS pdf_docs ( id INT PRIMARY KEY, textual content TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci, embedding BLOB ); |
To make use of Python code to hook up with our database, we will use the built-in connection_url
, as follows:
from sqlalchemy import * db_connection = create_engine(connection_url) |
We’ll set our OpenAI API Key:
import openai openai.api_key = “OpenAI API Key” |
and use LangChain’s OpenAIEmbeddings
:
from langchain.embeddings import OpenAIEmbeddings embedder = OpenAIEmbeddings(openai_api_key = openai.api_key) |
Now we’re able to acquire the vector embeddings and retailer them within the database system:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
db_connection.execute(“TRUNCATE TABLE pdf_docs”)
for i, doc in enumerate(texts): text_content = doc.page_content
embedding = embedder.embed_documents([text_content])[0]
stmt = “”“ INSERT INTO pdf_docs ( id, textual content, embedding ) VALUES ( %s, %s, JSON_ARRAY_PACK_F32(%s) ) ““”
db_connection.execute(stmt, (i+1, text_content, str(embedding))) |
We truncate the desk to make sure that we begin with an empty desk. Then we iterate by the pages of textual content, acquire the embeddings from OpenAI, and retailer the textual content and embeddings within the database desk.
We will now ask a query, as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
query_text = “Will object-oriented databases be commercially profitable?”
query_embedding = embedder.embed_documents([query_text])[0]
stmt = “”“ SELECT textual content, DOT_PRODUCT_F32(JSON_ARRAY_PACK_F32(%s), embedding) AS rating FROM pdf_docs ORDER BY rating DESC LIMIT 1 ““”
outcomes = db_connection.execute(stmt, str(query_embedding))
for row in outcomes: print(row[0]) |
Right here we convert the query into vector embeddings, carry out a DOT_PRODUCT
and return solely the highest-scoring worth.
Lastly, we will use a GPT to supply a solution, based mostly on the sooner query:
immediate = f“The consumer requested: query_text. Probably the most related textual content from the doc is: row[0]”
response = openai.ChatCompletion.create( mannequin=“gpt-3.5-turbo”, messages=[ “role”: “system”, “content”: “You are a helpful assistant.”, “role”: “user”, “content”: prompt ] )
print(response[‘choices’][0][‘message’][‘content’]) |
Right here is a few instance output:
Based mostly on the knowledge offered within the doc, it appears that evidently object-oriented databases are usually not anticipated to be commercially profitable within the close to future. Whereas they’re gaining some reputation in area of interest markets resembling CAD and telecommunications, relational databases proceed to dominate the market and are anticipated to take action for the foreseeable future. IDC predicts that the expansion fee for relational databases can be considerably greater than that of OO databases by 2004. Nonetheless, OO databases nonetheless have their place in sure area of interest markets.
Abstract
On this instance, we noticed the advantages of LangChain within the software improvement course of. We additionally noticed how simply we will convert paperwork from one format to a different, retailer the content material in a database system, generate vector embeddings and ask questions in regards to the information saved within the database system. We even have the total energy of SQL accessible if we’re involved in performing extra question operations on the information.
I’ll host a workshop on June 22 and can undergo constructing a ChatGPT software utilizing LangChain. I hope you possibly can be part of. Enroll right here.