Query Your PostgreSQL Database with LangChain and Llama 3.1 : Exploring LLMs - 1
You can find the link to this tutorial here
One of the first use cases for large language models (LLMs) is creating chatbots that can answer questions based on a custom database. Imagine a dashboard where instead of custom graphs and analytics tools, you can launch a chatbot to query your SQL database and provide context to the end user directly.
In this tutorial, we will create a simple implementation that takes a question and generates a valid SQL query to return the correct results from your database. We'll use a PostgreSQL database and Llama 3.1 as our LLM to query the database. This tutorial assumes you already have an LLM up and running; if not, please follow this tutorial to install llama 3.1 on your system
LangChain provides several key functionalities to simplify integrating natural language queries with database operations. We'll use it for language model integration, database connectivity and abstraction, creating a query chain, and natural language to SQL translation.
How does it work?
We'll be using a postgres database along with llama 3.1 as our llm to query the database. This tutorial assumes that you already have an llm up and running, if that's not the case please follow this tutorial to setup llama 3.1
LangChain provides several key functionalities that significantly simplify the process of integrating natural language queries with database operations. We'll be using it for Language Model Integration, Database Connectivity and Abstraction, Creating a Query Chain and Natural Language to SQL Translation.
Step 1 : Setup your environment
I am executing my python scripts inside a conda environment, this is not a hard requirement but it would keep your coding environment cleaner.
If you don't have miniconda/conda installed, you can follow the process mentioned here
Once you are done setting it up, start a new conda env by running the following command
conda create --name llm-chatbot python=3.8
conda activate llm-chatbot
Great! Now that your conda env is setup, let's start installing the necessary dependencies.
Step 2 : Installing dependencies
pip install langchain-community
pip install psycopg2
Step 3 : Start the implementation!
At this point, you can create a new python file or notebook and import the local llm model.
from langchain_community.llms import Ollama
llm = Ollama(model = "llama3.1")
A few more important libraries that we'll be using are for SQL Database connection and creating something called "chains" (we'll discuss this later)
from langchain_community.utilities import SQLDatabase
from langchain.chains import create_sql_query_chain
Great, now we need to connect to our database.
Note that this step considers that you already have a local database setup, if this is not the case, you can use any SQL database instance.
The database I'm using looks like this
Table 1: Employees
EmployeeID | FirstName | LastName | DepartmentID | Salary |
1 | John | Doe | 101 | 60000 |
2 | Jane | Smith | 102 | 65000 |
3 | Emily | Johnson | 101 | 70000 |
4 | Michael | Brown | 103 | 55000 |
5 | Sarah | Davis | 102 | 72000 |
Table 2: Departments
DepartmentID | DepartmentName | Location |
101 | Human Resources | Delhi |
102 | IT | San Francisco |
103 | Marketing | Chicago |
db = SQLDatabase.from_uri("postgresql://user-name:@localhost:5432/postgres")
Now, this function should return the available tables in your database (Departments and Employees).
db.get_usable_table_names()
#['Departments', 'Employees']
Now our next step would be creating a chain.
In LangChain, a "chain" is a sequence of processing steps where each step's output becomes the input for the next step. The idea is to create a pipeline that integrates multiple components like language models, templates, and data retrieval systems to handle complex natural language processing tasks. This modular approach allows for flexible and customizable workflows.
We'll be creating a very chain with just a single link like this.
chain = create_sql_query_chain(llm = llm, db = db)
Now that our langchain is created it's time to invoke it and ask questions! But before doing this, notice that although we have provided the database to the llm model, it still doesn't know the context for tables and what this data means, although this is not a necessary step, it helps improve the accuracy of the queries.
We'll do this by simply passing some additional information about the table embedded into our query like this
database_description = (
"The database consists of two tables: `public.employees_table` and `public.departments_table`. This is a PostgreSQL database, so you need to use postgres-related queries.\n\n"
"The `public.employees_table` table records details about the employees in a company. It includes the following columns:\n"
"- `EmployeeID`: A unique identifier for each employee.\n"
"- `FirstName`: The first name of the employee.\n"
"- `LastName`: The last name of the employee.\n"
"- `DepartmentID`: A foreign key that links the employee to a department in the `public.departments_table` table.\n"
"- `Salary`: The salary of the employee.\n\n"
"The `public.departments_table` table contains information about the various departments in the company. It includes:\n"
"- `DepartmentID`: A unique identifier for each department.\n"
"- `DepartmentName`: The name of the department.\n"
"- `Location`: The location of the department.\n\n"
"The `DepartmentID` column in the `public.employees_table` table establishes a relationship between the employees and their respective departments in the `public.departments_table` table. This foreign key relationship allows us to join these tables to retrieve detailed information about employees and their departments."
)
response = chain.invoke({"question": database_description + " How many employees have salary above 70k?"})
And that's it!
If you print the response, you should see that it contains an SQL query, but since this query is generally accompanied by some additional text, we'll be using an extra parsing method to recover our SQL query.
import re
def extract_sql_query(response):
# Define the regular expression pattern to match the SQL query
pattern = re.compile(r'SQLQuery:\s*(.*)')
# Search for the pattern in the response
match = pattern.search(response)
if match:
# Extract and return the matched SQL query
return match.group(1).strip()
else:
return None
sql_query = extract_sql_query(response)
And now if we print the sql_query, it should contain the final cleaned up query that we can execute to get the result
SELECT COUNT(*) FROM public.employees_table WHERE "Salary" > 70000
To execute the query, you can do the following.
result = db.run(sql_query)
print(result)
#1
Conclusion
You have now created a simple query tool with PostgreSQL database using natural language. This simple implementation demonstrates the potential of integrating LLMs with databases to provide a more intuitive user experience. While this is a basic query tool, it opens up many exciting possibilities for future enhancements.
In the next steps, we will explore adding memory to this implementation, enabling it to retain context and build upon previous interactions. Additionally, we'll incorporate features such as replying in natural language, drawing graphs to visualize query results, and calling functions using LLMs to perform more complex operations. These advancements will transform our chatbot from a straightforward query tool into a powerful interactive assistant capable of sophisticated data analysis and user interaction.