Vectors, the output data format of Neural Network models, can effectively encode information and serve a pivotal role in AI applications such as knowledge base, semantic search, Retrieval Augmented Generation (RAG) and more.
LanceDB is an open-source, serverless vector database that is embedded directly into your application. In this guide, we will walk you through how to set up LanceDB locally within minutes and use the Python client library to store and search vectors.
This notebook takes the in-memory Pandas DataFrame from Chapter 8’s rag_first_principles.ipynb and swaps it for a persistent vector database — the chunking and cosine-similarity ideas are identical; the storage layer is what changes.
Install Dependencies¶
First, we will install the required libraries: lancedb, sentence-transformers, pandas, and pydantic.
!pip install lancedb sentence-transformers pandas pydantic
Set Up Database and Embedding Model¶
To create a local LanceDB vector database, simply connect to a local directory (like a .db file but for folders). We will also define our embedding model using the sentence-transformers registry within LanceDB.
Importing Libraries¶
We import LanceDB along with pandas, PyArrow, and the LanceDB embeddings registry. The Pydantic integration lets us define a typed schema that LanceDB uses to validate and auto-embed data.
import lancedb
import pandas as pd
import pyarrow as pa
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
import os
from google.colab import userdata
Connecting to the Database and Selecting an Embedding Model¶
LanceDB stores data in a local directory, so there is no server to manage. We select the all-MiniLM-L6-v2 sentence-transformer as our embedding model; it produces 384-dimensional vectors and offers a good balance of speed and quality.
# Connects to a local directory (like a .db file but for folders)
db = lancedb.connect("./lancedb_is_research")
# Select the embedding model
registry = get_registry().get("sentence-transformers")
model = registry.create(name="all-MiniLM-L6-v2")
Setting the Hugging Face Token¶
While optional today, providing a Hugging Face token avoids rate limits and ensures reliable access to sentence-transformer model weights.
### Creating and using the HUGGINGFACE_TOKEN is optional (as of now in colab), but I would recommend that you do it, as it is used whenever SENTENCE TRANSFORMERS models will be called.
os.environ["HF_TOKEN"] = userdata.get('HUGGINGFACE_TOKEN')Define the Schema¶
In LanceDB, we define the schema to specify how the data should be structured and embedded. Here, we tell LanceDB to automatically embed the ‘Abstract’ column.
Loading and Previewing the Data¶
We load the ISResearch dataset directly from GitHub. Each row contains a paper’s title, abstract, publication year, journal, and URL. The abstracts will be embedded and stored as vectors.
class Papers(LanceModel):
id: int
Year: int
Title: str
Abstract: str = model.SourceField()
URL: str
JournalFN: str
# The vector field is automatically populated by the model above
vector: Vector(model.ndims()) = model.VectorField()Creating the Table and Inserting Data¶
When we call table.add(df), LanceDB automatically generates embedding vectors for every abstract using the model we registered in the schema. This single line handles both embedding and storage, which keeps the pipeline simple.
Prepare and Load Data¶
Load our existing dataset, ISResearch.csv, into a Pandas DataFrame. Then, create the table and add the data. LanceDB handles the embedding generation automatically during the insertion process.
# Load your CSV (ensure ISResearch.csv is in your directory)
df = pd.read_csv("https://raw.githubusercontent.com/KarAnalytics/datasets/refs/heads/master/ISResearch.csv")
df.head()Running a Semantic Search¶
We pass a natural-language query and LanceDB embeds it with the same model, then finds the closest vectors using approximate nearest-neighbor search. The results come back as a pandas DataFrame sorted by distance, making downstream analysis straightforward.
# Create the table and add the data
# LanceDB handles the embedding generation during the 'add' process
table = db.create_table("ISResearch", schema=Papers, mode="overwrite")
table.add(df)
print(f"Migration Complete. Total papers in DB: {len(table)}")Semantic Search¶
Now we can perform semantic searches. LanceDB seamlessly embeds the query automatically and returns a Pandas DataFrame for easy viewing.
query = "Which papers mention blockchain or decentralized systems?"
# We search using raw text; LanceDB embeds the query automatically
results = table.search(query).limit(5).to_pandas()
print("\n--- Top 5 Semantic Search Results ---")
print(results[['Title', 'Year', '_distance']])
# Optional: To show the URL of the top result
print(f"\nTop Match URL: {results.iloc[0]['URL']}")Key takeaways¶
LanceDB is serverless and embedded -- it stores vectors in a local directory, so there is no database server to run or manage.
Pydantic schemas with
SourceFieldandVectorFieldlet LanceDB auto-embed a text column on insert, collapsing embedding and storage into a singletable.add()call.Raw-text queries are embedded with the same registered model at search time, so you never have to hand-embed queries or maintain a separate inference path.
Results as pandas DataFrames (with a
_distancecolumn) make it trivial to sort, filter, and join semantic search output with the rest of an analytics workflow.all-MiniLM-L6-v2produces 384-dim vectors and hits a sweet spot of speed versus quality for demos over a few thousand documents.
Run the code¶
To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)
Estimated run time: ~3 minutes
https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/LanceDB_vectorDB.ipynb