Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

LanceDB: A Local Vector Database

University of Kansas School of Business

Vectors, the output data format of Neural Network models, can effectively encode information and serve a pivotal role in AI applications such as knowledge base, semantic search, Retrieval Augmented Generation (RAG) and more.

LanceDB is an open-source, serverless vector database that is embedded directly into your application. In this guide, we will walk you through how to set up LanceDB locally within minutes and use the Python client library to store and search vectors.

This notebook takes the in-memory Pandas DataFrame from Chapter 8’s rag_first_principles.ipynb and swaps it for a persistent vector database — the chunking and cosine-similarity ideas are identical; the storage layer is what changes.

Install Dependencies

First, we will install the required libraries: lancedb, sentence-transformers, pandas, and pydantic.

!pip install lancedb sentence-transformers pandas pydantic

Set Up Database and Embedding Model

To create a local LanceDB vector database, simply connect to a local directory (like a .db file but for folders). We will also define our embedding model using the sentence-transformers registry within LanceDB.

Importing Libraries

We import LanceDB along with pandas, PyArrow, and the LanceDB embeddings registry. The Pydantic integration lets us define a typed schema that LanceDB uses to validate and auto-embed data.

import lancedb
import pandas as pd
import pyarrow as pa
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
import os
from google.colab import userdata

Connecting to the Database and Selecting an Embedding Model

LanceDB stores data in a local directory, so there is no server to manage. We select the all-MiniLM-L6-v2 sentence-transformer as our embedding model; it produces 384-dimensional vectors and offers a good balance of speed and quality.

# Connects to a local directory (like a .db file but for folders)
db = lancedb.connect("./lancedb_is_research")

# Select the embedding model
registry = get_registry().get("sentence-transformers")
model = registry.create(name="all-MiniLM-L6-v2")

Setting the Hugging Face Token

While optional today, providing a Hugging Face token avoids rate limits and ensures reliable access to sentence-transformer model weights.

### Creating and using the HUGGINGFACE_TOKEN is optional (as of now in colab), but I would recommend that you do it, as it is used whenever SENTENCE TRANSFORMERS  models will be called.
os.environ["HF_TOKEN"] = userdata.get('HUGGINGFACE_TOKEN')

Define the Schema

In LanceDB, we define the schema to specify how the data should be structured and embedded. Here, we tell LanceDB to automatically embed the ‘Abstract’ column.

Loading and Previewing the Data

We load the ISResearch dataset directly from GitHub. Each row contains a paper’s title, abstract, publication year, journal, and URL. The abstracts will be embedded and stored as vectors.

class Papers(LanceModel):
    id: int
    Year: int
    Title: str
    Abstract: str = model.SourceField()
    URL: str
    JournalFN: str
    # The vector field is automatically populated by the model above
    vector: Vector(model.ndims()) = model.VectorField()

Creating the Table and Inserting Data

When we call table.add(df), LanceDB automatically generates embedding vectors for every abstract using the model we registered in the schema. This single line handles both embedding and storage, which keeps the pipeline simple.

Prepare and Load Data

Load our existing dataset, ISResearch.csv, into a Pandas DataFrame. Then, create the table and add the data. LanceDB handles the embedding generation automatically during the insertion process.

# Load your CSV (ensure ISResearch.csv is in your directory)
df = pd.read_csv("https://raw.githubusercontent.com/KarAnalytics/datasets/refs/heads/master/ISResearch.csv")

df.head()

We pass a natural-language query and LanceDB embeds it with the same model, then finds the closest vectors using approximate nearest-neighbor search. The results come back as a pandas DataFrame sorted by distance, making downstream analysis straightforward.

# Create the table and add the data
# LanceDB handles the embedding generation during the 'add' process
table = db.create_table("ISResearch", schema=Papers, mode="overwrite")
table.add(df)

print(f"Migration Complete. Total papers in DB: {len(table)}")

Now we can perform semantic searches. LanceDB seamlessly embeds the query automatically and returns a Pandas DataFrame for easy viewing.

query = "Which papers mention blockchain or decentralized systems?"

# We search using raw text; LanceDB embeds the query automatically
results = table.search(query).limit(5).to_pandas()

print("\n--- Top 5 Semantic Search Results ---")
print(results[['Title', 'Year', '_distance']])

# Optional: To show the URL of the top result
print(f"\nTop Match URL: {results.iloc[0]['URL']}")

Key takeaways

  • LanceDB is serverless and embedded -- it stores vectors in a local directory, so there is no database server to run or manage.

  • Pydantic schemas with SourceField and VectorField let LanceDB auto-embed a text column on insert, collapsing embedding and storage into a single table.add() call.

  • Raw-text queries are embedded with the same registered model at search time, so you never have to hand-embed queries or maintain a separate inference path.

  • Results as pandas DataFrames (with a _distance column) make it trivial to sort, filter, and join semantic search output with the rest of an analytics workflow.

  • all-MiniLM-L6-v2 produces 384-dim vectors and hits a sweet spot of speed versus quality for demos over a few thousand documents.


Run the code

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

Estimated run time: ~3 minutes

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/LanceDB_vectorDB.ipynb