CockroachDB vector store - Docs by LangChain

AsyncCockroachDBVectorStore is an implementation of a LangChain vector store using CockroachDB’s distributed SQL database with native vector support. This notebook goes over how to use the AsyncCockroachDBVectorStore API. The code lives in the integration package: langchain-cockroachdb.

概述

CockroachDB is a distributed SQL database that provides:

Native vector support with the VECTOR data type (v24.2+)
Distributed C-SPANN indexes for approximate nearest neighbor (ANN) search (v25.2+)
SERIALIZABLE isolation by default for transaction correctness
Horizontal scalability with automatic sharding and replication
PostgreSQL wire-compatible for easy adoption

Key advantages for vector workloads

Distributed vector indexes: C-SPANN indexes automatically shard across your cluster
Multi-tenancy support: Prefix columns in indexes for efficient tenant isolation
Strong consistency: SERIALIZABLE transactions prevent data anomalies
High availability: Automatic failover with no data loss

设置

Install

Install the integration library, langchain-cockroachdb.

pip install -qU langchain-cockroachdb

CockroachDB cluster

You need a CockroachDB cluster with vector support (v24.2+). Choose one option:

Option 1: CockroachDB Cloud (Recommended)

Sign up at cockroachlabs.cloud
Create a free cluster
Get your connection string from the cluster details page

Option 2: Docker (Development)

docker run -d \
  --name cockroachdb \
  -p 26257:26257 \
  -p 8080:8080 \
  cockroachdb/cockroach:latest \
  start-single-node --insecure

Option 3: Local binary

Download from cockroachlabs.com/docs/releases

cockroach start-single-node --insecure --listen-addr=localhost:26257

Set your connection values

# For CockroachDB Cloud
CONNECTION_STRING = "cockroachdb://user:password@host:26257/database?sslmode=verify-full"

# For local insecure cluster
CONNECTION_STRING = "cockroachdb://root@localhost:26257/defaultdb?sslmode=disable"

TABLE_NAME = "langchain_vectors"
VECTOR_DIMENSION = 1536  # Depends on your embedding model

初始化

Create a connection engine

The CockroachDBEngine manages a connection pool to your cluster:

from langchain_cockroachdb import CockroachDBEngine

engine = CockroachDBEngine.from_connection_string(
    url=CONNECTION_STRING,
    pool_size=10,        # Connection pool size
    max_overflow=20,     # Additional connections allowed
    pool_pre_ping=True,  # Health check connections
)

Initialize a table

Create a table with the proper schema for vector storage:

await engine.ainit_vectorstore_table(
    table_name=TABLE_NAME,
    vector_dimension=VECTOR_DIMENSION,
)

Optional: Specify a schema name

await engine.ainit_vectorstore_table(
    table_name=TABLE_NAME,
    vector_dimension=VECTOR_DIMENSION,
    schema="my_schema",  # Default: "public"
)

Create an embedding instance

Use any LangChain embeddings model.

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Initialize the vector store

from langchain_cockroachdb import AsyncCockroachDBVectorStore

vectorstore = AsyncCockroachDBVectorStore(
    engine=engine,
    embeddings=embeddings,
    collection_name=TABLE_NAME,
)

管理向量存储

添加文档

Add documents with metadata:

import uuid
from langchain_core.documents import Document

docs = [
    Document(
        id=str(uuid.uuid4()),
        page_content="CockroachDB is a distributed SQL database",
        metadata={"source": "docs", "category": "database"},
    ),
    Document(
        id=str(uuid.uuid4()),
        page_content="Vector search enables semantic similarity",
        metadata={"source": "docs", "category": "features"},
    ),
]

ids = await vectorstore.aadd_documents(docs)

Add texts

Add text directly without structuring as documents:

texts = ["First text", "Second text", "Third text"]
metadatas = [{"idx": i} for i in range(len(texts))]
ids = [str(uuid.uuid4()) for _ in texts]

ids = await vectorstore.aadd_texts(texts, metadatas=metadatas, ids=ids)

Performance note: CockroachDB’s vector indexes work best with smaller batch sizes. The default batch_size=100 is optimized for vector inserts. Large batch inserts of VECTOR types can cause performance degradation.

删除文档

Delete documents by ID:

await vectorstore.adelete([ids[0], ids[1]])

查询向量存储

相似度搜索

Search for similar documents using natural language:

query = "distributed database"
docs = await vectorstore.asimilarity_search(query, k=5)

for doc in docs:
    print(f"{doc.page_content[:50]}...")

Similarity search with scores

Get relevance scores with results:

docs_with_scores = await vectorstore.asimilarity_search_with_score(query, k=5)

for doc, score in docs_with_scores:
    print(f"Score: {score:.4f} - {doc.page_content[:50]}...")

Search by vector

Search using a pre-computed embedding vector:

query_vector = await embeddings.aembed_query(query)
docs = await vectorstore.asimilarity_search_by_vector(query_vector, k=5)

Maximum marginal relevance (MMR) search

Retrieve diverse results that balance relevance and diversity:

docs = await vectorstore.amax_marginal_relevance_search(
    query,
    k=5,           # Number of results to return
    fetch_k=20,    # Number of candidates to consider
    lambda_mult=0.5,  # 0 = max diversity, 1 = max relevance
)

Vector indexes

Speed up similarity search with CockroachDB’s C-SPANN vector indexes (requires v25.2+).

What is C-SPANN?

C-SPANN (CockroachDB Space Partition Approximate Nearest Neighbor) is a distributed vector index that:

Automatically shards across your cluster nodes
Provides sub-second query performance at scale
Supports cosine, Euclidean (L2), and inner product distances
Works with prefix columns for multi-tenant architectures

Create a vector index

from langchain_cockroachdb import CSPANNIndex, DistanceStrategy

# Create a cosine distance index (most common)
index = CSPANNIndex(
    distance_strategy=DistanceStrategy.COSINE,
    name="my_vector_index",
)

await vectorstore.aapply_vector_index(index)

Distance strategies

Choose the distance metric that matches your use case:

# Cosine similarity (most common for text embeddings)
CSPANNIndex(distance_strategy=DistanceStrategy.COSINE)

# Euclidean distance (L2)
CSPANNIndex(distance_strategy=DistanceStrategy.EUCLIDEAN)

# Inner product (for normalized vectors)
CSPANNIndex(distance_strategy=DistanceStrategy.INNER_PRODUCT)

Tune index parameters

Adjust partition sizes for performance:

index = CSPANNIndex(
    distance_strategy=DistanceStrategy.COSINE,
    min_partition_size=16,   # Minimum vectors per partition
    max_partition_size=128,  # Maximum vectors per partition
)

await vectorstore.aapply_vector_index(index)

Query-time tuning

Adjust search parameters at query time:

from langchain_cockroachdb import CSPANNQueryOptions

# Increase beam size for better recall (slower)
query_options = CSPANNQueryOptions(beam_size=200)  # Default: 100

docs = await vectorstore.asimilarity_search(
    query,
    k=10,
    query_options=query_options,
)

Drop an index

Remove a vector index:

index = CSPANNIndex(name="my_vector_index")
await vectorstore.adrop_vector_index(index)

元数据过滤

Filter similarity searches using metadata fields.

Supported operators

Operator	Meaning	Example
`$eq`	Equality	`{"category": "news"}`
`$ne`	Not equal	`{"category": {"$ne": "spam"}}`
`$gt`	Greater than	`{"year": {"$gt": 2020}}`
`$gte`	Greater than or equal	`{"rating": {"$gte": 4.0}}`
`$lt`	Less than	`{"year": {"$lt": 2023}}`
`$lte`	Less than or equal	`{"rating": {"$lte": 3.0}}`
`$in`	In list	`{"category": {"$in": ["news", "blog"]}}`
`$nin`	Not in list	`{"source": {"$nin": ["spam", "test"]}}`
`$between`	Between values	`{"year": {"$between": [2020, 2023]}}`
`$like`	Pattern match	`{"source": {"$like": "wiki%"}}`
`$ilike`	Case-insensitive	`{"category": {"$ilike": "%NEWS%"}}`
`$and`	Logical AND	`{"$and": [{...}, {...}]}`
`$or`	Logical OR	`{"$or": [{...}, {...}]}`

Filter examples

# Simple equality
docs = await vectorstore.asimilarity_search(
    query,
    filter={"category": "news"},
)

# Numeric comparison
docs = await vectorstore.asimilarity_search(
    query,
    filter={"year": {"$gte": 2020}},
)

# Complex filters
docs = await vectorstore.asimilarity_search(
    query,
    filter={
        "$and": [
            {"category": {"$in": ["news", "blog"]}},
            {"year": {"$gte": 2020}},
            {"rating": {"$gt": 3.5}},
        ]
    },
)

Sync interface

All async methods have sync equivalents using the sync wrapper:

from langchain_cockroachdb import CockroachDBVectorStore

# Create sync vectorstore
vectorstore = CockroachDBVectorStore(
    engine=engine,
    embeddings=embeddings,
    collection_name=TABLE_NAME,
)

# Use sync methods
docs = vectorstore.similarity_search(query, k=5)
ids = vectorstore.add_documents(docs)
vectorstore.apply_vector_index(index)

Usage for retrieval-augmented generation (RAG)

For implementing RAG with CockroachDB as your vector store, see the LangChain RAG tutorial. The CockroachDB vector store can be used in place of any other vector store in those patterns.

清理

⚠️ This operation cannot be undone

Drop the vector store table:

await engine.adrop_table(TABLE_NAME)

API 参考

For detailed documentation of all features and configurations:

Additional resources

连接这些文档到 Claude、VSCode 等工具，通过 MCP 获取实时答案。

在 GitHub 上编辑此页面或提交 issue。

Documentation Index

​概述

​Key advantages for vector workloads

​设置

​Install

​CockroachDB cluster

​Option 1: CockroachDB Cloud (Recommended)

​Option 2: Docker (Development)

​Option 3: Local binary

​Set your connection values

​初始化

​Create a connection engine

​Initialize a table

​Create an embedding instance

​Initialize the vector store

​管理向量存储

​添加文档

​Add texts

​删除文档

​查询向量存储

​相似度搜索

​Similarity search with scores

​Search by vector

​Maximum marginal relevance (MMR) search

​Vector indexes

​What is C-SPANN?

​Create a vector index

​Distance strategies

​Tune index parameters

​Query-time tuning

​Drop an index

​元数据过滤

​Supported operators

​Filter examples

​Sync interface

​Usage for retrieval-augmented generation (RAG)

​清理

​API 参考

​Additional resources

概述

Key advantages for vector workloads

设置

Install

CockroachDB cluster

Option 1: CockroachDB Cloud (Recommended)

Option 2: Docker (Development)

Option 3: Local binary

Set your connection values

初始化

Create a connection engine

Initialize a table

Create an embedding instance

Initialize the vector store

管理向量存储

添加文档

Add texts

删除文档

查询向量存储

相似度搜索

Similarity search with scores

Search by vector

Maximum marginal relevance (MMR) search

Vector indexes

What is C-SPANN?

Create a vector index

Distance strategies

Tune index parameters

Query-time tuning

Drop an index

元数据过滤

Supported operators

Filter examples

Sync interface

Usage for retrieval-augmented generation (RAG)

清理

API 参考

Additional resources