CocoIndex builds a data indexing platform for AI applications

ETL frameworks used to index data for AI, such as RAG; have real-time incremental updates and support for custom logic such as LEGO.

profile

CocoIndex is an open source data indexing engine designed to provide high-quality data preparation for AI applications such as semantic search, retrieval enhanced generation (RAG), and embedded-based knowledge graphs. It supports custom conversion logic and incremental updates to ensure real-time and consistency of data indexing.citeturn0search0

main characteristics

Data flow programming model: CocoIndex provides a data-driven programming model that allows users to define the index process by declaring data flows and transformation logic, similar to data and formulas in spreadsheets, making them easy to understand and maintain.citeturn0search2
Custom conversion logic: Supports users to insert custom logic such as chunking, embedding, and vector storage to meet specific business needs. For example, users can define their own data chunking strategies or choose different embedding models.citeturn0search0
incremental update: CocoIndex has intelligent state management, which can only recalculate necessary parts when source data or conversion logic changes, avoiding full rebuilding of the index and improving efficiency.citeturn0search0
Python SDK: The core of CocoIndex is implemented by Rust and provides Python binding, taking into account performance and ease of use. Users can use familiar Python syntax to build and manage the indexing process.citeturn0search0

quick start

Install the CocoIndex Python library：
```
pip install cocoindex
```

Setting up Postgres database with pgvector extension：

Make sure Docker Compose is installed, and then run the following command to launch a Postgres database that contains the pgvector extension:
```
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d
```


Define the indexing process：

Use CocoIndex’s decorator and data stream builder to define an indexing process for text embedding:

import cocoindex

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
 #Add data sources
 data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

 #Add a collector
 doc_embeddings = data_scope.add_collector()

 #Processing each document
 with data_scope["documents"].row() as doc:
 #Split the document into blocks
 doc["chunks"] = doc["content"].transform(
 cocoindex.functions.SplitRecursively(language="markdown", chunk_size=300, chunk_overlap=100))

 #Processing each block
 with doc["chunks"].row() as chunk:
 #Embedding blocks
 chunk["embedding"] = chunk["text"].transform(
 cocoindex.functions.SentenceTransformerEmbed(model="sentence-transformers/all-MiniLM-L6-v2"))

 #Collect embeddings and metadata for indexing
 doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
 text=chunk["text"], embedding=chunk["embedding"])

 #Export collected data to vector storage
 doc_embeddings.export(
 "doc_embeddings",
 cocoindex.storages.Postgres(),
 primary_key_fields=["filename", "location"],
 vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])



The above process defines the complete process of reading a document from a local file, breaking it into blocks, embedding each block, and storing the results in the Postgres database.citeturn0search0

application scenarios

CocoIndex is suitable for the following AI application scenarios:

semantic search: Achieve efficient semantic search functions by building text-embedded indexes.
Retrieval enhanced generation (RAG): Provide high-quality retrieval data for generative models and improve the accuracy and relevance of generated results.
knowledge map construction: Build a knowledge graph by parsing and indexing structured data to support complex queries and reasoning.

Community and Contribution

CocoIndex is an open source project under the Apache 2.0 license. We welcome community contributions, including code improvements, document updates, problem reports and feature requests. You can participate in our community by:

GitHub: In our GitHub repositorySubmit questions or pull requests in.
Discord: Join us Discord Community, communicate with other developers.
social media: Focus on our Twitter and LinkedIn Get the latest news.

Through CocoIndex, you can focus on the development of business logic, leave the complexity of data indexing to us to quickly build high-quality AI applications.

Github：https://github.com/cocoindex-io/cocoindex

Oil tubing: