ETL frameworks used to index data for AI, such as RAG; have real-time incremental updates and support for custom logic such as LEGO.
profile
CocoIndex is an open source data indexing engine designed to provide high-quality data preparation for AI applications such as semantic search, retrieval enhanced generation (RAG), and embedded-based knowledge graphs. It supports custom conversion logic and incremental updates to ensure real-time and consistency of data indexing.citeturn0search0
main characteristics
-
Data flow programming model: CocoIndex provides a data-driven programming model that allows users to define the index process by declaring data flows and transformation logic, similar to data and formulas in spreadsheets, making them easy to understand and maintain.citeturn0search2
-
Custom conversion logic: Supports users to insert custom logic such as chunking, embedding, and vector storage to meet specific business needs. For example, users can define their own data chunking strategies or choose different embedding models.citeturn0search0
-
incremental update: CocoIndex has intelligent state management, which can only recalculate necessary parts when source data or conversion logic changes, avoiding full rebuilding of the index and improving efficiency.citeturn0search0
-
Python SDK: The core of CocoIndex is implemented by Rust and provides Python binding, taking into account performance and ease of use. Users can use familiar Python syntax to build and manage the indexing process.citeturn0search0
quick start
-
Install the CocoIndex Python library:
pip install cocoindex
-
Setting up Postgres database with pgvector extension:
Make sure Docker Compose is installed, and then run the following command to launch a Postgres database that contains the pgvector extension:
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d
-
Define the indexing process:
Use CocoIndex’s decorator and data stream builder to define an indexing process for text embedding:
import cocoindex @cocoindex.flow_def(name="TextEmbedding") def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): #Add data sources data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files")) #Add a collector doc_embeddings = data_scope.add_collector() #Processing each document with data_scope["documents"].row() as doc: #Split the document into blocks doc["chunks"] = doc["content"].transform( cocoindex.functions.SplitRecursively(language="markdown", chunk_size=300, chunk_overlap=100)) #Processing each block with doc["chunks"].row() as chunk: #Embedding blocks chunk["embedding"] = chunk["text"].transform( cocoindex.functions.SentenceTransformerEmbed(model="sentence-transformers/all-MiniLM-L6-v2")) #Collect embeddings and metadata for indexing doc_embeddings.collect(filename=doc["filename"], location=chunk["location"], text=chunk["text"], embedding=chunk["embedding"]) #Export collected data to vector storage doc_embeddings.export( "doc_embeddings", cocoindex.storages.Postgres(), primary_key_fields=["filename", "location"], vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
The above process defines the complete process of reading a document from a local file, breaking it into blocks, embedding each block, and storing the results in the Postgres database.citeturn0search0
application scenarios
CocoIndex is suitable for the following AI application scenarios:
-
semantic search: Achieve efficient semantic search functions by building text-embedded indexes.
-
Retrieval enhanced generation (RAG): Provide high-quality retrieval data for generative models and improve the accuracy and relevance of generated results.
-
knowledge map construction: Build a knowledge graph by parsing and indexing structured data to support complex queries and reasoning.
Community and Contribution
CocoIndex is an open source project under the Apache 2.0 license. We welcome community contributions, including code improvements, document updates, problem reports and feature requests. You can participate in our community by:
-
GitHub: In our GitHub repositorySubmit questions or pull requests in.
-
Discord: Join us Discord Community, communicate with other developers.
-
social media: Focus on our Twitter and LinkedIn Get the latest news.
Through CocoIndex, you can focus on the development of business logic, leave the complexity of data indexing to us to quickly build high-quality AI applications.
Github:https://github.com/cocoindex-io/cocoindex
Oil tubing: