Splitter - semantic
Overview
Semantic splitter is an implementation of the Document Transformer interface that splits long documents based on semantic similarity. It follows Eino: Document Transformer Guide.
How It Works
- First split the document into initial fragments using basic separators (newline, period, etc.)
- Generate an embedding vector for each fragment
- Compute cosine similarity between adjacent fragments
- Decide split points by a similarity threshold percentile
- Merge fragments smaller than the minimum size
Usage
Initialization
Initialize via NewSplitter with configuration:
splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
Embedding: embedder, // required: embedder to generate vectors
BufferSize: 2, // optional: context buffer size
MinChunkSize: 100, // optional: minimum chunk size
Separators: []string{"\n", ".", "?", "!"}, // optional: separator list
Percentile: 0.9, // optional: split threshold percentile
LenFunc: nil, // optional: custom length func
})
Parameters:
Embedding: required embedder instanceBufferSize: include more context for similarity computationMinChunkSize: merge fragments smaller than this sizeSeparators: ordered list used for initial splitPercentile: 0–1; higher means fewer splitsLenFunc: custom length function, defaultlen()
Complete Example
package main
import (
"context"
"github.com/cloudwego/eino-ext/components/document/transformer/splitter/semantic"
"github.com/cloudwego/eino/components/embedding"
"github.com/cloudwego/eino/schema"
)
func main() {
ctx := context.Background()
embedder := &embedding.SomeEmbeddingImpl{} // eg: openai embedding
splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
Embedding: embedder,
BufferSize: 2,
MinChunkSize: 100,
Separators: []string{"\n", ".", "?", "!"},
Percentile: 0.9,
})
if err != nil { panic(err) }
docs := []*schema.Document{{
ID: "doc1",
Content: `This is the first paragraph with important info.
This is the second paragraph, semantically related to the first.
This is the third paragraph, the topic has changed.
This is the fourth paragraph, continuing the new topic.`,
}}
results, err := splitter.Transform(ctx, docs)
if err != nil { panic(err) }
for i, doc := range results { println("fragment", i+1, ":", doc.Content) }
}
Advanced Usage
Custom length function:
splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
Embedding: embedder,
LenFunc: func(s string) int { return len([]rune(s)) }, // unicode length
})
Adjust granularity:
splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
Embedding: embedder,
Percentile: 0.95, // fewer split points
MinChunkSize: 200, // avoid too-small fragments
})
Optimize semantic judgment:
splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
Embedding: embedder,
BufferSize: 10, // more context
Separators: []string{"\n\n", "\n", "。", "!", "?", ","}, // custom priority
})
References
Last modified
December 11, 2025
: feat(eino): sync zh documents (#1474) (958594401a)