Tài liệu kỹ thuật Mastra về chunking và embedding tài liệu

Trust: ★★★☆☆ (0.90) · 0 validations · factual

Published: 2026-05-09 · Source: crawler_authoritative

Tình huống

Tài liệu hướng dẫn kỹ thuật về cách xử lý và embedding tài liệu trong hệ thống RAG của Mastra

Insight

Mastra cung cấp nhiều chiến lược chunking được tối ưu cho các loại tài liệu khác nhau: recursive (tách thông minh theo cấu trúc nội dung), character (tách theo ký tự), token (tách theo token), markdown (nhận biết markdown), semantic-markdown (tách markdown theo header families có liên quan), html (nhận biết cấu trúc HTML), json (nhận biết cấu trúc JSON), latex (nhận biết LaTeX), và sentence (nhận biết câu). Mỗi chiến lược có các tham số riêng như maxSize, overlap, separators, minSize, sentenceEnders. Metadata extraction có thể sử dụng LLM calls. Với embedding, Mastra hỗ trợ OpenAI (text-embedding-3-small, text-embedding-3-large) và Google (gemini-embedding-001) thông qua Model Router. Các tham số cấu hình quan trọng bao gồm: dimensions (giảm chiều vector, chỉ supported trong text-embedding-3 trở lên, ví dụ 256), outputDimensionality cho Google (truncates excessive values from the end). Vector database phải được configure index để match với output size của embedding model, nếu không match sẽ gây lỗi hoặc data corruption. Model router tự động detect API key từ environment variables.

Hành động

Khởi tạo MDocument từ nội dung: MDocument.fromText(), MDocument.fromHTML(), MDocument.fromMarkdown(), MDocument.fromJSON(). 2. Sử dụng doc.chunk() với strategy phù hợp (recursive, semantic-markdown, sentence, etc.) và các tham số như maxSize, overlap. 3. Configure metadata extraction nếu cần (extract: { metadata: true }). 4. Generate embeddings bằng embedMany() với ModelRouterEmbeddingModel hoặc provider strings. 5. Set dimensions cho embedding model để giảm storage và computational costs. 6. Configure vector database index để match embedding output dimensions. 7. Store embeddings bằng vectorStore.upsert() với indexName và vectors.

Điều kiện áp dụng

Chiến lược semantic-markdown phù hợp khi cần preserve semantic relationships giữa các sections. Chiến lược sentence phù hợp khi cần preserve sentence structure. Chỉ text-embedding-3 và later mới hỗ trợ giảm dimensions. Metadata extraction có thể gây thêm chi phí API.

Nội dung gốc (Original)

Chunking and embedding documents

Before processing, create a MDocument instance from your content. You can initialize it from various formats:

const docFromText = MDocument.fromText('Your plain text content...')
const docFromHTML = MDocument.fromHTML('<html>Your HTML content...</html>')
const docFromMarkdown = MDocument.fromMarkdown('# Your Markdown content...')
const docFromJSON = MDocument.fromJSON(`{ "key": "value" }`)

Document processing

Use chunk to split documents into manageable pieces. Mastra supports multiple chunking strategies optimized for different document types:

recursive: Smart splitting based on content structure
character: Simple character-based splits
token: Token-aware splitting
markdown: Markdown-aware splitting
semantic-markdown: Markdown splitting based on related header families
html: HTML structure-aware splitting
json: JSON structure-aware splitting
latex: LaTeX structure-aware splitting
sentence: Sentence-aware splitting

Note: Each strategy accepts different parameters optimized for its chunking approach.

Here’s an example of how to use the recursive strategy:

const chunks = await doc.chunk({
  strategy: 'recursive',
  maxSize: 512,
  overlap: 50,
  separators: ['\n'],
  extract: {
    metadata: true, // Optionally extract metadata
  },
})

For text where preserving sentence structure is important, here’s an example of how to use the sentence strategy:

const chunks = await doc.chunk({
  strategy: 'sentence',
  maxSize: 450,
  minSize: 50,
  overlap: 0,
  sentenceEnders: ['.'],
})

For markdown documents where preserving the semantic relationships between sections is important, here’s an example of how to use the semantic-markdown strategy:

const chunks = await doc.chunk({
  strategy: 'semantic-markdown',
  joinThreshold: 500,
  modelName: 'gpt-3.5-turbo',
})

Note: Metadata extraction may use LLM calls, so ensure your API key is set.

We go deeper into chunking strategies in our chunk() reference documentation.

Embedding generation

Transform chunks into embeddings using your preferred provider. Mastra supports embedding models through the model router.

Using the Model Router

The simplest way is to use Mastra’s model router with provider/model strings:

import { ModelRouterEmbeddingModel } from '@mastra/core/llm'
import { embedMany } from 'ai'
 
const { embeddings } = await embedMany({
  model: new ModelRouterEmbeddingModel('openai/text-embedding-3-small'),
  values: chunks.map(chunk => chunk.text),
})

Mastra supports OpenAI and Google embedding models. For a complete list of supported embedding models, see the embeddings reference.

The model router automatically handles API key detection from environment variables.

The embedding functions return vectors, arrays of numbers representing the semantic meaning of your text, ready for similarity searches in your vector database.

Configuring Embedding Dimensions

Embedding models typically output vectors with a fixed number of dimensions (e.g., 1536 for OpenAI’s text-embedding-3-small). Some models support reducing this dimensionality, which can help:

Decrease storage requirements in vector databases
Reduce computational costs for similarity searches

Here are some supported models:

OpenAI (text-embedding-3 models):

import { ModelRouterEmbeddingModel } from '@mastra/core/llm'
 
const { embeddings } = await embedMany({
  model: new ModelRouterEmbeddingModel('openai/text-embedding-3-small'),
  options: {
    dimensions: 256, // Only supported in text-embedding-3 and later
  },
  values: chunks.map(chunk => chunk.text),
})

Google (text-embedding-001):

const { embeddings } = await embedMany({
  model: google('gemini-embedding-001', {
    outputDimensionality: 256, // Truncates excessive values from the end
  }),
  values: chunks.map(chunk => chunk.text),
})

Vector Database Compatibility: When storing embeddings, the vector database index must be configured to match the output size of your embedding model. If the dimensions don’t match, you may get errors or data corruption.

Example: Complete pipeline

Here’s an example showing document processing and embedding generation with both providers:

import { embedMany } from 'ai'
 
import { MDocument } from '@mastra/rag'
 
// Initialize document
const doc = MDocument.fromText(`
  Climate change poses significant challenges to global agriculture.
  Rising temperatures and changing precipitation patterns affect crop yields.
`)
 
// Create chunks
const chunks = await doc.chunk({
  strategy: 'recursive',
  maxSize: 256,
  overlap: 50,
})
 
// Generate embeddings with OpenAI
import { ModelRouterEmbeddingModel } from '@mastra/core/llm'
 
const { embeddings } = await embedMany({
  model: new ModelRouterEmbeddingModel('openai/text-embedding-3-small'),
  values: chunks.map(chunk => chunk.text),
})
 
// OR
 
// Generate embeddings with Cohere
const { embeddings } = await embedMany({
  model: 'cohere/embed-english-v3.0',
  values: chunks.map(chunk => chunk.text),
})
 
// Store embeddings in your vector database
await vectorStore.upsert({
  indexName: 'embeddings',
  vectors: embeddings,
})

For more examples of different chunking strategies and embedding configurations, see:

For more details on vector databases and embeddings, see:

Liên kết

Nền tảng: Chung
Nguồn: https://mastra.ai/docs/rag/chunking-and-embedding

Xem thêm:

KakaFlow KB

Nội dung

Tài liệu kỹ thuật Mastra về chunking và embedding tài liệu

Tài liệu kỹ thuật Mastra về chunking và embedding tài liệu

Tình huống

Insight

Hành động

Điều kiện áp dụng

Nội dung gốc (Original)

Chunking and embedding documents

Document processing

Embedding generation

Using the Model Router

Configuring Embedding Dimensions

Example: Complete pipeline

Liên kết

Sơ đồ

Mục lục

Liên kết ngược