Optimize Document Ingestion for Better Search
Learn chunking strategies and configuration techniques to improve search performance for your documents
Optimize Document Ingestion for Better Search
When uploading documents to GoodMem, how you configure the ingestion process significantly impacts search performance. This guide shows you how to optimize document processing for better retrieval results.
Content Input Methods
The API provides two mutually exclusive ways to provide document content to GoodMem. You must use exactly one of these fields - not both, not neither.
Note: If you're using the CLI, you don't need to worry about this distinction - the CLI automatically handles file encoding when you use the --file parameter.
Plain Text Content (originalContent)
For text-based documents, provide content directly as a string:
{
"spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
"originalContent": "This is plain text content that will be processed directly.",
"contentType": "text/plain"
}Use for: Text files, markdown, JSON, code files, or any content you can represent as a string.
Binary Content (originalContentB64)
For binary files like PDFs, images, or Word documents, encode the file as base64:
{
"spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
"originalContentB64": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFI+Pg==",
"contentType": "application/pdf"
}Use for: PDFs, images (PNG, JPG), Word documents, or any binary file format.
Decision Guide
- Text content you can read/edit? → Use
originalContent - Binary file or need to upload a file? → Use
originalContentB64(base64-encode the file first)
Comparison Examples
Here are both methods with identical chunking configuration to show the only difference:
Plain Text Document:
curl -X POST "https://api.goodmem.ai/v1/memories" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
"originalContent": "This is my document content with paragraphs.\n\nSecond paragraph here.",
"contentType": "text/plain",
"chunkingConfig": {
"recursive": {
"chunkSize": 512,
"chunkOverlap": 64
}
}
}'Binary PDF Document:
curl -X POST "https://api.goodmem.ai/v1/memories" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
"originalContentB64": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFI+Pg==",
"contentType": "application/pdf",
"chunkingConfig": {
"recursive": {
"chunkSize": 512,
"chunkOverlap": 64
}
}
}'Note: In practice, you'd base64-encode your actual PDF file content. Use base64 yourfile.pdf to generate the base64 string.
Understanding Chunking Strategies
Chunking breaks large documents into smaller, focused segments that can be individually embedded and searched. This is crucial because:
- Embedder models have token limits (typically 512-8192 tokens)
- Large text blocks produce generic embeddings that match poorly with specific queries
- Right-sized chunks balance precision and context - too large and they become generic, too small and they lack enough context for meaningful embeddings
Current Defaults
- API (REST and gRPC): Defaults to no chunking today (this may change in future versions)
- CLI: Uses recursive chunking by default with battle-tested parameters
The CLI's default configuration has proven effective across diverse document types, so we recommend starting there.
Recommended Starting Point: CLI Defaults
The GoodMem CLI uses recursive chunking with these proven settings:
- Strategy: Recursive (tries paragraphs → sentences → words → characters)
- Chunk Size: 512 characters
- Overlap: 64 characters
- Separator Retention: Keep at end
CLI Example
# CLI automatically applies optimal chunking
goodmem memory create \
--space-id f77a8555-0232-4c01-a33e-4f0ca072905e \
--file document.pdfEquivalent REST API Configuration
To replicate CLI behavior in API requests (note that CLI automatically handles file encoding, while the API requires manual content handling for binary files):
curl -X POST "https://api.goodmem.ai/v1/memories" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
"originalContentB64": "<base64-encoded-content>",
"contentType": "application/pdf",
"chunkingConfig": {
"recursive": {
"chunkSize": 512,
"chunkOverlap": 64,
"separators": ["\\n\\n", "\\n", " ", ""],
"keepStrategy": "KEEP_END",
"separatorIsRegex": false,
"lengthMeasurement": "CHARACTER_COUNT"
}
}
}'Document-Specific Optimizations
Text Documents and PDFs
For most text-heavy documents, use larger chunks to preserve context:
{
"chunkingConfig": {
"recursive": {
"chunkSize": 1024,
"chunkOverlap": 128,
"separators": ["\\n\\n", "\\n", " ", ""],
"keepStrategy": "KEEP_END",
"separatorIsRegex": false,
"lengthMeasurement": "CHARACTER_COUNT"
}
}
}Code Documentation
For technical documentation with clear section breaks:
{
"chunkingConfig": {
"recursive": {
"chunkSize": 1500,
"chunkOverlap": 200,
"separators": ["\\n## ", "\\n### ", "\\n\\n", "\\n"],
"keepStrategy": "KEEP_START",
"separatorIsRegex": false,
"lengthMeasurement": "CHARACTER_COUNT"
}
}
}Small Files
For documents under 500 characters, disable chunking entirely:
{
"chunkingConfig": {
"none": {}
}
}Token Limits and Embedder Considerations
Different embedders have different token limits:
| Embedder | Typical Token Limit | Recommended Chunk Size |
|---|---|---|
| OpenAI text-embedding-3-small | 8192 tokens | 1000-1500 characters |
| Snowflake Arctic | 512 tokens | 400-600 characters |
| Sentence Transformers | 512 tokens | 400-600 characters |
Character to Token Ratio: Roughly 4 characters = 1 token for English text.
Checking Your Embedder
# List your embedders to check token limits
goodmem embedder listChunking Strategy Reference
Recursive Chunking
- Best for: General documents, mixed content types, most PDFs
- How it works: Tries separators in order of preference
- Default separators:
["\\n\\n", "\\n", " ", ""](paragraphs → sentences → words → characters)
Sentence Chunking
- Best for: Experimentation with semantic boundary preservation
- How it works: Respects sentence boundaries using language detection
- Advantages: Preserves complete thoughts and semantic coherence
No Chunking
- Best for: Short documents, single concepts, structured data
- Trade-offs: Preserves full context but may produce generic embeddings
Troubleshooting Search Performance
Content Input Errors
If you receive errors during document upload:
"Either originalContent or originalContentB64 must be provided"
- You forgot to include either field
- Both fields are empty or contain only whitespace
"Provide either originalContent or originalContentB64, but not both"
- You included both fields in the same request
- Remove one field - use
originalContentfor text,originalContentB64for binary
Base64 decoding errors
- Your
originalContentB64contains invalid base64 data - Ensure you're encoding the file correctly:
base64 yourfile.pdf - Check for extra whitespace or line breaks in the base64 string
No Search Results
If your queries return empty results:
-
Check content input method - ensure you used the correct field for your content type
-
Check if chunking is configured (especially when using the API directly)
-
Verify document processing status:
goodmem memory get <memory-id>Look for
Status: COMPLETED -
Test with CLI defaults to establish a baseline:
goodmem memory create --space-id <space-id> --file <your-file>
Poor Result Relevance
If results seem unrelated:
- Try smaller chunk sizes (512-1024 characters)
- Switch to sentence chunking for better semantic boundaries
- Increase chunk overlap (20-25% of chunk size)
Example: Large PDF Workflow
Here's a complete workflow for a 60-page PDF:
1. Upload with Optimized Chunking
curl -X POST "https://api.goodmem.ai/v1/memories" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
"originalContentB64": "<base64-pdf>",
"contentType": "application/pdf",
"chunkingConfig": {
"recursive": {
"chunkSize": 1500,
"chunkOverlap": 200,
"separators": ["\\n\\n", "\\n", " ", ""],
"keepStrategy": "KEEP_END",
"separatorIsRegex": false,
"lengthMeasurement": "CHARACTER_COUNT"
}
}
}'2. Verify Processing
goodmem memory get <memory-id>3. Test Retrieval
goodmem memory retrieve \
--space-id f77a8555-0232-4c01-a33e-4f0ca072905e \
"your specific question here"Best Practices Summary
- Start with CLI defaults - they work well across diverse content types
- Match chunk size to your embedder's token limits
- Use sentence chunking for academic/structured content
- Increase chunk size for longer documents (1024-1500 characters)
- Add 20-25% overlap to preserve context across chunk boundaries
- Disable chunking only for very small files (< 500 characters)
- Test and iterate - different content types may need different strategies
Next Steps
- Post Processors Reference - Optimize search result processing
- CLI Reference - Complete command documentation
- CLI Memory Create - Advanced CLI chunking options
- API Reference - REST and gRPC specifications
- ChunkingConfig Proto - Complete protocol buffer specification