Optimize Document Ingestion for Better Search
Learn chunking strategies and configuration techniques to improve search performance for your documents
Optimize Document Ingestion for Better Search
When uploading documents to GoodMem, how you configure the ingestion process significantly impacts search performance. This guide shows you how to optimize document processing for better retrieval results.
Content Input Methods
The API provides three ways to provide document content to GoodMem. For JSON requests, you must use exactly one of originalContent or originalContentB64. For multipart requests, omit both and send the file as a binary part.
Note: If you're using the CLI, you don't need to worry about this distinction - the CLI automatically handles file encoding when you use the --file parameter.
Plain Text Content (originalContent)
For text-based documents, provide content directly as a string:
{
"spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
"originalContent": "This is plain text content that will be processed directly.",
"contentType": "text/plain"
}Use for: Text files, markdown, JSON, code files, or any content you can represent as a string.
Binary Content (originalContentB64)
For binary files like PDFs, images, or Word documents, encode the file as base64 when using JSON requests:
{
"spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
"originalContentB64": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFI+Pg==",
"contentType": "application/pdf"
}Use for: PDFs, images (PNG, JPG), Word documents, or any binary file format when you need a JSON-only request.
Direct File Upload (multipart/form-data)
For binary files when you want to avoid base64, send a multipart request with a JSON request part and a binary file part.
- The
requestpart contains all CreateMemoryRequest fields exceptoriginalContentandoriginalContentB64. - Exactly one file part is required for single create.
fileFieldis optional when there is only one file; use it when you want to name the file part explicitly or when multiple files are present. contentTypeis optional if the file part includes a Content-Type header; if both are missing, the request fails.- If
metadata.filenameis missing, it is populated from the uploaded filename.
curl -X POST "https://api.goodmem.ai/v1/memories" \
-H "x-api-key: <your-api-key>" \
-F '[email protected];type=application/pdf' \
-F 'request={"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","metadata":{"source":"upload"},"chunkingConfig":{"recursive":{"chunkSize":512,"chunkOverlap":64}}};type=application/json'http -f POST "https://api.goodmem.ai/v1/memories" \
x-api-key:"<your-api-key>" \
[email protected];type=application/pdf \
request:='{"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","metadata":{"source":"upload"},"chunkingConfig":{"recursive":{"chunkSize":512,"chunkOverlap":64}}}'For batch uploads, send a requests JSON array and one file part per request. When multiple files are included, each request must include fileField to map to the corresponding file part name. If there is only one request and one file, fileField is optional.
curl -X POST "https://api.goodmem.ai/v1/memories:batchCreate" \
-H "x-api-key: <your-api-key>" \
-F '[email protected];type=application/pdf' \
-F '[email protected];type=application/pdf' \
-F 'requests=[{"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","fileField":"file0"},{"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","fileField":"file1"}];type=application/json'http -f POST "https://api.goodmem.ai/v1/memories:batchCreate" \
x-api-key:"<your-api-key>" \
[email protected];type=application/pdf \
[email protected];type=application/pdf \
requests:='[{"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","fileField":"file0"},{"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","fileField":"file1"}]'Decision Guide
- Text content you can read/edit? → Use
originalContent - Binary file and using JSON? → Use
originalContentB64(base64-encode the file first) - Want to upload a file directly? → Use
multipart/form-data
Comparison Examples
Here are all three methods with identical chunking configuration to show the only difference:
Plain Text Document:
curl -X POST "https://api.goodmem.ai/v1/memories" \
-H "x-api-key: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
"originalContent": "This is my document content with paragraphs.\n\nSecond paragraph here.",
"contentType": "text/plain",
"chunkingConfig": {
"recursive": {
"chunkSize": 512,
"chunkOverlap": 64
}
}
}'http POST "https://api.goodmem.ai/v1/memories" \
x-api-key:"<your-api-key>" \
spaceId="f77a8555-0232-4c01-a33e-4f0ca072905e" \
originalContent="This is my document content with paragraphs.\n\nSecond paragraph here." \
contentType="text/plain" \
chunkingConfig:='{"recursive":{"chunkSize":512,"chunkOverlap":64}}'Binary PDF Document:
curl -X POST "https://api.goodmem.ai/v1/memories" \
-H "x-api-key: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
"originalContentB64": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFI+Pg==",
"contentType": "application/pdf",
"chunkingConfig": {
"recursive": {
"chunkSize": 512,
"chunkOverlap": 64
}
}
}'http POST "https://api.goodmem.ai/v1/memories" \
x-api-key:"<your-api-key>" \
spaceId="f77a8555-0232-4c01-a33e-4f0ca072905e" \
originalContentB64="JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFI+Pg==" \
contentType="application/pdf" \
chunkingConfig:='{"recursive":{"chunkSize":512,"chunkOverlap":64}}'Binary PDF Document (Multipart Upload, no base64):
curl -X POST "https://api.goodmem.ai/v1/memories" \
-H "x-api-key: <your-api-key>" \
-F '[email protected];type=application/pdf' \
-F 'request={"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","chunkingConfig":{"recursive":{"chunkSize":512,"chunkOverlap":64}}};type=application/json'http -f POST "https://api.goodmem.ai/v1/memories" \
x-api-key:"<your-api-key>" \
[email protected];type=application/pdf \
request:='{"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","chunkingConfig":{"recursive":{"chunkSize":512,"chunkOverlap":64}}}'Note: In practice, you'd base64-encode your actual PDF file content for JSON uploads. Use base64 yourfile.pdf to generate the base64 string.
Understanding Chunking Strategies
Chunking breaks large documents into smaller, focused segments that can be individually embedded and searched. This is crucial because:
- Embedder models have token limits (typically 512-8192 tokens)
- Large text blocks produce generic embeddings that match poorly with specific queries
- Right-sized chunks balance precision and context - too large and they become generic, too small and they lack enough context for meaningful embeddings
Current Defaults
- API (REST and gRPC): Defaults to no chunking today (this may change in future versions)
- CLI: Uses recursive chunking by default with battle-tested parameters
The CLI's default configuration has proven effective across diverse document types, so we recommend starting there.
Recommended Starting Point: CLI Defaults
The GoodMem CLI uses recursive chunking with these proven settings:
- Strategy: Recursive (tries paragraphs → sentences → words → characters)
- Chunk Size: 512 characters
- Overlap: 64 characters
- Separator Retention: Keep at end
CLI Example
# CLI automatically applies optimal chunking
goodmem memory create \
--space-id f77a8555-0232-4c01-a33e-4f0ca072905e \
--file document.pdfEquivalent REST API Configuration
To replicate CLI behavior in API requests (note that CLI automatically handles file encoding, while the API requires manual content handling for binary files unless you use multipart uploads):
curl -X POST "https://api.goodmem.ai/v1/memories" \
-H "x-api-key: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
"originalContentB64": "<base64-encoded-content>",
"contentType": "application/pdf",
"chunkingConfig": {
"recursive": {
"chunkSize": 512,
"chunkOverlap": 64,
"separators": ["\\n\\n", "\\n", " ", ""],
"keepStrategy": "KEEP_END",
"separatorIsRegex": false,
"lengthMeasurement": "CHARACTER_COUNT"
}
}
}'http POST "https://api.goodmem.ai/v1/memories" \
x-api-key:"<your-api-key>" \
spaceId="f77a8555-0232-4c01-a33e-4f0ca072905e" \
originalContentB64="<base64-encoded-content>" \
contentType="application/pdf" \
chunkingConfig:='{"recursive":{"chunkSize":512,"chunkOverlap":64,"separators":["\\n\\n","\\n"," ",""],"keepStrategy":"KEEP_END","separatorIsRegex":false,"lengthMeasurement":"CHARACTER_COUNT"}}'Document-Specific Optimizations
Text Documents and PDFs
For most text-heavy documents, use larger chunks to preserve context:
{
"chunkingConfig": {
"recursive": {
"chunkSize": 1024,
"chunkOverlap": 128,
"separators": ["\\n\\n", "\\n", " ", ""],
"keepStrategy": "KEEP_END",
"separatorIsRegex": false,
"lengthMeasurement": "CHARACTER_COUNT"
}
}
}Code Documentation
For technical documentation with clear section breaks:
{
"chunkingConfig": {
"recursive": {
"chunkSize": 1500,
"chunkOverlap": 200,
"separators": ["\\n## ", "\\n### ", "\\n\\n", "\\n"],
"keepStrategy": "KEEP_START",
"separatorIsRegex": false,
"lengthMeasurement": "CHARACTER_COUNT"
}
}
}Small Files
For documents under 500 characters, disable chunking entirely:
{
"chunkingConfig": {
"none": {}
}
}Token Limits and Embedder Considerations
Different embedders have different token limits:
| Embedder | Typical Token Limit | Recommended Chunk Size |
|---|---|---|
| OpenAI text-embedding-3-small | 8192 tokens | 1000-1500 characters |
| Snowflake Arctic | 512 tokens | 400-600 characters |
| Sentence Transformers | 512 tokens | 400-600 characters |
Character to Token Ratio: Roughly 4 characters = 1 token for English text.
Checking Your Embedder
# List your embedders to check token limits
goodmem embedder listChunking Strategy Reference
Recursive Chunking
- Best for: General documents, mixed content types, most PDFs
- How it works: Tries separators in order of preference
- Default separators:
["\\n\\n", "\\n", " ", ""](paragraphs → sentences → words → characters)
Sentence Chunking
- Best for: Experimentation with semantic boundary preservation
- How it works: Respects sentence boundaries using language detection
- Advantages: Preserves complete thoughts and semantic coherence
No Chunking
- Best for: Short documents, single concepts, structured data
- Trade-offs: Preserves full context but may produce generic embeddings
Troubleshooting Search Performance
Content Input Errors
If you receive errors during document upload:
For application/json requests:
- "Either originalContent or originalContentB64 must be provided": You forgot to include either field, or both fields are empty.
- "Provide either originalContent or originalContentB64, but not both": You included both fields in the same request.
- Base64 decoding errors: Your
originalContentB64contains invalid base64 data. Ensure you're encoding the file correctly:base64 yourfile.pdf. Check for extra whitespace or line breaks in the base64 string.
For multipart/form-data requests:
- "Missing multipart field: request" / "requests": The JSON part is missing or was sent as a file part. Send it as a form field.
- "File part is required": The binary file part is missing.
- "fileField is required when multiple files are uploaded": Add
fileFieldto each request when uploading multiple files. - "No file part named '...'": Ensure
fileFieldmatches the file part name. - "originalContent is not allowed for multipart uploads": Remove
originalContentfrom the JSON part. - "originalContentB64 is not allowed for multipart uploads": Remove
originalContentB64from the JSON part.
No Search Results
If your queries return empty results:
-
Check content input method - ensure you used the correct field for your content type
-
Check if chunking is configured (especially when using the API directly)
-
Verify document processing status:
goodmem memory get <memory-id>Look for
Status: COMPLETED -
Test with CLI defaults to establish a baseline:
goodmem memory create --space-id <space-id> --file <your-file>
Poor Result Relevance
If results seem unrelated:
- Try smaller chunk sizes (512-1024 characters)
- Switch to sentence chunking for better semantic boundaries
- Increase chunk overlap (20-25% of chunk size)
Example: Large PDF Workflow
Here's a complete workflow for a 60-page PDF:
1. Upload with Optimized Chunking
curl -X POST "https://api.goodmem.ai/v1/memories" \
-H "x-api-key: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
"originalContentB64": "<base64-pdf>",
"contentType": "application/pdf",
"chunkingConfig": {
"recursive": {
"chunkSize": 1500,
"chunkOverlap": 200,
"separators": ["\\n\\n", "\\n", " ", ""],
"keepStrategy": "KEEP_END",
"separatorIsRegex": false,
"lengthMeasurement": "CHARACTER_COUNT"
}
}
}'2. Verify Processing
goodmem memory get <memory-id>3. Test Retrieval
goodmem memory retrieve \
--space-id f77a8555-0232-4c01-a33e-4f0ca072905e \
"your specific question here"Best Practices Summary
- Start with CLI defaults - they work well across diverse content types
- Match chunk size to your embedder's token limits
- Use sentence chunking for academic/structured content
- Increase chunk size for longer documents (1024-1500 characters)
- Add 20-25% overlap to preserve context across chunk boundaries
- Disable chunking only for very small files (< 500 characters)
- Test and iterate - different content types may need different strategies
Next Steps
- Post Processors Reference - Optimize search result processing
- CLI Reference - Complete command documentation
- CLI Memory Create - Advanced CLI chunking options
- API Reference - REST and gRPC specifications
- ChunkingConfig Reference - Complete protocol buffer specification