GoodMem
How-To Guides

Optimize Document Ingestion for Better Search

Learn chunking strategies and configuration techniques to improve search performance for your documents

Optimize Document Ingestion for Better Search

When uploading documents to GoodMem, how you configure the ingestion process significantly impacts search performance. This guide shows you how to optimize document processing for better retrieval results.

Content Input Methods

The API provides two mutually exclusive ways to provide document content to GoodMem. You must use exactly one of these fields - not both, not neither.

Note: If you're using the CLI, you don't need to worry about this distinction - the CLI automatically handles file encoding when you use the --file parameter.

Plain Text Content (originalContent)

For text-based documents, provide content directly as a string:

{
  "spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
  "originalContent": "This is plain text content that will be processed directly.",
  "contentType": "text/plain"
}

Use for: Text files, markdown, JSON, code files, or any content you can represent as a string.

Binary Content (originalContentB64)

For binary files like PDFs, images, or Word documents, encode the file as base64:

{
  "spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
  "originalContentB64": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFI+Pg==",
  "contentType": "application/pdf"
}

Use for: PDFs, images (PNG, JPG), Word documents, or any binary file format.

Decision Guide

  • Text content you can read/edit? → Use originalContent
  • Binary file or need to upload a file? → Use originalContentB64 (base64-encode the file first)

Comparison Examples

Here are both methods with identical chunking configuration to show the only difference:

Plain Text Document:

curl -X POST "https://api.goodmem.ai/v1/memories" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
  "spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
  "originalContent": "This is my document content with paragraphs.\n\nSecond paragraph here.",
  "contentType": "text/plain",
  "chunkingConfig": {
    "recursive": {
      "chunkSize": 512,
      "chunkOverlap": 64
    }
  }
}'

Binary PDF Document:

curl -X POST "https://api.goodmem.ai/v1/memories" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
  "spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
  "originalContentB64": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFI+Pg==",
  "contentType": "application/pdf",
  "chunkingConfig": {
    "recursive": {
      "chunkSize": 512,
      "chunkOverlap": 64
    }
  }
}'

Note: In practice, you'd base64-encode your actual PDF file content. Use base64 yourfile.pdf to generate the base64 string.

Understanding Chunking Strategies

Chunking breaks large documents into smaller, focused segments that can be individually embedded and searched. This is crucial because:

  • Embedder models have token limits (typically 512-8192 tokens)
  • Large text blocks produce generic embeddings that match poorly with specific queries
  • Right-sized chunks balance precision and context - too large and they become generic, too small and they lack enough context for meaningful embeddings

Current Defaults

  • API (REST and gRPC): Defaults to no chunking today (this may change in future versions)
  • CLI: Uses recursive chunking by default with battle-tested parameters

The CLI's default configuration has proven effective across diverse document types, so we recommend starting there.

The GoodMem CLI uses recursive chunking with these proven settings:

  • Strategy: Recursive (tries paragraphs → sentences → words → characters)
  • Chunk Size: 512 characters
  • Overlap: 64 characters
  • Separator Retention: Keep at end

CLI Example

# CLI automatically applies optimal chunking
goodmem memory create \
  --space-id f77a8555-0232-4c01-a33e-4f0ca072905e \
  --file document.pdf

Equivalent REST API Configuration

To replicate CLI behavior in API requests (note that CLI automatically handles file encoding, while the API requires manual content handling for binary files):

curl -X POST "https://api.goodmem.ai/v1/memories" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
  "spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
  "originalContentB64": "<base64-encoded-content>",
  "contentType": "application/pdf",
  "chunkingConfig": {
    "recursive": {
      "chunkSize": 512,
      "chunkOverlap": 64,
      "separators": ["\\n\\n", "\\n", " ", ""],
      "keepStrategy": "KEEP_END",
      "separatorIsRegex": false,
      "lengthMeasurement": "CHARACTER_COUNT"
    }
  }
}'

Document-Specific Optimizations

Text Documents and PDFs

For most text-heavy documents, use larger chunks to preserve context:

{
  "chunkingConfig": {
    "recursive": {
      "chunkSize": 1024,
      "chunkOverlap": 128,
      "separators": ["\\n\\n", "\\n", " ", ""],
      "keepStrategy": "KEEP_END",
      "separatorIsRegex": false,
      "lengthMeasurement": "CHARACTER_COUNT"
    }
  }
}

Code Documentation

For technical documentation with clear section breaks:

{
  "chunkingConfig": {
    "recursive": {
      "chunkSize": 1500,
      "chunkOverlap": 200,
      "separators": ["\\n## ", "\\n### ", "\\n\\n", "\\n"],
      "keepStrategy": "KEEP_START",
      "separatorIsRegex": false,
      "lengthMeasurement": "CHARACTER_COUNT"
    }
  }
}

Small Files

For documents under 500 characters, disable chunking entirely:

{
  "chunkingConfig": {
    "none": {}
  }
}

Token Limits and Embedder Considerations

Different embedders have different token limits:

EmbedderTypical Token LimitRecommended Chunk Size
OpenAI text-embedding-3-small8192 tokens1000-1500 characters
Snowflake Arctic512 tokens400-600 characters
Sentence Transformers512 tokens400-600 characters

Character to Token Ratio: Roughly 4 characters = 1 token for English text.

Checking Your Embedder

# List your embedders to check token limits
goodmem embedder list

Chunking Strategy Reference

Recursive Chunking

  • Best for: General documents, mixed content types, most PDFs
  • How it works: Tries separators in order of preference
  • Default separators: ["\\n\\n", "\\n", " ", ""] (paragraphs → sentences → words → characters)

Sentence Chunking

  • Best for: Experimentation with semantic boundary preservation
  • How it works: Respects sentence boundaries using language detection
  • Advantages: Preserves complete thoughts and semantic coherence

No Chunking

  • Best for: Short documents, single concepts, structured data
  • Trade-offs: Preserves full context but may produce generic embeddings

Troubleshooting Search Performance

Content Input Errors

If you receive errors during document upload:

"Either originalContent or originalContentB64 must be provided"

  • You forgot to include either field
  • Both fields are empty or contain only whitespace

"Provide either originalContent or originalContentB64, but not both"

  • You included both fields in the same request
  • Remove one field - use originalContent for text, originalContentB64 for binary

Base64 decoding errors

  • Your originalContentB64 contains invalid base64 data
  • Ensure you're encoding the file correctly: base64 yourfile.pdf
  • Check for extra whitespace or line breaks in the base64 string

No Search Results

If your queries return empty results:

  1. Check content input method - ensure you used the correct field for your content type

  2. Check if chunking is configured (especially when using the API directly)

  3. Verify document processing status:

    goodmem memory get <memory-id>

    Look for Status: COMPLETED

  4. Test with CLI defaults to establish a baseline:

    goodmem memory create --space-id <space-id> --file <your-file>

Poor Result Relevance

If results seem unrelated:

  1. Try smaller chunk sizes (512-1024 characters)
  2. Switch to sentence chunking for better semantic boundaries
  3. Increase chunk overlap (20-25% of chunk size)

Example: Large PDF Workflow

Here's a complete workflow for a 60-page PDF:

1. Upload with Optimized Chunking

curl -X POST "https://api.goodmem.ai/v1/memories" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
  "spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
  "originalContentB64": "<base64-pdf>",
  "contentType": "application/pdf",
  "chunkingConfig": {
    "recursive": {
      "chunkSize": 1500,
      "chunkOverlap": 200,
      "separators": ["\\n\\n", "\\n", " ", ""],
      "keepStrategy": "KEEP_END",
      "separatorIsRegex": false,
      "lengthMeasurement": "CHARACTER_COUNT"
    }
  }
}'

2. Verify Processing

goodmem memory get <memory-id>

3. Test Retrieval

goodmem memory retrieve \
  --space-id f77a8555-0232-4c01-a33e-4f0ca072905e \
  "your specific question here"

Best Practices Summary

  1. Start with CLI defaults - they work well across diverse content types
  2. Match chunk size to your embedder's token limits
  3. Use sentence chunking for academic/structured content
  4. Increase chunk size for longer documents (1024-1500 characters)
  5. Add 20-25% overlap to preserve context across chunk boundaries
  6. Disable chunking only for very small files (< 500 characters)
  7. Test and iterate - different content types may need different strategies

Next Steps