GoodMem
How-To Guides

Optimize Document Ingestion for Better Search

Learn chunking strategies and configuration techniques to improve search performance for your documents

Optimize Document Ingestion for Better Search

When uploading documents to GoodMem, how you configure the ingestion process significantly impacts search performance. This guide shows you how to optimize document processing for better retrieval results.

Content Input Methods

The API provides three ways to provide document content to GoodMem. For JSON requests, you must use exactly one of originalContent or originalContentB64. For multipart requests, omit both and send the file as a binary part.

Note: If you're using the CLI, you don't need to worry about this distinction - the CLI automatically handles file encoding when you use the --file parameter.

Plain Text Content (originalContent)

For text-based documents, provide content directly as a string:

{
  "spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
  "originalContent": "This is plain text content that will be processed directly.",
  "contentType": "text/plain"
}

Use for: Text files, markdown, JSON, code files, or any content you can represent as a string.

Binary Content (originalContentB64)

For binary files like PDFs, images, or Word documents, encode the file as base64 when using JSON requests:

{
  "spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
  "originalContentB64": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFI+Pg==",
  "contentType": "application/pdf"
}

Use for: PDFs, images (PNG, JPG), Word documents, or any binary file format when you need a JSON-only request.

Direct File Upload (multipart/form-data)

For binary files when you want to avoid base64, send a multipart request with a JSON request part and a binary file part.

  • The request part contains all CreateMemoryRequest fields except originalContent and originalContentB64.
  • Exactly one file part is required for single create. fileField is optional when there is only one file; use it when you want to name the file part explicitly or when multiple files are present.
  • contentType is optional if the file part includes a Content-Type header; if both are missing, the request fails.
  • If metadata.filename is missing, it is populated from the uploaded filename.
curl -X POST "https://api.goodmem.ai/v1/memories" \
  -H "x-api-key: <your-api-key>" \
  -F '[email protected];type=application/pdf' \
  -F 'request={"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","metadata":{"source":"upload"},"chunkingConfig":{"recursive":{"chunkSize":512,"chunkOverlap":64}}};type=application/json'
http -f POST "https://api.goodmem.ai/v1/memories" \
  x-api-key:"<your-api-key>" \
  [email protected];type=application/pdf \
  request:='{"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","metadata":{"source":"upload"},"chunkingConfig":{"recursive":{"chunkSize":512,"chunkOverlap":64}}}'

For batch uploads, send a requests JSON array and one file part per request. When multiple files are included, each request must include fileField to map to the corresponding file part name. If there is only one request and one file, fileField is optional.

curl -X POST "https://api.goodmem.ai/v1/memories:batchCreate" \
  -H "x-api-key: <your-api-key>" \
  -F '[email protected];type=application/pdf' \
  -F '[email protected];type=application/pdf' \
  -F 'requests=[{"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","fileField":"file0"},{"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","fileField":"file1"}];type=application/json'
http -f POST "https://api.goodmem.ai/v1/memories:batchCreate" \
  x-api-key:"<your-api-key>" \
  [email protected];type=application/pdf \
  [email protected];type=application/pdf \
  requests:='[{"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","fileField":"file0"},{"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","fileField":"file1"}]'

Decision Guide

  • Text content you can read/edit? → Use originalContent
  • Binary file and using JSON? → Use originalContentB64 (base64-encode the file first)
  • Want to upload a file directly? → Use multipart/form-data

Comparison Examples

Here are all three methods with identical chunking configuration to show the only difference:

Plain Text Document:

curl -X POST "https://api.goodmem.ai/v1/memories" \
  -H "x-api-key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
  "spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
  "originalContent": "This is my document content with paragraphs.\n\nSecond paragraph here.",
  "contentType": "text/plain",
  "chunkingConfig": {
    "recursive": {
      "chunkSize": 512,
      "chunkOverlap": 64
    }
  }
}'
http POST "https://api.goodmem.ai/v1/memories" \
  x-api-key:"<your-api-key>" \
  spaceId="f77a8555-0232-4c01-a33e-4f0ca072905e" \
  originalContent="This is my document content with paragraphs.\n\nSecond paragraph here." \
  contentType="text/plain" \
  chunkingConfig:='{"recursive":{"chunkSize":512,"chunkOverlap":64}}'

Binary PDF Document:

curl -X POST "https://api.goodmem.ai/v1/memories" \
  -H "x-api-key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
  "spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
  "originalContentB64": "JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFI+Pg==",
  "contentType": "application/pdf",
  "chunkingConfig": {
    "recursive": {
      "chunkSize": 512,
      "chunkOverlap": 64
    }
  }
}'
http POST "https://api.goodmem.ai/v1/memories" \
  x-api-key:"<your-api-key>" \
  spaceId="f77a8555-0232-4c01-a33e-4f0ca072905e" \
  originalContentB64="JVBERi0xLjQKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFI+Pg==" \
  contentType="application/pdf" \
  chunkingConfig:='{"recursive":{"chunkSize":512,"chunkOverlap":64}}'

Binary PDF Document (Multipart Upload, no base64):

curl -X POST "https://api.goodmem.ai/v1/memories" \
  -H "x-api-key: <your-api-key>" \
  -F '[email protected];type=application/pdf' \
  -F 'request={"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","chunkingConfig":{"recursive":{"chunkSize":512,"chunkOverlap":64}}};type=application/json'
http -f POST "https://api.goodmem.ai/v1/memories" \
  x-api-key:"<your-api-key>" \
  [email protected];type=application/pdf \
  request:='{"spaceId":"f77a8555-0232-4c01-a33e-4f0ca072905e","contentType":"application/pdf","chunkingConfig":{"recursive":{"chunkSize":512,"chunkOverlap":64}}}'

Note: In practice, you'd base64-encode your actual PDF file content for JSON uploads. Use base64 yourfile.pdf to generate the base64 string.

Understanding Chunking Strategies

Chunking breaks large documents into smaller, focused segments that can be individually embedded and searched. This is crucial because:

  • Embedder models have token limits (typically 512-8192 tokens)
  • Large text blocks produce generic embeddings that match poorly with specific queries
  • Right-sized chunks balance precision and context - too large and they become generic, too small and they lack enough context for meaningful embeddings

Current Defaults

  • API (REST and gRPC): Defaults to no chunking today (this may change in future versions)
  • CLI: Uses recursive chunking by default with battle-tested parameters

The CLI's default configuration has proven effective across diverse document types, so we recommend starting there.

The GoodMem CLI uses recursive chunking with these proven settings:

  • Strategy: Recursive (tries paragraphs → sentences → words → characters)
  • Chunk Size: 512 characters
  • Overlap: 64 characters
  • Separator Retention: Keep at end

CLI Example

# CLI automatically applies optimal chunking
goodmem memory create \
  --space-id f77a8555-0232-4c01-a33e-4f0ca072905e \
  --file document.pdf

Equivalent REST API Configuration

To replicate CLI behavior in API requests (note that CLI automatically handles file encoding, while the API requires manual content handling for binary files unless you use multipart uploads):

curl -X POST "https://api.goodmem.ai/v1/memories" \
  -H "x-api-key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
  "spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
  "originalContentB64": "<base64-encoded-content>",
  "contentType": "application/pdf",
  "chunkingConfig": {
    "recursive": {
      "chunkSize": 512,
      "chunkOverlap": 64,
      "separators": ["\\n\\n", "\\n", " ", ""],
      "keepStrategy": "KEEP_END",
      "separatorIsRegex": false,
      "lengthMeasurement": "CHARACTER_COUNT"
    }
  }
}'
http POST "https://api.goodmem.ai/v1/memories" \
  x-api-key:"<your-api-key>" \
  spaceId="f77a8555-0232-4c01-a33e-4f0ca072905e" \
  originalContentB64="<base64-encoded-content>" \
  contentType="application/pdf" \
  chunkingConfig:='{"recursive":{"chunkSize":512,"chunkOverlap":64,"separators":["\\n\\n","\\n"," ",""],"keepStrategy":"KEEP_END","separatorIsRegex":false,"lengthMeasurement":"CHARACTER_COUNT"}}'

Document-Specific Optimizations

Text Documents and PDFs

For most text-heavy documents, use larger chunks to preserve context:

{
  "chunkingConfig": {
    "recursive": {
      "chunkSize": 1024,
      "chunkOverlap": 128,
      "separators": ["\\n\\n", "\\n", " ", ""],
      "keepStrategy": "KEEP_END",
      "separatorIsRegex": false,
      "lengthMeasurement": "CHARACTER_COUNT"
    }
  }
}

Code Documentation

For technical documentation with clear section breaks:

{
  "chunkingConfig": {
    "recursive": {
      "chunkSize": 1500,
      "chunkOverlap": 200,
      "separators": ["\\n## ", "\\n### ", "\\n\\n", "\\n"],
      "keepStrategy": "KEEP_START",
      "separatorIsRegex": false,
      "lengthMeasurement": "CHARACTER_COUNT"
    }
  }
}

Small Files

For documents under 500 characters, disable chunking entirely:

{
  "chunkingConfig": {
    "none": {}
  }
}

Token Limits and Embedder Considerations

Different embedders have different token limits:

EmbedderTypical Token LimitRecommended Chunk Size
OpenAI text-embedding-3-small8192 tokens1000-1500 characters
Snowflake Arctic512 tokens400-600 characters
Sentence Transformers512 tokens400-600 characters

Character to Token Ratio: Roughly 4 characters = 1 token for English text.

Checking Your Embedder

# List your embedders to check token limits
goodmem embedder list

Chunking Strategy Reference

Recursive Chunking

  • Best for: General documents, mixed content types, most PDFs
  • How it works: Tries separators in order of preference
  • Default separators: ["\\n\\n", "\\n", " ", ""] (paragraphs → sentences → words → characters)

Sentence Chunking

  • Best for: Experimentation with semantic boundary preservation
  • How it works: Respects sentence boundaries using language detection
  • Advantages: Preserves complete thoughts and semantic coherence

No Chunking

  • Best for: Short documents, single concepts, structured data
  • Trade-offs: Preserves full context but may produce generic embeddings

Troubleshooting Search Performance

Content Input Errors

If you receive errors during document upload:

For application/json requests:

  • "Either originalContent or originalContentB64 must be provided": You forgot to include either field, or both fields are empty.
  • "Provide either originalContent or originalContentB64, but not both": You included both fields in the same request.
  • Base64 decoding errors: Your originalContentB64 contains invalid base64 data. Ensure you're encoding the file correctly: base64 yourfile.pdf. Check for extra whitespace or line breaks in the base64 string.

For multipart/form-data requests:

  • "Missing multipart field: request" / "requests": The JSON part is missing or was sent as a file part. Send it as a form field.
  • "File part is required": The binary file part is missing.
  • "fileField is required when multiple files are uploaded": Add fileField to each request when uploading multiple files.
  • "No file part named '...'": Ensure fileField matches the file part name.
  • "originalContent is not allowed for multipart uploads": Remove originalContent from the JSON part.
  • "originalContentB64 is not allowed for multipart uploads": Remove originalContentB64 from the JSON part.

No Search Results

If your queries return empty results:

  1. Check content input method - ensure you used the correct field for your content type

  2. Check if chunking is configured (especially when using the API directly)

  3. Verify document processing status:

    goodmem memory get <memory-id>

    Look for Status: COMPLETED

  4. Test with CLI defaults to establish a baseline:

    goodmem memory create --space-id <space-id> --file <your-file>

Poor Result Relevance

If results seem unrelated:

  1. Try smaller chunk sizes (512-1024 characters)
  2. Switch to sentence chunking for better semantic boundaries
  3. Increase chunk overlap (20-25% of chunk size)

Example: Large PDF Workflow

Here's a complete workflow for a 60-page PDF:

1. Upload with Optimized Chunking

curl -X POST "https://api.goodmem.ai/v1/memories" \
  -H "x-api-key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
  "spaceId": "f77a8555-0232-4c01-a33e-4f0ca072905e",
  "originalContentB64": "<base64-pdf>",
  "contentType": "application/pdf",
  "chunkingConfig": {
    "recursive": {
      "chunkSize": 1500,
      "chunkOverlap": 200,
      "separators": ["\\n\\n", "\\n", " ", ""],
      "keepStrategy": "KEEP_END",
      "separatorIsRegex": false,
      "lengthMeasurement": "CHARACTER_COUNT"
    }
  }
}'

2. Verify Processing

goodmem memory get <memory-id>

3. Test Retrieval

goodmem memory retrieve \
  --space-id f77a8555-0232-4c01-a33e-4f0ca072905e \
  "your specific question here"

Best Practices Summary

  1. Start with CLI defaults - they work well across diverse content types
  2. Match chunk size to your embedder's token limits
  3. Use sentence chunking for academic/structured content
  4. Increase chunk size for longer documents (1024-1500 characters)
  5. Add 20-25% overlap to preserve context across chunk boundaries
  6. Disable chunking only for very small files (< 500 characters)
  7. Test and iterate - different content types may need different strategies

Next Steps