GoodMem
How-To Guides

Work with PDF Page Images

Extract, monitor, list, and fetch PDF page images for previews and document viewers.

Work with PDF Page Images

GoodMem can extract per-page images for PDF memories. This is useful when you want:

  • Page previews in a document viewer
  • Thumbnail strips in the console or a custom UI
  • A visual fallback when text extraction is incomplete
  • A way to link retrieved chunks back to the pages they came from

This guide covers the full REST-facing workflow:

  1. Request page-image extraction at ingest time
  2. Check whether page images are ready
  3. List available page-image renditions
  4. Fetch one page image
  5. Understand how retrieved chunks point back to their source pages

Before You Start

  • GoodMem server running (default REST address: https://localhost:8080)
  • API key with permission to create and read memories
  • A PDF file to upload

For PDF rendering, GoodMem selects a suitable page-rendering engine automatically in the default configuration. For server-side tuning details, see Server Runtime Footprint.

If you are calling the REST API directly, set:

export GOODMEM_REST_URL="https://localhost:8080"
export GOODMEM_API_KEY="gm_your_key"

For local development with the default self-signed certificate, add -k to curl and --verify=no to HTTPie.

Request Page Images at Ingest Time

Page-image extraction is controlled per memory.

  • REST and SDKs: set extractPageImages: true
  • CLI: enabled by default for eligible file uploads; use --no-extract-page-images to opt out
goodmem memory create \
  --space-id "$SPACE_ID" \
  --file document.pdf
curl -sS -k -X POST "$GOODMEM_REST_URL/v1/memories" \
  -H "x-api-key: $GOODMEM_API_KEY" \
  -F '[email protected];type=application/pdf' \
  -F 'request={"spaceId":"'"$SPACE_ID"'","contentType":"application/pdf","extractPageImages":true};type=application/json'
http --verify=no -f POST "$GOODMEM_REST_URL/v1/memories" \
  x-api-key:"$GOODMEM_API_KEY" \
  [email protected];type=application/pdf \
  request:='{"spaceId":"'"$SPACE_ID"'","contentType":"application/pdf","extractPageImages":true}'

The response returns the created memory. Save its memoryId; you will use it for the page-image and memory-status calls below.

Check Whether Page Images Are Ready

Page-image extraction is tracked separately from the memory's main processing status. A memory can finish chunking and embedding while page images are still processing, and an image-only PDF can produce page images even if text extraction fails.

Check the memory:

curl -sS -k "$GOODMEM_REST_URL/v1/memories/$MEMORY_ID" \
  -H "x-api-key: $GOODMEM_API_KEY" | jq
http --verify=no GET "$GOODMEM_REST_URL/v1/memories/$MEMORY_ID" \
  x-api-key:"$GOODMEM_API_KEY"

Look at these fields on the memory:

  • pageImageStatus
  • pageImageCount

Expected status values:

  • PENDING
  • PROCESSING
  • COMPLETED
  • FAILED

Treat pageImageStatus == COMPLETED and pageImageCount > 0 as the signal that page images are ready to fetch.

pageImageCount counts stored page-image renditions, not guaranteed logical PDF pages. If every page has exactly one rendition, the values usually match. If some pages have multiple renditions, pageImageCount will be higher than the human-visible page count.

List Available Page Images

Use the page-image listing endpoint to discover which renditions exist for a memory.

Endpoint:

  • GET /v1/memories/{id}/pages

Basic request:

curl -sS -k "$GOODMEM_REST_URL/v1/memories/$MEMORY_ID/pages" \
  -H "x-api-key: $GOODMEM_API_KEY" | jq
http --verify=no GET "$GOODMEM_REST_URL/v1/memories/$MEMORY_ID/pages" \
  x-api-key:"$GOODMEM_API_KEY"

The response looks like this:

{
  "pageImages": [
    {
      "memoryId": "550e8400-e29b-41d4-a716-446655440000",
      "pageIndex": 0,
      "dpi": 150,
      "contentType": "image/png",
      "imageContentLength": 281233,
      "imageContentSha256": "2d711642b726b04401627ca9fbac32f5c8530fb1903cc4db02258717921a4881",
      "createdAt": 1714762260000,
      "updatedAt": 1714762260000
    }
  ],
  "nextToken": "..."
}

Notes:

  • pageIndex is 0-based
  • one logical page can have more than one rendition
  • renditions are distinguished by dpi and contentType
  • nextToken is opaque; if present, pass it back unchanged

Filter the List

Supported query parameters:

  • startPageIndex
  • endPageIndex
  • dpi
  • contentType
  • maxResults
  • nextToken

Snake_case aliases are also accepted:

  • start_page_index
  • end_page_index
  • content_type
  • max_results
  • next_token

Example: list just page 2:

curl -sS -k \
  "$GOODMEM_REST_URL/v1/memories/$MEMORY_ID/pages?startPageIndex=2&endPageIndex=2" \
  -H "x-api-key: $GOODMEM_API_KEY" | jq

Fetch One Page Image

Use this endpoint to download one page image as raw binary content:

  • GET /v1/memories/{id}/pages/{pageIndex}/image

In the common case, you can omit rendition hints entirely:

curl -sS -k \
  "$GOODMEM_REST_URL/v1/memories/$MEMORY_ID/pages/2/image" \
  -H "x-api-key: $GOODMEM_API_KEY" \
  -o page-2.png

The server will return the unique rendition for that page if exactly one exists.

In the common case, you do not need to specify dpi or contentType. If GoodMem ever stores multiple renditions for the same page, the server may ask you to specify them explicitly.

HTTP Behavior

The image endpoint supports normal binary-download behavior:

  • GET for the image bytes
  • HEAD for headers only
  • Range requests
  • ETag and Digest headers when available

This is useful for browser caching and document-viewer prefetching.

Page Indices Are 0-Based

GoodMem uses 0-based page indices everywhere in the page-image APIs and chunk metadata.

Examples:

  • the first page in the PDF is pageIndex = 0
  • “page 3” in a human-facing UI is pageIndex = 2

If your UI shows human page numbers, convert them at the edge and keep the API calls 0-based.

How Retrieved Chunks Point Back to Pages

Page images are stored per page, but retrieved chunks can span one or more pages. GoodMem exposes page attribution in chunk metadata, not as first-class chunk fields.

When available, the metadata keys are:

  • source_page_start_index
  • source_page_end_index
  • source_page_count

Example retrieved chunk metadata:

{
  "metadata": {
    "source_page_start_index": 4,
    "source_page_end_index": 5,
    "source_page_count": 2
  }
}

Interpretation:

  • the chunk starts on page 4
  • ends on page 5
  • spans 2 pages total

These fields are optional. If GoodMem cannot infer page spans for a chunk, the keys are simply absent.

Common Patterns

Build a Viewer

  1. Fetch the memory and wait for pageImageStatus == COMPLETED
  2. List page metadata with GET /v1/memories/{id}/pages
  3. Render each visible page with GET /v1/memories/{id}/pages/{pageIndex}/image
  4. Use chunk metadata to highlight which pages a retrieval result came from

Handle Image-Only PDFs

Some PDFs do not yield usable extracted text. GoodMem can still render page images for them.

That means you may see:

  • processingStatus = FAILED
  • pageImageStatus = COMPLETED

This is expected for certain image-only or scan-heavy PDFs.

Next Steps