Work with PDF Page Images
Extract, monitor, list, and fetch PDF page images for previews and document viewers.
Work with PDF Page Images
GoodMem can extract per-page images for PDF memories. This is useful when you want:
- Page previews in a document viewer
- Thumbnail strips in the console or a custom UI
- A visual fallback when text extraction is incomplete
- A way to link retrieved chunks back to the pages they came from
This guide covers the full REST-facing workflow:
- Request page-image extraction at ingest time
- Check whether page images are ready
- List available page-image renditions
- Fetch one page image
- Understand how retrieved chunks point back to their source pages
Before You Start
- GoodMem server running (default REST address:
https://localhost:8080) - API key with permission to create and read memories
- A PDF file to upload
For PDF rendering, GoodMem selects a suitable page-rendering engine automatically in the default configuration. For server-side tuning details, see Server Runtime Footprint.
If you are calling the REST API directly, set:
export GOODMEM_REST_URL="https://localhost:8080"
export GOODMEM_API_KEY="gm_your_key"For local development with the default self-signed certificate, add -k to curl and
--verify=no to HTTPie.
Request Page Images at Ingest Time
Page-image extraction is controlled per memory.
- REST and SDKs: set
extractPageImages: true - CLI: enabled by default for eligible file uploads; use
--no-extract-page-imagesto opt out
goodmem memory create \
--space-id "$SPACE_ID" \
--file document.pdfcurl -sS -k -X POST "$GOODMEM_REST_URL/v1/memories" \
-H "x-api-key: $GOODMEM_API_KEY" \
-F '[email protected];type=application/pdf' \
-F 'request={"spaceId":"'"$SPACE_ID"'","contentType":"application/pdf","extractPageImages":true};type=application/json'http --verify=no -f POST "$GOODMEM_REST_URL/v1/memories" \
x-api-key:"$GOODMEM_API_KEY" \
[email protected];type=application/pdf \
request:='{"spaceId":"'"$SPACE_ID"'","contentType":"application/pdf","extractPageImages":true}'The response returns the created memory. Save its memoryId; you will use it for the page-image
and memory-status calls below.
Check Whether Page Images Are Ready
Page-image extraction is tracked separately from the memory's main processing status. A memory can finish chunking and embedding while page images are still processing, and an image-only PDF can produce page images even if text extraction fails.
Check the memory:
curl -sS -k "$GOODMEM_REST_URL/v1/memories/$MEMORY_ID" \
-H "x-api-key: $GOODMEM_API_KEY" | jqhttp --verify=no GET "$GOODMEM_REST_URL/v1/memories/$MEMORY_ID" \
x-api-key:"$GOODMEM_API_KEY"Look at these fields on the memory:
pageImageStatuspageImageCount
Expected status values:
PENDINGPROCESSINGCOMPLETEDFAILED
Treat pageImageStatus == COMPLETED and pageImageCount > 0 as the signal that page images are
ready to fetch.
pageImageCount counts stored page-image renditions, not guaranteed logical PDF pages. If every
page has exactly one rendition, the values usually match. If some pages have multiple renditions,
pageImageCount will be higher than the human-visible page count.
List Available Page Images
Use the page-image listing endpoint to discover which renditions exist for a memory.
Endpoint:
GET /v1/memories/{id}/pages
Basic request:
curl -sS -k "$GOODMEM_REST_URL/v1/memories/$MEMORY_ID/pages" \
-H "x-api-key: $GOODMEM_API_KEY" | jqhttp --verify=no GET "$GOODMEM_REST_URL/v1/memories/$MEMORY_ID/pages" \
x-api-key:"$GOODMEM_API_KEY"The response looks like this:
{
"pageImages": [
{
"memoryId": "550e8400-e29b-41d4-a716-446655440000",
"pageIndex": 0,
"dpi": 150,
"contentType": "image/png",
"imageContentLength": 281233,
"imageContentSha256": "2d711642b726b04401627ca9fbac32f5c8530fb1903cc4db02258717921a4881",
"createdAt": 1714762260000,
"updatedAt": 1714762260000
}
],
"nextToken": "..."
}Notes:
pageIndexis 0-based- one logical page can have more than one rendition
- renditions are distinguished by
dpiandcontentType nextTokenis opaque; if present, pass it back unchanged
Filter the List
Supported query parameters:
startPageIndexendPageIndexdpicontentTypemaxResultsnextToken
Snake_case aliases are also accepted:
start_page_indexend_page_indexcontent_typemax_resultsnext_token
Example: list just page 2:
curl -sS -k \
"$GOODMEM_REST_URL/v1/memories/$MEMORY_ID/pages?startPageIndex=2&endPageIndex=2" \
-H "x-api-key: $GOODMEM_API_KEY" | jqFetch One Page Image
Use this endpoint to download one page image as raw binary content:
GET /v1/memories/{id}/pages/{pageIndex}/image
In the common case, you can omit rendition hints entirely:
curl -sS -k \
"$GOODMEM_REST_URL/v1/memories/$MEMORY_ID/pages/2/image" \
-H "x-api-key: $GOODMEM_API_KEY" \
-o page-2.pngThe server will return the unique rendition for that page if exactly one exists.
In the common case, you do not need to specify dpi or contentType. If GoodMem ever stores
multiple renditions for the same page, the server may ask you to specify them explicitly.
HTTP Behavior
The image endpoint supports normal binary-download behavior:
GETfor the image bytesHEADfor headers onlyRangerequestsETagandDigestheaders when available
This is useful for browser caching and document-viewer prefetching.
Page Indices Are 0-Based
GoodMem uses 0-based page indices everywhere in the page-image APIs and chunk metadata.
Examples:
- the first page in the PDF is
pageIndex = 0 - “page 3” in a human-facing UI is
pageIndex = 2
If your UI shows human page numbers, convert them at the edge and keep the API calls 0-based.
How Retrieved Chunks Point Back to Pages
Page images are stored per page, but retrieved chunks can span one or more pages. GoodMem exposes
page attribution in chunk metadata, not as first-class chunk fields.
When available, the metadata keys are:
source_page_start_indexsource_page_end_indexsource_page_count
Example retrieved chunk metadata:
{
"metadata": {
"source_page_start_index": 4,
"source_page_end_index": 5,
"source_page_count": 2
}
}Interpretation:
- the chunk starts on page
4 - ends on page
5 - spans
2pages total
These fields are optional. If GoodMem cannot infer page spans for a chunk, the keys are simply absent.
Common Patterns
Build a Viewer
- Fetch the memory and wait for
pageImageStatus == COMPLETED - List page metadata with
GET /v1/memories/{id}/pages - Render each visible page with
GET /v1/memories/{id}/pages/{pageIndex}/image - Use chunk metadata to highlight which pages a retrieval result came from
Handle Image-Only PDFs
Some PDFs do not yield usable extracted text. GoodMem can still render page images for them.
That means you may see:
processingStatus = FAILEDpageImageStatus = COMPLETED
This is expected for certain image-only or scan-heavy PDFs.
Next Steps
- Optimize Document Ingestion for chunking strategy guidance
- Building a Basic RAG Agent for end-to-end ingestion and retrieval
- API Reference for the generated REST and gRPC surface