Hybrid Search Pipeline Guidelines

A comprehensive guide for implementing and optimizing intelligent search systems that combine multiple AI models to find the most relevant information

What is Hybrid Search?

Hybrid search is an advanced approach to information retrieval that combines different types of AI models (called "embedders") to find the most relevant content for a user's query. Think of it like having multiple specialists working together - one expert at understanding exact word matches, another expert at understanding meaning and context.

Why Use Hybrid Search?

Traditional search systems rely on a single approach, but different types of content and queries benefit from different search strategies:

Dense embedders (like MiniLM): Excel at understanding semantic meaning and context. Great for finding conceptually similar content even when exact words don't match.
Sparse embedders (like SPLADE): Excel at finding exact keyword matches and rare terms. Great for precise, technical queries.

By combining both approaches with optimal weights, hybrid search can achieve significantly better results than either approach alone.

Overview

GoodMem is an intelligent memory storage and retrieval platform that supports multiple AI embedding models. This documentation provides a complete step-by-step pipeline for:

Setting up your hybrid search system with multiple AI models
Preparing your data for testing and evaluation
Evaluating performance across different configurations
Optimizing the balance between different AI models
Validating improvements with statistical rigor

The pipeline consists of four main scripts that work together, progressing from basic setup to advanced optimization.

Step-by-Step Implementation Guide

Step 0: Setting Up Your AI Models (Embedders) and Search Space

Before we can test hybrid search, we need to set up the AI models that will power our system. Think of this as assembling your team of search specialists.

Choosing Your AI Models

For this demonstration, we'll use two complementary AI models that represent different search strategies:

Model 1 - MiniLM (Dense Embedder)

What it does: Understands meaning and context, great for conceptual similarity
Best for: "Find articles about renewable energy" (will find solar, wind, hydro content)
Technical details: MiniLM on HuggingFace

Model 2 - SPLADE (Sparse Embedder)

What it does: Excels at exact keyword matching and rare term recognition
Best for: "Find documents containing 'photovoltaic cells'" (precise technical terms)
Technical details: SPLADE on HuggingFace

Setting Up the Models

First, you'll need to run these AI models in Docker containers (think of containers as isolated environments for each model). Then register them with GoodMem using these commands:

# Register the MiniLM model (semantic understanding specialist)
goodmem embedder create --display-name "miniLM" --provider-type TEI --endpoint-url "http://localhost:8000" --model-identifier "all-MiniLM-L6-v2" --dimensionality 384

# Register the SPLADE model (keyword matching specialist)
goodmem embedder create --display-name "SPLADE Embedder" --provider-type TEI --endpoint-url "http://localhost:8081" --model-identifier "splade-cocondenser-ensembledistil" --dimensionality 30522 --distribution-type SPARSE

Creating Your Search Space

After registering both AI models, create a "search space" - this is where your content will be stored and where both models can work together:

# Create a hybrid search space that combines both models
goodmem space create --name "Dual miniLM + Splade" --embedder-id miniLM-id --embedder-id splade-id

What just happened? You now have a search system that can leverage both semantic understanding (MiniLM) and precise keyword matching (SPLADE). When someone searches, both models will contribute their expertise, and the system will combine their results.

Step 1: Loading Your Test Data

Script: insert_squad_sentences_goodmem.py

Now that your hybrid search system is set up, you need content to search through. This script loads a well-known research dataset to test how well your system works.

What is the SQuAD Dataset?

The Stanford Question Answering Dataset (SQuAD) is a collection of reading comprehension questions based on Wikipedia articles. It's perfect for testing search systems because:

Real-world content: Actual Wikipedia articles covering diverse topics
Known answers: Each question has a verified correct answer
Standardized testing: Widely used in AI research for comparing systems

How the Script Works

This script takes each sentence from the SQuAD dataset and stores it in your GoodMem search space. Here's what happens:

Content Processing: Every sentence becomes a searchable "memory"
Ground Truth Creation: The script tracks which sentences should be found for each question
Metadata Storage: Additional context (article titles, paragraph info) is preserved
Progress Tracking: You can resume if the process is interrupted

Prerequisites

Before running the script:

Download the SQuAD dataset
Set your GOODMEM_API_KEY and GOODMEM_SERVER_URL in the script
Ensure your search space is created and active

Running the Data Preparation

Usage Examples:

# Test the script without actually inserting data (recommended first step)
python insert_squad_sentences_goodmem.py --space-id SPACE-ID --dry-run

# Insert a smaller sample for initial testing (faster, good for development)
python insert_squad_sentences_goodmem.py --space-id SPACE-ID --limit 10000

# Insert the complete dataset (takes longer, needed for final evaluation)
python insert_squad_sentences_goodmem.py --space-id SPACE-ID

What You Get

After the script completes, you'll have several important files:

Generated Files:

squad_1.1_ground_truth_{SPACE-ID}.json: The "answer key" - which sentences should be found for each question
squad_1.1_sentence_memory_mapping_{SPACE-ID}.json: Maps original sentence IDs to your search system's memory IDs
squad_1.1_memories_progress_{SPACE-ID}.json: Allows resuming if the process is interrupted

How Each Memory is Structured:

{
  "content": "The actual sentence text that will be searched",
  "metadata": {
    "source": "squad_1.1",
    "type": "sentence",
    "article_title": "Solar Energy",
    "article_index": 42,
    "paragraph_id": "42_3",
    "sentence_id": "42_3_1",
    "paragraph_context": "Full paragraph for additional context",
    "test_purpose": "squad_retrieval_evaluation"
  }
}

Step 2: Testing Your Search Performance

Script: evaluate_squad_fast.py

Now comes the exciting part - testing how well your hybrid search system actually works! This script runs thousands of search queries and measures performance.

Understanding Hybrid Search Weighting

Your system has two AI models working together, but how much should you trust each one? This is controlled by weights:

Weight = 1.0: Full trust in this model's results
Weight = 0.5: Half trust (results are halved in importance)
Weight = 0.0: Ignore this model completely

Example Scenarios:

MiniLM: 1.0, SPLADE: 1.0 - Trust both models equally (default)
MiniLM: 0.9, SPLADE: 0.1 - Heavily favor semantic understanding over keyword matching
MiniLM: 0.3, SPLADE: 0.7 - Favor keyword matching for technical queries

How Hybrid Search Combines Results

When you search for "renewable energy systems":

MiniLM finds semantically similar content (solar panels, wind turbines, etc.)
SPLADE finds exact keyword matches ("renewable energy systems")
Hybrid system combines both results using your chosen weights
Final ranking shows the best overall results

The mathematical formula: For each result, the final score = (MiniLM_score × MiniLM_weight) + (SPLADE_score × SPLADE_weight)

Running Performance Tests

Usage Examples:

# Test with default settings (both models equally weighted)
python evaluate_squad_fast.py --space-id SPACE-ID

# Test a custom weighting (favor MiniLM over SPLADE)
python evaluate_squad_fast.py --space-id SPACE-ID --custom-weights '{"embedder-1": 1.5, "embedder-2": 0.8}'

# Speed up evaluation for development (use fewer questions and more parallel processing)
python evaluate_squad_fast.py --space-id SPACE-ID --limit 1000 --top-k 5 --batch-size 200 --threads 8

Understanding Your Results

Generated Files:

squad_fast_eval_results_{WEIGHTS}.json: Complete performance report with all metrics
missingTerms_{WEIGHTS}.json: Questions where the system failed to find the correct answer
embedder_info.json: Technical configuration details for your models

Key Performance Metrics:

MRR (Mean Reciprocal Rank): Overall search quality score (0.0 to 1.0, higher is better)
- 1.0 = Perfect (always finds correct answer as #1 result)
- 0.5 = Good (correct answer typically in top 2-3 results)
- 0.1 = Poor (correct answer rarely in top results)
Recall@N: Success rate for finding correct answers in top-N results
- Recall@1: What % of questions get the right answer as the #1 result?
- Recall@5: What % of questions get the right answer in the top 5 results?
- Recall@10: What % of questions get the right answer in the top 10 results?
Coverage: What % of questions returned any results at all (should be near 100%)

Step 3: Automatic Optimization

Script: optimize_embedder_weights.py

Manually testing different weight combinations is time-consuming. This script automatically finds the best weights for your specific use case and data.

What the Optimizer Does

Instead of guessing which weights work best, this script:

Tests Many Combinations: Automatically tries hundreds of different weight combinations
Measures Performance: Runs searches with each combination and measures MRR scores
Finds the Best: Identifies the weight combination that gives the highest performance
Validates Results: Uses statistical tests to confirm the improvements are real, not just luck

The Two-Phase Optimization Strategy

Phase 1 - Coarse Search: Tests major weight ratios quickly

Example: 90%/10%, 80%/20%, 70%/30%, etc.
Goal: Find the promising region (e.g., "MiniLM should dominate")

Phase 2 - Fine Search: Refines around the best area from Phase 1

Example: If 90%/10% was best, test 89%/11%, 90%/10%, 91%/9%, etc.
Goal: Find the exact optimal balance

Statistical Validation

The optimizer doesn't just find the best weights - it proves they're actually better:

Compares against baselines: How much better is hybrid vs. using just one model?
Statistical significance: Are improvements real or just random variation?
Confidence intervals: How confident can we be in the results?

Running the Optimizer

Common Usage Patterns:

# Quick optimization for development (fast, good enough for testing)
python optimize_embedder_weights.py --space-id SPACE-ID --sample-size 1000 --coarse-step 0.10 --max-evaluations 20

# Standard optimization (balance of speed and accuracy)
python optimize_embedder_weights.py --space-id SPACE-ID

# High-precision optimization (slower, but finds the absolute best weights)
python optimize_embedder_weights.py --space-id SPACE-ID --coarse-step 0.02 --fine-step 0.001 --fine-range 0.03 --max-evaluations 100

# Compare two specific weight combinations statistically
python optimize_embedder_weights.py --space-id SPACE-ID --compare-weights '{"emb1": 1.0, "emb2": 0.0}' '{"emb1": 0.7, "emb2": 0.3}'

Parameter Guide:

--sample-size 1000: Use 1,000 questions for optimization (faster) vs. full dataset (slower, more accurate)
--coarse-step 0.10: Test every 10% in Phase 1 (0.1, 0.2, 0.3...) vs. every 5% (0.05, 0.10, 0.15...)
--fine-step 0.001: Test every 0.1% in Phase 2 (very precise) vs. 0.005 (faster, less precise)
--max-evaluations 20: Stop after 20 tests (quick results) vs. 100 (thorough search)

What You Get From Optimization

The optimizer provides:

Optimal weights: The exact weight combination that maximizes performance (e.g., "MiniLM: 0.95, SPLADE: 0.05")
Performance improvement: How much better the optimal weights are vs. baseline
Statistical proof: Mathematical confidence that the improvements are real
Recommendations: Clear guidance on which weights to use in production

Example Output:

🏆 OPTIMIZATION RESULTS
Best MRR: 0.647 (6.2% improvement over best single model)
Optimal weights: MiniLM=0.95, SPLADE=0.05

📊 STATISTICAL VALIDATION
vs MiniLM alone: +3.1% improvement (p < 0.001, highly significant)
vs SPLADE alone: +18.4% improvement (p < 0.001, highly significant)
Confidence: 95% sure the improvement is between +2.1% and +4.2%

✅ RECOMMENDATION: Use MiniLM=0.95, SPLADE=0.05 in production

Step 4: Understanding Search Failures

Script: analyze_missing_gt_similarity.py

Even with optimized weights, your search system won't be perfect. This script helps you understand why certain questions fail to find the correct answers.

When to Use This Tool

Use this analysis when:

Your system's performance plateaus and you want to understand the remaining failures
You need to explain to stakeholders why certain types of queries don't work well
You want to identify patterns in failures to improve your content or system

What It Analyzes

For each failed search, the script shows:

The question that failed (e.g., "What year was solar energy first discovered?")
What the system found (the top results it returned)
What it should have found (the correct answer from your dataset)
Why it failed (similarity scores, content comparison)

Types of Failure Patterns

Common failure patterns you might discover:

Vocabulary mismatch: Question uses different words than the answer text
Context dependence: Answer requires understanding multiple sentences
Ambiguity: Multiple plausible answers exist for the same question
Model limitations: Neither AI model handles this type of content well

Usage Examples

# Basic usage
python analyze_missing_gt_similarity.py --space-id SPACE-ID --missing-file missingTerms.json --limit 10

# Deep dive with full content display (good for understanding specific failures)
python analyze_missing_gt_similarity.py --space-id SPACE-ID --missing-file missingTerms.json --limit 5 --verbose

Best Practices and Pro Tips

Getting Started Successfully

🚀 Start Small, Scale Up

Begin with --limit 1000 to test your pipeline quickly
Use --dry-run to verify data loading before full insertion
Test optimization with --sample-size 1000 before running on full dataset

📊 Always Establish Baselines

Test each AI model individually first (MiniLM alone, SPLADE alone)
This gives you a baseline to measure hybrid improvements against
Document your baseline performance for comparison

Optimization Strategy

⏱️ Time vs. Accuracy Trade-offs

Development: Use --coarse-step 0.10 --max-evaluations 20 for quick results
Production: Use --coarse-step 0.02 --fine-step 0.001 for precision
Validation: Always use full dataset for final statistical testing

🎯 Understanding Your Domain

Technical content: Sparse models (SPLADE) often perform better
Conversational content: Dense models (MiniLM) often perform better
Mixed content: Hybrid approaches typically win

Common Findings and Expectations

📈 Typical Performance Patterns

Hybrid search usually improves MRR by 3-8% over the best single model
Optimal weights are often extreme (e.g., 90%/10% rather than 50%/50%)
Dense embedders typically dominate in the optimal combination

⚠️ When Hybrid Might Not Help

Very domain-specific content where one model type clearly dominates
Small datasets where statistical differences aren't meaningful
Cases where the improvement is too small to justify the added complexity

Real-World Results: What to Expect

Here are actual performance results from running this pipeline on the SQuAD dataset:

Search Configuration	MRR	Recall@1	Recall@5	Recall@10
MiniLM only	0.61	52%	73%	78%
SPLADE only	0.42	37%	47%	50%
Optimized Hybrid (MiniLM=1.0, SPLADE=0.005)	0.64	56%	74%	78%

Key Insights:

Hybrid wins: 5% better MRR than the best single model
MiniLM dominates: Optimal weight heavily favors the dense embedder
Minimal SPLADE: Just 0.5% SPLADE weight provides the boost
Consistent improvement: Better performance across all recall metrics

Why does this work? MiniLM handles most queries well with semantic understanding, while SPLADE provides a small boost for technical terms and exact matches that MiniLM misses.

Complete Workflow: Putting It All Together

Here's your step-by-step checklist for implementing hybrid search:

Development Phase (Quick Testing)

# 1. Load test data (small sample for speed)
python insert_squad_sentences_goodmem.py --space-id SPACE-ID --limit 5000

# 2. Test individual models
python evaluate_squad_fast.py --space-id SPACE-ID --limit 1000

# 3. Quick optimization
python optimize_embedder_weights.py --space-id SPACE-ID --sample-size 1000 --max-evaluations 20

Production Phase (Full Evaluation)

# 1. Load complete dataset
python insert_squad_sentences_goodmem.py --space-id SPACE-ID

# 2. Comprehensive optimization
python optimize_embedder_weights.py --space-id SPACE-ID

# 3. Final validation with optimal weights
python evaluate_squad_fast.py --space-id SPACE-ID --custom-weights '{"embedder-1": 0.95, "embedder-2": 0.05}'

# 4. Understand remaining failures
python analyze_missing_gt_similarity.py --space-id SPACE-ID --missing-file missingTerms.json --verbose

Next Steps

After completing this pipeline, you'll have:

✅ A functioning hybrid search system
✅ Statistically validated optimal weights
✅ Performance metrics and confidence intervals
✅ Understanding of system limitations and failures

Deploy with confidence - you now have the data to prove your hybrid search system works better than single-model alternatives!

Hybrid Search Pipeline Guidelines

On this page