Hybrid Search Pipeline Guidelines
A comprehensive guide for implementing and optimizing intelligent search systems that combine multiple AI models to find the most relevant information
What is Hybrid Search?
Hybrid search is an advanced approach to information retrieval that combines different types of AI models (called "embedders") to find the most relevant content for a user's query. Think of it like having multiple specialists working together - one expert at understanding exact word matches, another expert at understanding meaning and context.
Why Use Hybrid Search?
Traditional search systems rely on a single approach, but different types of content and queries benefit from different search strategies:
- Dense embedders (like MiniLM): Excel at understanding semantic meaning and context. Great for finding conceptually similar content even when exact words don't match.
- Sparse embedders (like SPLADE): Excel at finding exact keyword matches and rare terms. Great for precise, technical queries.
By combining both approaches with optimal weights, hybrid search can achieve significantly better results than either approach alone.
Overview
GoodMem is an intelligent memory storage and retrieval platform that supports multiple AI embedding models. This documentation provides a complete step-by-step pipeline for:
- Setting up your hybrid search system with multiple AI models
- Preparing your data for testing and evaluation
- Evaluating performance across different configurations
- Optimizing the balance between different AI models
- Validating improvements with statistical rigor
The pipeline consists of four main scripts that work together, progressing from basic setup to advanced optimization.
Step-by-Step Implementation Guide
Step 0: Setting Up Your AI Models (Embedders) and Search Space
Before we can test hybrid search, we need to set up the AI models that will power our system. Think of this as assembling your team of search specialists.
Choosing Your AI Models
For this demonstration, we'll use two complementary AI models that represent different search strategies:
Model 1 - MiniLM (Dense Embedder)
- What it does: Understands meaning and context, great for conceptual similarity
- Best for: "Find articles about renewable energy" (will find solar, wind, hydro content)
- Technical details: MiniLM on HuggingFace
Model 2 - SPLADE (Sparse Embedder)
- What it does: Excels at exact keyword matching and rare term recognition
- Best for: "Find documents containing 'photovoltaic cells'" (precise technical terms)
- Technical details: SPLADE on HuggingFace
Setting Up the Models
First, you'll need to run these AI models in Docker containers (think of containers as isolated environments for each model). Then register them with GoodMem using these commands:
# Register the MiniLM model (semantic understanding specialist)
goodmem embedder create --display-name "miniLM" --provider-type TEI --endpoint-url "http://localhost:8000" --model-identifier "all-MiniLM-L6-v2" --dimensionality 384
# Register the SPLADE model (keyword matching specialist)
goodmem embedder create --display-name "SPLADE Embedder" --provider-type TEI --endpoint-url "http://localhost:8081" --model-identifier "splade-cocondenser-ensembledistil" --dimensionality 30522 --distribution-type SPARSECreating Your Search Space
After registering both AI models, create a "search space" - this is where your content will be stored and where both models can work together:
# Create a hybrid search space that combines both models
goodmem space create --name "Dual miniLM + Splade" --embedder-id miniLM-id --embedder-id splade-idWhat just happened? You now have a search system that can leverage both semantic understanding (MiniLM) and precise keyword matching (SPLADE). When someone searches, both models will contribute their expertise, and the system will combine their results.
Step 1: Loading Your Test Data
Script: insert_squad_sentences_goodmem.py
Now that your hybrid search system is set up, you need content to search through. This script loads a well-known research dataset to test how well your system works.
What is the SQuAD Dataset?
The Stanford Question Answering Dataset (SQuAD) is a collection of reading comprehension questions based on Wikipedia articles. It's perfect for testing search systems because:
- Real-world content: Actual Wikipedia articles covering diverse topics
- Known answers: Each question has a verified correct answer
- Standardized testing: Widely used in AI research for comparing systems
How the Script Works
This script takes each sentence from the SQuAD dataset and stores it in your GoodMem search space. Here's what happens:
- Content Processing: Every sentence becomes a searchable "memory"
- Ground Truth Creation: The script tracks which sentences should be found for each question
- Metadata Storage: Additional context (article titles, paragraph info) is preserved
- Progress Tracking: You can resume if the process is interrupted
Prerequisites
Before running the script:
- Download the SQuAD dataset
- Set your
GOODMEM_API_KEYandGOODMEM_SERVER_URLin the script - Ensure your search space is created and active
Running the Data Preparation
Usage Examples:
# Test the script without actually inserting data (recommended first step)
python insert_squad_sentences_goodmem.py --space-id SPACE-ID --dry-run
# Insert a smaller sample for initial testing (faster, good for development)
python insert_squad_sentences_goodmem.py --space-id SPACE-ID --limit 10000
# Insert the complete dataset (takes longer, needed for final evaluation)
python insert_squad_sentences_goodmem.py --space-id SPACE-IDWhat You Get
After the script completes, you'll have several important files:
Generated Files:
squad_1.1_ground_truth_{SPACE-ID}.json: The "answer key" - which sentences should be found for each questionsquad_1.1_sentence_memory_mapping_{SPACE-ID}.json: Maps original sentence IDs to your search system's memory IDssquad_1.1_memories_progress_{SPACE-ID}.json: Allows resuming if the process is interrupted
How Each Memory is Structured:
{
"content": "The actual sentence text that will be searched",
"metadata": {
"source": "squad_1.1",
"type": "sentence",
"article_title": "Solar Energy",
"article_index": 42,
"paragraph_id": "42_3",
"sentence_id": "42_3_1",
"paragraph_context": "Full paragraph for additional context",
"test_purpose": "squad_retrieval_evaluation"
}
}Step 2: Testing Your Search Performance
Script: evaluate_squad_fast.py
Now comes the exciting part - testing how well your hybrid search system actually works! This script runs thousands of search queries and measures performance.
Understanding Hybrid Search Weighting
Your system has two AI models working together, but how much should you trust each one? This is controlled by weights:
- Weight = 1.0: Full trust in this model's results
- Weight = 0.5: Half trust (results are halved in importance)
- Weight = 0.0: Ignore this model completely
Example Scenarios:
MiniLM: 1.0, SPLADE: 1.0- Trust both models equally (default)MiniLM: 0.9, SPLADE: 0.1- Heavily favor semantic understanding over keyword matchingMiniLM: 0.3, SPLADE: 0.7- Favor keyword matching for technical queries
How Hybrid Search Combines Results
When you search for "renewable energy systems":
- MiniLM finds semantically similar content (solar panels, wind turbines, etc.)
- SPLADE finds exact keyword matches ("renewable energy systems")
- Hybrid system combines both results using your chosen weights
- Final ranking shows the best overall results
The mathematical formula: For each result, the final score = (MiniLM_score × MiniLM_weight) + (SPLADE_score × SPLADE_weight)
Running Performance Tests
Usage Examples:
# Test with default settings (both models equally weighted)
python evaluate_squad_fast.py --space-id SPACE-ID
# Test a custom weighting (favor MiniLM over SPLADE)
python evaluate_squad_fast.py --space-id SPACE-ID --custom-weights '{"embedder-1": 1.5, "embedder-2": 0.8}'
# Speed up evaluation for development (use fewer questions and more parallel processing)
python evaluate_squad_fast.py --space-id SPACE-ID --limit 1000 --top-k 5 --batch-size 200 --threads 8Understanding Your Results
Generated Files:
squad_fast_eval_results_{WEIGHTS}.json: Complete performance report with all metricsmissingTerms_{WEIGHTS}.json: Questions where the system failed to find the correct answerembedder_info.json: Technical configuration details for your models
Key Performance Metrics:
-
MRR (Mean Reciprocal Rank): Overall search quality score (0.0 to 1.0, higher is better)
- 1.0 = Perfect (always finds correct answer as #1 result)
- 0.5 = Good (correct answer typically in top 2-3 results)
- 0.1 = Poor (correct answer rarely in top results)
-
Recall@N: Success rate for finding correct answers in top-N results
- Recall@1: What % of questions get the right answer as the #1 result?
- Recall@5: What % of questions get the right answer in the top 5 results?
- Recall@10: What % of questions get the right answer in the top 10 results?
-
Coverage: What % of questions returned any results at all (should be near 100%)
Step 3: Automatic Optimization
Script: optimize_embedder_weights.py
Manually testing different weight combinations is time-consuming. This script automatically finds the best weights for your specific use case and data.
What the Optimizer Does
Instead of guessing which weights work best, this script:
- Tests Many Combinations: Automatically tries hundreds of different weight combinations
- Measures Performance: Runs searches with each combination and measures MRR scores
- Finds the Best: Identifies the weight combination that gives the highest performance
- Validates Results: Uses statistical tests to confirm the improvements are real, not just luck
The Two-Phase Optimization Strategy
Phase 1 - Coarse Search: Tests major weight ratios quickly
- Example: 90%/10%, 80%/20%, 70%/30%, etc.
- Goal: Find the promising region (e.g., "MiniLM should dominate")
Phase 2 - Fine Search: Refines around the best area from Phase 1
- Example: If 90%/10% was best, test 89%/11%, 90%/10%, 91%/9%, etc.
- Goal: Find the exact optimal balance
Statistical Validation
The optimizer doesn't just find the best weights - it proves they're actually better:
- Compares against baselines: How much better is hybrid vs. using just one model?
- Statistical significance: Are improvements real or just random variation?
- Confidence intervals: How confident can we be in the results?
Running the Optimizer
Common Usage Patterns:
# Quick optimization for development (fast, good enough for testing)
python optimize_embedder_weights.py --space-id SPACE-ID --sample-size 1000 --coarse-step 0.10 --max-evaluations 20
# Standard optimization (balance of speed and accuracy)
python optimize_embedder_weights.py --space-id SPACE-ID
# High-precision optimization (slower, but finds the absolute best weights)
python optimize_embedder_weights.py --space-id SPACE-ID --coarse-step 0.02 --fine-step 0.001 --fine-range 0.03 --max-evaluations 100
# Compare two specific weight combinations statistically
python optimize_embedder_weights.py --space-id SPACE-ID --compare-weights '{"emb1": 1.0, "emb2": 0.0}' '{"emb1": 0.7, "emb2": 0.3}'Parameter Guide:
--sample-size 1000: Use 1,000 questions for optimization (faster) vs. full dataset (slower, more accurate)--coarse-step 0.10: Test every 10% in Phase 1 (0.1, 0.2, 0.3...) vs. every 5% (0.05, 0.10, 0.15...)--fine-step 0.001: Test every 0.1% in Phase 2 (very precise) vs. 0.005 (faster, less precise)--max-evaluations 20: Stop after 20 tests (quick results) vs. 100 (thorough search)
What You Get From Optimization
The optimizer provides:
- Optimal weights: The exact weight combination that maximizes performance (e.g., "MiniLM: 0.95, SPLADE: 0.05")
- Performance improvement: How much better the optimal weights are vs. baseline
- Statistical proof: Mathematical confidence that the improvements are real
- Recommendations: Clear guidance on which weights to use in production
Example Output:
🏆 OPTIMIZATION RESULTS
Best MRR: 0.647 (6.2% improvement over best single model)
Optimal weights: MiniLM=0.95, SPLADE=0.05
📊 STATISTICAL VALIDATION
vs MiniLM alone: +3.1% improvement (p < 0.001, highly significant)
vs SPLADE alone: +18.4% improvement (p < 0.001, highly significant)
Confidence: 95% sure the improvement is between +2.1% and +4.2%
✅ RECOMMENDATION: Use MiniLM=0.95, SPLADE=0.05 in productionStep 4: Understanding Search Failures
Script: analyze_missing_gt_similarity.py
Even with optimized weights, your search system won't be perfect. This script helps you understand why certain questions fail to find the correct answers.
When to Use This Tool
Use this analysis when:
- Your system's performance plateaus and you want to understand the remaining failures
- You need to explain to stakeholders why certain types of queries don't work well
- You want to identify patterns in failures to improve your content or system
What It Analyzes
For each failed search, the script shows:
- The question that failed (e.g., "What year was solar energy first discovered?")
- What the system found (the top results it returned)
- What it should have found (the correct answer from your dataset)
- Why it failed (similarity scores, content comparison)
Types of Failure Patterns
Common failure patterns you might discover:
- Vocabulary mismatch: Question uses different words than the answer text
- Context dependence: Answer requires understanding multiple sentences
- Ambiguity: Multiple plausible answers exist for the same question
- Model limitations: Neither AI model handles this type of content well
Usage Examples
# Basic usage
python analyze_missing_gt_similarity.py --space-id SPACE-ID --missing-file missingTerms.json --limit 10
# Deep dive with full content display (good for understanding specific failures)
python analyze_missing_gt_similarity.py --space-id SPACE-ID --missing-file missingTerms.json --limit 5 --verboseBest Practices and Pro Tips
Getting Started Successfully
🚀 Start Small, Scale Up
- Begin with
--limit 1000to test your pipeline quickly - Use
--dry-runto verify data loading before full insertion - Test optimization with
--sample-size 1000before running on full dataset
📊 Always Establish Baselines
- Test each AI model individually first (MiniLM alone, SPLADE alone)
- This gives you a baseline to measure hybrid improvements against
- Document your baseline performance for comparison
Optimization Strategy
⏱️ Time vs. Accuracy Trade-offs
- Development: Use
--coarse-step 0.10 --max-evaluations 20for quick results - Production: Use
--coarse-step 0.02 --fine-step 0.001for precision - Validation: Always use full dataset for final statistical testing
🎯 Understanding Your Domain
- Technical content: Sparse models (SPLADE) often perform better
- Conversational content: Dense models (MiniLM) often perform better
- Mixed content: Hybrid approaches typically win
Common Findings and Expectations
📈 Typical Performance Patterns
- Hybrid search usually improves MRR by 3-8% over the best single model
- Optimal weights are often extreme (e.g., 90%/10% rather than 50%/50%)
- Dense embedders typically dominate in the optimal combination
⚠️ When Hybrid Might Not Help
- Very domain-specific content where one model type clearly dominates
- Small datasets where statistical differences aren't meaningful
- Cases where the improvement is too small to justify the added complexity
Real-World Results: What to Expect
Here are actual performance results from running this pipeline on the SQuAD dataset:
| Search Configuration | MRR | Recall@1 | Recall@5 | Recall@10 |
|---|---|---|---|---|
| MiniLM only | 0.61 | 52% | 73% | 78% |
| SPLADE only | 0.42 | 37% | 47% | 50% |
| Optimized Hybrid (MiniLM=1.0, SPLADE=0.005) | 0.64 | 56% | 74% | 78% |
Key Insights:
- Hybrid wins: 5% better MRR than the best single model
- MiniLM dominates: Optimal weight heavily favors the dense embedder
- Minimal SPLADE: Just 0.5% SPLADE weight provides the boost
- Consistent improvement: Better performance across all recall metrics
Why does this work? MiniLM handles most queries well with semantic understanding, while SPLADE provides a small boost for technical terms and exact matches that MiniLM misses.
Complete Workflow: Putting It All Together
Here's your step-by-step checklist for implementing hybrid search:
Development Phase (Quick Testing)
# 1. Load test data (small sample for speed)
python insert_squad_sentences_goodmem.py --space-id SPACE-ID --limit 5000
# 2. Test individual models
python evaluate_squad_fast.py --space-id SPACE-ID --limit 1000
# 3. Quick optimization
python optimize_embedder_weights.py --space-id SPACE-ID --sample-size 1000 --max-evaluations 20Production Phase (Full Evaluation)
# 1. Load complete dataset
python insert_squad_sentences_goodmem.py --space-id SPACE-ID
# 2. Comprehensive optimization
python optimize_embedder_weights.py --space-id SPACE-ID
# 3. Final validation with optimal weights
python evaluate_squad_fast.py --space-id SPACE-ID --custom-weights '{"embedder-1": 0.95, "embedder-2": 0.05}'
# 4. Understand remaining failures
python analyze_missing_gt_similarity.py --space-id SPACE-ID --missing-file missingTerms.json --verboseNext Steps
After completing this pipeline, you'll have:
- ✅ A functioning hybrid search system
- ✅ Statistically validated optimal weights
- ✅ Performance metrics and confidence intervals
- ✅ Understanding of system limitations and failures
Deploy with confidence - you now have the data to prove your hybrid search system works better than single-model alternatives!