Search¶
cast2md provides hybrid search across episode metadata and transcript content, combining full-text keyword matching with semantic (meaning-based) search.
Overview¶
The search page at /search provides a unified interface. Behind the scenes, three search systems work together:
Query → Generate embedding (~20ms)
│
├── Episode title/description FTS
├── Transcript segment tsvector search
└── pgvector HNSW similarity search
│
Reciprocal Rank Fusion → Combined results
Search Modes¶
| Mode | Description | Best For |
|---|---|---|
| Hybrid (default) | Combines keyword and semantic search | General use |
| Keyword | PostgreSQL full-text search only | Exact terms, names, numbers |
| Semantic | Vector similarity search only | Conceptual queries, multilingual |
Result Types¶
Search returns two categories of results:
Episode Matches¶
- Matched episode title or description
- Shows "title" badge
- Links directly to episode (no timestamp)
Transcript Matches¶
- Matched content within transcripts
- Shows badge: "keyword", "semantic", or "both"
- Links to episode at the matched timestamp
How It Works¶
Full-Text Search (Keyword)¶
Uses PostgreSQL tsvector and tsquery for fast keyword matching:
- Episode titles and descriptions are indexed in
episode_searchtable - Transcript segments are indexed with tsvector columns
- Supports stemming, ranking, and phrase matching
Semantic Search¶
Uses sentence-transformers embeddings stored in pgvector:
- Model:
paraphrase-multilingual-MiniLM-L12-v2 - Dimensions: 384
- Languages: 50+ including German
- Index: HNSW (Hierarchical Navigable Small World) for fast approximate nearest neighbor
- Distance: Cosine similarity (
<=>operator)
The model understands semantic similarity -- for example, searching for "kaltbaden" (cold bathing) also finds content about "eisbaden" (ice bathing).
Reciprocal Rank Fusion (RRF)¶
In hybrid mode, results from keyword and semantic search are combined using RRF:
- Each search method produces a ranked list
- RRF assigns scores based on rank position:
1 / (k + rank) - Scores are summed across methods
- Results are re-ranked by combined score
This balances precision (keyword) with recall (semantic).
Segment Merging¶
Transcripts can have word-level timestamps where each word is a separate segment. The system automatically merges these into phrases during indexing:
- Phrase boundaries: punctuation, pauses (>1.5s), or max 200 characters
- Applies to: both FTS indexing and embedding generation
- Improves: search quality (fewer noisy single-word results) and readability
Embedding Generation¶
Embeddings are generated in the background:
- When a new transcript is saved, an
EMBEDjob is queued - The embedding worker processes these jobs
- Transcript segments are merged into phrases, then embedded
- Embeddings are stored in PostgreSQL with pgvector
Startup Behavior¶
- Embeddings: persisted in PostgreSQL (survive restarts)
- Model loading: ~3 seconds on first semantic query after restart
Reindexing¶
# Reindex FTS only
cast2md reindex-transcripts
# Reindex FTS and regenerate all embeddings
cast2md reindex-transcripts --embeddings
# Backfill missing embeddings only
cast2md backfill-embeddings
API¶
Search Endpoint¶
| Parameter | Default | Description |
|---|---|---|
q |
(required) | Search query |
mode |
hybrid |
hybrid, semantic, or keyword |
limit |
20 |
Maximum results |
feed_id |
(all) | Limit to specific feed |
Embedding Stats¶
Returns count of episodes with/without embeddings.
MCP Tool¶
Embedding Model¶
The default model paraphrase-multilingual-MiniLM-L12-v2 was chosen for:
- Multilingual support -- 50+ languages including German
- Small size -- ~470 MB, 384 dimensions
- Good quality -- strong semantic understanding across languages
- Fast inference -- ~20ms per query on CPU
pgvector Configuration¶
- Extension:
vector(installed automatically) - Column:
vector(384)matching embedding dimension - Index: HNSW for fast approximate nearest neighbor search
- Distance metric: Cosine distance
Key Files¶
| File | Purpose |
|---|---|
search/embeddings.py |
Embedding generation, model config |
search/parser.py |
Segment merging (merge_word_level_segments()) |
search/repository.py |
hybrid_search(), index_episode_embeddings() |
api/search.py |
/api/search/semantic endpoint |
mcp/tools.py |
semantic_search MCP tool |
worker/manager.py |
Embedding worker (processes EMBED jobs) |