Transcript Sources¶
cast2md uses a transcript-first workflow that prioritizes external transcript sources before falling back to Whisper transcription. This minimizes storage requirements (no audio download needed) and processing time.
Provider Priority¶
| Priority | Provider | Source | Description |
|---|---|---|---|
| 1 | Podcast 2.0 | RSS <podcast:transcript> tags |
Publisher-provided, authoritative |
| 2 | Pocket Casts | Pocket Casts API | Auto-generated transcripts |
| 3 | Whisper | Local transcription | Last resort, requires audio download |
Phase 1: Feed Discovery¶
When a feed is added or refreshed, discover_new_episodes() runs:
RSS Parsing¶
The parser extracts Podcast 2.0 transcript info from each episode entry:
transcript_url-- URL from<podcast:transcript>tagtranscript_type-- MIME type (e.g.,text/vtt,application/srt)
Pocket Casts Upfront Check¶
After creating episodes, the system checks Pocket Casts for episodes that don't have Podcast 2.0 transcripts:
- Search for show via
POST podcast-api.pocketcasts.com/discover/search- Search by feed title, match result by author name
- Cache
pocketcasts_uuidon feed for future lookups
- Get episodes via
GET podcast-api.pocketcasts.com/mobile/show_notes/full/{uuid}- Returns JSON with all episodes
- Each episode may have
pocket_casts_transcripts[]array with VTT URLs
- Match episodes by title similarity + published date within 24 hours
- If match found with transcript URL, store in
pocketcasts_transcript_url
- If match found with transcript URL, store in
Phase 2: Transcript Download¶
When a TRANSCRIPT_DOWNLOAD job runs, the provider chain is invoked:
_providers = [
Podcast20Provider(), # Check transcript_url first
PocketCastsProvider(), # Fallback to pocketcasts_transcript_url
]
Podcast20Provider¶
- Checks:
episode.transcript_urlexists - Downloads: From the URL in the RSS feed
- Parses: Based on MIME type (VTT, SRT, JSON, plain text)
- Source tag:
podcast2.0:vtt,podcast2.0:srt,podcast2.0:json, orpodcast2.0:text
PocketCastsProvider¶
- Always tries as a fallback
- If
episode.pocketcasts_transcript_urlexists (from upfront discovery), downloads directly - Otherwise, searches the Pocket Casts API (slower path)
- Source tag:
pocketcasts
Data Model¶
Episode Fields¶
| Field | Source | Description |
|---|---|---|
transcript_url |
RSS parsing | Podcast 2.0 <podcast:transcript> URL |
transcript_type |
RSS parsing | MIME type of Podcast 2.0 transcript |
pocketcasts_transcript_url |
Upfront discovery | Pocket Casts VTT URL |
transcript_source |
After download | Final source: whisper, podcast2.0:vtt, pocketcasts, etc. |
transcript_path |
After download | Local path to saved transcript |
transcript_model |
After transcription | Whisper model used (e.g., large-v3-turbo, parakeet-tdt-0.6b-v3) |
Feed Fields¶
| Field | Description |
|---|---|
pocketcasts_uuid |
Cached Pocket Casts show UUID (avoids repeated API searches) |
Transcript Source Tags¶
When an episode reaches completed, the transcript_source column records how it was obtained:
| Source | Description |
|---|---|
whisper |
Transcribed locally using Whisper |
podcast2.0:vtt |
Downloaded from publisher (WebVTT format) |
podcast2.0:srt |
Downloaded from publisher (SRT format) |
podcast2.0:json |
Downloaded from publisher (JSON format) |
podcast2.0:text |
Downloaded from publisher (plain text) |
pocketcasts |
Auto-generated by Pocket Casts |
NULL |
Legacy episodes (before this feature) |
403 Handling (Pocket Casts)¶
Pocket Casts sometimes returns transcript URLs before the files exist on S3. The system handles this with age-based retry logic:
| Episode Age | Behavior | Rationale |
|---|---|---|
| < 7 days | Status: awaiting_transcript, retry every 24h |
Transcript may be generated soon |
| >= 7 days | Status: needs_audio, no retry |
Transcript won't appear |
A scheduler job runs hourly to:
- Find episodes with
awaiting_transcriptstatus andnext_transcript_retry_at <= now - Re-queue
TRANSCRIPT_DOWNLOADjobs - Transition aged-out episodes to
needs_audio
Adding New Providers¶
- Create
src/cast2md/transcription/providers/newprovider.py -
Implement the
TranscriptProviderbase class:class NewProvider(TranscriptProvider): @property def source_id(self) -> str: return "newprovider" def can_provide(self, episode, feed) -> bool: return True # Your check logic def fetch(self, episode, feed) -> TranscriptResult | None: # Download and return transcript return TranscriptResult(content=markdown, source=self.source_id) -
Register in
providers/__init__.py:
Key Files¶
| File | Purpose |
|---|---|
clients/pocketcasts.py |
Pocket Casts API client |
transcription/providers/base.py |
TranscriptProvider abstract base class |
transcription/providers/podcast20.py |
Podcasting 2.0 implementation |
transcription/providers/pocketcasts.py |
Pocket Casts implementation |
transcription/providers/__init__.py |
Provider registry and try_fetch_transcript() |
transcription/formats.py |
VTT/SRT/JSON/text parsers |
feed/discovery.py |
Episode discovery and Pocket Casts upfront check |
worker/manager.py |
Transcript download job handler |
Known Limitations¶
- Pocket Casts search matching -- relies on author name matching, which may fail for some podcasts
- Rate limiting -- Pocket Casts API calls are rate-limited to 500ms between requests
- Episode matching -- title normalization may not handle all edge cases (special characters, truncation)