Transcript Sources¶

cast2md uses a transcript-first workflow that prioritizes external transcript sources before falling back to Whisper transcription. This minimizes storage requirements (no audio download needed) and processing time.

Provider Priority¶

Priority	Provider	Source	Description
1	Podcast 2.0	RSS `<podcast:transcript>` tags	Publisher-provided, authoritative
2	Pocket Casts	Pocket Casts API	Auto-generated transcripts
3	Whisper	Local transcription	Last resort, requires audio download

Phase 1: Feed Discovery¶

When a feed is added or refreshed, discover_new_episodes() runs:

RSS Parsing¶

The parser extracts Podcast 2.0 transcript info from each episode entry:

transcript_url -- URL from <podcast:transcript> tag
transcript_type -- MIME type (e.g., text/vtt, application/srt)

Pocket Casts Upfront Check¶

After creating episodes, the system checks Pocket Casts for episodes that don't have Podcast 2.0 transcripts:

Search for show via POST podcast-api.pocketcasts.com/discover/search
- Search by feed title, match result by author name
- Cache pocketcasts_uuid on feed for future lookups
Get episodes via GET podcast-api.pocketcasts.com/mobile/show_notes/full/{uuid}
- Returns JSON with all episodes
- Each episode may have pocket_casts_transcripts[] array with VTT URLs
Match episodes by title similarity + published date within 24 hours
- If match found with transcript URL, store in pocketcasts_transcript_url

Phase 2: Transcript Download¶

When a TRANSCRIPT_DOWNLOAD job runs, the provider chain is invoked:

_providers = [
    Podcast20Provider(),    # Check transcript_url first
    PocketCastsProvider(),  # Fallback to pocketcasts_transcript_url
]

Podcast20Provider¶

Checks: episode.transcript_url exists
Downloads: From the URL in the RSS feed
Parses: Based on MIME type (VTT, SRT, JSON, plain text)
Source tag: podcast2.0:vtt, podcast2.0:srt, podcast2.0:json, or podcast2.0:text

PocketCastsProvider¶

Always tries as a fallback
If episode.pocketcasts_transcript_url exists (from upfront discovery), downloads directly
Otherwise, searches the Pocket Casts API (slower path)
Source tag: pocketcasts

Data Model¶

Episode Fields¶

Field	Source	Description
`transcript_url`	RSS parsing	Podcast 2.0 `<podcast:transcript>` URL
`transcript_type`	RSS parsing	MIME type of Podcast 2.0 transcript
`pocketcasts_transcript_url`	Upfront discovery	Pocket Casts VTT URL
`transcript_source`	After download	Final source: `whisper`, `podcast2.0:vtt`, `pocketcasts`, etc.
`transcript_path`	After download	Local path to saved transcript
`transcript_model`	After transcription	Whisper model used (e.g., `large-v3-turbo`, `parakeet-tdt-0.6b-v3`)

Feed Fields¶

Field	Description
`pocketcasts_uuid`	Cached Pocket Casts show UUID (avoids repeated API searches)

Transcript Source Tags¶

When an episode reaches completed, the transcript_source column records how it was obtained:

Source	Description
`whisper`	Transcribed locally using Whisper
`podcast2.0:vtt`	Downloaded from publisher (WebVTT format)
`podcast2.0:srt`	Downloaded from publisher (SRT format)
`podcast2.0:json`	Downloaded from publisher (JSON format)
`podcast2.0:text`	Downloaded from publisher (plain text)
`pocketcasts`	Auto-generated by Pocket Casts
`NULL`	Legacy episodes (before this feature)

403 Handling (Pocket Casts)¶

Pocket Casts sometimes returns transcript URLs before the files exist on S3. The system handles this with age-based retry logic:

Episode Age	Behavior	Rationale
< 7 days	Status: `awaiting_transcript`, retry every 24h	Transcript may be generated soon
>= 7 days	Status: `needs_audio`, no retry	Transcript won't appear

A scheduler job runs hourly to:

Find episodes with awaiting_transcript status and next_transcript_retry_at <= now
Re-queue TRANSCRIPT_DOWNLOAD jobs
Transition aged-out episodes to needs_audio

Adding New Providers¶

Create src/cast2md/transcription/providers/newprovider.py

Implement the TranscriptProvider base class:

class NewProvider(TranscriptProvider):
    @property
    def source_id(self) -> str:
        return "newprovider"

    def can_provide(self, episode, feed) -> bool:
        return True  # Your check logic

    def fetch(self, episode, feed) -> TranscriptResult | None:
        # Download and return transcript
        return TranscriptResult(content=markdown, source=self.source_id)

Register in providers/__init__.py:

_providers = [
    Podcast20Provider(),
    PocketCastsProvider(),
    NewProvider(),  # Add here
]

Key Files¶

File	Purpose
`clients/pocketcasts.py`	Pocket Casts API client
`transcription/providers/base.py`	`TranscriptProvider` abstract base class
`transcription/providers/podcast20.py`	Podcasting 2.0 implementation
`transcription/providers/pocketcasts.py`	Pocket Casts implementation
`transcription/providers/__init__.py`	Provider registry and `try_fetch_transcript()`
`transcription/formats.py`	VTT/SRT/JSON/text parsers
`feed/discovery.py`	Episode discovery and Pocket Casts upfront check
`worker/manager.py`	Transcript download job handler

Known Limitations¶

Pocket Casts search matching -- relies on author name matching, which may fail for some podcasts
Rate limiting -- Pocket Casts API calls are rate-limited to 500ms between requests
Episode matching -- title normalization may not handle all edge cases (special characters, truncation)