Skip to content

Transcript Sources

cast2md uses a transcript-first workflow that prioritizes external transcript sources before falling back to Whisper transcription. This minimizes storage requirements (no audio download needed) and processing time.

Provider Priority

Priority Provider Source Description
1 Podcast 2.0 RSS <podcast:transcript> tags Publisher-provided, authoritative
2 Pocket Casts Pocket Casts API Auto-generated transcripts
3 Whisper Local transcription Last resort, requires audio download

Phase 1: Feed Discovery

When a feed is added or refreshed, discover_new_episodes() runs:

RSS Parsing

The parser extracts Podcast 2.0 transcript info from each episode entry:

  • transcript_url -- URL from <podcast:transcript> tag
  • transcript_type -- MIME type (e.g., text/vtt, application/srt)

Pocket Casts Upfront Check

After creating episodes, the system checks Pocket Casts for episodes that don't have Podcast 2.0 transcripts:

  1. Search for show via POST podcast-api.pocketcasts.com/discover/search
    • Search by feed title, match result by author name
    • Cache pocketcasts_uuid on feed for future lookups
  2. Get episodes via GET podcast-api.pocketcasts.com/mobile/show_notes/full/{uuid}
    • Returns JSON with all episodes
    • Each episode may have pocket_casts_transcripts[] array with VTT URLs
  3. Match episodes by title similarity + published date within 24 hours
    • If match found with transcript URL, store in pocketcasts_transcript_url

Phase 2: Transcript Download

When a TRANSCRIPT_DOWNLOAD job runs, the provider chain is invoked:

_providers = [
    Podcast20Provider(),    # Check transcript_url first
    PocketCastsProvider(),  # Fallback to pocketcasts_transcript_url
]

Podcast20Provider

  • Checks: episode.transcript_url exists
  • Downloads: From the URL in the RSS feed
  • Parses: Based on MIME type (VTT, SRT, JSON, plain text)
  • Source tag: podcast2.0:vtt, podcast2.0:srt, podcast2.0:json, or podcast2.0:text

PocketCastsProvider

  • Always tries as a fallback
  • If episode.pocketcasts_transcript_url exists (from upfront discovery), downloads directly
  • Otherwise, searches the Pocket Casts API (slower path)
  • Source tag: pocketcasts

Data Model

Episode Fields

Field Source Description
transcript_url RSS parsing Podcast 2.0 <podcast:transcript> URL
transcript_type RSS parsing MIME type of Podcast 2.0 transcript
pocketcasts_transcript_url Upfront discovery Pocket Casts VTT URL
transcript_source After download Final source: whisper, podcast2.0:vtt, pocketcasts, etc.
transcript_path After download Local path to saved transcript
transcript_model After transcription Whisper model used (e.g., large-v3-turbo, parakeet-tdt-0.6b-v3)

Feed Fields

Field Description
pocketcasts_uuid Cached Pocket Casts show UUID (avoids repeated API searches)

Transcript Source Tags

When an episode reaches completed, the transcript_source column records how it was obtained:

Source Description
whisper Transcribed locally using Whisper
podcast2.0:vtt Downloaded from publisher (WebVTT format)
podcast2.0:srt Downloaded from publisher (SRT format)
podcast2.0:json Downloaded from publisher (JSON format)
podcast2.0:text Downloaded from publisher (plain text)
pocketcasts Auto-generated by Pocket Casts
NULL Legacy episodes (before this feature)

403 Handling (Pocket Casts)

Pocket Casts sometimes returns transcript URLs before the files exist on S3. The system handles this with age-based retry logic:

Episode Age Behavior Rationale
< 7 days Status: awaiting_transcript, retry every 24h Transcript may be generated soon
>= 7 days Status: needs_audio, no retry Transcript won't appear

A scheduler job runs hourly to:

  1. Find episodes with awaiting_transcript status and next_transcript_retry_at <= now
  2. Re-queue TRANSCRIPT_DOWNLOAD jobs
  3. Transition aged-out episodes to needs_audio

Adding New Providers

  1. Create src/cast2md/transcription/providers/newprovider.py
  2. Implement the TranscriptProvider base class:

    class NewProvider(TranscriptProvider):
        @property
        def source_id(self) -> str:
            return "newprovider"
    
        def can_provide(self, episode, feed) -> bool:
            return True  # Your check logic
    
        def fetch(self, episode, feed) -> TranscriptResult | None:
            # Download and return transcript
            return TranscriptResult(content=markdown, source=self.source_id)
    
  3. Register in providers/__init__.py:

    _providers = [
        Podcast20Provider(),
        PocketCastsProvider(),
        NewProvider(),  # Add here
    ]
    

Key Files

File Purpose
clients/pocketcasts.py Pocket Casts API client
transcription/providers/base.py TranscriptProvider abstract base class
transcription/providers/podcast20.py Podcasting 2.0 implementation
transcription/providers/pocketcasts.py Pocket Casts implementation
transcription/providers/__init__.py Provider registry and try_fetch_transcript()
transcription/formats.py VTT/SRT/JSON/text parsers
feed/discovery.py Episode discovery and Pocket Casts upfront check
worker/manager.py Transcript download job handler

Known Limitations

  1. Pocket Casts search matching -- relies on author name matching, which may fail for some podcasts
  2. Rate limiting -- Pocket Casts API calls are rate-limited to 500ms between requests
  3. Episode matching -- title normalization may not handle all edge cases (special characters, truncation)