Produce an accurate, properly timed caption track (SRT or WebVTT) from a video's audio — transcribing or aligning to the voiceover script, timing cues to speech, and enforcing line-length and reading-speed rules so captions are readable and in sync. Use when someone says "generate captions", "make subtitles from the audio", "transcribe and caption this", "export an SRT or VTT", "the captions are out of sync", or "the subtitles flash by too fast to read". Do NOT use to burn-in, style, reframe, or position an existing caption track for a platform (9:16, brand styling, safe areas) — that is social-video-formatter; do NOT use to animate text word-by-word as a motion graphic — that is kinetic-typography; do NOT use to write the spoken script in the first place — that is narration-script.
Click to play with sound.
---
name: Captions From Transcript
description: Produce an accurate, properly timed caption track (SRT or WebVTT) from a video's audio — transcribing or aligning to the voiceover script, timing cues to speech, and enforcing line-length and reading-speed rules so captions are readable and in sync. Use when someone says "generate captions", "make subtitles from the audio", "transcribe and caption this", "export an SRT or VTT", "the captions are out of sync", or "the subtitles flash by too fast to read". Do NOT use to burn-in, style, reframe, or position an existing caption track for a platform (9:16, brand styling, safe areas) — that is social-video-formatter; do NOT use to animate text word-by-word as a motion graphic — that is kinetic-typography; do NOT use to write the spoken script in the first place — that is narration-script.
---
# Captions From Transcript
Captions are not optional polish — they are accessibility, they hold viewers watching muted, and for tutorials they reinforce exact UI terms. This skill produces a **clean, accurately timed caption file**. Styling and burn-in happen downstream.
## Source the most accurate text first
Accuracy comes from where you start:
- **If a narration script exists, use it as ground truth.** It is already correct on technical terms and UI labels. Align it to the audio rather than re-transcribing from scratch.
- **Otherwise transcribe the audio** (Whisper, a platform's auto-caption, or any ASR), then **correct it against the video**. ASR reliably mangles product names, code, and acronyms — fix every one to match what is on screen exactly.
Never ship raw ASR output. The errors are always in the highest-value words.
## Time the cues to speech
- Each cue appears as its line is spoken and clears when it ends — align to speech, not to arbitrary intervals.
- Minimum ~1 second on screen (even for a short cue), maximum ~7 seconds. Split anything longer.
- No gaps mid-sentence; small gaps between sentences are fine and aid readability.
… install to load the full skillSign in to rate and review this skill.
No reviews yet. Be the first to review this skill.