convertToCaptions()v4.0.131

warning

This API assumes a newer version of Whisper.cpp than the stable release to support tokenLevelTimestamps. As a downside, this version may crash unexpectedly. Use an older version of Whisper.cpp (1.0.54 or earlier) if you prefer to use a stable version of Whisper.cpp and forgo tokenLevelTimeStamps support.

Opinionated function that converts the output from transcribe() into easily digestable captions.
Can also combine words with close timestamps.
Useful for TikTok/Reel-type of videos that animate captions word-by-word.

transcribe.mjs
tsx
import path from "path";
import { transcribe, convertToCaptions } from "@remotion/install-whisper-cpp";
 
const { transcription } = await transcribe({
  inputPath: "/path/to/audio.wav",
  whisperPath: path.join(process.cwd(), "whisper.cpp"),
  model: "medium.en",
  tokenLevelTimestamps: true,
});
 
const { captions } = convertToCaptions({
  transcription,
  combineTokensWithinMilliseconds: 200,
});
 
for (const line of captions) {
  console.log(line.text, line.startInSeconds);
}

transcribe.mjs
tsx
import path from "path";
import { transcribe, convertToCaptions } from "@remotion/install-whisper-cpp";
 
const { transcription } = await transcribe({
  inputPath: "/path/to/audio.wav",
  whisperPath: path.join(process.cwd(), "whisper.cpp"),
  model: "medium.en",
  tokenLevelTimestamps: true,
});
 
const { captions } = convertToCaptions({
  transcription,
  combineTokensWithinMilliseconds: 200,
});
 
for (const line of captions) {
  console.log(line.text, line.startInSeconds);
}

Options

`transcription`

The transcription object that you retrieved from transcribe().
The tokenLevelTimestamps option must have been set to true.

`combineTokensWithinMilliseconds`

Combine words that are close to each other.
If words are not combined, they might display for a very short time if word-by-word captions are being used.
Disable combination by setting 0.
Recommendation: 200.

Return value

An object objects of the following shape:

ts
type Caption = {
  text: string;
  startInSeconds: number;
};
 
type ReturnValue = {
  captions: Caption[];
};

ts
type Caption = {
  text: string;
  startInSeconds: number;
};
 
type ReturnValue = {
  captions: Caption[];
};

Suggested usage

This shows how, given a data structure produced by convertToCaptions(), word-by-word captions can be rendered in a Remotion project.
See our TikTok template for a full reference implementation.

note

@remotion/install-whisper-cpp cannot be imported on the frontend, it is a Node.js API.
Only the TypeScript type is imported in this example

tsx
import type { Caption } from "@remotion/install-whisper-cpp";
import { Sequence, useVideoConfig } from "remotion";
 
const Captions: React.FC<{
  subtitles: Caption[];
}> = ({ subtitles }) => {
  const { fps } = useVideoConfig();
 
  return (
    <>
      {subtitles.map((subtitle, index) => {
        const nextSubtitle = subtitles[index + 1] ?? null;
        const subtitleStartFrame = subtitle.startInSeconds * fps;
        const subtitleEndFrame = Math.min(
          nextSubtitle ? nextSubtitle.startInSeconds * fps : Infinity,
          subtitleStartFrame + fps,
        );
 
        return (
          <Sequence
            from={subtitleStartFrame}
            durationInFrames={subtitleEndFrame - subtitleStartFrame}
          >
            <Subtitle key={index} text={subtitle.text} />;
          </Sequence>
        );
      })}
    </>
  );
};

tsx
import type { Caption } from "@remotion/install-whisper-cpp";
import { Sequence, useVideoConfig } from "remotion";
 
const Captions: React.FC<{
  subtitles: Caption[];
}> = ({ subtitles }) => {
  const { fps } = useVideoConfig();
 
  return (
    <>
      {subtitles.map((subtitle, index) => {
        const nextSubtitle = subtitles[index + 1] ?? null;
        const subtitleStartFrame = subtitle.startInSeconds * fps;
        const subtitleEndFrame = Math.min(
          nextSubtitle ? nextSubtitle.startInSeconds * fps : Infinity,
          subtitleStartFrame + fps,
        );
 
        return (
          <Sequence
            from={subtitleStartFrame}
            durationInFrames={subtitleEndFrame - subtitleStartFrame}
          >
            <Subtitle key={index} text={subtitle.text} />;
          </Sequence>
        );
      })}
    </>
  );
};

convertToCaptions()v4.0.131

Options​

transcription​

combineTokensWithinMilliseconds​

Return value​

Suggested usage​

See also​

Options

`transcription`

`combineTokensWithinMilliseconds`

Return value

Suggested usage

See also