I created a Python script with Claude Code to automatically split audio files at silent intervals

I created a Python script with Claude Code to automatically split audio files at silent intervals

How to detect silent sections with ffmpeg's silencedetect and automatically split audio files with a Python script. The method selects boundaries by silence duration and can split into a specified number of segments. The design process with Claude Code is also introduced.
2026.04.27

This page has been translated by machine translation. View original

Do you have a long audio file that you want to split into multiple files? I occasionally do.

I knew that ffmpeg could process audio, but I was intimidated by the multitude of options and couldn't get started. So I first asked Claude Code (AI coding agent) to "create a Python script that automatically splits audio files at silent sections."

Design Process with Claude Code

Determining How to Detect Silence

The first suggestion was ffmpeg's silencedetect filter. The basic framework was established: "By specifying parameters like silencedetect=noise=...dB:d=..., the start and end times of silent sections are output to stderr. These can be parsed after receiving them via Python's subprocess to identify split points." I initially got stuck on the stderr output, but Claude explained that this is how ffmpeg filter logs work by design.

Defining "Silence"

After establishing the framework, we needed to concretely define "silence." This involves two axes: "what dB level counts as silence (volume threshold)" and "how long must silence continue to be counted (minimum duration)." Testing with actual audio files, we set around -30dB / 1.0 seconds as the baseline. However, we encountered cases where what should be a single segment was detected as multiple short silences due to environmental noise or breathing. This led to the suggestion "it would be good to merge silence periods that are close together," which resulted in the --merge-gap parameter.

Changing the Silence Selection Algorithm

As we progressed with the design, the requirement "I just want to split into N parts" became clear. Initially, we planned to use "all detected silences as boundaries," but to achieve exactly N parts, we needed to select "which silences to use as boundaries." Claude suggested an algorithm based on the idea that "longer silences are more likely to be clear breaks," which sorts silences by length in descending order and selects the top N-1. This seemed simple and intuitive, so we proceeded with this approach.

Refining the Details

Finally, we refined the details. To avoid unnatural abrupt cuts when splitting segments at exact timestamps, we added padding of several tens of milliseconds before and after each segment. For the ffmpeg -ss option, there's a tradeoff between placing it before -i (fast seek) or after. Claude advised that "for audio files, quality degradation is negligible, and for extracting many segments, the speed difference becomes significant, so fast seek is better," so we adopted the fast seek approach.


In this article, I'll explain what the resulting script does. The full script text is included at the end of this blog.

What is ffmpeg?

ffmpeg is an open-source command-line tool for converting, cutting, encoding, and filtering audio and video files. It works on Linux, macOS, and Windows, and its wide format support and rich filters make it widely used inside video editing software and streaming tools.

However, it has so many options that it can be difficult to know where to start. So, I tried using it with Claude Code's support.

https://ffmpeg.org/

What I Created

I created a Python script that works as a CLI with the following input and output:

Input

  • Audio files such as .mp3 or .m4a

Output

  • Sequentially numbered segment files like original_filename-001.mp3, original_filename-002.mp3, ...
  • original_filename-manifest.tsv recording segment boundary information

Here's how it looks when executed:

# Split at all silence periods (check boundaries only)
python split_audio.py input.mp3 --dry-run
detected 11 silences merged to 11 selected 9 boundaries 10 segments
silence: 0.000s .. 1.108s  (1.108s)
001: 1.108s .. 2.066s  (0.958s)
silence: 2.066s .. 3.215s  (1.149s)
002: 3.215s .. 5.535s  (2.320s)
silence: 5.535s .. 7.184s  (1.649s)
003: 7.184s .. 10.921s  (3.737s)
silence: 10.921s .. 12.695s  (1.774s)
004: 12.695s .. 17.957s  (5.262s)
silence: 17.957s .. 19.997s  (2.040s)
005: 19.997s .. 24.857s  (4.860s)
silence: 24.857s .. 26.890s  (2.033s)
006: 26.890s .. 29.614s  (2.725s)
silence: 29.614s .. 31.740s  (2.126s)
007: 31.740s .. 35.275s  (3.535s)
silence: 35.275s .. 37.479s  (2.204s)
008: 37.479s .. 43.639s  (6.159s)
silence: 43.639s .. 45.354s  (1.716s)
009: 45.354s .. 48.156s  (2.802s)
silence: 48.156s .. 49.957s  (1.801s)
010: 49.957s .. 54.792s  (4.835s)
silence: 54.792s .. 57.353s  (2.561s)

# Split into 10 parts and output to out/
python split_audio.py input.mp3 --target-count 10 --output-dir out/

If you omit --target-count, all detected silent sections are used as boundaries. If specified, silent sections are selected to create exactly N segments.

Here's the overall process:

Mechanism: ffmpeg's silencedetect

silencedetect is an ffmpeg audio filter that detects "silence" as periods where the volume stays below a specified level for a certain duration.

ffmpeg -i input.mp3 -af "silencedetect=noise=-30dB:d=1.0" -f null -
  • noise=-30dB — Volume threshold below which is considered silence
  • d=1.0Minimum duration in seconds to be considered silence (filter)

Increasing d will miss shorter breaks, while decreasing it will pick up more noise. The script uses 1.0 seconds as the default value.

Detection results are output to stderr:

[silencedetect @ 0x...] silence_start: 0
[silencedetect @ 0x...] silence_end: 1.148005 | silence_duration: 1.148005
[silencedetect @ 0x...] silence_start: 2.025556
[silencedetect @ 0x...] silence_end: 3.254853 | silence_duration: 1.229297

In the Python program, we call ffmpeg through subprocess, parse only the silence_end and silence_duration lines with regular expressions, and calculate the start time with start = end - duration.

Design Points

Merging Nearby Silences

Breathing or environmental noise can cause what should be a single break to be detected as multiple short silences tens of milliseconds apart. Silences within --merge-gap seconds of each other are merged into one.

split_audio.py
def merge_nearby(silences: list[Silence], merge_gap: float) -> list[Silence]:
    """Merge silence periods that are within merge_gap seconds of each other."""
    merged: list[Silence] = []
    for si in silences:
        if merged and si.start - merged[-1].end <= merge_gap:
            prev = merged[-1]
            new_end = max(prev.end, si.end)
            merged[-1] = Silence(start=prev.start, end=new_end, duration=new_end - prev.start)
        else:
            merged.append(si)
    return merged

Building a Timeline

From the merged silence list, we create a timeline that alternates [Silence, Content, Silence, Content, ..., Silence]. Content represents the sections between silences.

split_audio.py
def build_timeline(silences: list[Silence], total: float) -> list[Silence | Content]:
    timeline: list[Silence | Content] = []
    pos = 0.0
    for si in silences:
        if si.start > pos:
            timeline.append(Content(start=pos, end=si.start))
        timeline.append(si)
        pos = si.end
    # Ignore tiny segments after the last silence (detection errors where ffmpeg's silence_end is before total)
    if total - pos > 0.1:
        timeline.append(Content(start=pos, end=total))
    return timeline

Selecting Top Silences by Duration

select_boundaries considers only middle silences as boundary candidates, excluding those starting at the beginning (0.0 seconds) and ending at the end (total seconds). With --dry-run, head and tail silences are also displayed so you can see where content actually begins.

When --target-count N is specified, N-1 boundaries are needed to create N segments. Since longer silences are more likely to be clear breaks, the top N-1 by length are selected.

split_audio.py
# Note: This is simplified conceptual code
def select_boundaries(timeline: list[Silence | Content], target_count: int | None, total: float) -> list[Silence]:
    # Exclude Silences starting at the beginning (0.0) and ending at the end (total)
    # Use the same 0.1s threshold as build_timeline uses to skip tiny end Content
    eps = 0.1
    inner = [b for b in timeline if isinstance(b, Silence) and b.start > eps and total - b.end > eps]
    if target_count is None:
        # When target_count is not specified, use all inner Silences as boundaries
        return inner
    needed = target_count - 1
    ...
    # Longer silences are more likely to be clear breaks, so select the top ones by length
    selected = sorted(inner, key=lambda si: si.duration, reverse=True)[:needed]
    return sorted(selected, key=lambda si: si.start)

Adding Padding for Natural Cuts

Cutting segments exactly can create abrupt starts and ends. Each segment's start is shifted earlier by padding seconds and its end is shifted later (default 40ms).

split_audio.py
# Note: This is simplified conceptual code
def build_segments(timeline: list[Silence | Content], boundaries: list[Silence], total: float, padding: float) -> list[Segment]:
    first, last = timeline[0], timeline[-1]
    content_start = first.end if isinstance(first, Silence) else first.start
    content_end = last.start if isinstance(last, Silence) else last.end
    pairs = list(zip(
        [content_start] + [b.end for b in boundaries],
        [b.start for b in boundaries] + [content_end],
        strict=True,
    ))
    for i, (start, end) in enumerate(pairs, 1):
        padded_start = max(0.0, start - padding)
        padded_end = min(total, end + padding)
        ...

content_start / content_end are automatically determined by examining the timeline's edges. If the beginning is Silence, then content_start = first.end; if Content (no silence at the beginning), then first.start = 0.0.

Extraction Implementation

The actual extraction uses ffmpeg's -ss / -t options:

split_audio.py
command = [
    "ffmpeg",
    "-hide_banner",
    "-loglevel", "error",
    "-y",
    # Placing -ss before -i enables fast keyframe seeking
    # Placing it after increases precision but is slower as it decodes all frames from the beginning
    "-ss", f"{segment.start:.6f}",
    "-i", str(source),
    "-t", f"{segment.end - segment.start:.6f}",
    "-vn",  # Exclude video streams
]

Placing -ss before -i enables fast seeking by jumping to keyframes in the input file, significantly improving speed when extracting many segments. For audio files, the precision loss is hardly noticeable.

The output format is determined by the file extension. m4a uses AAC encoding (128k), while mp3 uses MP3 encoding (96k) + 44100Hz resampling.

Testing

First, check the split result with --dry-run. To create 10 segments, we need 9 boundaries, so the top 9 out of 11 silences are selected:

python split_audio.py input.mp3 --target-count 10 --dry-run
detected 11 silences → merged to 11 → selected 9 boundaries → 10 segments
  silence: 0.000s .. 1.108s  (1.108s)
  001: 1.108s .. 2.066s  (0.958s)
  silence: 2.066s .. 3.215s  (1.149s)
  002: 3.215s .. 5.535s  (2.320s)
  silence: 5.535s .. 7.184s  (1.649s)
  003: 7.184s .. 10.921s  (3.737s)
  silence: 10.921s .. 12.695s  (1.774s)
  004: 12.695s .. 17.957s  (5.262s)
  silence: 17.957s .. 19.997s  (2.040s)
  005: 19.997s .. 24.857s  (4.860s)
  silence: 24.857s .. 26.890s  (2.033s)
  006: 26.890s .. 29.614s  (2.725s)
  silence: 29.614s .. 31.740s  (2.126s)
  007: 31.740s .. 35.275s  (3.535s)
  silence: 35.275s .. 37.479s  (2.204s)
  008: 37.479s .. 43.639s  (6.159s)
  silence: 43.639s .. 45.354s  (1.716s)
  009: 45.354s .. 48.156s  (2.802s)
  silence: 48.156s .. 49.957s  (1.801s)
  010: 49.957s .. 54.792s  (4.835s)
  silence: 54.792s .. 57.353s  (2.561s)

The display shows silences at the beginning/end and between segments. If the boundaries look good, proceed with actual extraction:

python split_audio.py input.mp3 --target-count 10 --output-dir out/
detected 11 silences → merged to 11 → selected 9 boundaries → 10 segments
wrote 10 files to out

This generates input-001.mp3 through input-010.mp3 and input-manifest.tsv in the output directory:

file	start	end	duration
input-001	1.108	2.066	0.958
input-002	3.215	5.535	2.320
input-003	7.184	10.921	3.737
input-004	12.695	17.957	5.262
input-005	19.997	24.857	4.860
input-006	26.890	29.614	2.725
input-007	31.740	35.275	3.535
input-008	37.479	43.639	6.159
input-009	45.354	48.156	2.802
input-010	49.957	54.792	4.835

If there's too much environmental noise causing false detections, lowering --noise-db (e.g., -45) will make only quieter sections count as silence.

Conclusion

By combining ffmpeg's silencedetect with Python's subprocess, I created a simple audio splitting script.

I wasn't very familiar with ffmpeg, but bouncing ideas off Claude Code led to interesting discoveries about how ffmpeg can be used.

However, final parameter adjustments still require human intervention. Depending on the characteristics of the target file, adjustments to --noise-db and --min-silence may be necessary. You'll likely need to first check boundaries with --dry-run, then run the actual generation, listen to the output, and modify if it doesn't match your expectations.

Claude Code was very helpful in this process, with the AI doing the actual work while humans make specification decisions based on requirements.

Additionally, combining with find + xargs commands enables batch processing of multiple files.

I hope this blog helps someone.

References

Complete Source Code
split_audio.py
#!/usr/bin/env python3
"""CLI tool to segment audio files at silent intervals.

Dependencies: ffmpeg / ffprobe (must be in PATH)
"""

import argparse
import csv
import re
import subprocess
from dataclasses import dataclass
from pathlib import Path

SILENCE_RE = re.compile(r"silence_end:\s*([\d.]+)\s*\|\s*silence_duration:\s*([\d.]+)")

@dataclass(frozen=True)
class Silence:
    """Detected silence interval."""
    start: float
    end: float
    duration: float

@dataclass(frozen=True)
class Content:
    """Content interval between Silence and Silence."""
    start: float
    end: float

@dataclass(frozen=True)
class Segment:
    """Segment to be extracted (with padding applied)."""
    index: int
    start: float
    end: float

def probe_duration(path: Path) -> float:
    """Returns the duration of an audio file in seconds using ffprobe."""
    result = subprocess.run(
        [
            "ffprobe",
            "-v", "error",
            "-show_entries", "format=duration",
            "-of", "default=noprint_wrappers=1:nokey=1",
            str(path),
        ],
        text=True,
        capture_output=True,
        check=True,
    )
    return float(result.stdout.strip())

def detect_silences(path: Path, noise_db: float, min_silence: float) -> list[Silence]:
    """Detects and returns silent intervals using ffmpeg silencedetect filter."""
    result = subprocess.run(
        [
            "ffmpeg",
            "-hide_banner",
            "-nostats",
            "-i", str(path),
            "-af", f"silencedetect=noise={noise_db}dB:d={min_silence}",
            "-f", "null", "-",
        ],
        text=True,
        capture_output=True,
        check=True,
    )
    # ffmpeg outputs filter logs to stderr, not stdout
    silences: list[Silence] = []
    for line in result.stderr.splitlines():
        m = SILENCE_RE.search(line)
        if m:
            end = float(m.group(1))
            duration = float(m.group(2))
            # Don't use silence_start line, calculate start from end - duration
            silences.append(Silence(start=end - duration, end=end, duration=duration))
    return silences

def merge_nearby(silences: list[Silence], merge_gap: float) -> list[Silence]:
    """Merges consecutive silent intervals that are within merge_gap seconds."""
    merged: list[Silence] = []
    for si in silences:
        if merged and si.start - merged[-1].end <= merge_gap:
            prev = merged[-1]
            new_end = max(prev.end, si.end)
            merged[-1] = Silence(start=prev.start, end=new_end, duration=new_end - prev.start)
        else:
            merged.append(si)
    return merged

def build_timeline(silences: list[Silence], total: float) -> list[Silence | Content]:
    """Builds a timeline with alternating Silence and Content from Silence list."""
    timeline: list[Silence | Content] = []
    pos = 0.0
    for si in silences:
        if si.start > pos:
            timeline.append(Content(start=pos, end=si.start))
        timeline.append(si)
        pos = si.end
    # Ignore tiny intervals at the end (detection error where ffmpeg's silence_end is slightly before total)
    if total - pos > 0.1:
        timeline.append(Content(start=pos, end=total))
    return timeline

def select_boundaries(timeline: list[Silence | Content], target_count: int | None, total: float) -> list[Silence]:
    """Selects boundary Silences from timeline. Excludes Silences at beginning and end."""
    # Use same 0.1s threshold as build_timeline uses for ignoring tiny intervals
    eps = 0.1
    inner = [b for b in timeline if isinstance(b, Silence) and b.start > eps and total - b.end > eps]
    if target_count is None:
        # When target_count is not specified, use all inner Silences as boundaries
        return inner
    needed = target_count - 1
    if needed < 0:
        raise ValueError(f"target-count must be >= 1, got {target_count}")
    if len(inner) < needed:
        raise ValueError(
            f"need {needed} boundaries for {target_count} segments, "
            f"but only {len(inner)} silences detected"
        )
    # Longer silences are more likely to be clear breaks, so select by duration in descending order
    selected = sorted(inner, key=lambda si: si.duration, reverse=True)[:needed]
    return sorted(selected, key=lambda si: si.start)

def build_segments(timeline: list[Silence | Content], boundaries: list[Silence], total: float, padding: float) -> list[Segment]:
    """Generates Segments from timeline and boundary Silences, adding padding seconds before and after."""
    first, last = timeline[0], timeline[-1]
    content_start = first.end if isinstance(first, Silence) else first.start
    content_end = last.start if isinstance(last, Silence) else last.end
    pairs = list(zip(
        [content_start] + [b.end for b in boundaries],
        [b.start for b in boundaries] + [content_end],
        strict=True,
    ))
    segments: list[Segment] = []
    for i, (start, end) in enumerate(pairs, 1):
        padded_start = max(0.0, start - padding)
        padded_end = min(total, end + padding)
        if padded_end <= padded_start:
            raise ValueError(f"invalid segment {i}: {start:.3f}..{end:.3f}")
        segments.append(Segment(index=i, start=padded_start, end=padded_end))
    return segments

def write_segment(
    segment: Segment,
    source: Path,
    output_dir: Path,
    prefix: str,
    extension: str,
) -> None:
    """Extracts segment using ffmpeg and saves it to a file."""
    output = output_dir / f"{prefix}{segment.index:03d}.{extension}"
    command = [
        "ffmpeg",
        "-hide_banner",
        "-loglevel", "error",
        "-y",
        # Placing -ss before -i enables fast keyframe seeking
        # Placing it after -i increases accuracy but is slower as it decodes all frames from the beginning
        "-ss", f"{segment.start:.6f}",
        "-i", str(source),
        "-t", f"{segment.end - segment.start:.6f}",
        "-vn",  # Exclude video streams
    ]
    if extension == "m4a":
        command += ["-c:a", "aac", "-b:a", "128k"]
    elif extension == "mp3":
        command += ["-af", "aresample=44100", "-c:a", "libmp3lame", "-b:a", "96k"]
    else:
        raise ValueError(f"unsupported extension: {extension}")
    command.append(str(output))
    subprocess.run(command, check=True)

def write_manifest(segments: list[Segment], output_dir: Path, prefix: str) -> None:
    """Outputs segment boundary information to a TSV file."""
    tsv_path = output_dir / f"{prefix.rstrip('-_')}-manifest.tsv"
    with tsv_path.open("w", encoding="utf-8", newline="") as f:
        writer = csv.writer(f, delimiter="\t")
        writer.writerow(["file", "start", "end", "duration"])
        for s in segments:
            writer.writerow(
                [
                    f"{prefix}{s.index:03d}",
                    f"{s.start:.3f}",
                    f"{s.end:.3f}",
                    f"{s.end - s.start:.3f}",
                ]
            )

def main() -> None:
    parser = argparse.ArgumentParser(description="Split audio files at silent intervals")
    parser.add_argument("input", help="Input audio file")
    parser.add_argument("--output-dir", default="out")
    parser.add_argument("--target-count", type=int, default=None)
    parser.add_argument("--prefix", default=None, help="Output file prefix (uses original filename if omitted)")
    parser.add_argument("--extension", choices=["mp3", "m4a"], default="mp3")
    parser.add_argument("--noise-db", type=float, default=-30.0)
    parser.add_argument("--min-silence", type=float, default=1.0)
    parser.add_argument("--merge-gap", type=float, default=0.1)
    parser.add_argument("--padding", type=float, default=0.04)
    parser.add_argument("--dry-run", action="store_true")
    args = parser.parse_args()

    source = Path(args.input)
    output_dir = Path(args.output_dir)
    prefix = args.prefix if args.prefix is not None else f"{source.stem}-"

    try:
        total = probe_duration(source)
        raw_silences = detect_silences(source, args.noise_db, args.min_silence)
        silences = merge_nearby(raw_silences, args.merge_gap)
        timeline = build_timeline(silences, total)
        boundaries = select_boundaries(timeline, args.target_count, total)
        segments = build_segments(timeline, boundaries, total, args.padding)
    except (ValueError, subprocess.CalledProcessError) as e:
        parser.exit(1, f"error: {e}\n")

    print(f"detected {len(raw_silences)} silences → merged to {len(silences)} → selected {len(boundaries)} boundaries → {len(segments)} segments")

    if args.dry_run:
        if isinstance(timeline[0], Silence):
            print(f"  silence: 0.000s .. {segments[0].start:.3f}s  ({segments[0].start:.3f}s)")
        for i, s in enumerate(segments):
            print(f"  {s.index:03d}: {s.start:.3f}s .. {s.end:.3f}s  ({s.end - s.start:.3f}s)")
            if i < len(segments) - 1:
                silence_start = s.end
                silence_end = segments[i + 1].start
                print(f"  silence: {silence_start:.3f}s .. {silence_end:.3f}s  ({silence_end - silence_start:.3f}s)")
        if isinstance(timeline[-1], Silence):
            print(f"  silence: {segments[-1].end:.3f}s .. {total:.3f}s  ({total - segments[-1].end:.3f}s)")
        return

    output_dir.mkdir(parents=True, exist_ok=True)
    write_manifest(segments, output_dir, prefix)
    for s in segments:
        write_segment(s, source, output_dir, prefix, args.extension)
    print(f"wrote {len(segments)} files to {output_dir}")

if __name__ == "__main__":
    main()

Share this article