
I created a Python script with Claude Code to automatically split audio files at silent intervals
This page has been translated by machine translation. View original
Do you have a long audio file that you want to split into multiple files? I occasionally do.
I knew that ffmpeg could process audio, but I was intimidated by the multitude of options and couldn't get started. So I first asked Claude Code (AI coding agent) to "create a Python script that automatically splits audio files at silent sections."
Design Process with Claude Code
Determining How to Detect Silence
The first suggestion was ffmpeg's silencedetect filter. The basic framework was established: "By specifying parameters like silencedetect=noise=...dB:d=..., the start and end times of silent sections are output to stderr. These can be parsed after receiving them via Python's subprocess to identify split points." I initially got stuck on the stderr output, but Claude explained that this is how ffmpeg filter logs work by design.
Defining "Silence"
After establishing the framework, we needed to concretely define "silence." This involves two axes: "what dB level counts as silence (volume threshold)" and "how long must silence continue to be counted (minimum duration)." Testing with actual audio files, we set around -30dB / 1.0 seconds as the baseline. However, we encountered cases where what should be a single segment was detected as multiple short silences due to environmental noise or breathing. This led to the suggestion "it would be good to merge silence periods that are close together," which resulted in the --merge-gap parameter.
Changing the Silence Selection Algorithm
As we progressed with the design, the requirement "I just want to split into N parts" became clear. Initially, we planned to use "all detected silences as boundaries," but to achieve exactly N parts, we needed to select "which silences to use as boundaries." Claude suggested an algorithm based on the idea that "longer silences are more likely to be clear breaks," which sorts silences by length in descending order and selects the top N-1. This seemed simple and intuitive, so we proceeded with this approach.
Refining the Details
Finally, we refined the details. To avoid unnatural abrupt cuts when splitting segments at exact timestamps, we added padding of several tens of milliseconds before and after each segment. For the ffmpeg -ss option, there's a tradeoff between placing it before -i (fast seek) or after. Claude advised that "for audio files, quality degradation is negligible, and for extracting many segments, the speed difference becomes significant, so fast seek is better," so we adopted the fast seek approach.
In this article, I'll explain what the resulting script does. The full script text is included at the end of this blog.
What is ffmpeg?
ffmpeg is an open-source command-line tool for converting, cutting, encoding, and filtering audio and video files. It works on Linux, macOS, and Windows, and its wide format support and rich filters make it widely used inside video editing software and streaming tools.
However, it has so many options that it can be difficult to know where to start. So, I tried using it with Claude Code's support.
What I Created
I created a Python script that works as a CLI with the following input and output:
Input
- Audio files such as
.mp3or.m4a
Output
- Sequentially numbered segment files like
original_filename-001.mp3,original_filename-002.mp3, ... original_filename-manifest.tsvrecording segment boundary information
Here's how it looks when executed:
# Split at all silence periods (check boundaries only)
python split_audio.py input.mp3 --dry-run
detected 11 silences → merged to 11 → selected 9 boundaries → 10 segments
silence: 0.000s .. 1.108s (1.108s)
001: 1.108s .. 2.066s (0.958s)
silence: 2.066s .. 3.215s (1.149s)
002: 3.215s .. 5.535s (2.320s)
silence: 5.535s .. 7.184s (1.649s)
003: 7.184s .. 10.921s (3.737s)
silence: 10.921s .. 12.695s (1.774s)
004: 12.695s .. 17.957s (5.262s)
silence: 17.957s .. 19.997s (2.040s)
005: 19.997s .. 24.857s (4.860s)
silence: 24.857s .. 26.890s (2.033s)
006: 26.890s .. 29.614s (2.725s)
silence: 29.614s .. 31.740s (2.126s)
007: 31.740s .. 35.275s (3.535s)
silence: 35.275s .. 37.479s (2.204s)
008: 37.479s .. 43.639s (6.159s)
silence: 43.639s .. 45.354s (1.716s)
009: 45.354s .. 48.156s (2.802s)
silence: 48.156s .. 49.957s (1.801s)
010: 49.957s .. 54.792s (4.835s)
silence: 54.792s .. 57.353s (2.561s)
# Split into 10 parts and output to out/
python split_audio.py input.mp3 --target-count 10 --output-dir out/
If you omit --target-count, all detected silent sections are used as boundaries. If specified, silent sections are selected to create exactly N segments.
Here's the overall process:
Mechanism: ffmpeg's silencedetect
silencedetect is an ffmpeg audio filter that detects "silence" as periods where the volume stays below a specified level for a certain duration.
ffmpeg -i input.mp3 -af "silencedetect=noise=-30dB:d=1.0" -f null -
noise=-30dB— Volume threshold below which is considered silenced=1.0— Minimum duration in seconds to be considered silence (filter)
Increasing d will miss shorter breaks, while decreasing it will pick up more noise. The script uses 1.0 seconds as the default value.
Detection results are output to stderr:
[silencedetect @ 0x...] silence_start: 0
[silencedetect @ 0x...] silence_end: 1.148005 | silence_duration: 1.148005
[silencedetect @ 0x...] silence_start: 2.025556
[silencedetect @ 0x...] silence_end: 3.254853 | silence_duration: 1.229297
In the Python program, we call ffmpeg through subprocess, parse only the silence_end and silence_duration lines with regular expressions, and calculate the start time with start = end - duration.
- Reference: FFmpeg Filters Documentation — silencedetect
Design Points
Merging Nearby Silences
Breathing or environmental noise can cause what should be a single break to be detected as multiple short silences tens of milliseconds apart. Silences within --merge-gap seconds of each other are merged into one.
def merge_nearby(silences: list[Silence], merge_gap: float) -> list[Silence]:
"""Merge silence periods that are within merge_gap seconds of each other."""
merged: list[Silence] = []
for si in silences:
if merged and si.start - merged[-1].end <= merge_gap:
prev = merged[-1]
new_end = max(prev.end, si.end)
merged[-1] = Silence(start=prev.start, end=new_end, duration=new_end - prev.start)
else:
merged.append(si)
return merged
Building a Timeline
From the merged silence list, we create a timeline that alternates [Silence, Content, Silence, Content, ..., Silence]. Content represents the sections between silences.
def build_timeline(silences: list[Silence], total: float) -> list[Silence | Content]:
timeline: list[Silence | Content] = []
pos = 0.0
for si in silences:
if si.start > pos:
timeline.append(Content(start=pos, end=si.start))
timeline.append(si)
pos = si.end
# Ignore tiny segments after the last silence (detection errors where ffmpeg's silence_end is before total)
if total - pos > 0.1:
timeline.append(Content(start=pos, end=total))
return timeline
Selecting Top Silences by Duration
select_boundaries considers only middle silences as boundary candidates, excluding those starting at the beginning (0.0 seconds) and ending at the end (total seconds). With --dry-run, head and tail silences are also displayed so you can see where content actually begins.
When --target-count N is specified, N-1 boundaries are needed to create N segments. Since longer silences are more likely to be clear breaks, the top N-1 by length are selected.
# Note: This is simplified conceptual code
def select_boundaries(timeline: list[Silence | Content], target_count: int | None, total: float) -> list[Silence]:
# Exclude Silences starting at the beginning (0.0) and ending at the end (total)
# Use the same 0.1s threshold as build_timeline uses to skip tiny end Content
eps = 0.1
inner = [b for b in timeline if isinstance(b, Silence) and b.start > eps and total - b.end > eps]
if target_count is None:
# When target_count is not specified, use all inner Silences as boundaries
return inner
needed = target_count - 1
...
# Longer silences are more likely to be clear breaks, so select the top ones by length
selected = sorted(inner, key=lambda si: si.duration, reverse=True)[:needed]
return sorted(selected, key=lambda si: si.start)
Adding Padding for Natural Cuts
Cutting segments exactly can create abrupt starts and ends. Each segment's start is shifted earlier by padding seconds and its end is shifted later (default 40ms).
# Note: This is simplified conceptual code
def build_segments(timeline: list[Silence | Content], boundaries: list[Silence], total: float, padding: float) -> list[Segment]:
first, last = timeline[0], timeline[-1]
content_start = first.end if isinstance(first, Silence) else first.start
content_end = last.start if isinstance(last, Silence) else last.end
pairs = list(zip(
[content_start] + [b.end for b in boundaries],
[b.start for b in boundaries] + [content_end],
strict=True,
))
for i, (start, end) in enumerate(pairs, 1):
padded_start = max(0.0, start - padding)
padded_end = min(total, end + padding)
...
content_start / content_end are automatically determined by examining the timeline's edges. If the beginning is Silence, then content_start = first.end; if Content (no silence at the beginning), then first.start = 0.0.
Extraction Implementation
The actual extraction uses ffmpeg's -ss / -t options:
command = [
"ffmpeg",
"-hide_banner",
"-loglevel", "error",
"-y",
# Placing -ss before -i enables fast keyframe seeking
# Placing it after increases precision but is slower as it decodes all frames from the beginning
"-ss", f"{segment.start:.6f}",
"-i", str(source),
"-t", f"{segment.end - segment.start:.6f}",
"-vn", # Exclude video streams
]
Placing -ss before -i enables fast seeking by jumping to keyframes in the input file, significantly improving speed when extracting many segments. For audio files, the precision loss is hardly noticeable.
The output format is determined by the file extension. m4a uses AAC encoding (128k), while mp3 uses MP3 encoding (96k) + 44100Hz resampling.
- Reference: FFmpeg Documentation — Main options
Testing
First, check the split result with --dry-run. To create 10 segments, we need 9 boundaries, so the top 9 out of 11 silences are selected:
python split_audio.py input.mp3 --target-count 10 --dry-run
detected 11 silences → merged to 11 → selected 9 boundaries → 10 segments
silence: 0.000s .. 1.108s (1.108s)
001: 1.108s .. 2.066s (0.958s)
silence: 2.066s .. 3.215s (1.149s)
002: 3.215s .. 5.535s (2.320s)
silence: 5.535s .. 7.184s (1.649s)
003: 7.184s .. 10.921s (3.737s)
silence: 10.921s .. 12.695s (1.774s)
004: 12.695s .. 17.957s (5.262s)
silence: 17.957s .. 19.997s (2.040s)
005: 19.997s .. 24.857s (4.860s)
silence: 24.857s .. 26.890s (2.033s)
006: 26.890s .. 29.614s (2.725s)
silence: 29.614s .. 31.740s (2.126s)
007: 31.740s .. 35.275s (3.535s)
silence: 35.275s .. 37.479s (2.204s)
008: 37.479s .. 43.639s (6.159s)
silence: 43.639s .. 45.354s (1.716s)
009: 45.354s .. 48.156s (2.802s)
silence: 48.156s .. 49.957s (1.801s)
010: 49.957s .. 54.792s (4.835s)
silence: 54.792s .. 57.353s (2.561s)
The display shows silences at the beginning/end and between segments. If the boundaries look good, proceed with actual extraction:
python split_audio.py input.mp3 --target-count 10 --output-dir out/
detected 11 silences → merged to 11 → selected 9 boundaries → 10 segments
wrote 10 files to out
This generates input-001.mp3 through input-010.mp3 and input-manifest.tsv in the output directory:
file start end duration
input-001 1.108 2.066 0.958
input-002 3.215 5.535 2.320
input-003 7.184 10.921 3.737
input-004 12.695 17.957 5.262
input-005 19.997 24.857 4.860
input-006 26.890 29.614 2.725
input-007 31.740 35.275 3.535
input-008 37.479 43.639 6.159
input-009 45.354 48.156 2.802
input-010 49.957 54.792 4.835
If there's too much environmental noise causing false detections, lowering --noise-db (e.g., -45) will make only quieter sections count as silence.
Conclusion
By combining ffmpeg's silencedetect with Python's subprocess, I created a simple audio splitting script.
I wasn't very familiar with ffmpeg, but bouncing ideas off Claude Code led to interesting discoveries about how ffmpeg can be used.
However, final parameter adjustments still require human intervention. Depending on the characteristics of the target file, adjustments to --noise-db and --min-silence may be necessary. You'll likely need to first check boundaries with --dry-run, then run the actual generation, listen to the output, and modify if it doesn't match your expectations.
Claude Code was very helpful in this process, with the AI doing the actual work while humans make specification decisions based on requirements.
Additionally, combining with find + xargs commands enables batch processing of multiple files.
I hope this blog helps someone.
References
Complete Source Code
#!/usr/bin/env python3
"""CLI tool to segment audio files at silent intervals.
Dependencies: ffmpeg / ffprobe (must be in PATH)
"""
import argparse
import csv
import re
import subprocess
from dataclasses import dataclass
from pathlib import Path
SILENCE_RE = re.compile(r"silence_end:\s*([\d.]+)\s*\|\s*silence_duration:\s*([\d.]+)")
@dataclass(frozen=True)
class Silence:
"""Detected silence interval."""
start: float
end: float
duration: float
@dataclass(frozen=True)
class Content:
"""Content interval between Silence and Silence."""
start: float
end: float
@dataclass(frozen=True)
class Segment:
"""Segment to be extracted (with padding applied)."""
index: int
start: float
end: float
def probe_duration(path: Path) -> float:
"""Returns the duration of an audio file in seconds using ffprobe."""
result = subprocess.run(
[
"ffprobe",
"-v", "error",
"-show_entries", "format=duration",
"-of", "default=noprint_wrappers=1:nokey=1",
str(path),
],
text=True,
capture_output=True,
check=True,
)
return float(result.stdout.strip())
def detect_silences(path: Path, noise_db: float, min_silence: float) -> list[Silence]:
"""Detects and returns silent intervals using ffmpeg silencedetect filter."""
result = subprocess.run(
[
"ffmpeg",
"-hide_banner",
"-nostats",
"-i", str(path),
"-af", f"silencedetect=noise={noise_db}dB:d={min_silence}",
"-f", "null", "-",
],
text=True,
capture_output=True,
check=True,
)
# ffmpeg outputs filter logs to stderr, not stdout
silences: list[Silence] = []
for line in result.stderr.splitlines():
m = SILENCE_RE.search(line)
if m:
end = float(m.group(1))
duration = float(m.group(2))
# Don't use silence_start line, calculate start from end - duration
silences.append(Silence(start=end - duration, end=end, duration=duration))
return silences
def merge_nearby(silences: list[Silence], merge_gap: float) -> list[Silence]:
"""Merges consecutive silent intervals that are within merge_gap seconds."""
merged: list[Silence] = []
for si in silences:
if merged and si.start - merged[-1].end <= merge_gap:
prev = merged[-1]
new_end = max(prev.end, si.end)
merged[-1] = Silence(start=prev.start, end=new_end, duration=new_end - prev.start)
else:
merged.append(si)
return merged
def build_timeline(silences: list[Silence], total: float) -> list[Silence | Content]:
"""Builds a timeline with alternating Silence and Content from Silence list."""
timeline: list[Silence | Content] = []
pos = 0.0
for si in silences:
if si.start > pos:
timeline.append(Content(start=pos, end=si.start))
timeline.append(si)
pos = si.end
# Ignore tiny intervals at the end (detection error where ffmpeg's silence_end is slightly before total)
if total - pos > 0.1:
timeline.append(Content(start=pos, end=total))
return timeline
def select_boundaries(timeline: list[Silence | Content], target_count: int | None, total: float) -> list[Silence]:
"""Selects boundary Silences from timeline. Excludes Silences at beginning and end."""
# Use same 0.1s threshold as build_timeline uses for ignoring tiny intervals
eps = 0.1
inner = [b for b in timeline if isinstance(b, Silence) and b.start > eps and total - b.end > eps]
if target_count is None:
# When target_count is not specified, use all inner Silences as boundaries
return inner
needed = target_count - 1
if needed < 0:
raise ValueError(f"target-count must be >= 1, got {target_count}")
if len(inner) < needed:
raise ValueError(
f"need {needed} boundaries for {target_count} segments, "
f"but only {len(inner)} silences detected"
)
# Longer silences are more likely to be clear breaks, so select by duration in descending order
selected = sorted(inner, key=lambda si: si.duration, reverse=True)[:needed]
return sorted(selected, key=lambda si: si.start)
def build_segments(timeline: list[Silence | Content], boundaries: list[Silence], total: float, padding: float) -> list[Segment]:
"""Generates Segments from timeline and boundary Silences, adding padding seconds before and after."""
first, last = timeline[0], timeline[-1]
content_start = first.end if isinstance(first, Silence) else first.start
content_end = last.start if isinstance(last, Silence) else last.end
pairs = list(zip(
[content_start] + [b.end for b in boundaries],
[b.start for b in boundaries] + [content_end],
strict=True,
))
segments: list[Segment] = []
for i, (start, end) in enumerate(pairs, 1):
padded_start = max(0.0, start - padding)
padded_end = min(total, end + padding)
if padded_end <= padded_start:
raise ValueError(f"invalid segment {i}: {start:.3f}..{end:.3f}")
segments.append(Segment(index=i, start=padded_start, end=padded_end))
return segments
def write_segment(
segment: Segment,
source: Path,
output_dir: Path,
prefix: str,
extension: str,
) -> None:
"""Extracts segment using ffmpeg and saves it to a file."""
output = output_dir / f"{prefix}{segment.index:03d}.{extension}"
command = [
"ffmpeg",
"-hide_banner",
"-loglevel", "error",
"-y",
# Placing -ss before -i enables fast keyframe seeking
# Placing it after -i increases accuracy but is slower as it decodes all frames from the beginning
"-ss", f"{segment.start:.6f}",
"-i", str(source),
"-t", f"{segment.end - segment.start:.6f}",
"-vn", # Exclude video streams
]
if extension == "m4a":
command += ["-c:a", "aac", "-b:a", "128k"]
elif extension == "mp3":
command += ["-af", "aresample=44100", "-c:a", "libmp3lame", "-b:a", "96k"]
else:
raise ValueError(f"unsupported extension: {extension}")
command.append(str(output))
subprocess.run(command, check=True)
def write_manifest(segments: list[Segment], output_dir: Path, prefix: str) -> None:
"""Outputs segment boundary information to a TSV file."""
tsv_path = output_dir / f"{prefix.rstrip('-_')}-manifest.tsv"
with tsv_path.open("w", encoding="utf-8", newline="") as f:
writer = csv.writer(f, delimiter="\t")
writer.writerow(["file", "start", "end", "duration"])
for s in segments:
writer.writerow(
[
f"{prefix}{s.index:03d}",
f"{s.start:.3f}",
f"{s.end:.3f}",
f"{s.end - s.start:.3f}",
]
)
def main() -> None:
parser = argparse.ArgumentParser(description="Split audio files at silent intervals")
parser.add_argument("input", help="Input audio file")
parser.add_argument("--output-dir", default="out")
parser.add_argument("--target-count", type=int, default=None)
parser.add_argument("--prefix", default=None, help="Output file prefix (uses original filename if omitted)")
parser.add_argument("--extension", choices=["mp3", "m4a"], default="mp3")
parser.add_argument("--noise-db", type=float, default=-30.0)
parser.add_argument("--min-silence", type=float, default=1.0)
parser.add_argument("--merge-gap", type=float, default=0.1)
parser.add_argument("--padding", type=float, default=0.04)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
source = Path(args.input)
output_dir = Path(args.output_dir)
prefix = args.prefix if args.prefix is not None else f"{source.stem}-"
try:
total = probe_duration(source)
raw_silences = detect_silences(source, args.noise_db, args.min_silence)
silences = merge_nearby(raw_silences, args.merge_gap)
timeline = build_timeline(silences, total)
boundaries = select_boundaries(timeline, args.target_count, total)
segments = build_segments(timeline, boundaries, total, args.padding)
except (ValueError, subprocess.CalledProcessError) as e:
parser.exit(1, f"error: {e}\n")
print(f"detected {len(raw_silences)} silences → merged to {len(silences)} → selected {len(boundaries)} boundaries → {len(segments)} segments")
if args.dry_run:
if isinstance(timeline[0], Silence):
print(f" silence: 0.000s .. {segments[0].start:.3f}s ({segments[0].start:.3f}s)")
for i, s in enumerate(segments):
print(f" {s.index:03d}: {s.start:.3f}s .. {s.end:.3f}s ({s.end - s.start:.3f}s)")
if i < len(segments) - 1:
silence_start = s.end
silence_end = segments[i + 1].start
print(f" silence: {silence_start:.3f}s .. {silence_end:.3f}s ({silence_end - silence_start:.3f}s)")
if isinstance(timeline[-1], Silence):
print(f" silence: {segments[-1].end:.3f}s .. {total:.3f}s ({total - segments[-1].end:.3f}s)")
return
output_dir.mkdir(parents=True, exist_ok=True)
write_manifest(segments, output_dir, prefix)
for s in segments:
write_segment(s, source, output_dir, prefix, args.extension)
print(f"wrote {len(segments)} files to {output_dir}")
if __name__ == "__main__":
main()