AI Transcription & Diarization | Imran Khalid

Overview

Production-grade pipeline combining Whisper for transcription and Pyannote for speaker diarization, processing 1000+ hours of audio with 80% cost reduction vs manual transcription.

The Problem

Transcribing multi-speaker meetings manually is time-consuming (4 hours of work per 1 hour of audio), expensive ($1.50/min), and requires specialized skills. Automated solutions often struggle with overlapping speech and speaker identification.

The Solution

Developed an end-to-end pipeline that first performs speaker diarization to identify 'who spoke when', then transcribes each speaker segment separately using Whisper. Post-processing aligns timestamps and formats output as structured JSON or SRT subtitles.

Project Gallery

AI Transcription & Diarization - Image 1

AI Transcription & Diarization - Image 2

Technical Architecture

Two-stage pipeline with speaker diarization and speech-to-text

Audio Preprocessing

Noise reduction, normalization, and VAD (Voice Activity Detection)

Pyannote Diarization

Identifies speaker segments and assigns speaker IDs

Whisper Transcription

Transcribes each speaker segment with timestamps

Alignment & Formatting

Merges diarization and transcription, formats output

Methodology

Audio preprocessing: Noise reduction with noisereduce, normalization to -20dB
Diarization: Pyannote 3.0 with custom speaker embedding model
Transcription: Whisper-large-v3 with language detection
Post-processing: Speaker label alignment, punctuation restoration
Quality assurance: Confidence scoring and manual review flagging

Results & Impact

8.5% Word Error Rate On clean audio

12% Diarization Error Rate Speaker identification

0.3x Processing Speed Real-time factor on GPU

80% Cost Reduction vs manual transcription

Key Impact

Processed 1000+ hours of meeting audio
Reduced transcription costs from $90K to $18K annually
Enabled searchable meeting archives
Improved accessibility with automated captions
Deployed for 5+ enterprise clients

Challenges & Solutions

Overlapping Speech

Implemented overlap detection and separate transcription of overlapped segments

Accent Variability

Fine-tuned Whisper on client-specific accent data

Speaker Confusion

Added speaker verification using voice embeddings

Key Implementation

Transcription Pipeline

class TranscriptionPipeline:
    def __init__(self):
        self.diarization = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.0"
        )
        self.whisper = whisper.load_model("large-v3")
    
    def process(self, audio_path):
        # Step 1: Speaker diarization
        diarization = self.diarization(audio_path)
        
        # Step 2: Extract speaker segments
        segments = []
        for turn, _, speaker in diarization.itertracks(yield_label=True):
            segments.append({
                'start': turn.start,
                'end': turn.end,
                'speaker': speaker,
                'audio': self.extract_segment(audio_path, turn)
            })
        
        # Step 3: Transcribe each segment
        for segment in segments:
            result = self.whisper.transcribe(segment['audio'])
            segment['text'] = result['text']
            segment['confidence'] = result['confidence']
        
        # Step 4: Format output
        return self.format_transcript(segments)

Technologies Used

PythonWhisperPyannotePyTorchLibrosaFFmpegFastAPICeleryRedisPostgreSQL