The Content Localization Blueprint is a comprehensive solution for translating and dubbing audio and video content using AI services. It is a microservices architecture that orchestrates three main AI services:
- Speech-to-Speech (S2S): Translates audio from one language to another with voice preservation
- Speaker Detection (ASD NIM): Identifies who needs to be dubbed in the video
- LipSync: Synchronizes lip movements with translated audio
The blueprint supports multiple client types to accommodate different deployment scenarios and use-case requirements.
Note: To get access to the LipSync feature of the Content localization Blueprint, please request to join our NVIDIA AI for Media Private Access Program
The diagram below shows the controller-centric end-to-end architecture used to orchestrate S2S, ASD, and LipSync workflows.
- You may not directly or indirectly use this Content Localization Blueprint to alter the name, likeness, image, or voice of any person in violation of applicable law or regulation or without the person’s express consent.
- The Content-Localization Blueprint is shared as reference and is provided "as is". The security in the production environment is the responsibility of the end users deploying it. When deploying in a production environment, please have security experts review any potential risks and threats; define the trust boundaries, implement logging and monitoring capabilities, secure the communication channels, integrate AuthN & AuthZ with appropriate access controls, keep the deployment up to date, ensure the containers/source code are secure and free of known vulnerabilities.
GOVERNING TERMS: The blueprint software is governed by the Apache License 2.0, and enables use of separate open source and proprietary software, models and services governed by their respective licenses, including those below.
- Active Speaker Detection NIM
- LipSync NIM
- RIVA ASR NIM
- RIVA Magpie-TTS-Zeroshot
- Eleven Labs API service
- Camb.ai service
Sample Assets: Use of the assets is governed by the NVIDIA Sample Data License.
Link to relevant licenses:
- ThirdPartyLicenses.md — Third-party open-source software licenses
- NIMLICENSES.md — NVIDIA NIM container licenses (LipSync, ASD, RIVA ASR, RIVA TTS)
Please see CONTRIBUTING.md for contribution guidelines.
- Code of Conduct — Standards for community participation
- Security — Reporting security vulnerabilities
- Prerequisites
- Development Environment Setup
- Running the Services
- Client Applications
- Configuration
- Development
- Testing
- Documentation
- Troubleshooting
- Utility Scripts
- License
- Additional Resources
Before setting up the development environment, ensure you have the following installed:
- Operating System: Linux (Ubuntu 22.04 or 24.04 recommended)
- Python: 3.12 or higher
- Git: With Git LFS enabled
- NVIDIA GPU: With CUDA-capable drivers installed
- CUDA Toolkit: CUDA 12.x
- TensorRT: Compatible with your CUDA version
- Docker: Docker Engine 24.x or higher with Docker Compose and NVIDIA Docker runtime.
- Node.js/npm: Node 24 (Optional, only for UI development and pre-commit hooks)
# Verify NVIDIA driver installation
nvidia-smi
# Check CUDA version
nvcc --versionNote: If nvcc is not found, ensure CUDA is properly installed and added to your PATH and LD_LIBRARY_PATH
Note: This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
These packages are required:
sudo apt-get update
sudo apt-get install -y \
curl \
wget \
git-lfsYou need the following credentials set as environment variables (or in a .env file).
Get your NGC API keys from: https://ngc.nvidia.com/setup/api-key
NIM container keys — docker-compose.yml maps these component-specific keys to
NGC_API_KEY inside each NIM container. Set the keys for the services you intend to run:
| Variable | Used by | When required |
|---|---|---|
LIPSYNC_API_KEY |
LipSync NIM container | When using Lipsync |
ASD_API_KEY |
ASD NIM container | When using Active Speaker Detection |
AST_API_KEY |
RIVA ASR NIM container | When using the RIVA backend |
TTS_API_KEY |
RIVA TTS NIM container | When using the RIVA backend |
Third-party API keys — required only when using the corresponding S2S backend:
| Variable | When required |
|---|---|
ELEVENLABS_API_KEY |
When S2S_SERVICE=EL_DUBBING |
CAMB_API_KEY |
When using CAMB.AI dubbing scripts |
Follow these steps to set up your local development environment:
Create a .env file in the project root with your credentials:
# .env file
# NIM container keys (mapped to NGC_API_KEY inside each container by docker-compose.yml)
LIPSYNC_API_KEY=your_ngc_api_key_here
ASD_API_KEY=your_ngc_api_key_here
AST_API_KEY=your_ngc_api_key_here # RIVA backend only
TTS_API_KEY=your_ngc_api_key_here # RIVA backend only
# Third-party S2S backend keys
ELEVENLABS_API_KEY=your_11labs_api_key_here
# CAMB_API_KEY=your_camb_api_key_here
PYTHONPATH=path-to-root-of-repositoryInstall uv, the fast Python package manager:
curl -LsSf https://astral.sh/uv/install.sh | shAdd uv to your PATH (or restart your shell):
export PATH="$HOME/.local/bin:$PATH"Create and activate a Python 3.12 virtual environment:
uv venv --python 3.12
source .venv/bin/activateInstall the project dependencies and commonly used extras:
# Install core dependencies with non-GPU extras (test, lint, docs)
uv pip install -r pyproject.toml --extra test --extra lint --extra docsInstall GPU extras only on hosts with CUDA Toolkit headers available (cuda.h):
# Optional: install GPU extras (requires CUDA Toolkit development headers)
uv pip install -r pyproject.toml --extra gpuGenerate Python code from the gRPC/protobuf definitions:
# Download gRPC health check proto
wget -O protos/health.proto https://raw.githubusercontent.com/grpc/grpc/master/src/proto/grpc/health/v1/health.proto
# Generate Python protobuf files
bash ./protos/generate_protos.shInstall linting and pre-commit hooks:
# Install development tools
uv tool install pre-commit
uv tool install ruff
# Set up pre-commit hooks
pre-commit installTo update pre-commit hook versions (optional):
pre-commit autoupdateTo run pre-commit on all files manually:
pre-commit run --all-filesAdd the project root, src, client, and generated protobuf files to your PYTHONPATH:
export PYTHONPATH="${PYTHONPATH}:${PWD}:${PWD}/src:${PWD}/client:${PWD}/protos/generated"Add this line to your shell profile (.bashrc or .zshrc) to make it permanent:
echo 'export PYTHONPATH="${PYTHONPATH}:'"${PWD}"':'"${PWD}"'/src:'"${PWD}"'/client:'"${PWD}"'/protos/generated"' >> ~/.bashrcCreate directories for test outputs and builds:
mkdir -p build
mkdir -p outputs
chmod -R 777 .For first-time deployment, use the deploy scripts to verify each service individually. This approach downloads models and verifies that each service starts correctly before deploying the full stack.
Deploy the RIVA ASR service with the Canary model:
./scripts/deploy_asr_canary.shThis will:
- Download the Canary 1B ASR model to
volumes/models/ast-canary/ - Start the RIVA ASR container on ports 8003 (HTTP) and 50053 (gRPC)
- Verify the service is running correctly
Note: Model download may take several minutes depending on your internet connection. Press Ctrl+C to stop the service once verified.
Deploy the zero-shot TTS service:
./scripts/deploy_tts_zeroshot.shThis requires the NGC_API_KEY environment variable and will create the necessary cache directories automatically.
This will download the Magpie Zero-Shot model to volumes/models/tts-zeroshot/.
Deploy the LipSync service:
./scripts/deploy_lipsync.shThis will:
- Download the LipSync models to
volumes/models/lipsync/ - Start the LipSync container on ports 8000 (HTTP) and 8001 (gRPC)
- Verify the service is running correctly
Note: This requires the NGC_API_KEY environment variable. Press Ctrl+C to stop the service once verified.
- Each deploy script runs in interactive mode (
-it) and will occupy your terminal - Run each script in a separate terminal or stop (Ctrl+C) before running the next
- These scripts are for verification only - use docker compose for production deployments
If all steps completed successfully, you're ready to run the full service stack!
To verify your setup:
- All deploy scripts should have started successfully without errors
- Model files should be present in
volumes/models/subdirectories - TensorRT engines should be built in
volumes/models/asd/
You can now proceed to Running the Services to launch the full stack with docker compose.
Start the full Content Localization stack:
# Default profile: S2S (ElevenLabs/CambAI) + ASD + LipSync + Controller + Demo App
docker compose --profile demo-app-third-party-s2s \
--env-file configs/elevenlabs.env \
--env-file .env \
up --buildDifferent service combinations for various use cases. Use the --profile flag to select which services to run:
| Profile | S2S | ASR (RIVA) | TTS (RIVA) | ASD | LipSync | Controller | Demo App | Description |
|---|---|---|---|---|---|---|---|---|
default |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | All services (for testing) |
third-party-s2s |
✓ | - | - | - | - | - | - | S2S only with ElevenLabs/CambAI |
riva |
✓ | ✓ | ✓ | - | - | - | - | S2S with RIVA ASR/TTS |
lipsync |
- | - | - | - | ✓ | - | - | LipSync only |
third-party-s2s-lipsync |
✓ | - | - | - | ✓ | - | - | S2S (ElevenLabs/CambAI) + LipSync |
riva-lipsync |
✓ | ✓ | ✓ | - | ✓ | - | - | S2S (RIVA) + LipSync |
third-party-s2s-asd-lipsync |
✓ | - | - | ✓ | ✓ | - | - | Full pipeline with ElevenLabs/CambAI |
riva-asd-lipsync |
✓ | ✓ | ✓ | ✓ | ✓ | - | - | Full pipeline with RIVA |
asd |
- | - | - | ✓ | - | - | - | Active Speaker Detection only |
controller-third-party-s2s |
✓ | - | - | ✓ | ✓ | ✓ | - | Orchestrated pipeline (ElevenLabs/CambAI) |
controller-riva |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | - | Orchestrated pipeline (RIVA) |
demo-app |
✓ | - | - | - | - | - | ✓ | S2S + Web Demo App |
demo-app-third-party-s2s |
✓ | - | - | ✓ | ✓ | ✓ | ✓ | Full stack with Web Demo (ElevenLabs/CambAI) |
demo-app-riva |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Full stack with Web Demo (RIVA) |
# ElevenLabs with full pipeline and demo app
docker compose --profile demo-app-third-party-s2s \
--env-file configs/elevenlabs.env \
--env-file .env \
up --build
# RIVA with full pipeline and demo app
docker compose --profile demo-app-riva \
--env-file configs/riva.env \
--env-file .env \
up --build
# CambAI S2S + ASD + LipSync (no controller, no demo)
docker compose --profile third-party-s2s-asd-lipsync \
--env-file configs/elevenlabs.env \
--env-file .env \
up --build
# RIVA S2S + ASD + LipSync (no controller, no demo)
docker compose --profile riva-asd-lipsync \
--env-file configs/riva.env \
--env-file .env \
up --build
# Controller orchestration with ElevenLabs (also similar for camb)
docker compose --profile controller-third-party-s2s \
--env-file configs/elevenlabs.env \
--env-file .env \
up --build
# Controller orchestration with RIVA
docker compose --profile controller-riva \
--env-file configs/riva.env \
--env-file .env \
up --build
# Basic S2S with ElevenLabs only (also simialar for camb)
docker compose --profile third-party-s2s \
--env-file configs/elevenlabs.env \
--env-file .env \
up --build
# Basic S2S with RIVA ASR/TTS
docker compose --profile riva \
--env-file configs/riva.env \
--env-file .env \
up --build- For Development/Testing: Use
demo-app-third-party-s2sordemo-app-rivafor the full stack with web interface - For Production with ElevenLabs/CambAI: Use
controller-third-party-s2sfor orchestrated processing - For Production with RIVA: Use
controller-rivafor orchestrated processing - For Service Testing: Use individual profiles like
third-party-s2s,riva,lipsync, orasd
# Stop all services
docker compose down
# Stop and remove volumes (clean state)
docker compose down -vView logs in real-time as services are running:
# View logs from all services (follow mode)
docker compose logs -f
# View logs from specific service
docker compose logs -f s2s
docker compose logs -f asd
docker compose logs -f controller
docker compose logs -f lipsync
# View logs from multiple services
docker compose logs -f s2s controller
# View last 100 lines of logs
docker compose logs --tail=100For debugging or sharing, use the log copy script to save logs to local files:
# Copy logs from all services to ./logs/ directory
./scripts/copy_docker_logs.sh
# Copy logs from a specific service only
./scripts/copy_docker_logs.sh s2s
./scripts/copy_docker_logs.sh ast
./scripts/copy_docker_logs.sh tts
./scripts/copy_docker_logs.sh lipsync
./scripts/copy_docker_logs.sh asd
./scripts/copy_docker_logs.sh controller
# View help
./scripts/copy_docker_logs.sh --helpThis creates log files in ./logs/:
./logs/s2s.log- Speech-to-Speech service logs./logs/ast.log- ASR (RIVA) service logs./logs/tts.log- TTS (RIVA) service logs./logs/lipsync.log- LipSync service logs./logs/asd.log- Active Speaker Detection logs./logs/controller.log- Controller orchestration logs
Benefits of copying logs:
- Persist logs even after containers are stopped
- Easy to share with team members for debugging
- Can be archived or uploaded to issue trackers
- Includes line counts and helpful status messages
This section describes the various client implementations for interacting with the Content-Localization services. Each client is designed for specific use cases and can be used independently or together.
| Client | File | Services | Use Case | Key Features | Usage |
|---|---|---|---|---|---|
| Controller | client/controller/app.py |
Controller (S2S + ASD + LipSync orchestration) | Streamlined end-to-end content localization with single service communication | • Audio and video input processing • Optional diarization input for speaker-aware dubbing • Optional background audio pass-through to LipSync • Pre-translated audio to bypass S2S ( --translated-audio)• Complete pipeline orchestration through Controller service • Single gRPC "streaming" connection for entire workflow • Simplified error handling and monitoring |
python client/controller/app.py [options] |
| Direct | client/direct/app.py |
S2S + ASD + LipSync (direct communication) | Full control over service pipeline with direct service coordination | • Direct communication with each service • Custom pipeline orchestration (S2S → LipSync → ASD) • Pre-translated audio to bypass S2S ( --translated-audio)• Complete control over service interactions, easy to test and debug various services individually and in combination |
python client/direct/app.py [options] |
| S2S | client/s2s/app.py |
S2S only | Audio translation between languages | • Audio input/output processing • Real-time streaming with configurable chunks • Built-in latency analysis and performance monitoring • Support for WAV and MP3 formats for RIVA and ElevenLabs/CambAI, respectively. Ensure output file name is set to the desired format |
python client/s2s/app.py [options] |
| LipSync | client/lipsync/app.py |
LipSync only | Synchronize lip movements with audio | • Video and audio input processing • Speaker info support from file • Optional background audio mixing • Multiple output formats and encoding options • Streaming support with performance optimization |
python client/lipsync/app.py [options] |
| ASD | client/asd/app.py |
ASD only | Detect active speakers in video | • Video processing for speaker detection • Multi-format diarization input (flat, RIVA, ElevenLabs/CambAI) with auto-detection • Speaker info generation in CSV format • Real-time speaker detection • Configurable chunk sizes for optimal performance |
python client/asd/app.py [options] |
| Demo Web App | client/demos |
Controller (full pipeline) | User-friendly interface for testing and demonstrations | • Web-based UI for file upload and processing • Real-time progress monitoring • Output preview and download • Accessible via browser at http://localhost:3000 |
• Service is spun up from docker compose • Open browser to http://localhost:3000 for UI interface |
| Use Case | Recommended Client | Description |
|---|---|---|
| Complete Pipeline (Simplified) | controller/ |
Single service orchestration with minimal configuration. Ideal for production workflows. |
| Complete Pipeline (Full Control) | direct/ |
Direct service communication with complete pipeline control. Best for development and detailed monitoring. |
| Audio Translation Only | s2s/ |
Just speech-to-speech translation with performance analysis. |
| Lip Sync Only | lipsync/ |
Just lip synchronization with advanced encoding options |
| Speaker Detection Only | asd/ |
Just active speaker detection and speaker info generation |
| Web Interface | demos/ |
Browser-based interface for demonstrations |
Purpose: Shared audio and video processing utilities
- Files:
audio.py- Audio source and sink simulatorsvideo.py- Video source and sink simulatorsfile.py- Generic file source simulator for non-WAV formats (e.g., MP3)base.py- Base classes for file simulators
- Use Case: Used by all clients for standardized file I/O operations
- Features:
- WAV/MP3 audio processing
- MP4 video processing with streaming support
- Chunk-based processing for large files
- Format validation and error handling
- Latency analysis and performance monitoring
The following diagrams illustrate the different client architectures.
Mermaid sources are available in docs/source/uml_mermaid/.
Source: docs/source/uml_mermaid/direct_client_architecture.mmd
Benefits:
- Full control over pipeline
- Custom orchestration
- Development flexibility
- Service-specific tuning
S2S Client Architecture
Source: docs/source/uml_mermaid/s2s_client_architecture.mmd
Benefits:
- Audio translation only
- Real-time streaming with configurable chunks
- Built-in latency analysis and performance monitoring
- Support for WAV and MP3 formats (RIVA and ElevenLabs/CambAI)
LipSync Client Architecture
Source: docs/source/uml_mermaid/lipsync_client_architecture.mmd
Benefits:
- Video and audio input processing
- Speaker info support from file or ASD NIM
- Multiple output formats and encoding options
- Streaming support with performance optimization
ASD Client Architecture
Source: docs/source/uml_mermaid/asd_client_architecture.mmd
Benefits:
- Video processing for speaker detection
- Speaker info generation in CSV format
- Real-time speaker detection
- Configurable chunk sizes for optimal performance
Source: docs/source/uml_mermaid/client_service_flow.mmd
Detailed sequence diagram showing:
- Health checks for all services
- Controller service orchestration
- Audio processing pipeline (Controller → S2S → RIVA ASR/TTS)
- Video processing pipeline (Controller → ASD → GPU/CPU fallback)
- LipSync processing coordination
- Error handling and fallback mechanisms
Source: docs/source/uml_mermaid/ui_demo_app_architecture.mmd
Benefits:
- User-friendly web interface for content localization workflows
- Upload and preview audio/video content with localization results
- Real-time output streaming to UI
📖 Complete Documentation: See Demo Web Application Guide for detailed setup, configuration, and usage instructions.
Before running any client:
-
Activate Virtual Environment:
source .venv/bin/activate -
Ensure Services are Running:
- Controller Service (default:
localhost:50056) - for controller client - S2S Service (default:
localhost:50050) - for direct client and S2S client - LipSync Service (default:
localhost:50054) - for direct client and LipSync client - ASD NIM Service (default:
localhost:50055) - for direct client and ASD client
- Controller Service (default:
-
Required Input Files:
- Audio files (WAV/MP3 format)
- Video files (MP4 format, streamable preferred)
- Speaker info files (CSV format, optional for LipSync)
docker compose --profile controller-third-party-s2s --env-file configs/elevenlabs.env --env-file .env up --builddocker compose --profile third-party-s2s-asd-lipsync --env-file configs/elevenlabs.env --env-file .env up --buildpython client/controller/app.pypython client/direct/app.py# S2S only
python client/s2s/app.py
# LipSync only
python client/lipsync/app.py
# ASD only
python client/asd/app.pyEach client supports various command-line options for customization:
- Server endpoints: Configure service addresses and ports
- Input/output paths: Specify file locations
- Chunk sizes: Adjust streaming parameters for optimal performance
- Audio/video formats: Configure processing options
- Performance settings: Tune for latency vs. throughput
Run python client/[client_name]/app.py --help for detailed options.
Clients generate various output files:
- Audio: Translated audio (WAV format)
- Video: Lip-synchronized video (MP4 format)
- Speaker Info: Per-frame speaker metadata (CSV format)
- Logs: Processing logs and diagnostics
- Performance: Latency analysis plots and metrics
- Module Import Errors: Ensure virtual environment is activated
- Service Connection Errors: Verify services are running and accessible
- File Format Issues: Check input file formats and convert if necessary
- Memory Issues: Reduce chunk sizes for large files
- Network Issues: Check network connectivity and firewall settings
- Controller Client: Check Controller service health and configuration
- Direct Client: Verify all three services (S2S, LipSync, ASD) are running
- S2S Client: Check audio format and S2S service logs
- LipSync Client: Verify video format and LipSync service configuration
- ASD Client: Check video format and ASD NIM logs
# Process multiple files with controller client
for audio_file in audio/*.wav; do
for video_file in video/*.mp4; do
python client/controller/app.py \
--input-audio "$audio_file" \
--input-mp4 "$video_file" \
--output-mp4 "outputs/$(basename "$audio_file" .wav)_$(basename "$video_file" .mp4)_output.mp4"
done
done# Test different chunk sizes with S2S client
for chunk_size in 0.5 1.0 2.0 5.0; do
python client/s2s/latency_analysis.py \
--chunk-size-audio-secs "$chunk_size" \
--output-plot "s2s_latency_${chunk_size}s.png"
done# Check all services are running
python -c "
import grpc
services = [
('localhost:50050', 'S2S'),
('localhost:50054', 'LipSync'),
('localhost:50055', 'ASD'),
('localhost:50056', 'Controller')
]
for addr, name in services:
try:
channel = grpc.insecure_channel(addr)
grpc.channel_ready_future(channel).result(timeout=1)
print(f'✓ {name} service is running on {addr}')
except:
print(f'✗ {name} service is not accessible on {addr}')
"The controller service supports various configuration options:
Basic Configuration:
CONTROLLER_GRPC_API_PORT: gRPC service port (default:50056)CONTROLLER_MAX_CONCURRENCY: Maximum concurrent requests (default:1)CONTROLLER_LOG_LEVEL: Logging level (default:INFO)
Service Endpoints:
S2S_SERVER: Speech-to-Speech service endpointASD_SERVER: Active Speaker Detection service endpoint (optional)LIPSYNC_SERVER: LipSync service endpoint
Controller Processing:
- ASD bypass a per-request option available via
bypass_asd=TrueinContentLocalizationConfig(LipSync uses internal face detection) CONTROLLER_INTERMEDIATE_AUDIO_FORMAT: S2S output format used by controller (MP3/WAV)
Debug Configuration:
CONTROLLER_DEBUG_PORT: VS Code debug port (default:5678)CONTROLLER_VS_CODE_DEBUG: Enable VS Code debugging (default:0)
Enabling Profiling and Metric Tracker:
CONTROLLER_PROFILER: Enable profiling framework (default:0)CONTROLLER_PROFILER_TYPE: Select profiler type between yappi and cprofiler (default:cprofiler)CONTROLLER_METRIC_TRACKER: Enable metric tracker (default:0)
S2S_GRPC_API_PORT: gRPC service port (default:50050)S2S_LOG_LEVEL: Logging level (default:INFO)
ASD_GRPC_API_PORT: gRPC service port (default:50055)ASD_LOG_LEVEL: Logging level (default:INFO)ASD_MODEL_PATH: Path to ASD TensorRT models
All timeout values are in seconds. See the full reference in the Sphinx Configuration docs.
Shared (all services):
HEALTH_CHECK_TIMEOUT: HTTP and gRPC health-check timeout (default:5.0)BUFFER_POLL_TIMEOUT: Buffer iterator poll cadence (default:0.1)
Controller:
CONTROLLER_CONFIG_POLL_TIMEOUT: Wait for per-request config messages (default:5.0)CONTROLLER_CLEANUP_TIMEOUT: Thread cleanup timeout (default:10.0)
S2S:
S2S_CLEANUP_TIMEOUT: Sub-pipeline thread cleanup timeout (default:1.0)S2S_EL_DUBBING_POLL_INTERVAL: ElevenLabs dubbing status poll interval (default:10)S2S_EL_DUBBING_MAX_ATTEMPTS: Max dubbing poll attempts (default:120)S2S_EL_KEEPALIVE_INTERVAL: Keepalive ping interval during dubbing (default:1)
Configuration files are located in the configs/ directory:
configs/elevenlabs.env: ElevenLabs S2S configurationconfigs/camb.env: CambAI S2S configurationconfigs/riva.env: RIVA S2S configuration
.
├── client/ # Client applications
│ ├── asd/ # Active Speaker Detection client (app.py, args.py, config.py)
│ ├── controller/ # Controller orchestration client (app.py, args.py, config.py)
│ ├── demos/ # Web demo application
│ ├── direct/ # Direct processing client (app.py, args.py)
│ ├── lipsync/ # LipSync client (app.py, args.py, config.py)
│ └── s2s/ # Speech-to-Speech client (app.py, args.py, config.py)
├── configs/ # Service configuration files
├── dockerfiles/ # Dockerfiles for each service
├── docs/ # Sphinx documentation
├── protos/ # gRPC/Protobuf definitions
├── scripts/ # Utility and standalone scripts
│ ├── deploy_*.sh # Service deployment scripts
│ ├── el_diarize.py # ElevenLabs diarization generation
│ ├── riva_parakeet_diarize.py # RIVA Parakeet diarization generation
│ ├── el_s2s_infer.py # ElevenLabs standalone dubbing
│ └── camb_s2s_infer.py # CAMB standalone dubbing
├── src/ # Service implementation
│ ├── common/ # Shared utilities
│ ├── controller_service/ # Controller service code
│ ├── docker_entrypoints/ # Docker container entrypoints
│ ├── profiler/ # Profiling and metrics tracking
│ └── s2s_service/ # S2S service code
├── tests/ # Unit and integration tests
└── volumes/ # Persistent data (models, cache, outputs)
gRPC protobuf definitions are in the protos/ folder. To regenerate:
cd protos
bash generate_protos.sh
cd ..The project uses ruff for code formatting and linting:
# Check formatting
ruff format --check src/ tests/ client/
# Auto-format code
ruff format src/ tests/ client/
# Run linter
ruff check src/ tests/ client/
# Auto-fix linting issues
ruff check --fix src/ tests/ client/Pre-commit hooks run automatically on each commit:
# Run manually on all files
pre-commit run --all-files
# Skip hooks (not recommended)
git commit --no-verify# Run all tests
pytest
# Run with coverage report
pytest --cov=src --cov-report=html --cov-report=term-missing
# Run specific test file
pytest tests/test_s2s_service.py
# Run tests matching pattern
pytest -k "test_asd"Coverage reports are generated in build/coverage/:
# Generate HTML coverage report
pytest --cov-report html:build/coverage
# Open coverage report
open build/coverage/index.html # macOS
xdg-open build/coverage/index.html # LinuxComprehensive end-to-end functional tests are available to validate the complete pipeline with all clients and services. These tests run actual clients with sample inputs and verify outputs.
Test Coverage:
- Controller Client (orchestrated pipeline)
- Direct Client (direct service communication)
- S2S Client (audio translation with latency analysis)
- LipSync Client (lip synchronization)
- ASD Client (active speaker detection)
Quick Start:
# Run all functional tests
python -m pytest functional_tests/ -v
# Run specific client tests
python -m pytest functional_tests/test_controller_client.py -v
python -m pytest functional_tests/test_s2s_client.py -vPrerequisites:
- All services running (S2S, ASD, LipSync, Controller)
- Sample input files in
assets/ - Python environment with dependencies
For detailed functional testing documentation, test configuration, and troubleshooting, see functional_tests/README.md
Comprehensive Sphinx-based documentation is available covering architecture, service modes, client types, and API references.
Build Documentation:
# Build HTML documentation
bash docs/build_docs.sh
# or
cd docs
make html
cd ..View Generated Documentation:
open build/html/index.html # macOS
xdg-open build/html/index.html # LinuxOutput Locations:
- HTML:
build/docs/html/index.html - PDF:
build/docs/pdf/index.pdf - EPUB:
build/docs/epub/index.epub
For documentation structure, maintenance guidelines see docs/README.md
Mermaid diagrams for system and client architecture are available in:
docs/source/uml_mermaid/
This directory includes diagrams for:
- System architecture overview
- Client architectures (Controller, Direct, Individual)
- Service mode comparisons
- Client-service flow diagrams
- Demo app workflow
Issue: nvidia-smi not found
# Install NVIDIA drivers
sudo ubuntu-drivers autoinstall
sudo rebootIssue: Docker permission denied
# Add user to docker group
sudo usermod -aG docker $USER
newgrp dockerIssue: RIVA models fail to download
- Verify your
NGC_API_KEYis correct - Check network connectivity to NVIDIA NGC
- Try pulling the docker/model using the deploy script in
scripts/deploy_asr_canary.shorscripts/deploy_tts_zeroshot.sh.
Issue: TensorRT engine build fails
- Verify CUDA and TensorRT versions match
- Ensure sufficient disk space (>10GB free)
- Check GPU compute capability compatibility
The scripts/ directory contains various utility scripts to help with development, deployment, and testing:
| Script | Purpose | Usage |
|---|---|---|
deploy_asr_canary.sh |
Deploy RIVA ASR with Canary model | ./scripts/deploy_asr_canary.sh |
deploy_asr_parakeet.sh |
Deploy RIVA ASR with Parakeet model | ./scripts/deploy_asr_parakeet.sh |
deploy_tts_multilingual.sh |
Deploy multilingual TTS service | ./scripts/deploy_tts_multilingual.sh |
deploy_tts_zeroshot.sh |
Deploy zero-shot TTS service | ./scripts/deploy_tts_zeroshot.sh |
deploy_lipsync.sh |
Deploy LipSync service | ./scripts/deploy_lipsync.sh |
deploy_asd.sh |
Deploy ASD NIM service | ./scripts/deploy_asd.sh |
These scripts download models and start individual services for verification before full deployment.
| Script | Purpose | Usage |
|---|---|---|
setup_env.sh |
Setup complete development environment | ./scripts/setup_env.sh |
copy_docker_logs.sh |
Copy Docker container logs to files | ./scripts/copy_docker_logs.sh [service] |
| Script | Purpose | Usage |
|---|---|---|
convert_to_streamable_mp4.sh |
Convert videos to streamable MP4 format | ./scripts/convert_to_streamable_mp4.sh input.mp4 |
extract_audio_from_videos.sh |
Extract audio from video files | ./scripts/extract_audio_from_videos.sh <input_dir> <output_dir> [sample_rate] [channels] |
Converts video files to MP4 format suitable for streaming with the faststart flag.
Usage:
./scripts/convert_to_streamable_mp4.sh input.mp4
# Output: input-fs.mp4Features:
- Automatically installs ffmpeg if not present
- Copies video/audio streams without re-encoding
- Adds
faststartflag for progressive download - Supports various input formats (avi, mkv, mp4, etc.)
Batch extract audio from all video files in a directory.
Usage:
# Basic usage with defaults (16kHz, mono, WAV)
./scripts/extract_audio_from_videos.sh videos/ audio/
# Custom parameters (44.1kHz, stereo, MP3)
./scripts/extract_audio_from_videos.sh videos/ audio/ 44100 2 mp3Arguments:
input_dir- Directory containing video files (required)output_dir- Directory to save audio files (required)sample_rate- Sample rate in Hz (default: 16000)channels- Audio channels: 1=mono, 2=stereo (default: 1)format- Output format: wav, mp3, flac (default: wav)
Features:
- Processes multiple video formats (mp4, avi, mkv, mov, webm)
- Configurable sample rate and channels
- Progress tracking and error reporting
- Creates output directory if needed
Complete automated setup of the development environment. This script:
- Installs system packages (build tools, ffmpeg, etc.)
- Installs Python 3.12
- Installs
uvpackage manager - Creates virtual environment and installs dependencies from
pyproject.toml - Generates gRPC/protobuf Python code
- Optionally installs dev tools (pre-commit, ruff) with
--dev - Optionally installs Docker and NVIDIA GPU drivers
Usage:
./scripts/setup_env.sh [--no-docker] [--no-gpu] [--dev] [--docs]Options:
--no-docker— Skip Docker and NVIDIA Container Toolkit installation--no-gpu— Skip NVIDIA GPU driver and CUDA toolkit installation--dev— Install development dependencies (lint, pre-commit)--docs— Install documentation build dependencies
Requirements:
- Internet connection for package downloads
- Ubuntu 22.04 or 24.04 recommended
Copy logs from Docker containers to local files for debugging and sharing.
Usage:
# Copy all service logs
./scripts/copy_docker_logs.sh
# Copy specific service logs
./scripts/copy_docker_logs.sh s2s
./scripts/copy_docker_logs.sh controllerOutput: Logs saved to ./logs/ directory with filenames like s2s.log, controller.log, etc.
These scripts generate diarization data (speaker segmentation) from audio files, producing JSON files that can be passed to the ASD client or Controller client via --diarization-file.
| Script | Purpose | Usage |
|---|---|---|
el_diarize.py |
Generate diarization using ElevenLabs Scribe STT | ELEVENLABS_API_KEY=<key> python scripts/el_diarize.py --input-file audio.wav |
riva_parakeet_diarize.py |
Generate diarization using RIVA Parakeet ASR NIM | python scripts/riva_parakeet_diarize.py --input-file audio.wav --server localhost:50053 |
Generate diarization data using the ElevenLabs Speech-to-Text (Scribe) API. Outputs native ElevenLabs STT JSON format.
Usage:
ELEVENLABS_API_KEY=<key> python scripts/el_diarize.py \
--input-file audio.wav \
--output-file diarization.jsonArguments:
--input-file- Path to audio file (WAV, MP3, etc.) (required)--output-file- Path to output JSON file (default:diarization.json)--language-code- Language code (default: auto-detect)--max-speakers- Maximum number of speakers (default: model default)--model-id- Scribe model ID (default:scribe_v2)
Requirements:
ELEVENLABS_API_KEYenvironment variable
Generate diarization data using RIVA Parakeet ASR NIM. Outputs native RIVA offline_recognize JSON format.
Usage:
python scripts/riva_parakeet_diarize.py \
--input-file audio.wav \
--output-file diarization.json \
--server localhost:50053Arguments:
--input-file- Path to audio file (WAV) (required)--output-file- Path to output JSON file (default:diarization.json)--server- RIVA ASR server address (default:localhost:50053)--language-code- Language code (default:en-US)--max-speakers- Maximum number of speakers (default:4)
Requirements:
- Running RIVA Parakeet ASR NIM (deploy with
./scripts/deploy_asr_parakeet.sh)
These scripts perform end-to-end dubbing outside of the gRPC service pipeline, using cloud dubbing APIs directly.
| Script | Purpose | Usage |
|---|---|---|
el_s2s_infer.py |
ElevenLabs end-to-end dubbing | ELEVENLABS_API_KEY=<key> python scripts/el_s2s_infer.py --input-file video.mp4 --source-language-code en --target-language-code es -o output.wav |
camb_s2s_infer.py |
CAMB end-to-end dubbing (URL-based) | CAMB_API_KEY=<key> python scripts/camb_s2s_infer.py --input-url <url> --source-language 1 --target-language 54 -o output.mp3 |
invoke_11labs_e2e.sh |
Wrapper for ElevenLabs E2E dubbing | ./scripts/invoke_11labs_e2e.sh |
invoke_camb_e2e.sh |
Wrapper for CAMB E2E dubbing | ./scripts/invoke_camb_e2e.sh |
Invoke ElevenLabs end-to-end dubbing for local media files. Extracts audio from video, submits a dubbing request, and downloads the translated audio.
Usage:
ELEVENLABS_API_KEY=<key> python scripts/el_s2s_infer.py \
--input-file video.mp4 \
--source-language-code en \
--target-language-code es \
--output-file output.wavRequirements:
ELEVENLABS_API_KEYenvironment variableffmpeginstalled (for video-to-audio extraction)
Invoke CAMB end-to-end dubbing for URL-based media. Submits a dubbing request, polls for completion, and downloads the translated audio.
CAMB.AI uses integer language IDs (e.g. 1 = English, 54 = Spanish). To get the full mapping, query the CambAI API or see the source languages and target languages docs.
Usage:
CAMB_API_KEY=<key> python scripts/camb_s2s_infer.py \
--input-url "https://example.com/media.mp3" \
--source-language 1 \
--target-language 54 \
--output-file output.mp3Requirements:
CAMB_API_KEYenvironment variable
Deploy the Active Speaker Detection (ASD) NIM container for standalone testing.
Usage:
./scripts/deploy_asd.shFeatures:
- Deploys ASD NIM container with GPU support
- Configures ports: HTTP (
ASD_NIM_HTTP_API_PORT, default 8005) and gRPC (ASD_GRPC_API_PORT, default 50055) - Mounts model cache at
volumes/models/asd/ - Requires
ASD_API_KEYenvironment variable
The controller service includes built-in profiling and metrics tracking capabilities to analyze performance bottlenecks and monitor system behavior.
Profile controller service execution to identify performance bottlenecks and optimize code paths.
Step 1: Enable Profiling and Metrics
# Add to .env file or export
export CONTROLLER_PROFILER=1
export CONTROLLER_METRIC_TRACKER=1Step 2: Run Service
docker compose --env-file .env --env-file configs/elevenlabs.env --profile controller-third-party-s2s up --buildStep 3: Send Request
In a new terminal window, activate the virtual environment as described in Prerequisites section. Execute controller client to send request.
python3 client/controller/app.pyFollowing file structure will be generated.
volumes/profiler/
├── YYYY-MM-DD_HH-MM-SS/infer_<uuid>/
│ ├── profile_overall.prof # For SnakeViz
│ ├── profile_thread_N.prof # Per-thread (yappi only)
│ └── profile_trace.json # For Chrome Tracing
├── raw_data_<timestamp>/
│ ├── lipsync_request.csv # Per-metric timestamps
│ └── ...
└── metrics_<timestamp> # Aggregated statistics
Step 4: Visualize Results
Profiling (SnakeViz or Chrome):
# Option 1: SnakeViz (hierarchical view)
snakeviz volumes/profiler/YYYY-MM-DD_HH-MM-SS/infer_<uuid>/profile_overall.profDefault browser will open with an interactive visualization of pstats.
# Option 2: Chrome Tracing (timeline view)
# 1. Open chrome://tracing
# 2. Load volumes/profiler/YYYY-MM-DD_HH-MM-SS/infer_<uuid>/profile_trace.jsonMetrics (plot_metrics.py):
# Generate plots for all metrics
python3 client/utilities/plot_metrics.py volumes/profiler/raw_data_2025-10-29_09-23-02/ -o outputs/metrics_plots/This will generate outputs/metrics_plots/<metric_name>.png (per-metric timeline) and
outputs/metrics_plots/metric_comparison.png (combined timeline of all events).
For detailed profiling documentation, advanced configuration, and troubleshooting, see Profiling Guide
