Disentangled Latent Representations for Audio Style Transfer

Neural Audio Style Transfer between piano and violin using GAN-based Encoder-Decoder architecture based on Disentanglement of Latent Representations of the input in complex domain.

Overview

This project implements a deep learning system that can transfer the style characteristics between different musical instruments while preserving the musical content. The system uses a combination of STFT and CQT representations (using both real and imaginary parts), 2 transformer encoders: one for style and one for content, an autoregressive decoder and a discriminator to perform adversarial training using also contrastive and reconstruction losses.

For further details, read the full paper Distentangled Latent Representations for Audio Style Transfer.pdf

Key Features

Operating in the full complex domain: Many musical models discard phase information, which is essential for a better reconstruction
Different encoders for different roles: Definition of content and style encoder and relative constraints and losses to enforce this
Transformer-based Architecture: Uses self-attention mechanisms for temporal modeling
Multi-scale Losses: Combines time-domain, frequency-domain, and perceptual losses (reconstruction, contrastive, adversarial, disentanglement)
Dynamic Sequence Handling: Supports variable-length audio sequences
Curriculum Learning: Progressive training strategy for stable convergence
Memory Efficient: Optimized for training with limited GPU memory
Easy to Scale to more instruments: Architecture designed to be easy to add more instrument classes

Architecture

The system consists of four main components:

Style Encoder: Extracts instrument-specific style embeddings
Content Encoder: Captures musical content independent of instrument
Autoregressive Decoder: Reconstructs STFT with transferred style
Discriminator: Ensures proper disentanglement between style and content

Repository Structure

Audio-Style-Transfer/
├── README.md
│
├── Core Models/
│   ├── style_encoder.py                            # Style encoder with CNN + Transformer
│   ├── content_encoder.py                          # Content encoder for musical structure
│   ├── new_decoder.py                              # Dynamic autoregressive decoder
│   ├── discriminator.py                            # Adversarial discriminator
│   └── SimpleDecoder_TransformerOnly.py            # Transformer-only decoder
│
├── Training & Loss Functions/
│   ├── losses.py                                   # InfoNCE, margin, adversarial, HSIC losses
│   ├── dataloader.py                               # Efficient dual-instrument data loading
│   └── train2.ipynb                                # Training file with regular training and curriculum training
│
│
├── Evaluation/
│   ├── evaluation_reconstruction.py                # evaluation file for reconstruction sequence modeling task
│   └── evaluation_style_transfer.py                # evaluation file for audio style transfer
│
│
├── Utilities & Testing/
│   ├── utilityFunctions.py                         # STFT/CQT processing, audio I/O
│   ├── test_correctness.ipynb                      # Model validation and testing
│   └── style_transfer_inference_test.ipynb         # Notebook for style transfer inference and audio export
│
├── Dataset Statistics/
│   └── train_set_stats/
│       ├── stats_stft_cqt_piano.npz                # Piano normalization statistics
│       ├── stats_stft_cqt_violin.npz               # Violin normalization statistics
│       └── stats_unified_stft_cqt.npz              # Combined statistics
|
├── Dataset Preprocessing/
│   └── Preprocessing_Dataset/
│       ├── unifies_violin_datasets.py              # Merges Bach and Etudes violin datasets
│       ├── split_BachViolinDataset.py              # Segments Bach violin recordings
│       ├── split_PianoMotion10M.py                 # Extracts piano segments from PianoMotion10M
│       ├── split_ViolinEtudes.py                   # Segments violin etudes recordings
│       ├── compute_unified_stats.py                # Computes combined normalization statistics
│       ├── compute_separated_stats.py              # Computes instrument-specific statistics
│       ├── read_unified_npz.py                     # Utility to inspect unified statistics
│       ├── read_separated_npz.py                   # Utility to inspect separated statistics
│       ├── dataset_trace_analysis.py               # Analyzes audio characteristics and metrics
│       └── dataset_variety.py                      # Visualizes dataset diversity using t-SNE

File Descriptions

Core Models

`style_encoder.py`

Implements the StyleEncoder class using ResNet-like CNN blocks followed by transformer layers. Key features:

Extracts style embeddings that capture instrument-specific characteristics
Provides both individual and class-level embeddings (by aggregation through CLS token)
Uses full STFT + CQT

`content_encoder.py`

ContentEncoder for extracting musical content representations. Architecture similar to style encoder.

No CLS token to focus on temporal content structure
Outputs sequence-level embeddings for temporal modeling, perfect for the autoregressive decoder

`new_decoder.py`

Dynamic transformer decoder that reconstructs STFT spectrograms. Features:

Autoregressive generation during inference with causal masking
Teacher forcing during training for stable convergence
CNN encoder-decoder for spatial feature processing
Dynamic sequence length handling without fixed parameters
Comprehensive loss functions

`discriminator.py`

Simple MLP discriminator for adversarial training:

Ensures proper disentanglement between style and content representations
Designed to correctly classify style/class_emb while performing randomly on content_emb

Training & Loss Functions

`losses.py`

Comprehensive loss function library including:

InfoNCE loss: Contrastive learning for style representation
Margin loss: Class separation in embedding space
Adversarial losses: Generator/discriminator training
HSIC-based disentanglement loss: Statistical independence between style and content

`dataloader.py`

Efficient data loading for dual-instrument training:

Handles STFT/CQT computation with configurable parameters
Applies instrument-specific normalization statistics
Creates overlapping windows for temporal modeling
Ensures balanced batch creation across instrument classes for the contrastive training

Dataset Statistics (`train_set_stats/`)

`stats_stft_cqt_piano.npz`

Piano-specific normalization statistics:

stft_mean, stft_std: STFT normalization parameters (shape: [2, 513])
cqt_mean, cqt_std: CQT normalization parameters (shape: [2, 84])
Computed from piano training data for instrument-specific normalization

`stats_stft_cqt_violin.npz`

Violin-specific normalization statistics:

Same structure as piano statistics
Ensures proper normalization for violin characteristics

`stats_unified_stft_cqt.npz`

Combined statistics from both instruments:

Used as fallback when separate statistics are unavailable
Provides general normalization for mixed-instrument scenarios
Maintains compatibility with different data configurations

Training Strategy

Implemented also a curriculum learning approach:

Phase 1 (0-20%): Reconstruction-only training
Phase 2 (20-40%): Add disentanglement losses
Phase 3 (40-60%): Introduce contrastive learning
Phase 4 (60-100%): Full adversarial training This progressive approach ensures stable training and better convergence.

Results and resources

Despite our experiments, the results were not convincing enough, some losses were oscillating too much indicating instability and the model wasn’t reconstructing properly. However we are strongly convinced of the foundations and mathematical proofs behind our choices and we believe that further experiments and refinements could lead to proper learning.

Resources

Dataset, statistics and checkpoints: https://drive.google.com/drive/folders/13P5gLzAJPyDuhp-urVOHHelahs5rQehu?usp=sharing

Requirements

Python 3.8+
PyTorch 1.12+
torchaudio
librosa
numpy
matplotlib
soundfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disentangled Latent Representations for Audio Style Transfer

Overview

Key Features

Architecture

Repository Structure

File Descriptions

Core Models

`style_encoder.py`

`content_encoder.py`

`new_decoder.py`

`discriminator.py`

Training & Loss Functions

`losses.py`

`dataloader.py`

Dataset Statistics (`train_set_stats/`)

`stats_stft_cqt_piano.npz`

`stats_stft_cqt_violin.npz`

`stats_unified_stft_cqt.npz`

Training Strategy

Results and resources

Resources

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
Preprocessing_Dataset		Preprocessing_Dataset
checkpoints		checkpoints
train_set_stats		train_set_stats
.gitignore		.gitignore
Disentangled Latent Representations for Audio Style Transfer.pdf		Disentangled Latent Representations for Audio Style Transfer.pdf
LICENSE		LICENSE
README.md		README.md
SimpleDecoder_TransformerOnly.py		SimpleDecoder_TransformerOnly.py
content_encoder.py		content_encoder.py
dataloader.py		dataloader.py
discriminator.py		discriminator.py
evaluation_reconstruction.py		evaluation_reconstruction.py
evaluation_style_transfer.py		evaluation_style_transfer.py
losses.py		losses.py
new_decoder.py		new_decoder.py
style_encoder.py		style_encoder.py
style_transfer_inference_test.ipynb		style_transfer_inference_test.ipynb
test_correctness.ipynb		test_correctness.ipynb
train2.ipynb		train2.ipynb
utilityFunctions.py		utilityFunctions.py

Folders and files

Latest commit

History

Repository files navigation

Disentangled Latent Representations for Audio Style Transfer

Overview

Key Features

Architecture

Repository Structure

File Descriptions

Core Models

style_encoder.py

content_encoder.py

new_decoder.py

discriminator.py

Training & Loss Functions

losses.py

dataloader.py

Dataset Statistics (train_set_stats/)

stats_stft_cqt_piano.npz

stats_stft_cqt_violin.npz

stats_unified_stft_cqt.npz

Training Strategy

Results and resources

Resources

Requirements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`style_encoder.py`

`content_encoder.py`

`new_decoder.py`

`discriminator.py`

`losses.py`

`dataloader.py`

Dataset Statistics (`train_set_stats/`)

`stats_stft_cqt_piano.npz`

`stats_stft_cqt_violin.npz`

`stats_unified_stft_cqt.npz`

Packages