DOSEN PROFIL LENGKAP

EnCodec
EnCodec
Developer	Meta AI
Initial release	October 24, 2022; 3 years ago
Written in	Python (PyTorch)
Type	Neural audio codec
License	MIT License
Repository	github.com/facebookresearch/encodec

EnCodec is an open-source neural network-based audio codec developed by Meta AI. It uses deep learning to compress audio at very low bit rates while maintaining high fidelity, achieving compression rates approximately ten times smaller than the MP3 format at comparable quality levels.^[1] The codec was introduced in October 2022 via a research paper titled "High Fidelity Neural Audio Compression" authored by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi.^[2]

Architecture

EnCodec employs a streaming encoder-decoder architecture with a quantized latent space, trained in an end-to-end fashion.^[3]

The model uses a multiscale STFT-based (MS-STFT) discriminator during training, which serves as a perceptual loss function to reduce artifacts and produce high-quality samples.^[4] A novel loss balancer mechanism stabilizes training by defining the weight of a loss as the fraction of the overall gradient it should represent.^[2]

EnCodec is composed of three main components: an encoder network, a residual vector quantization (RVQ) bottleneck, and a decoder network.^[4]^[5]

Encoder and decoder

The encoder is a convolutional network based on the SEANet architecture. It processes raw audio waveforms through a stack of one-dimensional convolutional residual blocks with strided downsampling layers, reducing the temporal resolution and producing a low frame-rate latent sequence. Optionally, long short-term memory (LSTM) layers are appended to capture long-range temporal context.^[2]^[6]

The decoder mirrors the encoder using transposed convolutions and residual blocks to reconstruct the audio waveform from the quantized latent representation. For the 24 kHz model, all operations are strictly causal so that the system can run in a streaming, low-latency mode. For the 48 kHz stereo model, the architecture is non-causal and uses larger context for improved music quality.^[2]^[7]

Residual vector quantization

The latent vectors produced by the encoder are quantized using residual vector quantization (RVQ). In RVQ, the latent vector is approximated by a sequence of codebook entries: the first codebook captures the coarse structure, and each subsequent codebook encodes the residual error of the previous stage. Each codebook contains 1024 learned vectors.^[2]^[8] The number of active codebooks determines the target bitrate: using 2, 4, 8, 16, or 32 codebooks on the 24 kHz model corresponds to approximately 1.5, 3, 6, 12, or 24 kbit/s respectively.^[7]^[9]

During training, a structured codebook dropout scheme is used so that a single model can operate at multiple bitrates without retraining.^[2]

Training objective

EnCodec is trained end-to-end with a compound loss consisting of three terms:^[2]^[6]

A time-domain reconstruction loss (L1 distance between the original and reconstructed waveform).
A multi-scale spectrogram loss computed over multiple STFT resolutions, encouraging accurate reconstruction in the frequency domain.
An adversarial loss from a multi-scale STFT discriminator (MS-STFTD), which operates on real and imaginary STFT components and encourages perceptually plausible outputs.

A key contribution of the paper is the loss balancer: rather than using fixed scalar weights for each loss term, the balancer normalizes each gradient by its running average magnitude, so that the contribution of each loss to the total gradient is proportional to a user-specified fraction. This decouples hyperparameter tuning from the typical scale of each loss and stabilizes adversarial training.^[2]^[10]

Features and specifications

EnCodec supports two primary configurations:^[7]

A causal model operating at 24 kHz on monophonic audio, trained on diverse audio data, supporting bitrates of 1.5, 3, 6, 12, or 24 kbps
A non-causal model operating at 48 kHz on stereophonic audio, trained on music data, supporting bitrates of 3, 6, 12, and 24 kbps

The codec includes an optional lightweight Transformer language model that can further compress the discrete audio representation by up to 40% without additional quality loss.^[2]^[7]

EnCodec can operate in two modes: a non-streamable mode where audio is split into one-second chunks with 10 ms overlap, and a streamable mode using weight normalization on convolution layers.^[3]

Models

The official EnCodec release provides two pretrained models:^[7]^[9]

Model	Sample rate	Channels	Supported bitrates	Use case
encodec_24khz	24 kHz	Mono	1.5, 3, 6, 12, 24 kbit/s	Speech and general audio
encodec_48khz	48 kHz	Stereo	3, 6, 12, 24 kbit/s	Music (non-causal)

The 24 kHz model is causal and designed for streaming applications with low algorithmic latency. The 48 kHz stereo model is non-causal and was described as the first neural audio codec applied to stereophonic audio at that sample rate, which is a standard quality level for music distribution.^[9]^[2]

Both models are accessible through the facebookresearch/encodec GitHub repository, through PyTorch Hub, and through the Hugging Face Transformers library.^[7]^[5]

Applications

The primary applications for EnCodec include improving audio quality for voice calls over low-bandwidth connections, such as phone calls in areas with poor network coverage.^[1] Meta has suggested the technology could support "rich metaverse experiences without requiring major bandwidth improvements."^[11]

Beyond compression, EnCodec serves as a foundational component in Meta's generative audio models within the AudioCraft library. MusicGen and AudioGen use EnCodec to convert raw audio waveforms into discrete tokens that can be modeled by autoregressive language models.^[12] EnCodec can also be fine-tuned for specific audio tasks or integrated into larger audio processing pipelines for applications such as speech synthesis and music generation.^[3]

VALL-E

VALL-E, VALL-E X, and SpeechX, all developed at Microsoft, are neural codec language models that generate discrete codes derived from EnCodec based on text or acoustic inputs.^[13] In the original VALL-E system, the EnCodec model encodes speech at 24 kHz into discrete tokens at 75 Hz using eight codebooks at 6 kbit/s; an autoregressive transformer then generates these tokens conditioned on phoneme sequences and a short acoustic prompt.^[8]

Performance

In subjective listening tests using the MUSHRA methodology, EnCodec demonstrated superior quality compared to baseline codecs at equivalent bitrates. Meta claims that EnCodec at 6 kbps achieves comparable quality to MP3 at 64 kbps.^[1] Evaluation results showed that EnCodec outperformed Opus and Lyra v2 at similar bandwidths, with EnCodec at 3 kbps reportedly achieving better performance than Lyra v2 at 6 kbps and Opus at 12 kbps.^[3]

Software

EnCodec is implemented in Python using PyTorch and is released under the MIT License.^[7] Pre-trained model weights are available on Hugging Face and through Torch Hub.^[3] The codec is also integrated into the Hugging Face Transformers library, allowing use at scale alongside other models.^[7]

EnCodec uses the .ecdc file extension for its compressed output format.^[7]

References

^ ^a ^b ^c "Meta's AI-Powered Audio Codec Promises 10x Compression Over MP3". Slashdot. 2022-11-01. Retrieved 2026-05-10.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j Défossez, Alexandre; Copet, Jade; Synnaeve, Gabriel; Adi, Yossi (2022-10-24). "High Fidelity Neural Audio Compression". arXiv:2210.13438 [eess.AS].
^ ^a ^b ^c ^d ^e "facebook/encodec_24khz". Hugging Face. Retrieved 2026-05-10.
^ ^a ^b He, Hecate (2022-10-27). "Meet Meta AI's EnCodec: A SOTA Real-Time Neural Model for High-Fidelity Audio Compression". Synced. Retrieved 2026-05-10.
^ ^a ^b "facebook/encodec_24khz". Hugging Face. Retrieved 9 May 2025.
^ ^a ^b "EnCodec: High Fidelity Neural Audio Compression". AudioCraft Documentation. Meta AI. Retrieved 9 May 2025.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ "EnCodec: State-of-the-art deep learning based audio codec". GitHub. Retrieved 2026-05-10.
^ ^a ^b "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers". arXiv. 2023. arXiv:2301.02111. Retrieved 9 May 2025.
^ ^a ^b ^c "EnCodec: High-fidelity Neural Audio Compression". AudioCraft. Meta AI. Retrieved 9 May 2025.
^ "High Fidelity Neural Audio Compression". OpenReview. 2023. Retrieved 9 May 2025.
^ "Meta Says Its AI-Compressed Audio Codec Beats MP3 by 10x". ETCentric. 2022-11-09. Retrieved 2026-05-10.
^ "AudioCraft". Meta AI. Retrieved 2026-05-10.
^ "Towards audio language modeling - an overview". arXiv. 2024. arXiv:2402.13236. Retrieved 9 May 2025.

External links

[slashdot-1] "Meta's AI-Powered Audio Codec Promises 10x Compression Over MP3". Slashdot. 2022-11-01. Retrieved 2026-05-10.

[arxiv-2] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j Défossez, Alexandre; Copet, Jade; Synnaeve, Gabriel; Adi, Yossi (2022-10-24). "High Fidelity Neural Audio Compression". arXiv:2210.13438 [eess.AS].

[huggingface-3] "facebook/encodec_24khz". Hugging Face. Retrieved 2026-05-10.

[synced-4] He, Hecate (2022-10-27). "Meet Meta AI's EnCodec: A SOTA Real-Time Neural Model for High-Fidelity Audio Compression". Synced. Retrieved 2026-05-10.

[hf_24k-5] "facebook/encodec_24khz". Hugging Face. Retrieved 9 May 2025.

[audiocraft_docs-6] "EnCodec: High Fidelity Neural Audio Compression". AudioCraft Documentation. Meta AI. Retrieved 9 May 2025.

[github-7] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ "EnCodec: State-of-the-art deep learning based audio codec". GitHub. Retrieved 2026-05-10.

[valle_paper-8] "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers". arXiv. 2023. arXiv:2301.02111. Retrieved 9 May 2025.

[audiocraft_meta-9] "EnCodec: High-fidelity Neural Audio Compression". AudioCraft. Meta AI. Retrieved 9 May 2025.

[openreview-10] "High Fidelity Neural Audio Compression". OpenReview. 2023. Retrieved 9 May 2025.

[etcentric-11] "Meta Says Its AI-Compressed Audio Codec Beats MP3 by 10x". ETCentric. 2022-11-09. Retrieved 2026-05-10.

[audiocraft-12] "AudioCraft". Meta AI. Retrieved 2026-05-10.

[valle_overview-13] "Towards audio language modeling - an overview". arXiv. 2024. arXiv:2402.13236. Retrieved 9 May 2025.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]