EnCodec
| EnCodec | |
|---|---|
| Developer | Meta AI |
| Initial release | October 24, 2022 |
| Written in | Python (PyTorch) |
| Type | Neural audio codec |
| License | MIT License |
| Repository | github |
EnCodec is an open-source neural network-based audio codec developed by Meta AI. It uses deep learning to compress audio at very low bit rates while maintaining high fidelity, achieving compression rates approximately ten times smaller than the MP3 format at comparable quality levels.[1] The codec was introduced in October 2022 via a research paper titled "High Fidelity Neural Audio Compression" authored by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi.[2]
Architecture

EnCodec employs a streaming encoder-decoder architecture with a quantized latent space, trained in an end-to-end fashion.[3]
The model uses a multiscale STFT-based (MS-STFT) discriminator during training, which serves as a perceptual loss function to reduce artifacts and produce high-quality samples.[4] A novel loss balancer mechanism stabilizes training by defining the weight of a loss as the fraction of the overall gradient it should represent.[2]
EnCodec is composed of three main components: an encoder network, a residual vector quantization (RVQ) bottleneck, and a decoder network.[4][5]
Encoder and decoder
The encoder is a convolutional network based on the SEANet architecture. It processes raw audio waveforms through a stack of one-dimensional convolutional residual blocks with strided downsampling layers, reducing the temporal resolution and producing a low frame-rate latent sequence. Optionally, long short-term memory (LSTM) layers are appended to capture long-range temporal context.[2][6]
The decoder mirrors the encoder using transposed convolutions and residual blocks to reconstruct the audio waveform from the quantized latent representation. For the 24 kHz model, all operations are strictly causal so that the system can run in a streaming, low-latency mode. For the 48 kHz stereo model, the architecture is non-causal and uses larger context for improved music quality.[2][7]
Residual vector quantization
The latent vectors produced by the encoder are quantized using residual vector quantization (RVQ). In RVQ, the latent vector is approximated by a sequence of codebook entries: the first codebook captures the coarse structure, and each subsequent codebook encodes the residual error of the previous stage. Each codebook contains 1024 learned vectors.[2][8] The number of active codebooks determines the target bitrate: using 2, 4, 8, 16, or 32 codebooks on the 24 kHz model corresponds to approximately 1.5, 3, 6, 12, or 24 kbit/s respectively.[7][9]
During training, a structured codebook dropout scheme is used so that a single model can operate at multiple bitrates without retraining.[2]
Training objective
EnCodec is trained end-to-end with a compound loss consisting of three terms:[2][6]
- A time-domain reconstruction loss (L1 distance between the original and reconstructed waveform).
- A multi-scale spectrogram loss computed over multiple STFT resolutions, encouraging accurate reconstruction in the frequency domain.
- An adversarial loss from a multi-scale STFT discriminator (MS-STFTD), which operates on real and imaginary STFT components and encourages perceptually plausible outputs.
A key contribution of the paper is the loss balancer: rather than using fixed scalar weights for each loss term, the balancer normalizes each gradient by its running average magnitude, so that the contribution of each loss to the total gradient is proportional to a user-specified fraction. This decouples hyperparameter tuning from the typical scale of each loss and stabilizes adversarial training.[2][10]
Features and specifications

EnCodec supports two primary configurations:[7]
- A causal model operating at 24 kHz on monophonic audio, trained on diverse audio data, supporting bitrates of 1.5, 3, 6, 12, or 24 kbps
- A non-causal model operating at 48 kHz on stereophonic audio, trained on music data, supporting bitrates of 3, 6, 12, and 24 kbps
The codec includes an optional lightweight Transformer language model that can further compress the discrete audio representation by up to 40% without additional quality loss.[2][7]
EnCodec can operate in two modes: a non-streamable mode where audio is split into one-second chunks with 10 ms overlap, and a streamable mode using weight normalization on convolution layers.[3]
Models
The official EnCodec release provides two pretrained models:[7][9]
| Model | Sample rate | Channels | Supported bitrates | Use case |
|---|---|---|---|---|
| encodec_24khz | 24 kHz | Mono | 1.5, 3, 6, 12, 24 kbit/s | Speech and general audio |
| encodec_48khz | 48 kHz | Stereo | 3, 6, 12, 24 kbit/s | Music (non-causal) |
The 24 kHz model is causal and designed for streaming applications with low algorithmic latency. The 48 kHz stereo model is non-causal and was described as the first neural audio codec applied to stereophonic audio at that sample rate, which is a standard quality level for music distribution.[9][2]
Both models are accessible through the facebookresearch/encodec GitHub repository, through PyTorch Hub, and through the Hugging Face Transformers library.[7][5]
Applications
The primary applications for EnCodec include improving audio quality for voice calls over low-bandwidth connections, such as phone calls in areas with poor network coverage.[1] Meta has suggested the technology could support "rich metaverse experiences without requiring major bandwidth improvements."[11]
Beyond compression, EnCodec serves as a foundational component in Meta's generative audio models within the AudioCraft library. MusicGen and AudioGen use EnCodec to convert raw audio waveforms into discrete tokens that can be modeled by autoregressive language models.[12] EnCodec can also be fine-tuned for specific audio tasks or integrated into larger audio processing pipelines for applications such as speech synthesis and music generation.[3]
VALL-E
VALL-E, VALL-E X, and SpeechX, all developed at Microsoft, are neural codec language models that generate discrete codes derived from EnCodec based on text or acoustic inputs.[13] In the original VALL-E system, the EnCodec model encodes speech at 24 kHz into discrete tokens at 75 Hz using eight codebooks at 6 kbit/s; an autoregressive transformer then generates these tokens conditioned on phoneme sequences and a short acoustic prompt.[8]
Performance
In subjective listening tests using the MUSHRA methodology, EnCodec demonstrated superior quality compared to baseline codecs at equivalent bitrates. Meta claims that EnCodec at 6 kbps achieves comparable quality to MP3 at 64 kbps.[1] Evaluation results showed that EnCodec outperformed Opus and Lyra v2 at similar bandwidths, with EnCodec at 3 kbps reportedly achieving better performance than Lyra v2 at 6 kbps and Opus at 12 kbps.[3]
Software
EnCodec is implemented in Python using PyTorch and is released under the MIT License.[7] Pre-trained model weights are available on Hugging Face and through Torch Hub.[3] The codec is also integrated into the Hugging Face Transformers library, allowing use at scale alongside other models.[7]
EnCodec uses the .ecdc file extension for its compressed output format.[7]
See also
References
- ^ a b c "Meta's AI-Powered Audio Codec Promises 10x Compression Over MP3". Slashdot. 2022-11-01. Retrieved 2026-05-10.
- ^ a b c d e f g h i j Défossez, Alexandre; Copet, Jade; Synnaeve, Gabriel; Adi, Yossi (2022-10-24). "High Fidelity Neural Audio Compression". arXiv:2210.13438 [eess.AS].
- ^ a b c d e "facebook/encodec_24khz". Hugging Face. Retrieved 2026-05-10.
- ^ a b He, Hecate (2022-10-27). "Meet Meta AI's EnCodec: A SOTA Real-Time Neural Model for High-Fidelity Audio Compression". Synced. Retrieved 2026-05-10.
- ^ a b "facebook/encodec_24khz". Hugging Face. Retrieved 9 May 2025.
- ^ a b "EnCodec: High Fidelity Neural Audio Compression". AudioCraft Documentation. Meta AI. Retrieved 9 May 2025.
- ^ a b c d e f g h i "EnCodec: State-of-the-art deep learning based audio codec". GitHub. Retrieved 2026-05-10.
- ^ a b "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers". arXiv. 2023. arXiv:2301.02111. Retrieved 9 May 2025.
- ^ a b c "EnCodec: High-fidelity Neural Audio Compression". AudioCraft. Meta AI. Retrieved 9 May 2025.
- ^ "High Fidelity Neural Audio Compression". OpenReview. 2023. Retrieved 9 May 2025.
- ^ "Meta Says Its AI-Compressed Audio Codec Beats MP3 by 10x". ETCentric. 2022-11-09. Retrieved 2026-05-10.
- ^ "AudioCraft". Meta AI. Retrieved 2026-05-10.
- ^ "Towards audio language modeling - an overview". arXiv. 2024. arXiv:2402.13236. Retrieved 9 May 2025.
External links
Content Disclaimer
Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.
- The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
- There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
- It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
- Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
- Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.