EnCodec

EnCodec
DeveloperMeta AI
Initial releaseOctober 24, 2022; 3 years ago (2022-10-24)
Written inPython (PyTorch)
TypeNeural audio codec
LicenseMIT License
Repositorygithub.com/facebookresearch/encodec

EnCodec is an open-source neural network-based audio codec developed by Meta AI. It uses deep learning to compress audio at very low bit rates while maintaining high fidelity, achieving compression rates approximately ten times smaller than the MP3 format at comparable quality levels.[1] The codec was introduced in October 2022 via a research paper titled "High Fidelity Neural Audio Compression" authored by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi.[2]

Architecture

Diagram of an autoencoder architecture with encoder, bottleneck (latent space), and decoder components, similar to the structure used in EnCodec.

EnCodec employs a streaming encoder-decoder architecture with a quantized latent space, trained in an end-to-end fashion.[3]

The model uses a multiscale STFT-based (MS-STFT) discriminator during training, which serves as a perceptual loss function to reduce artifacts and produce high-quality samples.[4] A novel loss balancer mechanism stabilizes training by defining the weight of a loss as the fraction of the overall gradient it should represent.[2]

EnCodec is composed of three main components: an encoder network, a residual vector quantization (RVQ) bottleneck, and a decoder network.[4][5]

Encoder and decoder

The encoder is a convolutional network based on the SEANet architecture. It processes raw audio waveforms through a stack of one-dimensional convolutional residual blocks with strided downsampling layers, reducing the temporal resolution and producing a low frame-rate latent sequence. Optionally, long short-term memory (LSTM) layers are appended to capture long-range temporal context.[2][6]

The decoder mirrors the encoder using transposed convolutions and residual blocks to reconstruct the audio waveform from the quantized latent representation. For the 24 kHz model, all operations are strictly causal so that the system can run in a streaming, low-latency mode. For the 48 kHz stereo model, the architecture is non-causal and uses larger context for improved music quality.[2][7]

Residual vector quantization

The latent vectors produced by the encoder are quantized using residual vector quantization (RVQ). In RVQ, the latent vector is approximated by a sequence of codebook entries: the first codebook captures the coarse structure, and each subsequent codebook encodes the residual error of the previous stage. Each codebook contains 1024 learned vectors.[2][8] The number of active codebooks determines the target bitrate: using 2, 4, 8, 16, or 32 codebooks on the 24 kHz model corresponds to approximately 1.5, 3, 6, 12, or 24 kbit/s respectively.[7][9]

During training, a structured codebook dropout scheme is used so that a single model can operate at multiple bitrates without retraining.[2]

Training objective

EnCodec is trained end-to-end with a compound loss consisting of three terms:[2][6]

  • A time-domain reconstruction loss (L1 distance between the original and reconstructed waveform).
  • A multi-scale spectrogram loss computed over multiple STFT resolutions, encouraging accurate reconstruction in the frequency domain.
  • An adversarial loss from a multi-scale STFT discriminator (MS-STFTD), which operates on real and imaginary STFT components and encourages perceptually plausible outputs.

A key contribution of the paper is the loss balancer: rather than using fixed scalar weights for each loss term, the balancer normalizes each gradient by its running average magnitude, so that the contribution of each loss to the total gradient is proportional to a user-specified fraction. This decouples hyperparameter tuning from the typical scale of each loss and stabilizes adversarial training.[2][10]

Features and specifications

Simplified representation of a neural network architecture.

EnCodec supports two primary configurations:[7]

  • A causal model operating at 24 kHz on monophonic audio, trained on diverse audio data, supporting bitrates of 1.5, 3, 6, 12, or 24 kbps
  • A non-causal model operating at 48 kHz on stereophonic audio, trained on music data, supporting bitrates of 3, 6, 12, and 24 kbps

The codec includes an optional lightweight Transformer language model that can further compress the discrete audio representation by up to 40% without additional quality loss.[2][7]

EnCodec can operate in two modes: a non-streamable mode where audio is split into one-second chunks with 10 ms overlap, and a streamable mode using weight normalization on convolution layers.[3]

Models

The official EnCodec release provides two pretrained models:[7][9]

Model Sample rate Channels Supported bitrates Use case
encodec_24khz 24 kHz Mono 1.5, 3, 6, 12, 24 kbit/s Speech and general audio
encodec_48khz 48 kHz Stereo 3, 6, 12, 24 kbit/s Music (non-causal)

The 24 kHz model is causal and designed for streaming applications with low algorithmic latency. The 48 kHz stereo model is non-causal and was described as the first neural audio codec applied to stereophonic audio at that sample rate, which is a standard quality level for music distribution.[9][2]

Both models are accessible through the facebookresearch/encodec GitHub repository, through PyTorch Hub, and through the Hugging Face Transformers library.[7][5]

Applications

The primary applications for EnCodec include improving audio quality for voice calls over low-bandwidth connections, such as phone calls in areas with poor network coverage.[1] Meta has suggested the technology could support "rich metaverse experiences without requiring major bandwidth improvements."[11]

Beyond compression, EnCodec serves as a foundational component in Meta's generative audio models within the AudioCraft library. MusicGen and AudioGen use EnCodec to convert raw audio waveforms into discrete tokens that can be modeled by autoregressive language models.[12] EnCodec can also be fine-tuned for specific audio tasks or integrated into larger audio processing pipelines for applications such as speech synthesis and music generation.[3]

VALL-E

VALL-E, VALL-E X, and SpeechX, all developed at Microsoft, are neural codec language models that generate discrete codes derived from EnCodec based on text or acoustic inputs.[13] In the original VALL-E system, the EnCodec model encodes speech at 24 kHz into discrete tokens at 75 Hz using eight codebooks at 6 kbit/s; an autoregressive transformer then generates these tokens conditioned on phoneme sequences and a short acoustic prompt.[8]

Performance

In subjective listening tests using the MUSHRA methodology, EnCodec demonstrated superior quality compared to baseline codecs at equivalent bitrates. Meta claims that EnCodec at 6 kbps achieves comparable quality to MP3 at 64 kbps.[1] Evaluation results showed that EnCodec outperformed Opus and Lyra v2 at similar bandwidths, with EnCodec at 3 kbps reportedly achieving better performance than Lyra v2 at 6 kbps and Opus at 12 kbps.[3]

Software

EnCodec is implemented in Python using PyTorch and is released under the MIT License.[7] Pre-trained model weights are available on Hugging Face and through Torch Hub.[3] The codec is also integrated into the Hugging Face Transformers library, allowing use at scale alongside other models.[7]

EnCodec uses the .ecdc file extension for its compressed output format.[7]

See also

References

  1. ^ a b c "Meta's AI-Powered Audio Codec Promises 10x Compression Over MP3". Slashdot. 2022-11-01. Retrieved 2026-05-10.
  2. ^ a b c d e f g h i j Défossez, Alexandre; Copet, Jade; Synnaeve, Gabriel; Adi, Yossi (2022-10-24). "High Fidelity Neural Audio Compression". arXiv:2210.13438 [eess.AS].
  3. ^ a b c d e "facebook/encodec_24khz". Hugging Face. Retrieved 2026-05-10.
  4. ^ a b He, Hecate (2022-10-27). "Meet Meta AI's EnCodec: A SOTA Real-Time Neural Model for High-Fidelity Audio Compression". Synced. Retrieved 2026-05-10.
  5. ^ a b "facebook/encodec_24khz". Hugging Face. Retrieved 9 May 2025.
  6. ^ a b "EnCodec: High Fidelity Neural Audio Compression". AudioCraft Documentation. Meta AI. Retrieved 9 May 2025.
  7. ^ a b c d e f g h i "EnCodec: State-of-the-art deep learning based audio codec". GitHub. Retrieved 2026-05-10.
  8. ^ a b "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers". arXiv. 2023. arXiv:2301.02111. Retrieved 9 May 2025.
  9. ^ a b c "EnCodec: High-fidelity Neural Audio Compression". AudioCraft. Meta AI. Retrieved 9 May 2025.
  10. ^ "High Fidelity Neural Audio Compression". OpenReview. 2023. Retrieved 9 May 2025.
  11. ^ "Meta Says Its AI-Compressed Audio Codec Beats MP3 by 10x". ETCentric. 2022-11-09. Retrieved 2026-05-10.
  12. ^ "AudioCraft". Meta AI. Retrieved 2026-05-10.
  13. ^ "Towards audio language modeling - an overview". arXiv. 2024. arXiv:2402.13236. Retrieved 9 May 2025.

Content Disclaimer

Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.

  1. The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
  2. There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
  3. It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
  4. Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
  5. Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.