VLLM

vLLM
Original authorsSky Computing Lab
Cal Berkeley
DevelopervLLM contributors
Initial release2023
Written inPython, CUDA, C++
TypeLarge language model inference engine
LicenseApache License 2.0
Websitevllm.ai
Repositorygithub.com/vllm-project/vllm

vLLM is an open-source software framework for inference and serving of large language models and related multimodal models. Originally developed at the University of California, Berkeley's Sky Computing Lab,[1] the project is centered on PagedAttention, a memory-management method for transformer key–value caches, and supports features such as continuous batching, distributed inference, quantization, and OpenAI-compatible APIs.[2][3][4] According to a project maintainer, the "v" in vLLM originally referred to "virtual", inspired by virtual memory.[5]

History

vLLM was introduced in 2023 by researchers affiliated with the Sky Computing Lab at UC Berkeley.[3][2] Its core ideas were described in the 2023 paper Efficient Memory Management for Large Language Model Serving with PagedAttention,[6] which presented the system as a high-throughput and memory-efficient serving engine for large language models.[3]

In 2025, the PyTorch Foundation announced that vLLM had become a Foundation-hosted project. PyTorch's project page states that the University of California, Berkeley contributed vLLM to the Linux Foundation in July 2024.[7][4]

In January 2026, TechCrunch reported that the creators of vLLM had launched the startup Inferact to commercialize the project, raising $150 million in seed funding.[8]

Architecture

According to its 2023 paper, vLLM was designed to improve the efficiency of large language model serving by reducing memory waste in the key–value cache used during transformer inference.[3] The paper introduced PagedAttention, an algorithm inspired by virtual memory and paging techniques in operating systems, and described vLLM as using block-level memory management and request scheduling to increase throughput while maintaining similar latency.[3]

The project documentation and repository describe support for continuous batching, chunked prefill, speculative decoding, prefix caching, quantization, and multiple forms of distributed inference and serving.[2][4] PyTorch has described vLLM as a high-throughput, memory-efficient inference and serving engine that supports a range of hardware back ends, including NVIDIA and AMD GPUs, Google TPUs, AWS Trainium, and Intel processors.[7][4]

See also

References

  1. ^ "vLLM - A High-Throughput and Memory-Efficient Inference and Serving Engine for LLMs". UC Berkeley, Sky Computing Lab.
  2. ^ a b c "GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs". GitHub. GitHub, Inc. Retrieved April 22, 2026.
  3. ^ a b c d e Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. Retrieved April 22, 2026.
  4. ^ a b c d "vLLM". PyTorch. PyTorch Foundation. Retrieved April 22, 2026.
  5. ^ "vLLM full name". GitHub. GitHub, Inc. August 23, 2023. Retrieved April 22, 2026.
  6. ^ Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (September 12, 2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention". arXiv:2309.06180 [cs.LG].
  7. ^ a b "PyTorch Foundation Welcomes vLLM as a Hosted Project". PyTorch. PyTorch Foundation. May 7, 2025. Retrieved April 22, 2026.
  8. ^ Temkin, Marina (January 22, 2026). "Inference startup Inferact lands $150M to commercialize vLLM". TechCrunch. Retrieved April 22, 2026.

Content Disclaimer

Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.

  1. The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
  2. There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
  3. It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
  4. Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
  5. Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.