Skip to content

Examples

vLLM's examples are organized into the following categories:

  • basic/ – Minimal examples for offline inference and online serving.
  • generate/ – Text generation examples, including multimodal models.
  • pooling/ – Examples for embedding, classification, scoring, reward, etc.
  • speech_to_text/ – Speech transcription, translation and real-time audio examples.
  • features/ – Demonstrations of individual vLLM features: automatic prefix caching, speculative decoding, LoRA, structured outputs, prompt embedding, pause/resume, batch invariance, KV events, data parallelism, and more.
  • reasoning/ – Examples for reasoning with vLLM.
  • tool_calling/ – Examples for function/tool calling with vLLM.
  • applications/ – Application examples such as chatbots and RAG (Retrieval-Augmented Generation).
  • rl/ – Reinforcement learning examples.
  • deployment/ – Examples for deploying vLLM in production.
  • ray_serving/ – Scalable serving using Ray.
  • disaggregated/ – Examples for disaggregated serving (separate prefill and decode), including various kv cache connectors (LMCache, Mooncake, FlexKV, P2P NCCL) and failure recovery.
  • observability/ – Metrics, logging, tracing (OpenTelemetry), and dashboards (Grafana, Perses).