Coles

Loading Inventory...
vLLM Serving: High‑Throughput LLM APIs with PagedAttention and KV Cache Tuning

vLLM Serving: High‑Throughput LLM APIs with PagedAttention and KV Cache Tuning in Brampton, ON

By None

Current price: $13.64
Visit retailer's website
vLLM Serving: High‑Throughput LLM APIs with PagedAttention and KV Cache Tuning

Coles

vLLM Serving: High‑Throughput LLM APIs with PagedAttention and KV Cache Tuning in Brampton, ON

By None

Current price: $13.64
Loading Inventory...

Size: Kobo eBook

Visit retailer's website
*Product information and pricing may vary - to confirm current pricing, availability, shipping, and return information please contact Coles. In the event of a pricing discrepancy, the retailer's price will apply.
"vLLM Serving: High‑Throughput LLM APIs with PagedAttention and KV Cache Tuning" Built for experienced ML systems engineers, platform architects, and performance-minded practitioners, this book is a deep technical guide to serving large language models with vLLM at production scale. Rather than treating inference as a black box, it explains the real control surfaces behind throughput, latency, and memory efficiency. Readers who already know LLM fundamentals but want to reason rigorously about serving behavior will find an internals-first, systems-oriented treatment. At the core of the book are the mechanisms that make vLLM distinctive: PagedAttention, continuous batching, KV cache design, and scheduler-driven execution. You will learn how request flow, cache allocation, sequence length, prefix reuse, quantized KV storage, and offloading strategies interact to determine concurrency limits and user-visible performance. The book also covers OpenAI-compatible API serving, streaming semantics, realistic benchmarking, and disciplined troubleshooting, so readers can move from conceptual understanding to evidence-based tuning and operational decisions. The emphasis throughout is on advanced mental models, trade-offs, and production diagnostics rather than introductory walkthroughs. This is a focused guide for readers comfortable with GPU inference, transformer decoding, and performance measurement who want a precise framework for designing, tuning, and operating high-throughput LLM APIs with confidence.
"vLLM Serving: High‑Throughput LLM APIs with PagedAttention and KV Cache Tuning" Built for experienced ML systems engineers, platform architects, and performance-minded practitioners, this book is a deep technical guide to serving large language models with vLLM at production scale. Rather than treating inference as a black box, it explains the real control surfaces behind throughput, latency, and memory efficiency. Readers who already know LLM fundamentals but want to reason rigorously about serving behavior will find an internals-first, systems-oriented treatment. At the core of the book are the mechanisms that make vLLM distinctive: PagedAttention, continuous batching, KV cache design, and scheduler-driven execution. You will learn how request flow, cache allocation, sequence length, prefix reuse, quantized KV storage, and offloading strategies interact to determine concurrency limits and user-visible performance. The book also covers OpenAI-compatible API serving, streaming semantics, realistic benchmarking, and disciplined troubleshooting, so readers can move from conceptual understanding to evidence-based tuning and operational decisions. The emphasis throughout is on advanced mental models, trade-offs, and production diagnostics rather than introductory walkthroughs. This is a focused guide for readers comfortable with GPU inference, transformer decoding, and performance measurement who want a precise framework for designing, tuning, and operating high-throughput LLM APIs with confidence.

More About Coles at Bramalea City Centre

Making Connections. Creating Experiences. We exist to add a little joy to our customers’ lives, each time they interact with us.

Find Coles at Bramalea City Centre in Brampton, ON

Visit Coles at Bramalea City Centre in Brampton, ON
Powered by Adeptmind