LinkedIn touts vLLM brilliance for 50 AI use cases

LinkedIn says it has over 50 different generative AI use cases in production merrily powered by the open-source library for LLM inference, vLLM.

vLLM was first developed at the Sky Computing Lab at UC Berkeley. It is now a widely adopted, open-source (Apache 2.0) community project.

Engineers at Microsoft-owned platform LinkedIn described it as uniquely useful for “high-throughput, low-latency serving for LLMs while keeping GPU memory usage efficient via techniques like PagedAttention.”

(PagedAttention is a technique to reduce the memory footprint of LLM caches. Memory bottlenecks are among the biggest challenges for those deploying AI applications. vLLM was built explicitly to help tackle this problem, via flexible sharing of key-value caches across requests.)

In a multi-author engineering blog published on August 26, senior software engineer Yingjiao Zhai said that LinkedIn now uses vLLM across “thousands of hosts across our platform, enabling everything from search enhancements to AI-powered recommendations…” for over 50 features.

See also: Not the CIO’s job? Getting your organisation AI agent ready – “the promise and the peril”

Its AI hiring assistant was one of the several examples she shared, noting that deploying it involves outputting a lot of tokens (“nearly 1000 on average per candidate”) – a highly computationally demanding process.

But vLLM ensures the process reuses computation for the shared portion of the input instead of recalculating it for every request. This dramatically cuts down “prefill latency and GPU load, which is critical for our inputs…”

Other key takeaways from LinkedIn’s vLLM use?

Its engineers said that they re-architected their serving stack to be modular and OpenAI-compatible. This decoupled their custom gRPC server from the vLLM engine, dramatically reducing maintenance burden and allowing for the easy addition of new features like image generation.

(vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API, and numerous more, listed here, to serve models and interact with them using an HTTP client. A smattering of FAQs for engineers can be found here and documentation more broadly here.)

The team also decided from the start to expose key vLLM parameters to internal customers, allowing them to tune performance for their specific applications without modifying the core engine code, Zhai explained.

Using vLLM? Want to share your use case or experiences? Get in touch.