“I don’t think we’ve ever seen a platform shift like this where the hardware architectures, the software architectures, and the model architectures are all changing at the same time”, Mark Collier tells The Stack.

The newly appointed Executive Director of the PyTorch Foundation has a lot on his plate given the open source community’s critical role in building AI inference infrastructure – for which there is rampant demand.

“Nobody wants to ship a new GPU accelerator without support in PyTorch and vLLM”, he says – but this rapid development puts “pressure on us to coordinate and collaborate across the whole world…

"We’ve got to keep raising our game.”

Scaling PyTorch

Collier was sitting down to chat with The Stack at the recent KubeCon Europe conference. He's not daunted by the scale of activity, having previously founded and scaled OpenStack and the OpenInfra Foundation.

As Executive Director, he will lead the operational and business execution of the Foundation, working closely with the Governing Board.

Collier’s predecessor Matt White moved to the CTO role at the foundation earlier this year, and also took on a new role for the broader Linux Foundation: CTO of AI.

“For me to scale PyTorch, I needed resources,” Collier says.

“The CTO role is much more focused on what’s next, what are the great projects out there that we should bring in.”

To answer that question, White is starting with key areas and issues PyTorch and LF are hoping to address and building up from there.

He's also going to be working closely with the seven working groups set up at the new Agentic AI Foundation – which are bringing in 30 to 40 people across highly regulated industries, government organisations, academia, enterprises and big tech.

“We’re doing everything from security, to privacy, to identity and trust, to governance risk, and regulation – working with industry bodies to help shepherd responsible AI or agentic AI usage.”

The full AI stack

The overarching theme, as White and Collier both made clear, is to bring projects from each layer of the AI stack under the PyTorch banner.

While open source projects, by their nature, can easily be used together, White says bringing them under one foundation “provides enough confidence for other organisations and big tech to start investing.”

See also: Hugging Face's Safetensors, Meta's Helion join PyTorch Foundation

“The framework we apply to projects makes them more durable, increases trust in the industry, and increases investment, ultimately making them more suitable for adoption.”

Foundations are also "the ultimate facilitators", says Collier, thanks to their ability to have an overview of open source projects and connect people and teams working on similar problems.

"It turns out that when you're talking about these hard technical problems, almost everyone's trying to solve the same set of problems."

Cross-collaboration

The idea of cross-project collaboration is one Collier advocated for on the KubeCon mainstage, telling the audience "it's on all of us to not be stuck in siloed thinking, and really look at what users actually need."

"I think we should approach this is almost like we're co-designing or co-evolving all these open source tools," he told his CNCF counterpart Jonathan Bryce, encouraging project maintainers to "think across communities, think globally" to move at pace on project innovation.

Expanding on his comments, he tells The Stack, the open source projects used by a company will never all be from one foundation, citing the use of tools like PyTorch and vLLM on Kubernetes infrastructure and Linux-based machines.

"We need to be thinking about [progress] from the lens of what problems need to get solved, and how do we do solve them in a way that involves co-developing these different components [within a] bigger frame of reference of 'what are we running this on?', and 'how are those projects participating?'"

The profitability misnomer

White says he is also working to bring researchers and students.

At Linux, he's building an academic programme with three objectives: to ensure PyTorch Foundation tools are taught in academia, teach grad students open source best practice, and support projects that could one day end up at the foundation.

Following success with webinars, White says he is developing a short course on "how to do open source the right way":

What does that mean? "How to grow community, how to amplify your project, picking the right licenses... the difference between open weights and open models, these kinds of nuances that are not really well known."

The future of AI

While open source AI models like DeepSeek and Llama have played a significant role in the development of LLMs, there's no denying proprietary models from Google, OpenAI and Anthropic are leading the way.

But at KubeCon, there's an excitement around the role open source is playing in the growth of inference and agentic infrastructure.

On stage, Red Hat’s AI CTO Brian Stevens said the agentic era was “one of the first times I’ve seen open source truly lead a wave… from an infrastructure, platform perspective, there's no apologies for what’s being built here, it’s leading.”

Collier is keen to emphasise PyTorch's role in the "training era" of the early LLM boom, but says a focus on inference and scaling models is putting open source at the very centre of the AI industry.

"Some of these scaling problems... actually can't be solved without open source," he says, using the example of bringing GPUs online and reliably managing them to use for model training, and then also using the same pool for inference later on.

"I think these problems are technically difficult enough that everybody is better off if they share the load of trying to solve it, but also I'm not sure you could even solve them in an entirely closed manner."

"Sort of the default these days is 'we're going to solve challenges in the open', so we're not having to convince people to go open source."

EFF% and CI progress

The pressure is on for the Foundation to support its members – and also ensure they are contributing back as widely as they can.

Meta is one large example. As a Meta team noted in a recent blog, those training and serving large AI models "face aggressive ROI targets under tight compute capacity. As workloads scale, improving infrastructure effectiveness gets harder because end-to-end runtime increasingly includes overheads beyond 'real training' (initialization, orchestration, checkpointing, retries, failures, and recovery)."

Meta worked closely with the foundation to open-source and share building blocks like TorchRec and PyTorch 2 to improve Effective Training Time (ETT%), but there is a lot more to do.

The foundation is clear about some near-term priorities on the organisational side too. It cites these as "diversifying and optimizing CI architecture and costs; onboarding additional project CI workloads where shared accelerator access unlocks faster iteration;" and deepening community initiatives like mentorship for "stronger global enablement."

"As the scope grows," the foundation said when announcing Collier's appointment, "there is a straightforward operational reality: leadership capacity must scale so that organizational throughput, not leadership bandwidth, sets our pace."

Join peers following The Stack on LinkedIn

The link has been copied!