Skip to content

Search the site

With $1.3 billion MosaicML acquisition, Databricks sets itself up for open source generative AI fight

Databricks has agreed the deal just weeks after releasing Dolly 2.0, a 12 billion parameter open source language model, as a battle begins for dominance of open source generative AI workflows and platforms.

Databricks to buy MosaicML

Databricks has agreed to buy MosaicML, a generative AI specialist that just this month released its MPT-30B, an open source LLM model licensed for commercial use that it claims outperforms the original GPT-3 – a statement of intent in an early battle to own the open source generative AI space.

The deal comes as ChatGPT became the fastest ever application to hit 100 million monthly users and now sees some one billion visitors per month.  

Originally an open source non-profit, OpenAI is now working to lock enterprise clients into paid subscriptions to its API, charging per “prompt token” ($0.06/1k prompt tokens/$0.12/1k sampled tokens for the GPT-4 model with 32k context lengths, according to updated June 2023 prices).

But questions abound around the transparency of its approach, in a world in which legislation like GDPR demands AI "explainability" and enterprises are also increasingly looking to create LLMs on their own data.

Databricks and MosaicML: "An incredible opportunity"

“Every organisation should be able to benefit from the AI revolution… Databricks and MosaicML have an incredible opportunity to democratise AI and make the Lakehouse the best place to build generative AI and LLMs,” said Databricks CEO Ali Ghodsi in a statement shared June 26.

Databricks, founded by the creators of Apache Spark, offers a wide set of open source tools including its Delta Lake to build “Lakehouses” for enterprise data warehousing and data science workloads, and as a machine learning pioneer and creator of the hugely popular MLFlow toolkit, has been pushing further into the open-source LLM space itself this year.

See also: The Big Hallucination...

In April for example Databricks releasing Dolly 2.0, a 12  billion parameter language model based on the EleutherAI pythia family and fine-tuned exclusively on a “new, high-quality human-generated instruction-following dataset, crowdsourced among Databricks employees” – while its open source MLflow platform to help connect arbitrary ML code and tools as part of larger workflows sees over 10 million downloads monthly.

Dolly was trained on the Databricks Machine Learning Platform using a two-years-old open source model (GPT-J) “subjected to just 30 minutes of fine tuning on a focused corpus of 50k records (Stanford Alpaca).”

The deal will give customers “a simple, fast way to retain control, security, and ownership over their valuable data without high costs” DataBricks said today, adding that the combination of the Databricks and MosaicML platforms will ensure that, “combined with near linear scaling of resources, multi-billion-parameter models can be trained in hours, not days… training and using LLMs will cost thousands of dollars, not millions.”

The entire MosaicML team, including MosaicML’s industry-leading research team, is expected to join Databricks after the transaction closes and its platform will be “supported, scaled, and integrated over time to offer customers a seamless unified platform where they can build, own and secure their generative AI models” Databricks said on Monday.

MosaicML was valued at $136 million in its last funding round, meaning its valuation has leaped ten-fold in less than two years, as work to clean up enterprise datasets and excitement about the potential for generative AI continues. (Dr Peter Van Der Putten, Director AI Lab at Pegasystems put it to The Stack recently: “As LlaMA is restricted to non-commercial use, the same goes for these other models, but there have been other more open releases such as OpenChatKit [a set of models trained on the OIG-43M training dataset; a collaboration between Together, LAION, and] and Dolly, with the latter only a very thin training veneer on GPT-J.

"With a small data science team and some hundreds to thousands in compute you can self-host and finetune models like these towards your use cases as for many, pretraining from scratch is still too costly," he said. "It may be worth it if for example your data is sensitive and/or proprietary, as you won’t have to call external services. That said, for quite a few use cases it will be acceptable to share prompts with a central service, and cloud vendors will allow clients to finetune base foundation models, without sharing the fine-tuning data with the base models. And to simplify further, many use cases can be achieved through clever automated prompt engineering without any fine-tuning."

Follow The Stack on LinkedIn