Google’s massively multimodal Gemini 1.5 Pro preview shows a giant warming to its task

Google has unveiled a new generative AI model, Gemini 1.5, that expands the “context window” from which models can incorporate information to a massive million tokens. (This capability is currently only limited to select developers and enterprise customers in private preview, as the company works on what its AI leaders said were “optimizations to improve latency, reduce computational requirements and enhance the user experience.”)

This makes the a sparse mixture-of-experts Transformer-based model capable of (per Burak Gokturk, Google VP Cloud AI) analysing an hour of video, 11 hours of audio, 700,000 words, or 30,000 lines of code at once – alternatively, it could be given 22 hours of audio or three hours of video at one frame-per-second to analyse in a single upload. Users, in short, will be able to query an entire code repository or entire film in one go.

Gemini 1.5 Pro, the first model of this new family released for early testing, “performs at a similar level to 1.0 Ultra, our largest model to date. It also introduces a breakthrough experimental feature in long-context understanding,” Google DeepMind CEO Demis Hassabis said.

The company did not give a date for general availability of Gemini 1.5 Pro. It has also promised to slash the prices of its 1.0 Pro model with “today’s stable version… priced 50% less for text inputs and 25% less for outputs… pay-as-you-go plans for AI Studio are coming soon,” the company added.

Gemini 1.5 Pro will be available to try out “soon” in Google AI Studio (a browser-based IDE for prototyping with genAI models) and will come with a more modest 128,000 token context window by default – the same as ChatGPT 4-Turbo, a flurry of February 15 blogs from Google revealed.

So what’s actually GA now from Google? Well, for those (like most enterprise users) still experimenting with different foundation model outputs and general capabilities, Gemini 1.0 Pro, its general purpose AI model is now “generally available to all Vertex AI customers… Starting today, any developer can start building with Gemini Pro in production.”

Google Gemini 1.5: "Surprising new capabilities"

Google Labs’ Jaclyn Konzelmann and Wiktor Gworek said in one of Google’s many blogs: “We’ve [also] added the ability for developers to upload multiple files, like PDFs, and ask questions in Google AI Studio.”

The AI product managers added that the large token context window meant “we’ve been able to load in over 700,000 words of text in one go.”

A separate 58-page paper from the Google Gemini team claimed Gemini 1.5 exhibited “continued improvement in predictive performance, near perfect recall (>99%) on synthetic retrieval tasks, and a host of surprising new capabilities like in-context learning from entire long documents.”

That paper added that the model “greatly surpasses Gemini 1.0 Pro, performing better on the vast majority of benchmarks, increasing the margin in particular for Math, Science and Reasoning (+28.9%), Multilinguality (+22.3%), Video Understanding (+11.2%) and Code (+8.9%) – progress it attributed to a “host of improvements made across nearly the entire model stack (architecture, data, optimization and systems.)”

These were/are black box changes: Google did not publish any details on the underlying mechanisms driving this kind of reported breakthrough.

They do, however, put a fresh spotlight on the dizzying pace of change in the generative AI space with a genuine revolution clearly happening in front of us, even if downstream at the enterprise level the toolchains, cultural processes and datasets have some serious catching up to do.

A combination of a large context window and the ability to support a mix of audio, visual, text, and code inputs in the same input sequence could be hugely powerful however. As Google noted in the primary Gemini 1.5 release paper, “understanding the limits of these capabilities and studying their exciting capabilities and applications remains an area of continued research exploration.”

Jeff Dean, Chief Scientist, Google DeepMind and Google Research said on X: “Coming soon, we plan to introduce pricing tiers that start at the standard 128,000 context window and scale up to 1 million tokens, as we improve the model. Early testers can try the 1 million token context window at no cost during the testing period. We’re excited to see what developer’s creativity unlocks with a very long context window…”

He added: “The model’s ability to quickly ingest large code bases and answer complex questions is pretty remarkable. three.js is a 3D Javascript library of about 100,000 lines of code, examples, documentation, etc.

Dean posted on X: “With this codebase in context, the system can help the user understand the code, and can make modifications to complex demonstrations based on high-level human specifications (‘show me some code to add a slider to control the speed of the animation. use that kind of GUI the other demos have”), or pinpoint the exact part of the codebase to change the height of some generated terrain in a different demo…”

Google Gemini 1.5: "Surprising new capabilities"

See also: From ‘chunking’ to security, avoiding RAG “hell”