The ‘physical world’ is AI’s new data frontier

For the past three years, AI has been largely fueled by a finite resource: the human-written internet. Large Language Models (LLMs) have devoured trillions of words from digitized books, Reddit threads, and news archives.

Models have now run headfirst into the "data wall" – the point at which they run out of quality human text to scrape – leaving organizations training LLMs further and further on synthetic data; a potentially perilous task.

By contrast, the physical world offers a largely untapped firehose of data: every pressure change in a hydraulic press , every temperature shift in a cold-chain truck, and every millisecond of CPU fluctuation in a data center, to give just three examples, represents "time series data" that can be pulled at varying degrees of granularity or “resolution” into specialized databases.

“Models of the real world”

"The beauty of the physical world is there’s no shortage of data. It’s an infinite proposition," says Evan Kaplan, CEO of InfluxData, creator of the time series database, InfluxDB. “If you can capture data every 10 milliseconds, you can begin to build models of the real world that are very different. That's the foundation of most sophisticated physical AI models.”

Sitting down to chat with Brad Bebee, who leads AWS’s Amazon ElastiCache, Neptune, Timestream, and managed InfluxData services, the two say they see an ongoing shift from LLMs summarizing the past, to "physical AI" that renders a high-resolution picture of what's happening in the present.

“High-velocity, high-resolution”

By instrumenting reality at a higher resolution (moving from checking a sensor every hour, to every minute or second) companies are building training sets for machines that don't just chatter, but predict and self-heal.

As Kaplan emphasises, the digital plumbing of these systems is specialized: “[When it comes to] high resolution instrumentation of the real world, the storage matters, right? You can't keep all of that stuff in memory, and you can't build your models in memory… the future of this whole thing is operational, high velocity, high resolution [data] that builds intelligence…”

(Resolution, in this context, he explains, is “very, very small time slices.”)

NinjaOne: Pre-empting the help desk

It’s not just operational technology plants deploying time series data. Take NinjaOne, an IT management platform that unifies IT to simplify work for 35,000 customers in 140+ countries.

The platform oversees a sprawling fleet of laptops, servers, and iPads that funnel telemetry from across the globe to its cloud environment.

Historically, managing such a fleet was reactive. A laptop would crash, a blue screen would appear, and a frustrated employee would submit a ticket to the help desk. The technician would then look at "low-resolution" data – snapshots of memory or CPU usage – to guess what went wrong.

"We historically stored that information with limited fidelity," explains Joel Carusone, SVP, Data and AI, at NinjaOne. Under their old relational database systems, storing the "heartbeat" of millions of devices at high frequency was cost-prohibitive.

Amazon Timestream for InfluxDB

NinjaOne has moved the needle on resolution by migrating to Amazon Timestream for InfluxDB, a fully managed time series database for developers to run InfluxDB on AWS, that’s purpose-built for time series workloads.

Every 120 seconds, NinjaOne’s agents report telemetry across its endpoints, creating what can be thought of as a high definition movie of an entire device fleet’s health rather than a blurry polaroid. This granularity, Carusone says, has allowed NinjaOne to build a "proactive" AI loop.

Instead of waiting for a crash, their models identify strategic changes in performance: "We can proactively inform a technician: 'Hey, you’ve got a device you likely want to take a look at today,'" says Carusone.

"You may not have gotten a ticket from that user yet, but it’s coming."

Investing in HD reality

Improving model performance here is a quest for better resolution.

A sensor that reports temperature once an hour provides a grainy, low-fidelity picture; a sensor that reports every millisecond provides a high-definition stream.

"To use a metaphor," says Kaplan, "we started with cameras at two megapixels. Now, we’re at some absurd resolution. The better the resolution of the real world, the more predictive our models can become."

In an industrial context, high resolution is the difference between knowing a machine failed and knowing the exact micro-vibration that preceded the failure by ten seconds. InfluxDB 3, InfluxData’s latest engine, its CEO tells The Stack, now handles nanosecond resolution; that’s a level of detail previously reserved for quantum physics or high-frequency stock trading.

Compute and storage decoupled

However, high resolution creates a massive storage problem. If a company instruments every asset in a factory at millisecond intervals, the data volume quickly becomes a "tsunami" that would bankrupt a firm using traditional "hot" storage (fast, expensive disks). That’s just not tenable.

But a breakthrough with InfluxDB 3 decouples "compute" from "storage."

By allowing the database to write directly to "cold" storage, like Amazon’s S3, companies can now afford to keep years of high-resolution history.

"It opens up new use cases to take a look at all of that time series data and what it is telling you over time about your systems," says AWS’s Brad Bebee.

“Previously it would be a financial decision about how much data I wanted to retain. By being able to separate the compute and storage, it really opens up new use cases to look at all of the time series data, and what is it telling you, over time, about your systems?

He adds: “It will be really interesting to see what customers do [now that they can] cost-effectively look back over longer periods of time and understand systematic behaviors [at much lower cost].”

An infrastructure alliance: AWS and InfluxData

The partnership between AWS and InfluxData signals another detente in a historic cold war between cloud giants and open-source startups.

For a long time the hyperscalers would face allegations of "stripmining": taking open-source code, wrapping it in their own branding, and keeping the profits, leaving the original creators unsupported in the dust.

AWS offering an InfluxDB service underscores a partnership Kaplan describes as strategic and mutually beneficial.

He attributes much of its success to Bebee and his team, who, according to Kaplan, have "consistently delivered on their commitments."

Open-source flexibility + a global utility.

The "relationship of trust," as he puts it, reflects a shift toward deeper collaboration, with AWS and InfluxData combining strengths to help enterprises more effectively manage and extract value from time series data.

For AWS, offering a "managed" version of InfluxData satisfies a specific customer demand: the desire for open-source flexibility combined with the reliability of a global utility. "Amazon has always been an active contributor to open source, but our approach has evolved," Bebee says.

By partnering directly with the creators, AWS ensures that the version running on its cloud is the most optimized and feature-rich.

For InfluxData, the benefit is the ability to exist wherever the customer’s data lives. Being in the AWS ecosystem means its engine can sit right next to other heavy-duty AI tools like Amazon SageMaker. (This "plumbing" is essential for the AI era, Bebee notes; training a model requires moving massive amounts of data from the database to the processing engine.)

If those two things sit in the same cloud ecosystem, the "friction or impedance," as Kaplan puts it, disappears. The result is a seamless pipeline where raw sensor data is ingested, stored cheaply on S3, and fed into AI models that can trigger operational responses in milliseconds.

Towards the "self-driving" enterprise

The ultimate goal of this data architecture is to close the "operational loop." In the world of Generative AI, the loop often ends with a chatbot providing a recommendation to a human. In the world of time series AI, the loop may end with the system taking action on its own; becoming more autonomous.

"A canonical example of that is something like a self-driving car," Kaplan notes. "You run that loop of instrumenting, looking for anomalies... you run it a billion and a half times, and you get a car that can navigate the streets."

The tech leaders argue that the same logic is now being applied to the enterprise. To get there, the industry is moving toward "data portability."

One of the key roadmap initiatives, for example, is compatibility with Apache Iceberg, an open table format that allows different analytics tools to work on the same data simultaneously. This means a company’s time series data isn't trapped in a silo; it can be queried by a Python script for a custom machine learning model one second, and analyzed by a BI tool the next.

The first inning of physical AI

Despite the rapid advances in cloud storage and database resolution, both Kaplan and Bebee agree that we are still in the "first batter, first inning" of the physical AI revolution. While the digital world may be feeling the constraints of a finite internet, the physical world remains untapped.

As Kaplan concludes: “If you're instrumenting the physical world, you're laying the foundation for intelligence – for AI. The ability to instrument the physical world at high resolution is where all of this is going.

“Time series is really coming to the forefront: it’s now for anything you can attach a sensor to; whether it's a virtual sensor like in software infrastructure, or a physical sensor, in the real world.”

Both InfluxDB 3 Core (open source) and InfluxDB 3 Enterprise are available as fully-managed services on Amazon Timestream for InfluxDB.

Delivered in partnership with InfluxDB and AWS.