Online retailer Zalando trims AWS bill by getting manual with Flink stream filtering

Running efficient data pipelines for a large online retailer is no mean feat.

Zalando sells some €3 billion ($3.4 billion) of fashion goods every quarter.

Its engineers bring together multiple streams of data about pricing, stock, sponsored products and more for its website – to create a “unified, enriched view of what a customer sees on the site when browsing…”

But in a recent blog, engineer Maryna Kryvko said what started as an “elegant, declarative solution” powered by a managed Apache Flink service on AWS ultimately “started crumbling under the weight of its own state…”

(Apache Flink originated from a university research project called "Stratosphere," based at the Technische Universität Berlin in 2009. It has since turned into a widely used data systems backbone, connecting databases, datalakes, message queues and more; it has been forked over 13,000 times and users include Apple, Netflix, Shopify, and Splunk.)

What used to look like a “magic” system powered by Apache Flink, its Table API and SQL (“you write a simple join statement, and the query optimizer handles the heavy lifting”) started costing “thousands of dollars in AWS bills and crashing [its] clusters every time a snapshot is triggered…” she said.

Ouch. Kryvko, a senior software engineer at the German retailer, sketched out tidily in a blog this month some of the compromises and fixes you have to make on the fly as you start to become a victim of your own success.

The issue started with how Flink SQL handles joins in this data “blender.”

In Flink 1.20 (which Zalando was using as it is/was the only version available as a managed service on AWS) when you join Table A to Table B, and then join that result to Table C, Flink treats these as isolated steps.

As Kryvko wrote: “In Flink 1.20, each join operator is a strictly independent unit. Because Flink must account for late-arrival data and potential updates, it must maintain data integrity by keeping every record in its internal state (RocksDB). When you chain four joins together, you aren't just adding state; you are multiplying it…” This state amplification led Zalando to a “staggering 235–245GB of state per application,” she explained.

The key outcome: An “unstable nightmare” caused by the fact that every hour, a cronjob would trigger a snapshot that required Flink to “iterate over the RocksDB state, serialize it, and move it to S3…” the engineer said.

This would keep the cluster's CPU at 100% for nearly 12 minutes, ultimately triggering a cascade of issues like multiple forced restarts, unreliable data backups and also scaling operations that sometimes took up to 20 minutes.

Enter, manual stream filtering...

A long story short (Kryzko has more detail in her blog), Zalando decided, instead of chaining joins, to unify all incoming data streams into a single DataStream[BaseEvent]: “This allowed us to replace the chain of joins with increasing state with a single KeyedProcessFunction-specifically, our custom MultiStreamJoinProcessor. The key in this case is the SKU - the product identifier, which is the common key across all streams,” she explained.

The company opted, whilst doing this, for “manual stream filtering.”

“If an incoming update has a timestamp that is earlier than what we already have, we drop it immediately, avoiding a state write altogether.”

That sounds painful, but was “less verbose than the SQL version, because our SQL was quite complex, with aggregations for calculating the maximal timestamps between several parts of the join [and] ranking functions [to ensure the] last record from the same part of the join always wins.”

The big wins: A 13% reduction in its AWS bill for this workload, and restarts that dropped to 4–5 minutes from 12 to 20 minutes. “ButFlink 2+ is better”?

Indeed, Flink 2.1 introduced a “multi-Way Join Operator as an experimental feature. But Zalando’s on an earlier version on a managed AWS service. “This feature alone will be worth the wait when we get there, but until then, we're covered by our home-baked solution,” she said.

Final words? “Flink SQL is perfect for 90% of use cases - it's fast, elegant, and maintainable. But a software engineer's value is in recognizing the remaining 10%: the use cases where the abstraction starts costing too much. And this was definitely one of those…”

Enter, manual stream filtering...

See also: MongoDB open-sources “Kingfisher” secrets scanner