Hot-Reloaded Rules in Flink: The Broadcast State Pattern (with Code)

Introduction

In the previous blog of this series, we introduced the detection engine architecture: why static, hardcoded rules create operational drag, and how treating logic as data opens the door to truly dynamic systems. Now we go deeper

If you're new to this series, our Real-Time Pattern Detection primer covers the foundational design principles behind dynamic rule evaluation at scale.

. This post focuses on the first and most foundational pattern: threshold detection, and the mechanism that makes it useful in production: hot-reloaded rules via Flink's broadcast state.

Threshold Detection

Every real-time detection system begins with thresholds. They are the first line of defense against risk, abuse, and anomalous behavior. They are also the rules that change most frequently.

In most systems, threshold logic is hardcoded or loaded at startup. A small adjustment, a value change, a new condition, or a rule disable triggers a full deployment cycle. This turns the most basic form of detection into an operational bottleneck.

Before tackling more complex patterns like velocity or correlation, the system must first answer a simpler question: can detection logic change while the system is running, without downtime or state loss?

The Naïve Approach And Why It Fails

A common implementation embeds threshold rules directly into streaming code or static configuration files.

This approach breaks down quickly:

Rules require redeployments to change
Stateful jobs must restart even for trivial updates
Operational risk increases with every change
The engineering team becomes the gatekeepers for routine rule updates

As detection logic grows, this rigidity compounds. What should be a fast decision becomes a slow-release process.

Architectural Approach For Logic To Travel Like Data

To remove this friction, detection logic must be treated as data.

Rules should:

Arrive continuously, not at startup
Be versioned and observable
Apply immediately to new events
Propagate consistently across all parallel tasks

This requirement leads directly to a dual-stream design: one stream for events, another for rules.

Context: What This Engine Does

At a high level, the engine:

Consumes events from Kafka (for example, events.raw)
Evaluates them against active detection rules
Emits enriched detections to a sink topic (for example, events.processed)

Rules themselves are streamed from Kafka (for example, rules.active) and hot-reloaded while the job is running.

Body Image 1 (2)High-level flow (threshold path and beyond)

High-level flow (threshold path and beyond)

Rule reload (Kafka): how a new rule reaches every task

Rule reload (Kafka): how a new rule reaches every task

Flink’s Concepts Explained

This blog series relies on the following Flink/PyFlink concepts. If you are new to Flink, use the table below as a quick reference.

Concept	What is means
MapStateDescriptor	A descriptor that tells Flink how to create a key-value state: you give it a name and the key/value types (e.g. STRING, STRING). Flink uses it to create the actual state at runtime. For broadcast state, the same descriptor is used on every task so all tasks share the same logical map.
Broadcast state	State that is updated from a broadcast stream: every record from that stream is sent to every parallel task, and each task updates its local copy of the state the same way. The result is that all tasks hold the same logical data (e.g. the same set of rules). Ideal for low-volume “control” data (like rules) that must be visible everywhere.
BroadcastProcessFunction	A process function that handles two inputs: (1) a normal stream (e.g. events) and (2) a broadcast stream (e.g. rules). You implement `process_broadcast_element` (called when a broadcast record arrives; you can read/write broadcast state) and `process_element` (called when a normal record arrives; you can read broadcast state and emit to the main output).
get_broadcast_state(descriptor)	Method on the context (ctx) passed to your `BroadcastProcessFunction`. Returns the broadcast state (a map-like object) for the given descriptor. In `process_broadcast_element` you can put/remove; in `process_element` you typically iterate (e.g. state.items()).
DataStream.broadcast(descriptor)	Called on a stream (e.g. the rules stream). Returns a BroadcastStream tied to that state descriptor. When you later `connect(normal_stream`, `broadcast_stream).process(BroadcastProcessFunction)`, every record from the broadcast stream is sent to every task and updates the broadcast state.
connect(...) and .process(...)	`stream_a.connect(stream_b)` creates a connected stream (two inputs). `.process(MyProcessFunction)` runs your function: Flink calls your methods for each record from either input. With a broadcast stream as one input, Flink knows to broadcast that side and to give you the two different callbacks (process_broadcast_element vs process_element).
KafkaSource and from_source	`KafkaSource.builder()...build()` defines a Kafka consumer (topic, group, deserializer, etc.). `env.from_source(kafka_source, watermark_strategy, name, type_info)` turns it into an unbounded DataStream: records keep coming as they are produced to the topic.

Threshold Pattern

Threshold rules are the simplest and most common form of detection logic, but getting them right at scale requires more than a condition check. Here is how the engine structures and evaluates them.

What Is A Rule? (Structure Common To All Rule Types)

Every rule in the engine shares a common shape:

Rule_id: Unique identifier (string).
Version: Version string; when rules are loaded from Kafka, bumping the version triggers hot reload so the job picks up the new definition.
Rule_type: One of threshold, velocity, or correlation; it determines how the engine evaluates the rule.
Source_topic: The Kafka topic this rule applies to (e.g., events.raw).
Optional: Name, description, enabled, tags.

The sink topic (where detections are written) is not set in the rule; it is set per job via --sink-topic (e.g., events.processed). Threshold rules add one more required block: conditions , a list of field conditions that must all be true for an event to match.

What Is A Threshold Rule?

A threshold rule is a set of conditions applied to event fields. If all conditions are satisfied for an event, the event matches, and we emit an enriched version to the sink. No history is kept; each event is evaluated independently (stateless).

Example rule:

Meaning: Emit events where type == "purchase" and value >= 20. Supported operators: ==, !=, >, <, >=, <=. Conditions are combined with AND logic.

An example of matching and non-matching events:

Event	Match?	Why
`{"type": "purchase", "value": 25}`	Yes	Both conditions true: `type == "purchase"` and `value >= 20`.
`{"type": "purchase", "value": 10}`	No	`value` is below 20.
`{"type": "view", "value": 100}`	No	`type` is not "`purchase`".

When an event matches, we emit an enriched JSON that includes the original payload plus processed, rule_id, rule_version, rule_type, and optional rule_name.

How We Built It: Flink’s Broadcast State And Our Rule Logic

We use Flink’s BroadcastProcessFunction so that rule updates and events are handled in one place: when a rule arrives, we store it in broadcast state via ctx.get_broadcast_state(descriptor); when an event arrives, we read that state and evaluate threshold rules. Flink’s MapStateDescriptor and broadcast() give us a shared map (rule_id → rule JSON) across all tasks; connect(broadcast_rules).process(...) wires the two streams to our function. Inside that function, we implement rule validation, condition evaluation, and enrichment.

Broadcast Function: Rule Storage And Event Evaluation

We extended BroadcastProcessFunction and used get_broadcast_state in both callbacks. For a rule, we parse, validate with normalize_rule, and add/remove in the state; on an event, we iterate the state and apply threshold rules (velocity and correlation run in separate operators).

Condition Evaluation And Rule Validation

We use get_field_value to read event fields (including nested paths like user.id) and evaluate_condition to compare against the rule’s expected value (with numeric coercion for >, <, etc.). When we receive a rule, we call normalize_rule, which enforces required fields and a non-empty conditions list before we place it in the broadcast state.

Wiring: MapStateDescriptor, broadcast(), And connect().process()

We used Flink’s MapStateDescriptor to describe the shared map, rules_stream.broadcast(rules_state_desc) to turn the rules stream into the broadcast side, and events_stream.connect(broadcast_rules).process(RulesBroadcastFunction(...)) so that Flink delivers rules and events to our function, and we read/write broadcast state.

End-To-End Flow For Threshold

Putting it all together, here is how an event moves through the system from ingestion to enriched output.

Events are read from Kafka (events.raw) and form events_stream.
Rules are read from a Kafka topic (rules.active) and form rules_stream (unbounded).
We create the broadcast state descriptor and the broadcast view of rules:
broadcast_rules = rules_stream.broadcast(rules_state_desc)
so that every rule record is sent to every parallel task and updates the same logical map (rule_id → rule JSON).
We connect events and broadcast rules and process with our function:
threshold_processed = events_stream.connect(broadcast_rules).process(RulesBroadcastFunction(...), ...)
For each event, process_element runs; it reads the broadcast state, loops over threshold rules, evaluates conditions via get_field_value and evaluate_condition, and yields (sink_topic, enriched_json) for each match.
The output stream threshold_processed is later unioned with velocity and correlation results, mapped to a single JSON string (with _sink_topic set), and written to the Kafka sink.

So for threshold only: events stream + broadcast rules → connect → process (RulesBroadcastFunction) → threshold_processed → (later) union → map → sink.

Code Snippet: Condition Evaluation And Emit

Inside process_element, for each rule with rule_type == "threshold" we perform:

One event can match multiple threshold rules; we emitted one (sink_topic, json) per matching rule.

Rule Reload

Hot reload is what separates a production-ready detection engine from a brittle one. Here is how rules flow from Kafka into every running task, without stopping the job.

How Do Rules Get Into The Job?

Rules are loaded from a Kafka topic (e.g. rules.active). The job subscribes to that topic; every message is a rule (JSON). New or updated rules are consumed as they are produced, with hot reload without restart. The rest of the job sees a stream of rule records (JSON strings) that is broadcast and fed into the BroadcastProcessFunction.

Where Are The Rules Stored?

Rules are stored in the Flink broadcast state (refer to the concepts table above).

We define it with a MapStateDescriptor: a named map with key type STRING (rule_id) and value type STRING (rule JSON). At runtime, Flink creates one such map per parallel instance of the operator; each instance lives in a TaskManager (state in memory or in the configured state backend).
Because the rules stream is broadcast, every rule record is delivered to every instance. Each instance runs process_broadcast_element and does state.put(rule_id, json.dumps(rule)) (or state.remove(rule_id) if the rule is disabled). So every instance’s map is updated the same way , they all hold the same logical set of rules.

So: rules are stored in the broadcast state map (rule_id → rule JSON) maintained by each task. There is no separate external “rules store”; the map is the rules store for the duration of the job.

How Does Reload Work Internally?

“Reload” here means “the next event should see the new rule.” There is no special reload API; we just have to update the same map.

A new or updated rule is produced to rules.active (e.g. same rule_id, updated conditions).
Flink’s Kafka source for the rules topic reads the record and emits one element on the rules stream.
That stream is the broadcast side of the connected stream, so Flink sends that element to every parallel instance of the operator.
Each instance runs process_broadcast_element(value, ctx): we parse the JSON, normalize the rule, get the broadcast state with ctx.get_broadcast_state(rules_state_desc), and then state.put(rule_id, json.dumps(rule)) (or state.remove(rule_id) if disabled).
The next event that hits any instance triggers process_element; it calls state.items() and sees the updated rule. Evaluation uses the new definition immediately.

So internally: reload = new Kafka message → broadcast to all tasks → each task’s process_broadcast_element updates its broadcast state map → subsequent events see the new rules. No restart, no separate reload step.

Flink Concepts Used For Rule Reload

What we achieve	Flink concept / API
Rules from Kafka (ongoing)	KafkaSource + from_source(...) , unbounded stream
Same rules on every task	MapStateDescriptor + stream.broadcast(descriptor) + broadcast state in BroadcastProcessFunction
Connect events + rules, process both	events_stream.connect(broadcast_rules).process(RulesBroadcastFunction(...))
Read/update the shared rules map	ctx.get_broadcast_state(descriptor) then state.put / state.remove / state.items()

What This Pattern Gets You And Where It Goes Next

Here is a quick recap of what the threshold pattern and broadcast state give you, and a look at what the rest of this series will build on top of it.

Threshold: Stateless filtering via a list of conditions (field, operator, value). Implemented inside a BroadcastProcessFunction that reads rules from broadcast state and uses get_field_value and evaluate_condition; Flink provides the broadcast state (via MapStateDescriptor and broadcast) and the two-input process function contract.
Rule reload: Rules are stored only in broadcast state (a map per task, kept in sync by the broadcast stream). Rules are loaded from Kafka (KafkaSource / from_source); producing a new message to the rules topic updates the map on all tasks, and the next event sees the new rule, no restart required.

In the next post, we’ll cover the velocity pattern (keyed state, windows, aggregations, and optional side output for correlation). The third post will cover the correlation pattern (primary + context streams, lookback, and metrics).

Building real-time detection systems and want to talk architecture? If you're exploring Flink, Kafka, or composable data pipelines for your platform, our team at Axelerant is happy to help.

Frequently Asked Questions

How does Flink broadcast state work?

Flink broadcast state is a shared key-value map that is replicated identically across every parallel task in an operator. We define it with a MapStateDescriptor (name, key type STRING, value type STRING), call rules_stream.broadcast(descriptor) to create a BroadcastStream, then wire it to events via connect().process(BroadcastProcessFunction). Every record on the broadcast side is delivered to all task instances, which each run process_broadcast_element to update their local copy — so all tasks hold the same logical set of rules at all times.

Can you hot-reload Flink rules?

Yes, you can hot-reload Flink rules without restarting the job. We produce a new or updated rule (same rule_id, changed conditions) to the rules.active Kafka topic. Flink's KafkaSource reads it, broadcasts the record to every parallel task, and each task's process_broadcast_element calls state.put(rule_id, json.dumps(rule)). The next event that arrives triggers process_element, which calls state.items() and immediately evaluates against the updated rule — no restart, no separate reload step required.

What's the throughput cost of broadcast state?

The post does not report specific throughput benchmarks or latency numbers for broadcast state. What we do note is that broadcast state is explicitly described as ideal for low-volume 'control' data like rules — the implication being that the broadcast mechanism is not designed for high-frequency data streams. The event stream carries the high-volume load, while the rules stream (broadcast side) is expected to be infrequent. For precise throughput figures, you would need to benchmark against your own cluster configuration.

When should you use broadcast state vs keyed state?

Use broadcast state when you have low-volume control data — like detection rules — that must be visible identically across all parallel tasks without partitioning. We use it for threshold rules because evaluation is stateless per event. Use keyed state when logic requires per-key history or aggregation, such as counting events per user over a window. The post explicitly notes that velocity rules (which require windowed aggregations) run in a separate operator using keyed state, not broadcast state.

FAQ'S

How does Flink broadcast state work?

Flink broadcast state is a shared key-value map replicated identically across every parallel task in an operator. We define it with a MapStateDescriptor (name, key type STRING, value type STRING), call rules_stream.broadcast(descriptor) to create a BroadcastStream, then wire it to events via connect().process(BroadcastProcessFunction). Every record on the broadcast side is delivered to all task instances, which each run process_broadcast_element to update their local copy — so all tasks hold the same logical set of rules at all times, with no external rules store required.

Can you hot-reload Flink rules?

Yes, you can hot-reload Flink rules without restarting the job. We produce a new or updated rule message to the rules.active Kafka topic; Flink's KafkaSource reads it, broadcasts it to every parallel task, and each task's process_broadcast_element calls state.put(rule_id, json.dumps(rule)) — or state.remove if disabled. The next event processed by any task triggers process_element, which calls state.items() and immediately evaluates against the updated rule definition. No deployment cycle, no state loss.

What's the throughput cost of broadcast state?

The post does not publish specific throughput benchmarks for flink broadcast state. What it does note is that broadcast state is described as ideal for low-volume "control" data — like rules — that must be visible everywhere, implying it is not designed for high-volume streams on the broadcast side. The event stream bears the processing load, while the rules stream is expected to be comparatively sparse. For precise throughput numbers, profiling under your own workload is recommended.

When should you use broadcast state vs keyed state?

Use broadcast state when you need the same low-volume control data — such as detection rules — visible across all parallel tasks simultaneously, with no partitioning by key. In our engine, threshold rules live in broadcast state because they apply to every event regardless of identity. Keyed state is the right choice when computation depends on per-entity history, such as the velocity pattern we cover next in the series, which requires windowed aggregations scoped to a specific key like a user or device ID.

FAQ'S

How does Flink broadcast state work?

Can you hot-reload Flink rules?

Yes, you can hot-reload Flink rules without restarting the job. We produce a new or updated rule message to the rules.active Kafka topic; the broadcast stream delivers it to every parallel task instance, which runs process_broadcast_element and calls state.put(rule_id, json.dumps(rule)) — or state.remove if disabled. The next event processed by any instance calls state.items() and immediately sees the updated definition. No restart, no separate reload API, no state loss.

What's the throughput cost of broadcast state?

The post does not publish specific throughput benchmarks for broadcast state. What it does note is that broadcast state is ideal for low-volume "control" data like rules, because every record on the broadcast stream is sent to every parallel task. The implication is that broadcast state is not designed for high-volume data — keeping the rules stream lean (rule updates, not event data) is the architectural safeguard our team relied on to avoid amplification overhead.

When should you use broadcast state vs keyed state?

Use broadcast state when you have low-volume control data — like detection rules — that must be visible identically across all parallel tasks; it requires no key partitioning and suits stateless per-event evaluation. Use keyed state when computation depends on a specific entity's history, such as counting events per user over time. In our engine, threshold rules use broadcast state (stateless, no history), while the upcoming velocity pattern relies on keyed state for per-key aggregations and windowed counts.

About the Author

Qais Qadri, Senior Software Engineer

Qais enjoys exploring places, going on long drives, and hanging out with close ones. He likes to read books about life and self-improvement, as well as blogs about technology. He also likes to play around with code. Qais lives by the values of Gratitude, Patience, and Positivity.

Hot-Reloaded Rules In Flink: Threshold Patterns with Broadcast State

Table of Contents

Introduction

Threshold Detection

The Naïve Approach And Why It Fails

Architectural Approach For Logic To Travel Like Data

Context: What This Engine Does

Flink’s Concepts Explained

Threshold Pattern

What Is A Rule? (Structure Common To All Rule Types)

What Is A Threshold Rule?

How We Built It: Flink’s Broadcast State And Our Rule Logic

Broadcast Function: Rule Storage And Event Evaluation

Condition Evaluation And Rule Validation

Wiring: MapStateDescriptor, broadcast(), And connect().process()

End-To-End Flow For Threshold

Code Snippet: Condition Evaluation And Emit

Rule Reload

How Do Rules Get Into The Job?

Where Are The Rules Stored?

How Does Reload Work Internally?

Flink Concepts Used For Rule Reload

What This Pattern Gets You And Where It Goes Next

Frequently Asked Questions

How does Flink broadcast state work?

Can you hot-reload Flink rules?

What's the throughput cost of broadcast state?

When should you use broadcast state vs keyed state?

FAQ'S

FAQ'S

Qais Qadri, Senior Software Engineer

Leave us a comment

Partner With Us

Join us

Hot-Reloaded Rules In Flink: Threshold Patterns with Broadcast State

Get Your Free Copy

Table of Contents

Introduction

Threshold Detection

The Naïve Approach And Why It Fails

Architectural Approach For Logic To Travel Like Data

Context: What This Engine Does

Flink’s Concepts Explained

Threshold Pattern

What Is A Rule? (Structure Common To All Rule Types)

What Is A Threshold Rule?

How We Built It: Flink’s Broadcast State And Our Rule Logic

Broadcast Function: Rule Storage And Event Evaluation

Condition Evaluation And Rule Validation

Wiring: MapStateDescriptor, broadcast(), And connect().process()

End-To-End Flow For Threshold

Code Snippet: Condition Evaluation And Emit

Rule Reload

How Do Rules Get Into The Job?

Where Are The Rules Stored?

How Does Reload Work Internally?

Flink Concepts Used For Rule Reload

What This Pattern Gets You And Where It Goes Next

Frequently Asked Questions

How does Flink broadcast state work?

Can you hot-reload Flink rules?

What's the throughput cost of broadcast state?

When should you use broadcast state vs keyed state?

FAQ'S

Related Reading

FAQ'S

Related Reading

Qais Qadri, Senior Software Engineer

Leave us a comment

Related Blogs