20 October 2021

The Event-Carried State Transfer pattern

In the third blog of the ‘engineers guide to event-driven architectures’ series, I discussed the most commonly applied event-driven software design pattern, the event notification pattern. Let’s explore an alternative pattern to use, the event-carried state transfer pattern. In this blog, I explain the characteristics of the pattern, how this impacts the design of events and the challenges to expect. I will finish up with a great tip on how to start getting experience with the event-carried state transfer pattern using realistic production data volumes and throughput.

The characteristics of the Event-Carried State Transfer pattern

As the name suggests, the main characteristic for the event-carried state transfer pattern is that the events contain state, which is quite different from notification events that just contain an identifier to retrieve state from the producer. A stateful event for a customer placing an order might look like this:

{
  "specversion": "1.0",
  "type": "com.example.orderPlaced",
  "order": {
    "id": "A001-1234-1234",
    "time": "2020-12-15T00:00:00Z",
    "products": [
      {
        "id": "1234321",
        "name": "eBook Seven Languages in Seven Weeks",
        "price": 25.0,
        "quantity": 1
      }
    ]
  }
}

Including state in events eliminates the need for the consumer to make a call back to the producer to retrieve state. Instead, consumers build a private replica of state by storing the state from events they consume.

An example could be a billing microservice, that listens to customerAddressUpdate events to keep the latest address available to suggest to the user as the default billing address for new orders. When a user changes their address, the customer microservices update’s its own state, which would be considered the “source of truth”, and then produces a stateful customerAddressUpdate event which enables other microservices like billing to update their private replicate of state.

The consumer having all state readily available for processing subsequent events eliminates the need for a call back to the producer upon event consumption, making this pattern ideally suited for workloads with high throughput, robustness, and availability requirements.

Including state in events introduces new challenges in the form of state management and replication, more challenging event design, dealing with duplicate processing and order in event flow design. Let’s discuss these challenges and available design options.

State management and replication

As consumers use stateful events to update their private replica of state, there’s a really short period in which the data is “lagging behind” compared to the “source of truth”. This means that rather than strong consistency, the system is eventually consistent. Eventual consistency is the theoretical guarantee that, provided no new updates to data are made, all reads of that data throughout the system will eventually return the last updated value.

Sacrificing strong consistency is not uncommon in distributed systems as they must balance consistency, availability, and partition tolerance. It’s impossible to have all three as per CAP theorem and partition tolerance and availability are often prioritized when designing scalable systems.

Eventual consistency is not specific to microservices, events or the event carried state transfer pattern. Clustered storage solutions are distributed systems in and of themselves and as such also make choices on the level of consistency guarantees. Cassandra, for example, is eventually consistent and many other offerings have configuration to finetune consistency behavior. However, when using third-party solutions like Cassandra we have the luxury that much of the complexity of dealing with the challenges of eventual consistency is abstracted away behind battle-tested solutions created by the authors that either work “out-of-the-box” or require minimal configuration to finetune to a particular use-case. It’s still important to conceptually understand what’s happening, but a lot has been done for us.

When using the event-carried state transfer pattern, nothing stands between you and the fickle reality of eventual consistency.

There’s one universal truth, all things that can fail will fail, it’s not a matter of “if” but a matter of “when”. If a single consumer fails, and that isn’t accounted for in the design, state replicas can diverge permanently.

Fortunately, there are strategies that deal with failures through robust event flows with retries and re-processing. See some examples here from Confluent and Uber on how to implement reties for Apache Kafka. While the implementation will be different for other brokers, but the general concepts should be similarly applicable.

Besides eventual consistency, the assumption that all microservices should have the state they need readily available brings additional complexity when introducing new microservices that require historical state. There are three possible scenarios to deal with this:

Re-consume an available stream of persisted stateful events
A data migration from the “source of truth” microservice(s)
A transition period in which the consumers build state over time

Some event brokers, like Apache Kafka for example, support configurable retention periods for events. This means that all historic events remain available to be “re-played”. This is a great way for newly introduced services to build their state quickly.

It’s also possible to perform a data migration from the “source of truth” microservice(s). The approach should be based on attributes like dataset size, accuracy, repeatability, and transformation requirements. One concept that is this space that I found really interesting is merkle trees. Cassandra and Dynamo rely on them to compare state between cluster nodes and correct inconsistencies very efficiently across the network.

This final solution to build state in new microservices is a transition scenario in which the consumer starts building state by consuming stateful events but in case of missing state has the fallback option to make a call to other services for that state, like you would do with the event notification pattern. The biggest drawback with this approach is that the runtime-dependency between microservices is temporarily reintroduced until the new microservice is “caught up”.

To conclude, do not underestimate the complexity of maintaining a consistent state across microservices in a high availability environment and at scale. Be prepared because if it can fail it will fail, the question is when.

Event design

When using the event-notification pattern, events are quite small. They only contain references to state, accompanied by useful metadata. This also means that their design is stable as it’s often the callback interface that is changed as the system evolves and features are added. With stateful events, as is the case with event-state transfer, there are far more trade-offs to make during the design process. Not only that, but it also means adapting the design of events continuously to changing circumstances and (non) functional requirements.

That’s easier said than done, especially with more complex systems. There’s one sentence in Martin Fowler’s article on evolutionary design where he states that one of the core skills for good design is knowing how to communicate the design to the people who need to understand it, using code, diagrams and above all: conversation. Practices from domain-driven design like bounded contexts, aggregates, context mapping and event storming while not mandatory in any way, can be extremely helpful in the design process and communicating that design. For the next section, it’s helpful to understand the basic concepts of bounded contexts and ubiquitous language.

Different types of events

Before we talk about how to scope stateful events, it’s important to distinguish between domain events and integration events. A domain event is something that happened in the domain that is relevant to other microservices within the same bounded context. Domain events can be part of the execution of a single business transaction. An integration event is something that happened which is (also) relevant to other bounded contexts in the system, often when the business transaction is completed successfully.

This distinction is important because events continuously adapt to new requirements as the system evolves and integration events tend to be more challenging to adapt as they form a cross-bounded context contact. Different bounded contexts are often implemented and maintained by different teams. Changes across teams require more planning and alignment since teams tend to have their own goals, priorities, and roadmap.

Additionally, there’s a higher chance of having misunderstandings on a detailed level since ubiquitous language for both contexts are not guaranteed to be consistent. A seemingly simple concept like a ‘customer’ can mean different things to different people. To the marketing department someone might be considered a customer the moment they register while to the risk department only considers someone a customer after all the “know your customer” compliance checks are completed.

Unless these discrepancies are caught during design, can become costly. This is because our delivery processes, automated tests and interactions are optimized to what happens within the team and teams working on the same bounded context. Focus on integrations with the broader system often come into play later, meaning bugs have more impact in terms of rework.

So, make sure that the design process is adapted to the type of event and use techniques like context mapping and event storing to facilitate the right conversation.

How to scope stateful events

One of the key decisions for stateful events is scoping the state that is included in the event. In many publications an event that contains any state is referred to as a fat event. That would make all events that leverage the event-carried state transfer pattern fat events. Some authors, including myself, make distinctions between delta events and fat events. Delta events contain just the properties that changed, so just enough detail, nothing more.

Events that contain redundant information with the goal of reducing the number of different events that a consumer must listen to is what I consider a fat event. An example could be including the customer’s name in an orderPlaced event, so that the shipping service doesn’t have to separately build customer state to trigger personalized emails about the shipping status of the order.

The benefit of delta events is that they are a small and focused contract that’s expressive in their intent. Clear intent is an often-underestimated characteristic of a clean and maintainable implementation. Small events also prevent coupling since they’re not designed with specific consumer requirements in mind. This is big benefit over fat events, which often include additional state for specific consumers. As consumers come and go over time, it often becomes unclear what state is used by which consumers, making it harder to change producers when using fat events. Additionally, fat events can turn into bloated events over time as more and more state is included to support additional features.

There are also some challenges with delta events as they require the consumer to build state based on several small events and do lookups to associate things with each other. This adds complexity to the consumer and, based on the number of events consumed, creates a more indirect form of coupling since the consumer has a higher awareness of the producers’ business processes. A final drawback can be performance. State building and retrieval during event processing can limit performance potential and increase operational costs.

So, while both have their strengths and weaknesses, for me it’s “delta events, unless…” meaning that I take delta events as the starting point. Their clear intent and expressiveness are extremely valuable, and it avoids the bloated event pitfall.

To conclude, state brings a sharp complexity increase to event design due to the higher number of considerations. In general, be very careful with fat events, don’t ever haphazardly add additional state to events for specific consumers. Be disciplined in design by examining the options and tradeoffs and considering the irreversibility of decisions to prioritize where to spend the time available.

Duplicate processing

The challenges related to event design are one thing, designing the event flow introduces additional challenges in the form of potential duplicate processing and out-of-order processing that can occur. Let’s start with duplicate processing.

Many event brokers offer at-least-once processing by default, which means that if the broker does not receive confirmation that the event is consumed successfully, the broker will resend the event. While this ensures that events get processed, it also means that this can happen multiple times if it’s the acknowledgement from the consumer after processing that failed.

The defense here is designing your events to be idempotent. Idempotency means that an event can be applied multiple times without changing the result.

Assume you have itemQuantityIncreased event. If you simply design it as “increase quantity by one”, applying that event multiple times leads to a different outcome. By modeling it like “set quantity to 5” instead, you can apply the event numerous times, the outcome is always five.

Making an event idempotent might not always be possible. Some brokers offer features to detect and ignore duplicate deliveries. Always read the fine print. There’s often a limited time span in which this guarantee is given. Which might be fine for most cases, but it might not be for the one that you’re designing. It’s also possible to implement duplication detection manually, by keeping a cache of keys of hashes or recently processed events or using sequence numbers to detect and ignore duplicates. Since this detection requires state, you would either use partitioned consumption with a local state store or a centralized state store if you’re using competing consumer instances. This can be quite impactful to performance and scalability.

To conclude, there are multiple ways to deal with duplicate processing, and modelling events to be idempotent is by far the easiest and most reliable so take advantage of it when possible.

Order in event flows

Now that we covered event design and duplicate processing, let’s discuss out-of-order processing and the impact on designing the event flow.

When we talk ordering, it’s important to realize that there’s multiple levels of ordering, global ordering, and partial ordering. There are very few use-cases that require global ordering across all events, but partial ordering is often relevant.

Ordering events of all customers is irrelevant if the events from a single customer are ordered properly. This ordering is needed as they are all related to each other and processing them in the incorrect order can cause issues.

Guaranteeing partial ordering end-to-end requires a look at every link in the chain. For most event flows using a FIFO (first in first out) capable event platform and partitioned consumption with the customer id as the partition key. If these concepts are not known, they are covered in my previous blog. Depending on the expected frequency of events with the same partition key, also have a look at the consumer side. If events are batch produced, this could break ordering. Here’s an example of how producer configuration in Apache Kafka impacts ordering if not set properly.

Partial ordering has the potential to become more difficult with event flows that have retries to ensure proper message processing. If not designed properly, these alternative non-happy flows can break ordered processing. Here’s an example of error handling in Apache Kafka with order guarantees kept.

This is just considering happy flows. In real life, things fail. Therefore, there’s often a mechanism like retries in place where events are written to a different topic of queue and processed by a retry application. These flows also tend to break order guarantees unless the producer is specifically built to send subsequent related events to the retry as well regardless of the underlying error being resolved.

As we previously learned, with the event-carried state transfer pattern, consumers can leverage events to build state that is required for them to perform a business task that is initiated by a different event. Since guarantees are only provided on a queue/partition level, this can introduce consumption overhead.

Let’s explain this with an example. We’re responsible for the shipping microservice in a big eCommerce system. Our scope is simple, our consumer listens to two events: customerShippingAddressEntered and orderPlaced. We obviously have a flexible state-of-the-art ordering process so the customer can go back-and-forth in the ordering process and change things as they see fit. This means that the customer can change their shipping address we can have multiple customerShippingAddressEntered belonging to one order. Since incorrectly shipped orders and angry customers are not good for business, we better process both those customerShippingAddressEntered events before the orderPlaced event…So, how do we make sure that happens?

In general, there’s three approaches to this when you have a broker that supports FIFO and partitioned consumption. In these options I will use the word queue as that’s the most widely understood term for a user-defined segmentation of event-streams at the broker level. Please note that not all solutions offer ordering guarantees on this level. Read the docs for the broker you plan to use and understand how ordering works in detail.

Now, on with the approaches:

The first approach would be to publish the entire event stream to a single queue and have all microservices consume that needs events related to orders consume that queue using a portioned consumption pattern with the order ID as partition key. While this is a simple setup, there’s (significant) consumption overhead if consumers are only interested in a subset of the events. Our shipping microservice would have to consume all the events belonging to the order, while it only cares about the customerShippingAddressEntered and orderPlaced events.

The second approach would be to create different queues with portioned consumption pattern with the order ID as partition key. For example, an order and shipping queue. Having different queues eliminates the consumption overhead from the previous approach. Since there’s no cross-queue order guarantees, listening to the shipping queue for the customerShippingAddressEntered event and the order queue for the orderPlaced event can break ordering. A simple solution would be to produce the orderPlaced event to both the order and shipping queue, so that the shipping microservice only must consume from the shipping queue. If the state in the orderPlaced is kept minimal and is not adjusted to specific consumers, this approach doesn’t create coupling. However, the producer is no longer completely unaware of the consumers like is common in pub/sub. This could be considered a form of coupling as well.

The third and final approach would be to publish the entire event stream to a single queue using a portioned consumption pattern with the order ID as partition key just like the first approach. However, now instead of different microservices consuming that event stream it’s consumed by the process orchestrator. That orchestrator can then send appropriate commands to microservices like shipping when appropriate. This also eliminates the consumer overhead, and it’s not needed to produce the orderPlaced to multiple queues. While not relevant to ordered processing allows centralizing flow logic, more granular time-based error handling scenarios and SLA monitoring. The main challenge with orchestrators is that they are a potential single point of failure, and they require more discipline in their design since they are susceptible to complexity because business processed might cover multiple bounded contexts.

While far less likely, there’s two scenarios’ remaining that are worth a short mention, these are dealing with situations where FIFO and partitioned consumption are not available, and how to deal with global ordering requirements.

While FIFO and partitioned consumption are commonly available in most popular event brokers and big cloud vendors, ordered processing is still possible without. This requires a combination of a key like order ID and event sequence numbering to indicate order. Consider this a fallback scenario since it does mean that the consumer must consume all events in the stream, and without portioned consumption you are either limited to running a single consumer instance or a centralized state store when using competing consumer instances. This can be quite impactful to performance and scalability.

When there are global ordering requirements, the most common approach is to use a buffer layer implementation of some kind and implement ordering mechanics there. A common thought or inclination is to use physical time in the form of timestamps for ordering. While possible, I would not recommend relying on physical time in a distributed system since physical clocks are never perfectly synchronized across the network of systems. Rather, have a look at logical clocks like lamport clocks or vector clocks. They are explained extremely well here by Martin Klepmann. While I have not implemented these in production myself, they are a well-known concept in the distributed systems space when it comes to ordering.

To conclude, dealing with order in event flows can be challenging. While there’s plenty of mechanisms that can support, implementing it properly end-to-end requires a good understanding of each link in the chain that the event flows through.

Controlled experimentation

After reading this blog, hopefully I’ve been able to accurately communicate what challenges the event-carried state transfer pattern introduces when used. I would not recommend using this pattern for production workloads without prior experience.

At the beginning of this blog, I hinted at a great tip on how to start getting experience with the event-carried state transfer pattern using realistic production data volumes and throughput. Because we’ve all heard of proof-of-concepts or technical spikes to gain valuable experience before fully committing. And they’re great, but they’re also a magical fairly land of small data volumes in and rather controlled and predictable environment.

The problem with this is that when you decide to take something to production, the real-world slaps. Hard. So, how can we close the gap between a spike and production when it’s too big?

One unique and cool benefit of events and their flexible asynchronous nature is that it’s possible to perform to a controlled experiment in production by introducing distributary flows. Distributary is a term from river delta’s, specifically indicating a branch of a river that does not return to the mainstream after leaving it. You can easily build such a river branch using events. It’s paramount that such a flow of events is designed to be fully separated from the main production flow by using different queues and consumers etc. This flow can start small, experimenting with one or two microservices and then expended over time.

So, if you have an existing microservice that publishes a notification event, publish an additional stateful event as you would have done when using only event-carried state transfer. To a different queue or topics.

Now launch a separate consumer and listen to those events and build state with it. See if and how challenges like eventual consistency, duplicate processing and out-of-order processing present themselves in realistic production data volumes and throughput.

Since the distributary flow is fully separated from the production flow, if any unforeseen issues are encountered with state management or otherwise, there’s no major panic since it’s not a production issue impacting users. Go in, investigate, wipe the slate clean and start over. Until you feel confident that you have a rock-solid implementation. If there are privacy concerns about looking at production data, ensure that the first distributary flow consumer pseudonymizes sensitive data items in the original event before any subsequent processing occurs. Experimenting with the event-carried state-transfer patterns this way is relatively safe, and way more realistic than the fairytale isolated proof-of-concept environment often used.

That concludes this fairly substantial addition to the ‘engineers guide to event-driven architectures’ blogging series. Stay tuned for the next one. Until then, I wish you an awesome day and happy coding!