Real Encounters with Microservice Events
Ashley Newton, Software Engineer II, SkySpecs
At SkySpecs, we use drones to inspect wind turbines. We consider turbines our assets, and on my team we own a microservice that tracks data for those assets.
We needed a scalable way to share that data with other domains. Ultimately we chose to address this need with a Kafka-based events system.
In this post I’ll walk through how we implemented events, while sharing some of the challenges we faced (such as the dual-write problem), and the solutions we found (like using the outbox pattern).
Why Events?
To answer the question of why we decided to start publishing events, let’s take a step back and explore our application’s architecture.
How our application is organized: microservices
Our application is based on microservices, and my team works on the assets service — think turbines, towers, blades. Other teams work on other domains related to those assets, such as blade inspections, and tracking damages.
Given that we have microservices for different domains, and different domains need data from each other, let’s address the big question:
❓ How do we keep each other informed of changes to our data?
In the assets domain, we track real-time updates to assets. For example if a blade is replaced, we need to make sure other domains get the memo.
That way when a new inspection is recorded, we can be confident that the inspection is being linked to the new blade that was just installed, and not to the old blade that was removed.
And we wanted to do this in the most efficient way, and with the least amount of coupling possible.
For example, we didn’t want the inspections service to have to always request the latest asset data from the assets service. Furthermore, we didn’t want inspections to have to reach into the assets database.
Ultimately we chose to address this need with a Kafka-based events system.
So, to address the next question: what is meant by an events system?
What is an events system?
Our events system is a common message location and a common language that all services use to keep each other informed when there are changes to the things that they’re responsible for.
I wanted to share our story here to provide food for thought — this is not a comprehensive guide to events or to Kafka, but more of a meditation on specific challenges we faced while starting to implement events, and how we addressed them.
To be able to follow this post, you don’t need a comprehensive knowledge of Kafka or of event-driven architecture, but knowledge of 90s comedies might help!
Adding events to an existing service
This project was triggered because we wanted to start publishing events for our assets.
For example, when a new blade gets installed on a turbine, we wanted a way to publish data about that installation, like the date it happened and the reason.
But we ran into a challenge — our service was already running, and it already had operations in place to create and update asset data. We needed a way to keep everything running as usual while also starting to publish events.
And this is when we encountered. . .
The Dual Write Problem
To illustrate the dual write problem, let’s walk through our initial plan (spoiler: we ended up changing it):
Our initial plan was this:
- Step 1: A CRUD action takes place
- Step 2a: Update the data in the assets database (for example, in the assets table)
- Step 2b: Concurrently emit an event that’s shared with our consumers.
But after some analysis, we realized a weakness in this plan — it did not offer transactional safety — there was a risk that if one operation failed, the other would not, so we might end up either:
- updating our local data store but failing to emit an event,
- OR
- emitting an event without updating our local data store
“sending a message in the middle of a transaction is not reliable. There’s no guarantee that the transaction will commit. Similarly, if a service sends a message after committing the transaction there’s no guarantee that it won’t crash before sending the message.”
— Chris Richardson, Microservices.io
Put plainly, this initial plan would take us away from having a single source of truth.
Solution: The Transactional Outbox Pattern
In the end we addressed this with the transactional outbox pattern.
The Transactional Outbox pattern is a solution to this problem that ensures atomicity as we store our data and publish events.
As part of our solution, we:
- Write to our assets table.
- In the same transaction, write to an outbox table. This ensures transactional safety.
- From the outbox table, we use a Change Data Capture (CDC) connector to publish events.
Or in the words of Gunnar Morling:
The answer is to only modify one of the two resources (the database or Apache Kafka) and drive the update of the second one based on that, in an eventually consistent manner.
— Gunnar Morling, Debezium.io
This addresses our first major challenge.
Now, to introduce the next challenge we encountered. . .
The Day Zero Problem
❗ Disclaimer: The “Day Zero Problem” is an internally coined title — if you Google it you probably won’t find relevant results. Some other terms that describe it (and links for further reading) might be:
What is the Day Zero problem?
I’ll be honest, this is a bit tricky to explain, so for the moment I’ll step away from the wind turbines example, and talk about this in terms of pizza.
To illustrate, let’s say you work in a pizzeria, and you keep getting orders.
Every order includes a list of toppings, but before you can start adding toppings, you need a pizza to put them on, and you need to know the size.
So let’s say you receive an order that includes a list of toppings, but no size.
In this case you don’t know where to apply the toppings — you don’t know which pizza to put them on.
We had this issue when we started to publish events. We had different consumers, like the inspections service, that needed to relate inspections to our assets. But we learned they couldn’t apply that data if they had no prior knowledge of an asset.
This presented a dilemma:
Consumers only have knowledge of an asset after some operation takes place against it: if it gets created, updated, or deleted AFTER the event stream has already started.
The problem: we had lots of assets already in our registry — there was no need to update them, so how could our consumers ever find out about them?
Solution: Snapshot Events
To address this, we decided to implement a type of event called snapshots.
A snapshot event is different from other event types. Most events provide updates on the state of our data after an operation is performed: an asset is created, updated, or deleted.
Here’s how a snapshot event is different:
- We deliver a batch of events synchronously to provide a snapshot of the current state of our entire datastore
- This provides the baseline current state of asset data for all assets that currently exist
- This way, when another service needs to reference asset data, we have assurance that they have a starting point for every single asset
Going back to the pizza example, this would ensure that every order includes a size before it gets to you. You always know what pizza to put the toppings on.
Now, we’ve covered some big design decisions, and next I’d like to talk about something smaller but just as important.
Schema Design
So far we’ve covered the why and the how of our events, but I want to touch briefly on the what — specifically, what is in our events.
Like any team working with data, we had to make some decisions about our schema.
Here are just a few general lessons learned I’d like to share!
Preventing Stale Hierarchy Data
The first lesson is about hierarchy data and how some of our data was at risk of becoming stale.
Our asset data is very rich with hierarchy information. For example, looking at the hierarchy of parts when it comes to blades:
- Blades are part of a turbine.
- Turbines have many other parts, not just blades.
- Blades themselves have various sub-parts.
Initially we decided to provide a few specific hierarchy data points for every event:
- direct parent
- site
- turbine
But over time we realized this could make it difficult for our events that show hierarchy changes, as in the following example of a blade and one of its sub-components when the blade is removed.
Consider the following examples, paying attention to the parentId
and turbineId
:
Snapshot Event — Blade
{
"operation": "AssetSnapshot",
"assetId": "B_001",
"assetType": "blade",
"parentId": "T_001",
"siteId": "S_001",
"turbineId": "T_001",
...
}
Snapshot Event — Forward Shear Web
{
"operation": "AssetSnapshot",
"assetId": "FSW_001",
"assetType": "forward_shear_web",
"parentId": "B_001",
"siteId": "S_001",
"turbineId": "T_001"
}
When the above Blade is put into storage (meaning it used to be part of a turbine, and now it’s not). We publish one UPDATE event for that blade. And in that event body, it shows NO turbine in the asset’s ancestry.
This results in an UPDATE event something like the following, again paying attention to the parentId
and turbineId
:
{
"operation": "AssetUpdated",
"assetId": "B_001",
"assetType": "blade",
"parentId": "S_001",
"siteId": "S_001",
"turbineId": null
}
But in our first implementation, this did not account for blades having their own sub-parts. Meaning, we do not propagate events all the way down the asset tree.
Sadly, we don’t send any events to represent those sub-parts. In this case, consumers don’t know to update the sub-parts — their records still show these child assets having the original turbine data, even though the blade shows turbine: null
.
{
"operation": "AssetSnapshot",
"assetId": "FSW_001",
"assetType": "forward_shear_web",
"parentId": "B_001",
"siteId": "S_001",
"turbineId": "T_001"
}
To resolve this, we face two options:
- A) Publish more events: one per sub-part in the component tree, OR
- B) Provide a bit less information in each event
We found that by always providing the turbine ID, we were providing a bit too much information, and some of it was at risk of becoming stale.
If we provide only the parent ID, this will cover all of our needs and not require us to send additional events.
“Tell the truth, the whole truth, and nothing but the truth.”
— Adam Bellemare, Building Event-Driven Microservices
We realized we might need to limit the amount of hierarchy data in each event to resolve this.
Making Events Opinionated
Another challenge is providing the right set of information for the right type of asset.
We track data for many different types of assets, such as turbines, blades, gearboxes, generators, (and more!) and each has its own set of attributes.
To support the different types, we initially started emitting generic events to inform consumers if any assets were created, updated, or deleted.
No matter the type of asset, the event schema always included a collection of attributes — a generic list of objects.
Consider the following examples of generic events:
Generic Event — Blade
{
"operation": "AssetCreated",
"assetId": "B_001",
"assetType": "blade",
"createdOn": "2022-11-08",
"attributes": [
{
"name": "Position",
"value": "A",
},
{
"name": "Status",
"value": "Active",
},
{
"name": "Serial Number",
"value": 12345,
}
],
...
}
Generic Event — Turbine
{
"operation": "AssetCreated",
"assetId": "T_001",
"assetType": "turbine",
"createdOn": "2022-11-08",
"attributes": [
{
"name": "Name",
"value": "Initech",
},
{
"name": "Latitude",
"value": "42.28",
},
{
"name": "Longitude",
"value": "-83.75",
}
],
...
}
With these examples in mind, we learned that generic events require consumers to do extra work to filter out relevant information based on the asset type, and then structure it a useful way.
We learned that the data could be more accessible to users if events were more opinionated, as in the following, new and improved examples.
Opinionated Event — Blade
{
"operation": "BladeCreated",
"turbineId": "B_001",
"createdOn": "2022-11-08",
"position": "A",
"status": "Active",
"serialNumber": 12345,
...
}
Opinionated Event — Turbine
{
"operation": "TurbineCreated",
"turbineId": "T_001",
"createdOn": "2022-11-08",
"name": "Initech",
"latitude": "42.28",
"longitude": "-83.75",
...
}
Making events more opinionated is something we’re working to address.
Recap
To sum up some of the biggest design challenges we encountered on our events journey, and the solutions we found:
- The Transactional Outbox Pattern allowed us to add events to an existing service and avoid the dual-write problem
- Event snapshotting allows consumers to establish a baseline state for every asset
- Limiting the data in each event can be more efficient than providing unnecessary data
- Opinionated events reduce the burden on consumers to process and reformat data
I hope this gives you something to think about if you’re planning to implement events, or just some ideas to reflect on if you already have events up and running.
Resources / Further reading
- Kafka product description, Apache.org
- Event Streaming, VMWare Tech Insights
- Reliable Microservices Data Exchange With the Outbox Pattern, Gunnar Morling, Debezium.io, 2019
- Charity Majors quote, Twitter, 2018
- Pattern: Event Sourcing, Chris Richardson, Microservices.io
- Dual Writes: The Unknown Cause of Data Inconsistencies, Thorben Janssen, thorben-janssen.com
- Snapshots in Event Sourcing for Rehydrating Aggregates, Derek Comartin, CodeOpinion.com, 2021
- Snapshotting Strategies, Oskar Dudycz, Event Store Blog, 2021
- Building Event-Driven Microservices, Adam Bellemare, O’Reilly, 2020