Real Encounters with Microservice Events

11 min readApr 11, 2023

Ashley Newton, Software Engineer II, SkySpecs

SkySpecs drone inspecting the blades on a wind turbine.

At SkySpecs, we use drones to inspect wind turbines. We consider turbines our assets, and on my team we own a microservice that tracks data for those assets.

We needed a scalable way to share that data with other domains. Ultimately we chose to address this need with a Kafka-based events system.

In this post I’ll walk through how we implemented events, while sharing some of the challenges we faced (such as the dual-write problem), and the solutions we found (like using the outbox pattern).

Why Events?

To answer the question of why we decided to start publishing events, let’s take a step back and explore our application’s architecture.

How our application is organized: microservices

Our application is based on microservices, and my team works on the assets service — think turbines, towers, blades. Other teams work on other domains related to those assets, such as blade inspections, and tracking damages.

Example application diagram where the website connects to an API Gateway then to various services.

Given that we have microservices for different domains, and different domains need data from each other, let’s address the big question:

❓ How do we keep each other informed of changes to our data?

In the assets domain, we track real-time updates to assets. For example if a blade is replaced, we need to make sure other domains get the memo.

“Did you get the memo?” — Office Space, 1999

That way when a new inspection is recorded, we can be confident that the inspection is being linked to the new blade that was just installed, and not to the old blade that was removed.

And we wanted to do this in the most efficient way, and with the least amount of coupling possible.

For example, we didn’t want the inspections service to have to always request the latest asset data from the assets service. Furthermore, we didn’t want inspections to have to reach into the assets database.

Example application diagram showing an anti-pattern where the website connects to an API Gateway then to various services, each with its own database, requiring individual services to reach across into other services or databases to fulfill queries.

Ultimately we chose to address this need with a Kafka-based events system.

So, to address the next question: what is meant by an events system?

What is an events system?

Our events system is a common message location and a common language that all services use to keep each other informed when there are changes to the things that they’re responsible for.

Example application diagram showing pattern where the website connects to an API Gateway then to various services, each with its own database, but with an event log spread across all service to database connections.

I wanted to share our story here to provide food for thought — this is not a comprehensive guide to events or to Kafka, but more of a meditation on specific challenges we faced while starting to implement events, and how we addressed them.

To be able to follow this post, you don’t need a comprehensive knowledge of Kafka or of event-driven architecture, but knowledge of 90s comedies might help!

Adding events to an existing service

This project was triggered because we wanted to start publishing events for our assets.

For example, when a new blade gets installed on a turbine, we wanted a way to publish data about that installation, like the date it happened and the reason.

Blade being installed on a wind turbine. Ad-liftra, CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0, via Wikimedia Commons

But we ran into a challenge — our service was already running, and it already had operations in place to create and update asset data. We needed a way to keep everything running as usual while also starting to publish events.

And this is when we encountered. . .

The Dual Write Problem

To illustrate the dual write problem, let’s walk through our initial plan (spoiler: we ended up changing it):

Our initial plan was this:

Step 1: A CRUD action takes place
Step 2a: Update the data in the assets database (for example, in the assets table)
Step 2b: Concurrently emit an event that’s shared with our consumers.

Chart depicting a blade replacement occurring, then triggering two parallel actions for writing to the database and publishing an event. Then event consumers rely on the data from the published event.

But after some analysis, we realized a weakness in this plan — it did not offer transactional safety — there was a risk that if one operation failed, the other would not, so we might end up either:

updating our local data store but failing to emit an event,
OR
emitting an event without updating our local data store

“sending a message in the middle of a transaction is not reliable. There’s no guarantee that the transaction will commit. Similarly, if a service sends a message after committing the transaction there’s no guarantee that it won’t crash before sending the message.”
— Chris Richardson, Microservices.io

Put plainly, this initial plan would take us away from having a single source of truth.

Solution: The Transactional Outbox Pattern

In the end we addressed this with the transactional outbox pattern.

Chart depicting a blade replacement occurring, then triggering writing to the database which then triggers publishing an event. Then event consumers rely on the data from the published event.

The Transactional Outbox pattern is a solution to this problem that ensures atomicity as we store our data and publish events.

As part of our solution, we:

Write to our assets table.
In the same transaction, write to an outbox table. This ensures transactional safety.
From the outbox table, we use a Change Data Capture (CDC) connector to publish events.

Or in the words of Gunnar Morling:

The answer is to only modify one of the two resources (the database or Apache Kafka) and drive the update of the second one based on that, in an eventually consistent manner.
— Gunnar Morling, Debezium.io

This addresses our first major challenge.

Now, to introduce the next challenge we encountered. . .

The Day Zero Problem

❗ Disclaimer: The “Day Zero Problem” is an internally coined title — if you Google it you probably won’t find relevant results. Some other terms that describe it (and links for further reading) might be:

What is the Day Zero problem?

I’ll be honest, this is a bit tricky to explain, so for the moment I’ll step away from the wind turbines example, and talk about this in terms of pizza.

Descriptions of 3 pizza orders, one of which is lacking data for its desired size.

To illustrate, let’s say you work in a pizzeria, and you keep getting orders.

Every order includes a list of toppings, but before you can start adding toppings, you need a pizza to put them on, and you need to know the size.

So let’s say you receive an order that includes a list of toppings, but no size.

Zoomed in view of pizza order without size.

In this case you don’t know where to apply the toppings — you don’t know which pizza to put them on.

We had this issue when we started to publish events. We had different consumers, like the inspections service, that needed to relate inspections to our assets. But we learned they couldn’t apply that data if they had no prior knowledge of an asset.

Diagram depicting a request for data about Blade C when there is a lack of data in the event stream about Blade C because its creation occurred before events existed to track the creation.

This presented a dilemma:

Consumers only have knowledge of an asset after some operation takes place against it: if it gets created, updated, or deleted AFTER the event stream has already started.

The problem: we had lots of assets already in our registry — there was no need to update them, so how could our consumers ever find out about them?

Solution: Snapshot Events

To address this, we decided to implement a type of event called snapshots.

A snapshot event is different from other event types. Most events provide updates on the state of our data after an operation is performed: an asset is created, updated, or deleted.

Here’s how a snapshot event is different:

We deliver a batch of events synchronously to provide a snapshot of the current state of our entire datastore
This provides the baseline current state of asset data for all assets that currently exist
This way, when another service needs to reference asset data, we have assurance that they have a starting point for every single asset

Diagram depicting a request for data about Blade C when there is a both a stream of data about Blades and a historical snapshot of data about Blades in the event stream.

Going back to the pizza example, this would ensure that every order includes a size before it gets to you. You always know what pizza to put the toppings on.

Descriptions of 3 pizza orders, one of which had its data about size populated from a snapshot event.

Now, we’ve covered some big design decisions, and next I’d like to talk about something smaller but just as important.

Schema Design

So far we’ve covered the why and the how of our events, but I want to touch briefly on the what — specifically, what is in our events.

Video clip asking “What would you say you do here”. — Office Space, 1999

Like any team working with data, we had to make some decisions about our schema.

Here are just a few general lessons learned I’d like to share!

Preventing Stale Hierarchy Data

The first lesson is about hierarchy data and how some of our data was at risk of becoming stale.

Our asset data is very rich with hierarchy information. For example, looking at the hierarchy of parts when it comes to blades:

Blades are part of a turbine.
Turbines have many other parts, not just blades.
Blades themselves have various sub-parts.

Turbine diagram highlighting various parts and sub-parts. Image credit Researchgate.net

Initially we decided to provide a few specific hierarchy data points for every event:

direct parent
site
turbine

But over time we realized this could make it difficult for our events that show hierarchy changes, as in the following example of a blade and one of its sub-components when the blade is removed.

Consider the following examples, paying attention to the parentId and turbineId:

Snapshot Event — Blade

{
  "operation": "AssetSnapshot",
  "assetId": "B_001",
  "assetType": "blade",
  "parentId": "T_001",
  "siteId": "S_001",
  "turbineId": "T_001",
  ...
}

Snapshot Event — Forward Shear Web

{
  "operation": "AssetSnapshot",
  "assetId": "FSW_001",
  "assetType": "forward_shear_web",
  "parentId": "B_001",
  "siteId": "S_001",
  "turbineId": "T_001"
}

When the above Blade is put into storage (meaning it used to be part of a turbine, and now it’s not). We publish one UPDATE event for that blade. And in that event body, it shows NO turbine in the asset’s ancestry.

Diagram depicting the hierarchy of assets and sub-assets within a turbine, then how that structure changes when a blade and its sub-assets are removed from the turbine and placed in storage.

This results in an UPDATE event something like the following, again paying attention to the parentId and turbineId:

{
  "operation": "AssetUpdated",
  "assetId": "B_001",
  "assetType": "blade",
  "parentId": "S_001",
  "siteId": "S_001",
  "turbineId": null
}

But in our first implementation, this did not account for blades having their own sub-parts. Meaning, we do not propagate events all the way down the asset tree.

Sadly, we don’t send any events to represent those sub-parts. In this case, consumers don’t know to update the sub-parts — their records still show these child assets having the original turbine data, even though the blade shows turbine: null.

{
  "operation": "AssetSnapshot",
  "assetId": "FSW_001",
  "assetType": "forward_shear_web",
  "parentId": "B_001",
  "siteId": "S_001",
  "turbineId": "T_001"
}

To resolve this, we face two options:

A) Publish more events: one per sub-part in the component tree, OR
B) Provide a bit less information in each event

We found that by always providing the turbine ID, we were providing a bit too much information, and some of it was at risk of becoming stale.

If we provide only the parent ID, this will cover all of our needs and not require us to send additional events.

“Tell the truth, the whole truth, and nothing but the truth.”
— Adam Bellemare, Building Event-Driven Microservices

We realized we might need to limit the amount of hierarchy data in each event to resolve this.

Making Events Opinionated

Another challenge is providing the right set of information for the right type of asset.

We track data for many different types of assets, such as turbines, blades, gearboxes, generators, (and more!) and each has its own set of attributes.

Image depicting various shapes of trees as icons. Image by Freepik.

To support the different types, we initially started emitting generic events to inform consumers if any assets were created, updated, or deleted.

No matter the type of asset, the event schema always included a collection of attributes — a generic list of objects.

Consider the following examples of generic events:

Generic Event — Blade

{
  "operation": "AssetCreated",
  "assetId": "B_001",
  "assetType": "blade",
  "createdOn": "2022-11-08",
  "attributes": [
    {
      "name": "Position",
      "value": "A",
    },
    {
      "name": "Status",
      "value": "Active",
    },
    {
      "name": "Serial Number",
      "value": 12345,
    }
  ],
  ...
}

Generic Event — Turbine

{
  "operation": "AssetCreated",
  "assetId": "T_001",
  "assetType": "turbine",
  "createdOn": "2022-11-08",
  "attributes": [
    {
      "name": "Name",
      "value": "Initech",
    },
    {
      "name": "Latitude",
      "value": "42.28",
    },
    {
      "name": "Longitude",
      "value": "-83.75",
    }
  ],
  ...
}

With these examples in mind, we learned that generic events require consumers to do extra work to filter out relevant information based on the asset type, and then structure it a useful way.

We learned that the data could be more accessible to users if events were more opinionated, as in the following, new and improved examples.

Opinionated Event — Blade

{
  "operation": "BladeCreated",
  "turbineId": "B_001",
  "createdOn": "2022-11-08",
  "position": "A",
  "status": "Active",
  "serialNumber": 12345,
  ...
}

Opinionated Event — Turbine

{
  "operation": "TurbineCreated",
  "turbineId": "T_001",
  "createdOn": "2022-11-08",
  "name": "Initech",
  "latitude": "42.28",
  "longitude": "-83.75",
  ...
}

Making events more opinionated is something we’re working to address.

Recap

To sum up some of the biggest design challenges we encountered on our events journey, and the solutions we found:

The Transactional Outbox Pattern allowed us to add events to an existing service and avoid the dual-write problem
Event snapshotting allows consumers to establish a baseline state for every asset
Limiting the data in each event can be more efficient than providing unnecessary data
Opinionated events reduce the burden on consumers to process and reformat data

I hope this gives you something to think about if you’re planning to implement events, or just some ideas to reflect on if you already have events up and running.

Resources / Further reading

Kafka product description, Apache.org
Event Streaming, VMWare Tech Insights
Reliable Microservices Data Exchange With the Outbox Pattern, Gunnar Morling, Debezium.io, 2019
Charity Majors quote, Twitter, 2018
Pattern: Event Sourcing, Chris Richardson, Microservices.io
Dual Writes: The Unknown Cause of Data Inconsistencies, Thorben Janssen, thorben-janssen.com
Snapshots in Event Sourcing for Rehydrating Aggregates, Derek Comartin, CodeOpinion.com, 2021
Snapshotting Strategies, Oskar Dudycz, Event Store Blog, 2021
Building Event-Driven Microservices, Adam Bellemare, O’Reilly, 2020