September 24, 2019

Simplifying Observability

by Chris Talbot
in Observability

Observability Enablement

Observability solutions are born out of an organization’s need to make the availability and performance of their businesses visible. Highly observable applications are easier to support, better understood, and have higher availability.

In this post, we’ll share the steps we took to make our observability platform easier for our customers to both use and understand while setting them up for success as we shift observability left.

The Beginning

At Enova, our observability platform is comprised of several different products, each excelling in a specific observability domain. You’d likely be familiar with the names of several of the products in our observability stack, but the approach described here is applicable regardless of which vendor or open-source observability tools are in use.

While each of these tools provides critical observability coverage, we discovered that they often weren’t being leveraged effectively by stakeholders outside of operations. We spent time training business and engineering stakeholders on each of the tools in the platform so they could start self-servicing their own observability but there was too much complexity and the tools remained underutilized.

Make it Easy

Users mentioned that having so many tools was overwhelming and often confusing. Furthermore, they weren’t always sure what to observe, what constituted a good metric, or how they could best use the various tools at their disposal.

To enhance stakeholder buy-in and utilization of the observability platform, we initiated two projects: the Observability Standards project and the Observability Automator project.

Observability Standards

To make observability more prescriptive and less confusing, we defined an easy to digest set of Observability Standards. Where possible, we adopted industry-standard methodologies rather than defining our own.

Our Standards follow:

USE Method: Defined by Brendan Gregg, the USE method describes a standards-based way to observe system performance.
RED Method: Defined by Tom Wilkie, the RED method describes a standards-based way to observe service/application performance.
Custom Business Metrics (CBM): An Enova-created term, this defines our standards for custom metrics. These are metrics that are emitted from the application/service itself.
Incident Management: This standard describes how we handle incident management across the organization.

Observability Automator

After we identified our Observability Standards, we still needed to make these standard metrics easy to access and utilize. There is no point in defining standards if they are too difficult for customers to actually use.

We determined that dashboards that automatically pull from the applicable tools and display the standards to our customers would be ideal. We then had to find an organizational principle for these dashboards, and we chose Services.

Services and USE: Services were already defined for all of our infrastructure within our configuration management system. Fortunately, our USE focused observability product decorates the USE metrics with service tags/attributes by default. It just took a simple API call to pull only the metrics matching a specific service tag.
Services and RED: We use a popular APM product that provides our RED visibility. The APM product does not expose service tags, so we had to map the APM application names to the service names defined in the configuration management product. That way our API call to the APM product would pull in only the corresponding RED metrics for the service.
Services and CBM: Our custom metric store is not organized by service, and due to the highly custom nature of these metrics, there was no way to automate their addition to a dashboard. So we simply defined a placeholder location in our dashboard where these metrics can be added as needed. We then instructed service owners to add their CBMs to the service dashboards themselves, as they know the critical custom metrics of their application far better than anyone else.
Services and Incident Management: Our incident management tool is natively designed around the concept of services. We simply named the service in the incident management tool the same as the configuration management service name. That way any alerts from the various tools could flow directly into the correct service in the incident management system.

We now have all the pieces of the puzzle and need to tie them together. We could create hundreds of service dashboards by hand but that isn’t scalable. It’s also difficult for human beings to always remember to leverage our standards while building dashboards, so we automated service dashboard creation.

The automation solution looks at the list of service names that exist and iterates through each. It ensures that each service has both a service dashboard that displays our observability standards metrics as well as a corresponding incident management service. If these are not found the automation creates them. There is no human intervention needed when new services are spun up. #WIN

Outcome

We’ve achieved our objective of delivering automated service dashboards that show the standard observability metrics for every service at our company. Any user can now easily determine the health of any service by finding the corresponding service dashboard, without getting lost among all the various observability tools. Additionally, since these dashboards have the service name in them, any alert created in these dashboards will automatically route to the correct incident management policy assigned to that service.

What’s Next?

This is just the beginning. We’ve defined a solid foundation to build upon and will be adding additional observability standards and corresponding automations in the future. Some options we are exploring include adding exception tracking and deployment visibility to each service.

Conclusion

There are many ways to think about expertise. One way is to thoroughly explain and detail the complexities of a topic in the hope that others can become experts themselves. A different way to think about expertise is the ability to make complex topics seem simple for non-specialists. That is the approach we took we took with our Observability Standards and Observability Automator initiatives, simplifying observability for our stakeholders by leveraging standardization and automation.

About