Micrometer and the Modern Observability Stack

Philip Leonard
Picnic Engineering
Published in
11 min readMay 6, 2020

--

In the Picnic backend landscape, our monitoring stack looks like this. New Relic is our APM provider for tracing and performance monitoring of our Java and Python backend services. We have also experimented with DataDog as an alternative provider. Prometheus is our internal metrics collection system for application metrics which targets backend applications and core infrastructure like our messaging and database clusters and Kubernetes clusters, all behind a Grafana frontend.

It’s quite an array, and for a developer it can be confusing where your metrics should fit in this landscape. For that reason Micrometer is a perfect companion for a Java engineer. It is the facade between application metrics and the metrics infrastructure. As they describe it themselves, it is the SLF4J for metrics.

To add to that, the monitoring space is highly competitive with various service providers fighting for your subscription. Putting a vendor neutral facade between you and their APIs makes total sense. Switching from one service provider should be completely transparent to application side metric emission.

Micrometer: Setup

Micrometer is a project adopted by and integrated into the Spring ecosystem. We at Picnic are invested in the Spring ecosystem, utilising other projects like Spring Reactor (for further details on Reactive Programming at Picnic see here), Spring WebFlux and Spring Core.

Let’s see how it is configured. If you are running Spring Boot 2.x then congrats, you already have it. If you aren’t then you need to use their legacy dependency:

<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-spring-legacy</artifactId>
<version>${micrometer.version}</version>
</dependency>

Next we need to configure some MeterRegistrys. There are MeterRegistrys for each vendor which they support that expose a Telemetry SDK. If you are a vendor then this is where you need to contribute to. If you are a consumer like us, then it is simple:

<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
<version>${micrometer.version}</version>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-newrelic</artifactId>
<version>${micrometer.version}</version>
</dependency>

Depending on what vendors you use the configuration will be different, just see which vendors Micrometer supports. For example Prometheus is a pull based metrics system and New Relic is a pushed based system. Therefore New Relic requires some account details, and Prometheus just requires configuring the management endpoint in your properties file:

# New Relic
management.metrics.export.newrelic.enabled=true
management.metrics.export.newrelic.api-key=YOUR_KEY
management.metrics.export.newrelic.account-id=YOUR_ACCOUNT_ID
# Prometheus
management.metrics.export.prometheus.enabled=true

All that is left now configuration-wise is to inject your MeterRegistry into any bean:

class SomeService {private final MeterRegistry reg;SomeService(MeterRegistry reg) {
this.reg = reg;
}
}

This injects a CompositeMeterRegistry that combines multiple vendor meter registries together, but this is completely transparent to your application code which doesn't care where, when or how this is done. The MeterRegistry API is where the fun happens. Here we have access to a number of different types of meters.

Micrometer: Meters

There are a number of different types of meter, from Timer, Counter, Gauge, DistributionSummary, LongTaskTimer, FunctionCounter, FunctionTimer, and TimeGauge. We have found that we use the following the most frequently, and here is how the Meter APIs look like:

They are rather self explanatory, but the way in which you use them during charting can leave you scratching your head.

Let’s start with the easiest: the Timer, you can simply time a provided function call or segment of an operation. For the former, APM tracing can already cover your function calls in single traces. However if you want to avoid possible sampling, or if your APM metrics don't live in the same place as your Micrometer metrics then it can be very handy. You can also start and finish a timing at any point in your application if you can pass the resulting Samplearound.

Let’s move on to the Counter. This should be utilised for monotonically increasing values in your application domain. For example, the number of customer orders, the number of requests, the number of articles .

Here is where the confusing part lies. The Gauge is also used to count values. But it should be seen as a snapshot of some state that is non-monotonically increasing. Values that can fluctuate. For example, the number of active user sessions, current memory usage or number of open delivery slots are all values who’s state must be captured in one instant.

For creating the right metric, there is one rule of thumb to follow. Always try and push the aggregations and calculations to your querying frontend, be it NRQL, PromQL or whichever metrics provider you use. Try and avoid calculating or pre-aggregation in your application. Leave your metrics raw. The benefit of doing this is two fold, you have less chance of getting your metrics wrong and having to change them in code, and it makes metrics more reusable. The number of times I have changed queries would have made life 10x more inconvenient if they were embedded in my application code. I’ve lost count of the number of times that all the tools I required to define new KPIs and queries were already at my disposal. Leave the hard work to the querying and not the metric production.

Another rule of thumb: if you are measuring anything temporal on your application, that isn’t a Timer then stop right there. Your time series data store should be responsible for allowing you to query throughput in time buckets. If your meter name contains "per.second" then you have at some point made a wrong turn. It also makes it impossible to tag per event occurrence. It is very hard if not impossible to disaggregate such metrics. Which brings me to a very important component of Micrometer, tagging.

Micrometer: Tags

Tags add dimensions to your data. What is a metric if you can’t compare it, filter and cross-reference it. Think of how the metric is stored, how you might want to query it later and how you might want to compare or cross reference it to other metrics that share the same tag dimensions. For example, if you tag all your metrics in your application with a region tag, one you define one high level aggregate dashboard for a country with all of your metrics plotted in a time series you can filter by region for free!

The Tag object is a simple key value. Tags.of("key", "value", "key1", "value1"...). What is important about tags and naming in Micrometer is that we use dot separated names, which Micrometer translates to vendor naming standards behind the scenes, and to be consistent. Agree on metric and tag names that all teams, products and services use and centralise if needs be.

Micrometer: Meter Registry Customisers

If you want to change MeterRegistrys globally, or you want to add global tags then you are looking for a MeterRegistryCustomizer. It is pretty simple, it is a Spring bean definition defining a function that transforms all MeterRegistrybeans.

Micrometer: A Picnic Use Case

Ok, now we have all of the tools we need, let’s increase observability. Let me set it to the tone of a Picnic use case. Stock availability is an important aspect of both customer satisfaction, and supply chain efficiency. If we have that item that you want in stock then you are a happy customer. If it is out of stock then you are an unhappy customer, and in actual fact we have discovered that each out of stock event contributes massively towards decreasing conversion. With the majority of customers giving up after the 4th in any session.

It is no use measuring stock levels. That doesn’t tell us what customers want to buy. Put it this way, if you walked into any highstreet supermarket one day and 99% of their products were in stock, but the 1% of products that you wanted were all out of stock then this wouldn’t make for a good shopping experience, right? The same applies for our customers and our warehouses. We needed to track stock availability as customers see it in real time. By registering each successful stock reservation and each unsuccessful reservation (stock shortage), means we can later calculate a percentage of all successful stock reservation quantities over all stock reservation quantities (including stock shortages). This metric is called stock availability and the code looks like this:

private final MeterRegistry meterRegistry;StockAvailabilityServiceImpl(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
}

...

private void registerAvailable(
ImmutableMap<Article, Integer> articleAmounts) {
articleAmounts.entrySet().stream()
.forEach(o -> {
Tags tags =
Tags.of(
ARTICLE_ID,
o.getKey().getArticleId(),
ARTICLE_NAME,
o.getKey().getArticleName(),
SUPPLIER,
o.getKey().getSupplier());
meterRegistry.counter(SAS_STOCK_AVAILABLE, tags)
.increment(o.getValue());
});
}

private void registerShortage(
StockShortage shortage,
String articleName,
String supplier) {
Tags tags =
Tags.of(
ARTICLE_ID,
shortage.getArticleId(),
ARTICLE_NAME,
shortage.getArticleName(),
SHORTAGE_REASON,
shortage.getReason(),
SUPPLIER,
supplier);
meterRegistry.counter(SAS_STOCK_SHORTAGE, tags)
.increment(shortage.getAmount());
}

Now Micrometer has taken care of the rest, with a little infrastructure magic of course, we have the metrics available in Prometheus and New Relic. We created dashboards with the following queries:

NRQL:

SELECT percentage(sum(throughput), where shortageReason IS NULL) FROM sasAvailable, sasShortage TIMESERIES 1m FACET sasWarehouseId

PromQL:

sum(increase(sas_available_total[1m])) / (sum(increase(sas_shortage_total[1m])) + sum(increase(sas_available_total[1m]))) * 100
Coronavirus has been a testing time for our stock availability. On a regular day, this dashboard is completely green.

This increases observability in the following ways: we can see if we made accurate stock forecasting, we can see issues with supplier’s stock and from a technical standpoint we can tell if there is an issue in stock availability calculations or in article data. For all of these points of observability we can set alerts. For warehouses, suppliers even specific articles. This shortens the time to identifying issues and their origin. There aren’t many blog posts these days without Coronavirus being slipped in at least somewhere. Our latest challenge in the world of inventory keeping has been the coronavirus, it has put unprecedented strain on our systems, forecasting and supply chain alike. Observability of these three components has never been more critical to Picnic.

We can also plot it in fun and inventive ways such as over a map of our warehouses:

But the intrinsic value is the same. We set alerting based on acceptable stock SLAs and now, instead of our customers telling us about low availability, we can simply observe it ourselves in real time.

Micrometer: learnings

Some learnings about Micrometer and how to use it correctly:

  • Separate instance vs service metrics. Pull based models such as Prometheus require that you make this distinction. If your metric is a singleton for your service, i.e. that it is some snapshot of database state, then you will need to configure your application load balancer to point to a single elected instance to retrieve it’s value.
  • Gauge vs Counter. Most of the time you want to log an event and aggregate it later. There you need a counter. Some of the time you need a snapshot of a state at a given time. There you need a gauge. Don’t use a gauge to try and measure throughput.
  • Consistent naming and tagging. Having consistent tagging and naming across the board helps increase observability tremendously. Observability platforms offer cross dashboard filtering, just take a look at the dashboard filtering which is possible if you consistently tag. You can effectively cross dimensions and domains with shared tags.

Out of the box instrumentation

Micrometer also provides a rich set of instrumentation which perhaps your APM provider does not. (List all of the micrometer core instrumentations). This is a great resource for enriching your APM metrics and filling in any blindspots which you might have there. I have ticket those that we use at Picnic and ew plan to use more:

── binder
│ ├── BaseUnits.java
│ ├── MeterBinder.java
│ ├── cache
│ │ ├── CacheMeterBinder.java
│ │ ├── CaffeineCacheMetrics.java ✅
│ │ ├── EhCache2Metrics.java
│ │ ├── GuavaCacheMetrics.java
│ │ ├── HazelcastCacheMetrics.java
│ │ ├── JCacheMetrics.java
│ │ └── package-info.java
│ ├── db
│ │ ├── DatabaseTableMetrics.java
│ │ ├── JooqExecuteListener.java ✅
│ │ ├── MetricsDSLContext.java ✅
│ │ └── PostgreSQLDatabaseMetrics.java ✅
│ ├── http
│ │ ├── DefaultHttpServletRequestTagsProvider.java
│ │ ├── HttpRequestTags.java
│ │ ├── HttpServletRequestTagsProvider.java
│ │ ├── Outcome.java
│ │ └── package-info.java
│ ├── httpcomponents
│ │ ├── DefaultUriMapper.java
│ │ ├── HttpContextUtils.java
│ │ ├── MicrometerHttpClientInterceptor.java
│ │ ├── MicrometerHttpRequestExecutor.java
│ │ ├── PoolingHttpClientConnectionManagerMetricsBinder.java ✅
│ │ └── PoolingNHttpClientConnectionManagerMetricsBinder.java ✅
│ ├── hystrix
│ │ ├── HystrixMetricsBinder.java
│ │ ├── MicrometerMetricsPublisher.java
│ │ ├── MicrometerMetricsPublisherCommand.java
│ │ └── MicrometerMetricsPublisherThreadPool.java
│ ├── jetty
│ │ ├── InstrumentedQueuedThreadPool.java
│ │ ├── JettyConnectionMetrics.java
│ │ ├── JettyServerThreadPoolMetrics.java
│ │ └── TimedHandler.java
│ ├── jpa
│ │ ├── HibernateMetrics.java
│ │ └── HibernateQueryMetrics.java
│ ├── jvm ✅
│ │ ├── ClassLoaderMetrics.java ✅
│ │ ├── DiskSpaceMetrics.java ✅
│ │ ├── ExecutorServiceMetrics.java ✅
│ │ ├── JvmCompilationMetrics.java ✅
│ │ ├── JvmGcMetrics.java ✅
│ │ ├── JvmHeapPressureMetrics.java ✅
│ │ ├── JvmMemory.java ✅
│ │ ├── JvmMemoryMetrics.java ✅
│ │ └── JvmThreadMetrics.java ✅
│ ├── kafka
│ │ ├── KafkaClientMetrics.java
│ │ ├── KafkaConsumerMetrics.java
│ │ ├── KafkaMetrics.java
│ │ └── KafkaStreamsMetrics.java
│ ├── logging
│ │ ├── Log4j2Metrics.java
│ │ └── LogbackMetrics.java
│ ├── mongodb
│ │ ├── MongoMetricsCommandListener.java ✅
│ │ └── MongoMetricsConnectionPoolListener.java ✅
│ ├── okhttp3
│ │ └── OkHttpMetricsEventListener.java
│ ├── system
│ │ ├── FileDescriptorMetrics.java
│ │ ├── ProcessorMetrics.java
│ │ └── UptimeMetrics.java
│ └── tomcat
│ └── TomcatMetrics.java

Next steps

Here are a few future points for how we are plan to improve observability at Picnic.

Python and OpenCensus

Micrometer is built upon the OpenTelemetry SDK. This means that Vendors such as New Relic and Prometheus can provide their own implementations of this SDK in Java, Python and a number of other languages, such that it can be implemented in vendor neutral facades such as Micrometer. As mentioned before, we are introducing ever more Python backend services into the Picnic tech landscape.

OpenCensus is the child project of OpenTelemetry which provides a Vendor neutral facade to Python applications. We are beginning work to bring this same functionality to our Python backend!

Dashboard contracts

While solid naming patterns and an eagle eye can help, there is no contract between metrics and the dashboards that plot them. If a metric is dropped then there isn’t currently a way to know until you see that it has disappeared from your dashboard. This is a frustrating process and somewhere where contracts could help. If the dashboard itself is the consumer contract, and the application side the server contract, we can devise a simple test that exports the state of a dashboard and tests that the contract is being fulfilled on the server side during build time.

Observability coverage metric.

As I mentioned in the beginning of this article, one of the frustrating things about observability is that it lacks a measure. We should focus on devising such a measure. It won’t be perfect, but something akin to code coverage could be a good start, to statically analyse the total externalised state of your application. It’s going to be a minuscule number for most applications, but it should act as a good baseline, just like code coverage.

--

--

Tech Lead @ Picnic. Reactive Programming, Observability & Distributed Systems Enthusiast.