Metrics based monitoring and the wonders of data in Operations

Introduction Link to heading

IT operations produce a whole lot of data daily… hourly… every second even. Just think about the vast amounts of work a system has to do.A core concept behind system infrastructure is uptime. In order to ensure uptime we design platforms that are highly available, redundant and self healing. Or at least that is what we all tell ourselves. Truth be told this level of automation is only achievable after a very concerted effort that usually doesn’t make a lot of business sense.

We all went crazy for the DevOPS, SRE paradigm of the 2010s, but the reality is many organizations never got to a Google level of autonomous systems. One part we all adopted from them, though, was using metrics for our monitoring needs. Any system needs to be monitored, no matter how much it would heal itself, there’s going to be a point a problem is encountered only a human can fix and it is at times like that where monitoring is most critical.

So I invite you to sit around a proverbial camp fire, while I regale you with the tales of monitoring IT infrastructure.

The ancient times Link to heading

Back in the day monitoring used to be dominated by the Zabbixes (Zabbixae? Zabbixies?) and Nagioses of the world. Any Sys Admin that’s been paged for an alert from these systems, to only have it disappear or see a bunch of logs of the particular alert being triggered in the morning, has had the thought “If only I could capture the system as it was at the point of the error”. This dream is so old in fact that there were solutions for it even in the dark ages. Munin has been around since 2003 and is still kicking… ish. Even though it is quite rudimentary, relying on a of the-time approach of Perl scripts, we can still see the beginnings of metrics based monitoring. We weren’t really doing metrics per se, but rather just capturing system states with some hacky setups, there was no standardization out there. Naturally, the state of play would evolve in the coming years.

Inventing the wheel Link to heading

In Greek mythology, Prometheus is one of the Titans and a god of fire. Prometheus is best known for defying the Olympian gods by stealing fire from them (which they had originally hidden from the people) and giving it to humanity in the form of technology, knowledge, and more generally, civilization. I believe that the engineers at SoundCloud did the same with their eponymous solution to the monitoring situation we had back in the day. In our allegory the “fire”, is data which isn’t as hot, but certainly is as illuminating. More can be read about it here. Other similar solutions exist, but Prometheus is emblematic for the modern monitoring stack, also my not-so-eloquent anologies don’t work as well with them.

I apologize to the not-so-technically inclined out there, but I just want to gush at the engineering behind it for the next few sentences, feel free to skip ahead if not interested. Prometheus is designed in such a way that there is an actual main instance which then requests data from whatever targets we’ve configured - our actual servers. So far, so good, nothing all that innovative, Nagios and Zabbix also have a master - agent design. Where this starts to differ is that instead of relying on custom tasks being executed by those agents, we actually run so-called “exporters” on those machines. An exporter is simply put a very small web server which collects whatever data we want it to collect and serves it as text over a specific port and in a specific syntax (though it is quite human readable). These tiny servers aren’t expected to do anything else but serve the data whenever they get a request, they don’t do any caching or anything like that. More often than not they are written in a Golang for it’s blazing speeds and portability. So then, how do I control scrape intervals or adjust anything for these servers? Easy - from the master itself! The setup is simplified in such way, where everything is controlled from the master and all we need to do is install our exporter on the target and we can “Ctrl+D” out of it and manage our config from the actual Prometheus server. Of course there’s a lot more awesome engineering that went into it, but I’ll save my ravings for another day.

Suffice it to say, we now have a way of collecting metrics until our heart’s content on any specific machine, so we can get almost constant snapshots of the state of our system. Bump in the road So our technologies changed, our knowledge expanded, yet with time we went back to our old ways. It’s a tale as old as time, with advancements in technology, if our mindset does not change we quickly revert to what is familiar. Like with many innovations we need to change our approach to things, utilizing the wheel requires one to actually build a cart instead of carrying everything by hand. Our brains understand routine and repeatability, we feel security in them, so it is common for us to feel friction when doing something in a new way.

Sadly, that is the path metrics based monitoring went down. IT teams were given the means to actually generate data which shows your whole system state, not just rudimentary heartbeat information, our proverbial wheel. They were not given a means to analyze that, translate it to the rest of the organization to emphasize the problems - the cart in our story. Therefore, no meaningful business outcomes were produced - the location our cart needs to be pulled to. As a side note, having infrastructure that is more self-healing, scalable and highly available is ultimately impactful on any modern venture - at the end of the day all companies are tech companies to a larger or lesser extent. So as a result infrastructure teams went back to using metrics based monitoring for old school alarming style solutions.

Building a cart Link to heading

Even though one is perfectly capable of making a cart on one’s own given enough time, cart-makers do exist. Also our analogy breaks down a little as carts are fundamentally something we see, where as data driven infrastructure engineering is a bit more ephemeral.

So how do we go back to actually using metrics instead of old school alerting? Google did it out of necessity of scale, I can’t imagine their infrastructure costs now, but surely they could fund a mission to some celestial body if they used classic monitoring.

Business people tend to actually be quite data driven and IT infrastructure already produces a bunch of data as we established. The key is to translate all of the infra data into something that actually makes sense from a business perspective and provides actionable items which turn into results. Be it through custom scripts, graphing software, bespoke solutions or anything else, this is the actual “translation” layer that is required. Sometimes that layer is actually a human, however, this goes against our goals. The fulcrum of solving this problem is to utilize the brain’s desire for routine, which caused the problem in first place to instill a more data driven approach to monitoring. When we do something long enough it becomes a routine, the new normal. It is, therefore, paramount that what we fall back on is sound and data driven, scientific even, rather than some sort of tech-priest-like behavior where we act more on superstitions and daily rituals. Any solution should be as automated as possible, since having input from an actual person means that at some point it will be abandoned.

I know it feels I haven’t given you any direct actionable items like “build a tool that does X”, rather general guidelines. The truth is each use case is unique (more tech-orientated stakeholders, lower/higher reliance on tech etc). Many a time this is a scary and dark path and it requires somebody that’s been down that road there to light the way.

Takeaways Link to heading

Data is power. It is quite fascinating what one might discover, if one were to simply look at the data. For example I recently saw a map that shows hotel bookings in the direct path of the 2024 solar eclipse in the continental US had been much higher than the rest of the country. Of course that seems obvious now and I’m a firm believer that all inventions seem blatantly obvious when looked at in hindsight, but with data we get the ability of at least some semblance of being able to do this beforehand.

Not all data is as simple and easy to interpret, though, our world is fraught with many a malicious actor that simplify, under or over analyze, mislead, misinterpret or plainly spread false information. Then there’s also the simple risk of loosing the forest for the trees. Missing critical insights if non-experts are interpreting the information or simply wasting time generating information only to fail at the actually important thing - analysis, is an ever present risk. Running water through a pipeline is quite wasteful if there’s nobody at the other end to drink it.

To quote a wise uncle, “With great power, comes great responsibility” and the data journey is one which requires skills to light the way. New paths need to be tread, old ways challenged or re-thought. The specifics of that are unique to each enterprise but one thing remains the same: Having somebody light the way while we get to the end goal of making pre-emptive decisions instead of post-hoc ones and having a truly self-healing IT infrastructure is critical.