Have you been in a situation where you have chosen a specific solution to a set of problems because it seemed to be the right thing to use, only to discover later that it did not work well at all, was too expensive, or no one used it? I have been in that position multiple times. It is too easy to let the engineer brain find a solution before actually understanding the problem and directly believe what the problem is, without doing the research.
An example: At a former employer, we used SumoLogic for a lot of log analytics and monitoring of AWS workloads. It worked very well, and it was a useful tool for the operations group. Then some persons from that group joined a startup company and had to select the tooling to use when operating their AWS workloads. In this case, it did not work out at all. The usage of the tool was low, and the value was questionable.
A key reason for this was that we picked the tool before understanding the needs in comparison to the old company usage:
- Different and smaller organization, a bit different responsibilities and roles
- Workloads were not the same, a bit different use cases, although similar at a glance
- Priorities were different
In short, the differences meant that applying the same solution pattern did not provide enough value to the organization. It does not mean that SumoLogic is not a good tool - on the contrary, I think it is a very good tool and can provide significant value. It was just not the right choice at this point.
Amazon and Amazon Web Services, which pride themselves on being customer-obsessed, have an approach to product development which is about working backward from the customer - they start from a press release:
You certainly do not need to make a press release like Amazon does to let people know which monitoring solution you have picked, but the general principle applies still.
A bit of research going backward would have helped to make a better decision for the monitoring solution:
- What roles do need information about the state of the solution(s) and what are their responsibilities?
- What do information they require, in what form, and when?
Identify key performance indicators
- What are the business values that we want to uphold for our customers?
- How is that measured?
- What is the definition of a healthy state for those values?
- At what point is that good state at risk?
- What action(s) should we perform when the healthy state is at risk?
Map KPIs and solution resources
- Identify resources/components that can provide data
- Identify types of insights that can be obtained (behaviour, faults, performance)
- Identify type of information to collect (logs, metrics)
- Identify source to collect from (e.g. log files, system performance data)
- Figure out threshold and/or patterns in resources to use for alarms
Reports, alerts, actions
- What to report to who
- How to deliver report data
- Which formats to use
- Determine severity
- Actions (automated, manual)
It is an iterative process. It includes getting into enough detail to answer questions, as well as changes in requirements.
The general pattern here applies to many areas, not just monitoring solutions. The pattern also applies to automation. When should you automate a process or activity?
The answer is “it depends” or rather it is not the right question to ask first. If you ask a question like in that way, you may get answers such as “When you have repeated it three times” or “Always”. But there is no context in such answers.
A better question to begin asking is why should this process or activity be automated? At the surface level, answers that may come up could be:
- To save time for repeated tasks
- To be (more) consistent
- To avoid human error
- To distill expert knowledge into something that others may use
- To have some documentation of the steps of the process or activity
But these are still a bit vague. They may all be valid to some extent. However, they should still go back to a defined business value or objective. You may certainly end up with most of your processes being automated, but the steps to get to that point may be very different.
Start with your customer and work backward. It requires discipline and practice.
Even with Amazon Web Services, where they presumably practice this every day, and it is a part of their corporate culture, they can still provide customer experiences that are somewhat crappy. I do think there are other aspects at play as well.
What do you think about working backward from the customer?