12 Factor, 12 Factor App, conference, development, Grafana, JAX, logging, London, OpenSearch, Prometheus, Splunk, stdout
The 12 Factor App definition is now ten years old. In the world of software that is a long time. So perhaps it’s time to revisit and review what it says. As I have spent a lot of time around Logging – I’ve focussed on Factor 11 – Logging.
I have been fortunate enough to present at the hybrid JAX London conference on this subject. It was great to get out and see people at a conference rather than just with a screen and a chat console of online-only events.
You can see my presentation here:
Core of the presentation revolves around two themes:
- The wording puts an emphasis on looking at logs more retrospectively through the use of analytics tools such as Splunk, and OpenSearch.
- How literal should we read the statement about the use of stdout?
The start of the 12 Factor App description for logs describes log events as a stream. Something that I agree with, but after this initial point is made, a lot of the focus is on getting logs into an analytics platform such as Splunk or OpenSearch. The problem is that this misses the opportunities for stream processing before storing the logs. Probably the most potential comes from:
- the possibility of processing log events as close as possible to when they occur. In doing so we can detect critical events and push notifications to people as issues occur.
- Near real-time views of metrics relating to our logs through the use of Prometheus and Grafana. For example how many warnings. and errors are occurring. Which component is experiencing transient issues etc.
- Examining logs for data that should not be in the logs. For example, if someone introduced some code that logged some freeform text and a user put into that freeform text credit card details. As a result, those details get propagated to different storage locations. The handling of credit card data and data that makes data personally identifiable is subject to a lot of controls. So processing log events as they occur gives an opportunity to screen out such issues before the data appears in multiple locations creating all sorts of issues with ensuring that data is retrospectively removed from every location those logs may have been copied to. it may sound far-fetched, but this has been observed to have happened.
When it comes to Prometheus, it is worth saying that it is not restricted just to Kubernetes environments (the Prometheus first steps document makes no mention of Kubernetes). Since the 12 Factor App was written in 2012 we have seen a lot more development in the world of streaming has evolved.
Reading the statement about using stdout can be rather disconcerting. Most development languages either have inbuilt logging frameworks and/or have open-source frameworks freely available. At the very minimum the frameworks will help:
- Provide a means to structure each log output consistently, such as the date format separation between different attributes.
- Ability to switch off development-level logging when operating in production environments. Improving performance (less I/O and string manipulation) and reducing resource consumption in terms of log volumes. Not to mention removing the need to change code to enable and disable logging, or custom implementing the logic of switching logging on and off.
Using stdout and trusting the infrastructure to capture it can present some problems and overheads …
- Identifying the component that produced the log output. Even the simplest environment will have many processes running, and each process may have multiple threads. So diagnosing the log event origins can be difficult even on one server or container before we scale things up.
- In Kubernetes environments, Kubernetes can intercept stdout. But the output will be associated with the pod. When the Pod goes (either shutdown or killed) so do the captured logs associated with the pod. As a result, you will lose logs if you’re up on capturing the log events from Kubernetes. That log information is likely to help you understand why Kubernetes has evicted the pod. Not only that parts of Kubernetes still use klog which creates a binary, compressed log that isn’t very easy to consume.
- stdout is the lowest common denominator – therefore it is very hard to infer anything. For example, if we know which application is generating the output we can immediately infer some information about the log events. If we can apply some structure, extracting some more meaning programmatically becomes a lot easier. An application log event has a different meaning to logs coming from a database server for example.
I think the intent of the 12 Factor App statement in this area is to promote the idea that the code should not be polluted with lots of additional logic to deal with logging-related activities, such as log rotation, etc. as it makes the objective of the code hard to see. But the use of a logging framework can be applied in a manner that would make it more compact than just using stdout. I say this because it masks the need to generate a timestamp. You can make the logging calls in a manner that means the code can be optimized away, rather than either adding your own conditionality or commenting stdout calls in and out. The behavior of logging should be hidden from the application and driven through configuration, and the configuration is defined for the circumstances and environment being used. So if the logging framework is in production then it is great if we can direct logs to tools such as Fluentd without writing to a file, but the same code for a developer probably does want the logs directed to stdout or a file.
One thing I think is absent or perhaps assumed in the 12 Factor App definition is the use of log event time-stamping. As we previously mentioned, the statement calls for logging out as a stream of events. Typically a stream has a temporal characteristic even if each event is relative to the next. It also points out the need to aggregate logs from multiple sources. When we aggregate the logs we need the events to be correctly ordered regardless of the source. This means we have to have a log timestamp of some sort
As you can see the 12 Factor App is not fundamentally wrong, but could have its statements taken a little too literally which always makes me a little nervous. Just as some people take the Agile Manifesto‘s statement about documentation out of context. It says we value working code over documentation – that doesn’t mean you don’t need to produce documentation. All it means is given the conflicting demands of documentation and delivering a working solution – then go for a working solution – software earns the money to pay our salaries (unless you’re writing a book), and you can catch up with documentation.