Fluent Bit v4 the big news

17 Thursday Apr 2025

Posted by mp3monster in Books, Fluentbit, General, Technology

Tags

book, CNCF, development, eBPF, Fluent Bit, logs, OpenTelemetry, processors, sampling, Security, Trace, zig

With the announcement of Fluent Bit v4 at Kubecon Europe, we thought it worthwhile to take a look at what it means, aside from celebrating 10 years of Fluent Bit.

Firstly, normally using Semantic Versioning would suggest likely breaking (or incompatible changes to use SemVer wording) changes. The good news is that, like all the previous version changes for Fluent Bit the numbering change only reflects the arrival of major new features.

This is good news for me as the author of Logs and Telemetry with Fluent Bit, as it means the book remains entirely relevant. The book obviously won’t address the latest features, but we’ll try to cover those here as supplemental content.

Let’s reflect upon the new features, their benefits, and their implications.

New features in Processors allowing:
- Conditionality to be included.
- Trace sampling.
More flexible support for TLS (v1.3, choosing ciphers to enable)
New language for custom plugins in the form of Zig

Security Improvements

While security for many is not something that will get most developers excited about, there are things here that will make a CSO (Chief Security Officer) smile. Any developer who knows implementing security behaviors because it is a good thing, rather than because you have been told to do it, makes a CSO happy, puts them in a good place, to be given some more lianency when there is a need to do something that would get the CSO hot under the collar. Given this, we can now win those points with CSOs by using new Fluent Bit configurations that control TLS versions (1.1 – 1.3) and ciphers to support in use.

But even more fundamental than that are the improvements around basic credentials management. Historically, credentials and tokens had to be explicit in a configuration file or referenced back to an environment variable. Now, such values can come from a file, and as a result, there is no explicitness in the configuration. File security can manage access and visibility of such details. This will also make credentials rotation a lot easier to implement.

Processor Improvements

The processor improvements are probably the most exciting changes. Processors allow us to introduce additional activities within the pipeline as part of a process such as an input, rather than requiring additional buffer fetch and return which we see in standard plugin operations.

Of course, the downside is that if the processor introduces a lot of effort, we can create unexpected problems, such as back pressure, for example, as a result of a processor working hard on an input.

The other factor that extending processors bring is that they are not supported in classic format, meaning that to exploit such formats, you do need to define your configuration using YAML. The only thing I’m not a fan of, is that the configuration for these features does make me think I’m having to read algorithms expressed with Backus Naur form (BNF).

Trace Sampling

Firstly, the processors supporting OpenTelemetry Tracing can now sample. This is probably Fluent Bit’s only weakness in the Open Telemetry domain until now. Sampling is essential here as traces can become significant as you track application executions through many spans. When combined with each new transaction creating a new trace, traces can become voluminous. To control this explosion on telemetry data, we want to sample traces, collecting a percentage of typical traces (performance, latency, no errors, etc) and the outliers, where tracing will show us where a process is suffering, e.g., an end-to-end process is slowing because of a bottleneck. We can dictate how the sampling is applied based on values of existing attributes, the trace status, status codes, latencies, the number of spans, etc.

Conditionality in Processors

Conditionality makes it easier to respond to aspects of logs. For example, only when the logging payload has several attributes with specific values do we want to filter the event out for more attention. For example, an application reporting that it is starting up, and logs are classified as representing an error – then we may want to add a tag to the event so it can be easily filtered and routed to the escalation process.

Plugins with Zig

The enablement of Zig for plugin development (input, output and filters) is strictly an experimental feature. The contributors are confident they have covered all the typical use cases. But the innate flexibility supporting a language always represents potential edge cases never considered and may require some additional work to address.

Let’s be honest: Zig isn’t a well-known language. So, let’s start by looking briefly at it and why the community has adopted it for custom plugin development as an alternative to the existing options with Lua and WASM.

So Zig has a number of characteristics that align with the Fluent Bit ethos better than Lua and WASM, specifically:

It is a compiled rather than interpreted language, meaning that we reduce the runtime overheads of an interpreter or JIT compiler such as Lua and the proxy layer of WASM. This aligns to be very fast/minimal compute overhead to do its job, – ideal for IoT and minimising the cost of side-care container deployments.
The footprint for the Zig executable is very, very small—smaller than even a C-generated binary! As with the previous point, this lends itself to common Fluent Bit deployments.
The language definition is formally defined, compact, and freely available. This means you should be able to take a tool chain from anyone, and it is easy for specialist chip vendors to provide compilers.
Based on those who have tried, cross-compiling is far easier to deal with than working with GCC, MSVC, etc. Making it a lot easier to develop with the benefits we want from Go. Unlike Go – to connect to the C binary of Fluent Bit doesn’t require the use of a translation layer.

One of Zig’s characteristics that differs from C is its stronger typing and its approach of, rather than prescribing how edge cases are handled, e.g., null pointers, working to prevent you from entering those conditions.

Zig has been around for a few years (the first pre-release was in 2017, and the first non-pre-release was in August 2023). This is long enough for the supporting tooling to be pretty well fleshed out with package management, important building blocks such as the HTTP server, etc.

While asking a large enterprise with more conservative approaches to development (particularly when IT is seen as an overhead, and source of risk rather than a differentiator/revenue generator) to consider adopting Zig could be challenging compared to adopting, say Go. The different potential values here, make for some interesting potential.

Not Only, but Also

While we have made some significant advancements, each Fluent Bit release brings a variety of improvements in its plugins. For example, working with it with eBPF, HTTP output supports more compression techniques, such as Snappy and ZSTD, and Exit having a configurable delay.

The Plus version of library dependencies is being updated to exploit new capabilities or ensure Fluent Bit isn’t using libraries with vulnerabilities.

Additional resources

- Chronosphere announcement

- Announcement YouTube video

- book
- My links to technical resources – we’ve extended to include Zig related resources

Fluent Bit and AI: Unlocking Machine Learning Potential

30 Monday Dec 2024

Posted by mp3monster in Fluentbit, General, Technology

≈ Leave a comment

Tags

AI, artificial-intelligence, Cloud, Data Drift, development, Fluent Bit, GenAI, Machine Learning, ML, observability, Security, Technology, Tensor Lite, TensorFlow

These days, everywhere you look, there are references to Generative AI, to the point that what have Fluent Bit and GenAI got to do with each other? GenAI has the potential to help with observability, but it also needs observation to measure its performance, whether it is being abused, etc. You may recall a few years back that Microsoft was trailing new AI features for Bing, and after only having it in use for a couple of days, it had been recorded generating abusive comments and so on (Microsoft’s Tay is such an example).

But this isn’t the aspect of GenAI (or the foundations of AI with Machine Learning (ML)) I was thinking about. Fluent Bit can be linked to GenAI through its TensorFlow plugin. Is this genuinely of value or just a bit of ‘me too’?

There are plenty of backend use cases once the telemetry has been incorporated into an analytics platform, for example:

Making it easy to query and mine the observability data, such as natural language searching – to simplify expressing what is being looked for.
Outlier / Anomaly detection – when signals, particularly metrics, diverge from the normal patterns of behavior, we have the first signs of a problem. This is more Machine Learning than generative AI.
Using AI agents to tune monitoring thresholds and alerting scenarios

But these are all backend, big data style use cases and do not center on Fluent Bit’s core value of getting data sources to appropriate destination systems for such analysis or visualization.

To incorporate AI into Fluent Bit pipelines, we need to overcome a key issue – AI tends to be computationally heavy – making it potentially too slow for streams of signals being generated by our applications and too expensive given that most logs reflecting ‘business as usual’ are, in effect, low value.

There are some genuine use cases where lightweight AI can deliver value. First, we should be a little more precise. The TensorFlow plugin is the TensorFlow Lite version, also known as LiteRT. The name comes from the fact that it is a lite-weight solution intended to be deployable using small devices (by AI standards). This fits the Fluent Bit model of having a small footprint.

So, where can we put such a use case:

Translating stack traces into actionable information can be challenging. A trained ML or AI model can help classify and characterize the cause of a stack trace. As a result, we can move from the log to triggering appropriate actions.
Targeted use cases where we’ve filtered out most signal data to help analyze specific events – for example, we want to prevent the propagation of PII data downstream. Some PII data can be easily isolated through patterns using REGEX. For example, credit card IDs are a pattern of 4 digits in 4 groups. Phone numbers and email addresses can also be easily identified. However, postal addresses aren’t easy, particularly when handling multinational addresses, where the postal code/zip code can’t be used as an indicative pattern. Using AI to help with such checks means we must filter out signals to only examine messages that could accidentally carry such information.

When adopting AI into such scenarios, we have to be aware of the problems that can impact the use of ML and AI. These use cases are less high profile than the issues of hallucinations but just as important. As we’re observing software, which will change over time. As a result, payloads or data shifts (technically referred to as data drift) and the detection rate can drop. So, we need to measure the efficacy of the model. However, issues such as data drift need to be taken into account, as the scenario being detected may change in volume, reflecting changes in software usage and/or changes in how the solution works.

There are ways to help address such considerations, such as tracking false positive outcomes, and if the model can provide confidence scoring, is there a trend in the score?

Conclusion

There are good use cases for using Machine Learning (and, to an extent, Artificial Intelligence) within an observability pipeline – but we have to be selective in its application as:

The cost of the computation can outweigh the benefits
The execution time for such computation can be notably slower than our pipeline, leading to risks of back pressure if applied to every event in the pipeline.
The effectiveness and how much data drift might occur (we might initially see very good results, but then things can fall off).

Possibly, the most useful application is when the AI/ML engine has been trained to recognize patterns of events that preceded a serious operational issue (strictly, this is the use of ML).

Forward-looking

The true potential for Gen AI is when we move beyond isolating potential faults based on pattern recognition to using AI to help recommend or even trigger remediation processes.

Fluent Bit 3.2: YAML Configuration Support Explained

23 Monday Dec 2024

Posted by mp3monster in Fluentbit, General, Technology

≈ Leave a comment

Tags

book, Cloud, config, configuration, development, Fluent Bit, parsers, streams, stream_task, YAML

Among the exciting announcements for Fluent Bit 3.2 is the support for YAML configuration is now complete. Until now, there have been some outliers in the form of details, such as parser and streamer configurations, which hadn’t been made YAML compliant until now.

As a result, the definitions for parsers and streams had to remain separate files. That is no longer the case, and it is possible to incorporate parser definitions within the same configuration file. While separate configuration files for parsers make for easier re-use, it is more troublesome when incorporating the configuration into a Kubernetes deployment configuration, particularly when using a side-car deployment.

Parsers

With this advancement, we can define parsers like this:

Classic Fluent Bit

[PARSER]
    name myNginxOctet1
    format regex
    regex (?<octet1>\d{1,3})

YAML Configuration

parsers:
  - name: myNginxOctet1
    format: regex
    regex: '/(?<octet1>\d{1,3})/'

As the examples show, we swap [PARSER] for a parsers object. Then, each parser is an array of attributes starting with the parser name. The names follow a one-to-one mapping in most cases. This does break down when it comes to parsers where we can define a series of values, which in classic format would just be read in order.

Multiline Parsers

When using multiline parsers, we must provide different regular expressions for different lines. In this situation, we see each set of attributes become a list entry, as we can see here:

Classic Fluent Bit

[MULTILINE_PARSER]
  name multiline_Demo
  type regex
  key_content log
  flush_timeout 1000
  #
  # rule|<state name>|<regex>|<next state>
  rule "start_state" "^[{].*" "cont"
  rule "cont" "^[-].*" "cont"

YAML Configuration

multiline_parsers:
  - name: multiline_Demo
    type: regex
    rules:
    - state: start_state
      regex: '^[{].*'
      next_state: cont
    - state: cont
      regex: "^[-].*"
      next_state: cont

In addition to how the rules are nested, we have moved from several parameters within a single attribute(rule) to each rule having several discrete elements (regex, next_state). In addition to this, we have also changed the use of single and double quote marks.

If you want to keep the configurations for parsers and streams separate, we can continue to do so, referencing the file and name from the main configuration file. While converting the existing conf to a YAML format is the bulk of the work, in all likelihood, you’ll change the file extension to be .YAML will means you must also modify the referencing parsers_file reference in the server section of the main configuration file.

Streams

Streams follow very much the same path as parsers. However, we do have to be a lot more aware of the query syntax to remain within the YAML syntax rules.

Classic Fluent Bit

[STREAM_TASK]
  name selectTaskWithTag
  exec SELECT record_tag(), rand_value FROM STREAM:random.0;

[STREAM_TASK]
  name selectSumTask
  exec SELECT now(), sum(rand_value)   FROM STREAM:random.0;

[STREAM_TASK]
  name selectWhereTask
  exec SELECT unix_timestamp(), count(rand_value) FROM STREAM:random.0 where rand_value > 0;

YAML Configuration

stream_processor:
  - name: selectTaskWithTag
    exec: "SELECT record_tag(), rand_value FROM STREAM:random.0;"
  - name: selectSumTask
    exec: "SELECT now(), sum(rand_value) FROM STREAM:random.0;"
  - name: selectWhereTask
    exec: "SELECT unix_timestamp(), count(rand_value) FROM STREAM:random.0 where rand_value > 0;"

Note, it is pretty common for Fluent Bit YAML to use the plural form for each of the main blocks, although stream definition is an exception to the case. Additionally, both stream_processor and stream_task are accepted (although stream_task is not recognized in the main configuration file)..

Incorporating Configuration directly into the core configuration file

To support directly incorporating these definitions into a single file, we can lift the YAML file contents and apply them as root elements (i.e., at the same level as the pipeline, and service, for example).

Fluent Bit book examples

Our Fluent Bit book (Manning, Amazon UK, Amazon US, and everywhere else) has several examples of using parsers and streams in its GitHub repo. We’ve added the YAML versions of the configurations illustrating parsers and stream processing to its repository in the Extras folder.

Books Books Books

22 Tuesday Oct 2024

Posted by mp3monster in Books, General, Technology

≈ Leave a comment

Tags

book, books, development, ebook, FluentBit, manning, pbook, print, published, reading

Today we got the official notification that our book has been published …

As you can see, the eBook is now available. The print edition can be purchased from Thursday (24th Oct). If you’ve been a MEAP subscriber, you should be able to download the complete book. The book will start showing up on other platforms in the coming weeks (Amazon UK has set an availability date, and Amazon.com you can preorder).

There are some lovely review quotes as well:

A detailed dive into building observability and monitoring.

Jamie Riedesel, author of Software Telemetry

Extensive real-life examples and comprehensive coverage! It’s a great resource for architects, developers, and SREs.

Sambasiva Andaluri, IBM

A must read for anyone managing a critical IT-system. You will truly understand what’s going on in your applications and infrastructure.

Hassan Ajan, Gain Momentum

And there is more …

I hadn’t noticed until today, but the partner book Logging in Action, which covers Fluentd, is available in ebook and print as well as audio and video editions. As you can see, these are available on Manning and platforms like O’Reilly/Safari…

In Logging in Action you will learn how to:

Deploy Fluentd and Fluent Bit into traditional on-premises, IoT, hybrid, cloud, and multi-cloud environments, both small and hyperscaled
Configure Fluentd and Fluent Bit to solve common log management problems
Use Fluentd within Kubernetes and Docker services
Connect a custom log source or destination with Fluentd’s extensible plugin framework
Logging best practices and common pitfalls

Logging in Action is a guide to optimize and organize logging using the CNCF Fluentd and Fluent Bit projects. You’ll use the powerful log management tool Fluentd to solve common log management, and learn how proper log management can improve performance and make management of software and infrastructure solutions easier. Through useful examples like sending log-driven events to Slack, you’ll get hands-on experience applying structure to your unstructured data.

I have to say that my digital twin, who narrated the book, sounds pretty intelligent.

Update

Amazon UK is correct now, and has an availability date

Fluent Bit config from classic to YAML

02 Tuesday Jul 2024

Posted by mp3monster in Fluentbit, General, Technology

≈ 2 Comments

Tags

configuration, development, FluentBit, format, tool, YAML

Fluent Bit supports both a classic configuration file format and a YAML format. The support for YAML reflects industry direction. But if you’ve come from Fluentd to Fluent Bit or have been using Fluent Bit from the early days, you’re likely to be using the classic format. The differences can be seen here:

[SERVICE]
    flush 5
    log_level debug
[INPUT]
   name dummy
   dummy {"key" : "value"}
   tag blah
[OUTPUT]
   name stdout
   match *
#
# Classic Format
#

service:
    flush: 1
    log_level: info
pipeline:
    inputs:
        - name: dummy
          dummy: '{"key" : "value"}'
          tag: blah
    outputs:
        - name: stdout
          match: "*"
#
# YAML Format
#

Why migrate to YAML?

Beyond having a consistent file format, the driver is that some new features are not supported by the classic format. Currently, this is predominantly for Processors; it is fair to assume that any other new major features will likely follow suit.

Migrating from classic to YAML

The process for migrating from classic to YAML has two dimensions:

Change of formatting
- YAML indentation and plugins as array elements
- addressing any quirks such as wildcard (*) being quoted, etc
Addressing constraints such as:
- Using include is more restrictive
- Ordering of inputs and outputs is more restrictive – therefore match attributes need to be refined.

None of this is too difficult, but doing it by hand can be laborious and easy to make mistakes. So, we’ve just built a utility that can help with the process. At the moment, this solution is in an MVP state. But we hope to have beefed it up over the coming few weeks. What we plan to do and how to use the util are all covered in the GitHub readme.

The repository link (fluent-bit-classic-to-yaml-converter)

Update 4th July 24

A quick update to say that we now have a container configuration in the repository to make the tool very easy to use. All the details will be included in the readme, along with some additional features.

Update 7th July

We’ve progressed past the MVP state now. The detected include statements get incorporated into a proper include block but commented out.

We’ve added an option to convert the attributes to use Kubernetes idiomatic form, i.e., aValue rather than a_value.

The command line has a help option that outputs details such as the control flags.

Update 12th July

In the last couple of days, we pushed a little too quickly to GitHub and discovered we’d broken some cases. We’ve been testing the development a lot more rigorously now, and it helps that we have the regression container image working nicely. The Javadoc is also generating properly.

We have identified some edge cases that need to be sorted, but most scenarios have been correctly handled. Hopefully, we’ll have those edge scenarios fixed tomorrow, so we’ll tag a release version then.

Useful Quick Reference Links when Writing API Specs

13 Monday May 2024

Posted by mp3monster in APIs & microservices, General, Technology

≈ Leave a comment

Tags

API, Async, development, OAS

Whether you’re writing Asynchronous or Open APIs unless you’re doing it pretty much constantly, it is useful to have links to the specific details, to quickly check the less commonly used keywords, or to check whether you’re not accidentally mixing OpenAPI with AsyncAPI or the differences between version 2 or version 3 of the specs. So here are the references I keep handy:

API Handyman’s excellent OpenAPI Map and his spec navigator for Async and OpenAPI
Async API Schema (v3) and v2.6
Open API Specification (3.1)
JSON Schema
YAML Schema
Common Mark (as allowed in AsyncAPI for formatting text)

There are some useful ISO Specs for common data types like dates. Ideally, if you’re working in a specific industry domain, it is worth evaluating the industry standard definitions (even if you elect to use entire standardized objects). But when you’re not in such a position, it is at least work using standard ways of representing data—it saves on documentation effort.

ISO Message Catalog covers a wide range from Swift to Telecos (emphasis is on financial-related standards)
ISO 3166 – Country Codes
ISO 639 – Language Codes
ISO 8601 Date & Time

Not a standard, but still an initiative to promote consistency back by the likes of Microsoft etc, so could provide some insights/ideas/templates for common data structures – https://schema.org/

There are, of course, a lot of technology-centered standards such as media streaming, use of HTTP, etc.

These and many more resources are in my Tech resources.

Observing the Observer (Fluent Bit monitoring)

25 Thursday Apr 2024

Posted by mp3monster in Fluentbit, General, Technology

≈ Leave a comment

Tags

development, FluentBit, Grafana, Jaeger, logs, metrics, OpenSearch, Prometheus, Traces

In the Fluent Bit book I touch upon the point that we should be observing the observer. After all, if we don’t monitor our observability stack, then we’ll be operating blind and may never know until things go catastrophically wrong, and we’re getting complaints that production business solutions are down. One of the peer review comments was it would be really good to have a visual representation in the book. While I’d love to incorporate such diagrams, for them to be readable, they do use up a lot of space on the printed page, and very long chapters can also put some readers off. So, given the point wasn’t a key theme, we simply couldn’t incorporate the diagram.

But the suggestion is a good one. So we’ve created the visual representation here.

Annotated diagram showing how Fluent Bit could be used to monitor an open-source observation stack.

The diagram shows how we can monitor our observation tech stack so we don’t become ‘blind’ without knowing about it. The Outflow from the right of the diagram would send a signal to a notification service, which could be as simple as email or as user-friendly as Slack. We’ve indicated which kinds of metrics, logs, and traces could offer the most value from our observation tech stack.

If we’re running everything within a Kubernetes cluster, it would be easy to say we don’t need such a sophisticated setup as we can use Kubernetes liveness probes if the containers are well configured. While it is true if one of our services starts to fail, a liveness check should pick it up and recycle the container. But such probes only worry about the HTTP response code, not the cause. If we don’t monitor and capture more information we’ll never understand the problem. At worst we could end up seeing Kubernetes starting and then killing our containers in a vicious cycle and struggling to resolve the cause. So collecting the logs and metrics remains just as important.

How to Publish Fluent Bit Metrics and Logs

To publish Fluent Bit’s metrics to Prometheus, we need to configure the fluentbit-metrics input plugin (it does sound odd as an input, but there are reasons that become clearer in the book). We then route the output that supports using Fluent Bit as a Prometheus node exporter or makes use of the remote write API.

The log output for Fluent Bit can be configured via the command line or in the SERVICE blog (using the attributes log_file and log_level in the configuration file. Today this is setting the log threshold and identifying the log file. We can then, of course, configure a tail input plugin against the file if we want to send the logs to OpenSearch. We can also set plugin-specific logging thresholds as overrides to the Fluent Bit wide setting in the SERVICE block of configuration.

Configuring the other monitoring tools

Grafana‘s configuration will allow it to publish Prometheus scrapable metrics and Traces that are OTLP compliant can be found documented here.
Prometheus provides metrics on itself (details here) and logging controls as part of its command line and generates logfmt or JSON logs, details here.
OpenSearch‘s logs can be accessed as documented here. The Logs are created with Log4j2, which means out of the box, it will be easy to parse them. Configuring the output of slow query reports does need to be switched on. OpenSearch also illustrates a pre OpenTelemetry/OpenMetrics approach to sharing internal metrics by writing them as logs. However, there are ways to convert such log events to OTLP Metrics with Fluent Bit.
Jaeger provides metrics endpoints that are Prometheus-compatible, along with JSON-based logs, and are documented here. There is some support for tracing.

Fluent Bit the engine to power ChatOps – update

17 Sunday Mar 2024

Posted by mp3monster in chatbots, Fluentbit, General

≈ 2 Comments

Tags

chatops, Cloud, conference, demo, development, FluentBit, video

The other month, I described a presentation and demo (Fluent Bit – Powering Chat Ops) we’ll be doing for the Cloud Native Rejekts conference, which is the precursor event to KubeCon in Paris this week. Since that post, we’re excited to say that, with Patrick Stephens’s contributions from Chronosphere, the demo is now in the Fluent GitHub repo. It has been nicely packaged with a Docker Compose, so everything runs in a couple of containers.

In addition, if you want to see the presentation and hear us discuss the solution and explain how it works, we recorded part of the presentation dry run, which can be heard here (Demo) and here (Code overview).

I couldn’t be in Paris in person, so Patrick took the job of presenting in Paris, we tried to enable my remote participation but had audio issues. Hopefully, you’ll see the recording of Pat’s physical presentation here. But I did manage to collaborate in the demo:

This means that the original repo I mentioned can be viewed as a beta or upstream version (it’s cluttered with some generated code from Helidon, which we will eventually get around to exploiting and making the utility a native binary executable).

Fluent Bit with Kubernetes book update

05 Tuesday Mar 2024

Posted by mp3monster in Books, Fluentbit, Technology

≈ Leave a comment

Tags

book, development, FluentBit, review

A quick update on the book – very early this morning or late last night (depending on your perspective), we sent our development editor the final chapter of the Fluent Bit with Kubernetes book. There is still a way to go before we’re completed (with multiple reviews to happen, appropriate edits to be made, copy editing, etc. Still, it is an important milestone from an author’s perspective.

For the keen readers who have signed up for the MEAP (Manning Early Access Programme) of the book, I can confirm that the editorial team (preparation for eBook and website formatting, checking the edits to address the Technical Editor and Development Editor haven’t introduced any obvious issues) are working on the preparation of Chapter 7 – so that should be available soon. When this chapter is available, the content covering all the foundational aspects of Fluent Bit will be available. The remaining chapters reflect the advanced features.

Cloud Observability in Action – Book Review

04 Thursday Jan 2024

Posted by mp3monster in Books, General

≈ Leave a comment

Tags

book, development, FluentBit, Fluentd, manning, Michael Hausenblas, o11y, observability, OpenTelemetry, Prometheus, review

With the Christmas holidays happening, things slowed down enough to sit and catch up on some reading – which included reading Cloud Observability in Action by Michael Hausenblas from Manning. You could ask – why would I read a book about a domain you’ve written about (Logging In Action with Fluentd) and have an active book in development (Fluent Bit with Kubernetes)? The truth is, it’s good to see what others are saying on the subject, not to mention it is worth confirming I’m not overlapping/duplicating content. So what did I find?

Cloud Observability in Action by Michael Hausenblas

Cloud Observability In Action has been an easygoing and enjoyable read. Tech books can sometimes get a bit heavy going or dry, not the case here. Firstly, Michael went back to first principles, making the difference between Observability and monitoring – something that often gets muddied (and I’ve been guilty of this, as the latter is a subset of the former). Observability doesn’t roll off the tongue as smoothly as monitoring (although I rather like the trend of using O11y). This distinction, while helpful, particularly if you’re still finding your feet in this space, is good. What is more important is stepping back and asking what should we be observing and why we need to observe it. Plus, one of my pet points when presenting on the subject – we all have different observability needs – as a developer, an ops person, security, or auditors.

Next is Michael’s interesting take on how much O11y code is enough. Historically, I’ve taken the perspective – that enough is a factor of code complexity. More complex code – warrants more O11y or logging as this is where bugs are most likely to manifest themselves; secondly, I’ve looked at transaction and service boundaries. The problem is this approach can sometimes generate chatty code. I’ve certainly had to deal with chatty apps, and had to filter out the wheat from the chaff. So Michael’s approach of cost/benefit and measuring this using his B2I ratio (how much code is addressing the business problems over how much is instrumentation) was a really fresh perspective and presented in a very practical manner, with warnings about using such a measure too rigidly. It’s a really good perspective as well if you’re working on hyperscaling solutions where a couple of percentage point improvements can save tens of thousands of dollars. Pretty good going, and we’re only a couple of chapters into the book.

The book gets into the underlying ideas and concepts that inform OpenTelemetry, such as traces and spans, metrics, and how these relate to Observability. Some of the classic mistakes are called out, such as dimensioning metrics with high cardinality and why this will present real headaches for you.

As the data is understood, particularly metrics you can start to think about how to identify what normal is, what is abnormal, or an outlier. That then leads to developing Service Level Objectives (SLOs), such as an acceptable level of latency in the solution or how many errors can be tolerated.

The book isn’t all theory. The ideas are illustrated with small Go applications, which are instrumented, and the generated metrics, traces, and logs. Rather than using a technology such as Fluentd or Fluent Bit, Michael starts by keeping things simple and directly connecting the gathering of the metrics into tools such as Prometheus, Zipkin, Jaeger, and so on. In later chapters, the complexity of agents, aggregators, and collectors is addressed. Then, the choices and considerations for different backend solutions from cloud vendor-provided services such as OpenSearch, ElasticSearch, Splunk, Instana and so on. Then, the front-end visualization of the data is explored with tools such as Grafana, Kibana, cloud-provided tools, and so on.

As the book progresses, the chapters drill down into more detail, such as the differences and approaches for measuring containerized solutions vs. serverless implementations such as Lambda and the kinds of measures you may want. The book isn’t tied to technologies typically associated with modern Cloud Native solutions, but more traditional things like relational databases are taken into account.

The closing chapters address questions such as how to address alerting, incident management, and implementing SLOs. How to use these techniques and tools can help inform the development processes, not just production.

So I would recommend the book, if you’re trying to understand Observability (regardless of a cloud solution or not). If you’re trying to advance from the more traditional logging to a fuller capability, then this book is a great guide, showing what, why, and how to evaluate the value of doing so.

To come back to my opening question. The books have small points of overlap, but this is no bad thing, as it helps show how the different viewpoints intersect. I would actually say that the Observability in Action shows how the wider landscape fits together, the underlying value propositions that can help make the case for implementing a full observability solution. Then, Logging in Action and the new book, Fluent Bit with Kubernetes, give you some of the common context, and we drill into the details of how and what can be done with Fluent Bit and Fluentd. All Manning needs now is content to deep dive into Prometheus, Grafana, Jaeger, and OpenSearch to provide an end-to-end coverage of first principles to the art of the possible in Observability.

I also have to thank Michael for pointing his readers and sections of Logging in Action that directly relate and provide further depth into an area.

Phil (aka MP3Monster)'s Blog

~ from Technology to Music

Tag Archives: development

Fluent Bit v4 the big news

Security Improvements

Processor Improvements

Trace Sampling

Conditionality in Processors

Plugins with Zig

Not Only, but Also

Additional resources

Books Books Books

And there is more …

Update

Fluent Bit config from classic to YAML

Why migrate to YAML?

Migrating from classic to YAML

The repository link (fluent-bit-classic-to-yaml-converter)

Update 4th July 24

Update 7th July

Update 12th July

Useful Quick Reference Links when Writing API Specs

Observing the Observer (Fluent Bit monitoring)

How to Publish Fluent Bit Metrics and Logs

Configuring the other monitoring tools

Fluent Bit with Kubernetes book update

Cloud Observability in Action – Book Review

Further reading

Security Improvements

Processor Improvements

Trace Sampling

Conditionality in Processors

Plugins with Zig

Not Only, but Also

Additional resources

Share this:

Conclusion

Forward-looking

Share this:

Parsers

Classic Fluent Bit

YAML Configuration

Multiline Parsers

Classic Fluent Bit

YAML Configuration

Streams

Classic Fluent Bit

YAML Configuration

Incorporating Configuration directly into the core configuration file

Fluent Bit book examples

Share this:

And there is more …

Update

Share this:

Why migrate to YAML?

Migrating from classic to YAML

The repository link (fluent-bit-classic-to-yaml-converter)

Update 4th July 24

Update 7th July

Update 12th July

Share this:

Share this:

How to Publish Fluent Bit Metrics and Logs

Configuring the other monitoring tools

Share this:

Share this:

Share this:

Further reading

Share this: