Fluent Bit and AI: Unlocking Machine Learning Potential

30 Monday Dec 2024

Posted by mp3monster in Fluentbit, General, Technology

Tags

AI, artificial-intelligence, Cloud, Data Drift, development, Fluent Bit, GenAI, Machine Learning, ML, observability, Security, Technology, Tensor Lite, TensorFlow

These days, everywhere you look, there are references to Generative AI, to the point that what have Fluent Bit and GenAI got to do with each other? GenAI has the potential to help with observability, but it also needs observation to measure its performance, whether it is being abused, etc. You may recall a few years back that Microsoft was trailing new AI features for Bing, and after only having it in use for a couple of days, it had been recorded generating abusive comments and so on (Microsoft’s Tay is such an example).

But this isn’t the aspect of GenAI (or the foundations of AI with Machine Learning (ML)) I was thinking about. Fluent Bit can be linked to GenAI through its TensorFlow plugin. Is this genuinely of value or just a bit of ‘me too’?

There are plenty of backend use cases once the telemetry has been incorporated into an analytics platform, for example:

Making it easy to query and mine the observability data, such as natural language searching – to simplify expressing what is being looked for.
Outlier / Anomaly detection – when signals, particularly metrics, diverge from the normal patterns of behavior, we have the first signs of a problem. This is more Machine Learning than generative AI.
Using AI agents to tune monitoring thresholds and alerting scenarios

But these are all backend, big data style use cases and do not center on Fluent Bit’s core value of getting data sources to appropriate destination systems for such analysis or visualization.

To incorporate AI into Fluent Bit pipelines, we need to overcome a key issue – AI tends to be computationally heavy – making it potentially too slow for streams of signals being generated by our applications and too expensive given that most logs reflecting ‘business as usual’ are, in effect, low value.

There are some genuine use cases where lightweight AI can deliver value. First, we should be a little more precise. The TensorFlow plugin is the TensorFlow Lite version, also known as LiteRT. The name comes from the fact that it is a lite-weight solution intended to be deployable using small devices (by AI standards). This fits the Fluent Bit model of having a small footprint.

So, where can we put such a use case:

Translating stack traces into actionable information can be challenging. A trained ML or AI model can help classify and characterize the cause of a stack trace. As a result, we can move from the log to triggering appropriate actions.
Targeted use cases where we’ve filtered out most signal data to help analyze specific events – for example, we want to prevent the propagation of PII data downstream. Some PII data can be easily isolated through patterns using REGEX. For example, credit card IDs are a pattern of 4 digits in 4 groups. Phone numbers and email addresses can also be easily identified. However, postal addresses aren’t easy, particularly when handling multinational addresses, where the postal code/zip code can’t be used as an indicative pattern. Using AI to help with such checks means we must filter out signals to only examine messages that could accidentally carry such information.

When adopting AI into such scenarios, we have to be aware of the problems that can impact the use of ML and AI. These use cases are less high profile than the issues of hallucinations but just as important. As we’re observing software, which will change over time. As a result, payloads or data shifts (technically referred to as data drift) and the detection rate can drop. So, we need to measure the efficacy of the model. However, issues such as data drift need to be taken into account, as the scenario being detected may change in volume, reflecting changes in software usage and/or changes in how the solution works.

There are ways to help address such considerations, such as tracking false positive outcomes, and if the model can provide confidence scoring, is there a trend in the score?

Conclusion

There are good use cases for using Machine Learning (and, to an extent, Artificial Intelligence) within an observability pipeline – but we have to be selective in its application as:

The cost of the computation can outweigh the benefits
The execution time for such computation can be notably slower than our pipeline, leading to risks of back pressure if applied to every event in the pipeline.
The effectiveness and how much data drift might occur (we might initially see very good results, but then things can fall off).

Possibly, the most useful application is when the AI/ML engine has been trained to recognize patterns of events that preceded a serious operational issue (strictly, this is the use of ML).

Forward-looking

The true potential for Gen AI is when we move beyond isolating potential faults based on pattern recognition to using AI to help recommend or even trigger remediation processes.

Fluent Bit 3.2: YAML Configuration Support Explained

23 Monday Dec 2024

Posted by mp3monster in Fluentbit, General, Technology

≈ Leave a comment

Tags

book, Cloud, config, configuration, development, Fluent Bit, parsers, streams, stream_task, YAML

Among the exciting announcements for Fluent Bit 3.2 is the support for YAML configuration is now complete. Until now, there have been some outliers in the form of details, such as parser and streamer configurations, which hadn’t been made YAML compliant until now.

As a result, the definitions for parsers and streams had to remain separate files. That is no longer the case, and it is possible to incorporate parser definitions within the same configuration file. While separate configuration files for parsers make for easier re-use, it is more troublesome when incorporating the configuration into a Kubernetes deployment configuration, particularly when using a side-car deployment.

Parsers

With this advancement, we can define parsers like this:

Classic Fluent Bit

[PARSER]
    name myNginxOctet1
    format regex
    regex (?<octet1>\d{1,3})

YAML Configuration

parsers:
  - name: myNginxOctet1
    format: regex
    regex: '/(?<octet1>\d{1,3})/'

As the examples show, we swap [PARSER] for a parsers object. Then, each parser is an array of attributes starting with the parser name. The names follow a one-to-one mapping in most cases. This does break down when it comes to parsers where we can define a series of values, which in classic format would just be read in order.

Multiline Parsers

When using multiline parsers, we must provide different regular expressions for different lines. In this situation, we see each set of attributes become a list entry, as we can see here:

Classic Fluent Bit

[MULTILINE_PARSER]
  name multiline_Demo
  type regex
  key_content log
  flush_timeout 1000
  #
  # rule|<state name>|<regex>|<next state>
  rule "start_state" "^[{].*" "cont"
  rule "cont" "^[-].*" "cont"

YAML Configuration

multiline_parsers:
  - name: multiline_Demo
    type: regex
    rules:
    - state: start_state
      regex: '^[{].*'
      next_state: cont
    - state: cont
      regex: "^[-].*"
      next_state: cont

In addition to how the rules are nested, we have moved from several parameters within a single attribute(rule) to each rule having several discrete elements (regex, next_state). In addition to this, we have also changed the use of single and double quote marks.

If you want to keep the configurations for parsers and streams separate, we can continue to do so, referencing the file and name from the main configuration file. While converting the existing conf to a YAML format is the bulk of the work, in all likelihood, you’ll change the file extension to be .YAML will means you must also modify the referencing parsers_file reference in the server section of the main configuration file.

Streams

Streams follow very much the same path as parsers. However, we do have to be a lot more aware of the query syntax to remain within the YAML syntax rules.

Classic Fluent Bit

[STREAM_TASK]
  name selectTaskWithTag
  exec SELECT record_tag(), rand_value FROM STREAM:random.0;

[STREAM_TASK]
  name selectSumTask
  exec SELECT now(), sum(rand_value)   FROM STREAM:random.0;

[STREAM_TASK]
  name selectWhereTask
  exec SELECT unix_timestamp(), count(rand_value) FROM STREAM:random.0 where rand_value > 0;

YAML Configuration

stream_processor:
  - name: selectTaskWithTag
    exec: "SELECT record_tag(), rand_value FROM STREAM:random.0;"
  - name: selectSumTask
    exec: "SELECT now(), sum(rand_value) FROM STREAM:random.0;"
  - name: selectWhereTask
    exec: "SELECT unix_timestamp(), count(rand_value) FROM STREAM:random.0 where rand_value > 0;"

Note, it is pretty common for Fluent Bit YAML to use the plural form for each of the main blocks, although stream definition is an exception to the case. Additionally, both stream_processor and stream_task are accepted (although stream_task is not recognized in the main configuration file)..

Incorporating Configuration directly into the core configuration file

To support directly incorporating these definitions into a single file, we can lift the YAML file contents and apply them as root elements (i.e., at the same level as the pipeline, and service, for example).

Fluent Bit book examples

Our Fluent Bit book (Manning, Amazon UK, Amazon US, and everywhere else) has several examples of using parsers and streams in its GitHub repo. We’ve added the YAML versions of the configurations illustrating parsers and stream processing to its repository in the Extras folder.

Binary Large Objects with Fluent Bit

16 Monday Dec 2024

Posted by mp3monster in Fluentbit, General, Technology

≈ Leave a comment

Tags

3.2.2, Azure, Binary object, BLOB, configuration, Fluent Bit, use cases

When I first heard about Fluent Bit introducing the support binary large objects (BLOBs) in release 3.2. I was a bit surprised; often, handling such data structures is typical, and some might see it as an anti-pattern. Certainly, trying to pass such large objects through the buffers could very quickly blow up unless buffers are suitably sized.

But rather than rush to judgment, the use cases for handling blobs became clear after a little thought. First of all, there are some genuine use cases. The scenarios I’d look to blobs to help are for:

Microsoft applications can create dump files (.dmp). This is the bundling of not just the stack traces but the state, which can include a memory dump and contextual data. The file is binary in nature, and guess what? It can be rather large.
While logs, traces, and metrics can tell us a lot about why a component or application failed, sometimes we have to see the payload that is being processed – is there something in the data we never anticipated? There are several different payloads that we are handling increasingly even with remote and distributed devices, namely images and audio. While we can compress these kinds of payloads, sometimes that isn’t possible as we lose fidelity through compression, and the act of compression can remove the very artifact we need.

Real-world use cases

This later scenario I’d encountered previously. We worked with a system designed to send small images as part of product data through a messaging system, so the data was disturbed by too many endpoints. A scenario we encountered was the master data authoring system, which didn’t have any restrictions on image size. As a result, when setting up some new products in the supply chain system, a new user uploaded the ultra-high-resolution marketing images before they’d been prepared for general use. As you can imagine, these are multi-gigabyte images, not the 10s or 100s of kilobytes expected. The messaging’s allocated storage structures couldn’t cope with the payload.

We had to remotely access the failure points at the time to see what was happening and realize the issue. While the environment was distributed, it wasn’t as distributed as systems can be today, so remote access wasn’t so problematic. But in a more distributed use case, or where the data could have been submitted to the enterprise more widely, we’d probably have had more problems. Here is a case where being able to move a blob would have helped.

A similar use case was identified in the recent Release Webinar presented by Eduardo Silva Pereira, and a use case with these characteristics was explained. With modern cars, particularly self-driving vehicles, being able to transfer imagery back in the event navigation software experiences a problem is essential.

Avoid blowing up buffers.

To move the Blob without blowing up the buffering, the input plugin tells the blob-consuming output plugin about the blob rather than trying to shunt the GBs through the buffer. The output plugin (e.g., Azure Blob) takes the signal and then copies the file piece by piece. By consuming their blob in parts, we reduce the possible impacts of network disruption (ever tried to FTP a very large file over a network for the connection to briefly drop, as a result needing to from scratch?). The sender and receiver use a database table to track the communication and progress of the pieces and reassemble the blob. Unlike other plugins, there is a reverse flow from the output plugin back to the blob plugin to enable the process to be monitored. Once complete, the input plugin can execute post-transfer activities.

This does mean that the output plugin must have a network ‘line of sight’ to the blob when this is handled within a single Fluent Bit node – but it is something to consider if you want to operate in a more distributed model.

A word to the wise

Binary objects are known to be a means by which malicious code can easily be transported within an organization. This means that while observability tooling can benefit from being able to centralize problematic data for us to examine further, we could unwittingly help a malicious actor.

We can protect ourselves in several ways. Firstly, we must first understand and ensure the source location for the blob can only contain content that we know and understand. Secondly, wherever the blob is put, make sure it is ring-fenced and that the content is subject to processes such as malware detection.

Limitations

As the blob is handled with a new payload type, the details transmitted aren’t going to be accessible to any other plugins, but given how the mechanism works, trying to do such things wouldn’t be very desirable.

Input plugin configuration

At the time of writing, the plugin configuration details haven’t been published, but with the combination of the CLI and looking at the code, we do know the input plugin has these parameters:

Attribute Name	Description
path	Location to watch for blob files – just like the path for the tail plugin
exclude_pattern	We can define patterns that exclude files other than our blob files. The pattern logic, is the same as all other Fluent Bit patterns.
database_file	These are the same options as upload_success_action but are applied if the upload fails.
scan_refresh_interval	These are the same options as upload_success_action but are applied if the upload fails.
upload_success_action	This is a value that tells the plugin what to do, when successful. The options are: 0. Do nothing – the default action if no option is provided. delete (1). Delete the blob file add_suffix (2). Emit a Fluent Bit log record emit_log (3). Add suffix to the file – as defined by upload_success_suffix
upload_success_suffix	If the upload success_action is set to use a suffix, then the value provided here will be used as the suffix.
upload_success_message	This text will be incorporated into the Fluent Bit logs
upload_failure_action	These are the same options as upload_success_action but applied if the upload fails.
upload_failure_suffix	This is the failure version of upload_success_suffix
upload_failure_message	This is the failure version of upload_success_message

Output Options

Currently, the only blob output option is for the Azure Blob output plugin that works with the Azure Blob service, but support through using the Amazon S3 standard is being worked on. Once this is available, the feature will be widely available as the S3 standard is widely supported, including all the hyperscalers.

Note

The configuration information has been figured out by looking at the code. We’ll return to this subject when the S3 endpoint is provided and use something like Minio to create a local S3 storage capability.