Fluent Bit and AI: Unlocking Machine Learning Potential

30 Monday Dec 2024

Posted by mp3monster in Fluentbit, General, Technology

Tags

AI, artificial-intelligence, Cloud, Data Drift, development, Fluent Bit, GenAI, Machine Learning, ML, observability, Security, Technology, Tensor Lite, TensorFlow

These days, everywhere you look, there are references to Generative AI, to the point that what have Fluent Bit and GenAI got to do with each other? GenAI has the potential to help with observability, but it also needs observation to measure its performance, whether it is being abused, etc. You may recall a few years back that Microsoft was trailing new AI features for Bing, and after only having it in use for a couple of days, it had been recorded generating abusive comments and so on (Microsoft’s Tay is such an example).

But this isn’t the aspect of GenAI (or the foundations of AI with Machine Learning (ML)) I was thinking about. Fluent Bit can be linked to GenAI through its TensorFlow plugin. Is this genuinely of value or just a bit of ‘me too’?

There are plenty of backend use cases once the telemetry has been incorporated into an analytics platform, for example:

Making it easy to query and mine the observability data, such as natural language searching – to simplify expressing what is being looked for.
Outlier / Anomaly detection – when signals, particularly metrics, diverge from the normal patterns of behavior, we have the first signs of a problem. This is more Machine Learning than generative AI.
Using AI agents to tune monitoring thresholds and alerting scenarios

But these are all backend, big data style use cases and do not center on Fluent Bit’s core value of getting data sources to appropriate destination systems for such analysis or visualization.

To incorporate AI into Fluent Bit pipelines, we need to overcome a key issue – AI tends to be computationally heavy – making it potentially too slow for streams of signals being generated by our applications and too expensive given that most logs reflecting ‘business as usual’ are, in effect, low value.

There are some genuine use cases where lightweight AI can deliver value. First, we should be a little more precise. The TensorFlow plugin is the TensorFlow Lite version, also known as LiteRT. The name comes from the fact that it is a lite-weight solution intended to be deployable using small devices (by AI standards). This fits the Fluent Bit model of having a small footprint.

So, where can we put such a use case:

Translating stack traces into actionable information can be challenging. A trained ML or AI model can help classify and characterize the cause of a stack trace. As a result, we can move from the log to triggering appropriate actions.
Targeted use cases where we’ve filtered out most signal data to help analyze specific events – for example, we want to prevent the propagation of PII data downstream. Some PII data can be easily isolated through patterns using REGEX. For example, credit card IDs are a pattern of 4 digits in 4 groups. Phone numbers and email addresses can also be easily identified. However, postal addresses aren’t easy, particularly when handling multinational addresses, where the postal code/zip code can’t be used as an indicative pattern. Using AI to help with such checks means we must filter out signals to only examine messages that could accidentally carry such information.

When adopting AI into such scenarios, we have to be aware of the problems that can impact the use of ML and AI. These use cases are less high profile than the issues of hallucinations but just as important. As we’re observing software, which will change over time. As a result, payloads or data shifts (technically referred to as data drift) and the detection rate can drop. So, we need to measure the efficacy of the model. However, issues such as data drift need to be taken into account, as the scenario being detected may change in volume, reflecting changes in software usage and/or changes in how the solution works.

There are ways to help address such considerations, such as tracking false positive outcomes, and if the model can provide confidence scoring, is there a trend in the score?

Conclusion

There are good use cases for using Machine Learning (and, to an extent, Artificial Intelligence) within an observability pipeline – but we have to be selective in its application as:

The cost of the computation can outweigh the benefits
The execution time for such computation can be notably slower than our pipeline, leading to risks of back pressure if applied to every event in the pipeline.
The effectiveness and how much data drift might occur (we might initially see very good results, but then things can fall off).

Possibly, the most useful application is when the AI/ML engine has been trained to recognize patterns of events that preceded a serious operational issue (strictly, this is the use of ML).

Forward-looking

The true potential for Gen AI is when we move beyond isolating potential faults based on pattern recognition to using AI to help recommend or even trigger remediation processes.

Fluent Bit 3.2: YAML Configuration Support Explained

23 Monday Dec 2024

Posted by mp3monster in Fluentbit, General, Technology

≈ Leave a comment

Tags

book, Cloud, config, configuration, development, Fluent Bit, parsers, streams, stream_task, YAML

Among the exciting announcements for Fluent Bit 3.2 is the support for YAML configuration is now complete. Until now, there have been some outliers in the form of details, such as parser and streamer configurations, which hadn’t been made YAML compliant until now.

As a result, the definitions for parsers and streams had to remain separate files. That is no longer the case, and it is possible to incorporate parser definitions within the same configuration file. While separate configuration files for parsers make for easier re-use, it is more troublesome when incorporating the configuration into a Kubernetes deployment configuration, particularly when using a side-car deployment.

Parsers

With this advancement, we can define parsers like this:

Classic Fluent Bit

[PARSER]
    name myNginxOctet1
    format regex
    regex (?<octet1>\d{1,3})

YAML Configuration

parsers:
  - name: myNginxOctet1
    format: regex
    regex: '/(?<octet1>\d{1,3})/'

As the examples show, we swap [PARSER] for a parsers object. Then, each parser is an array of attributes starting with the parser name. The names follow a one-to-one mapping in most cases. This does break down when it comes to parsers where we can define a series of values, which in classic format would just be read in order.

Multiline Parsers

When using multiline parsers, we must provide different regular expressions for different lines. In this situation, we see each set of attributes become a list entry, as we can see here:

Classic Fluent Bit

[MULTILINE_PARSER]
  name multiline_Demo
  type regex
  key_content log
  flush_timeout 1000
  #
  # rule|<state name>|<regex>|<next state>
  rule "start_state" "^[{].*" "cont"
  rule "cont" "^[-].*" "cont"

YAML Configuration

multiline_parsers:
  - name: multiline_Demo
    type: regex
    rules:
    - state: start_state
      regex: '^[{].*'
      next_state: cont
    - state: cont
      regex: "^[-].*"
      next_state: cont

In addition to how the rules are nested, we have moved from several parameters within a single attribute(rule) to each rule having several discrete elements (regex, next_state). In addition to this, we have also changed the use of single and double quote marks.

If you want to keep the configurations for parsers and streams separate, we can continue to do so, referencing the file and name from the main configuration file. While converting the existing conf to a YAML format is the bulk of the work, in all likelihood, you’ll change the file extension to be .YAML will means you must also modify the referencing parsers_file reference in the server section of the main configuration file.

Streams

Streams follow very much the same path as parsers. However, we do have to be a lot more aware of the query syntax to remain within the YAML syntax rules.

Classic Fluent Bit

[STREAM_TASK]
  name selectTaskWithTag
  exec SELECT record_tag(), rand_value FROM STREAM:random.0;

[STREAM_TASK]
  name selectSumTask
  exec SELECT now(), sum(rand_value)   FROM STREAM:random.0;

[STREAM_TASK]
  name selectWhereTask
  exec SELECT unix_timestamp(), count(rand_value) FROM STREAM:random.0 where rand_value > 0;

YAML Configuration

stream_processor:
  - name: selectTaskWithTag
    exec: "SELECT record_tag(), rand_value FROM STREAM:random.0;"
  - name: selectSumTask
    exec: "SELECT now(), sum(rand_value) FROM STREAM:random.0;"
  - name: selectWhereTask
    exec: "SELECT unix_timestamp(), count(rand_value) FROM STREAM:random.0 where rand_value > 0;"

Note, it is pretty common for Fluent Bit YAML to use the plural form for each of the main blocks, although stream definition is an exception to the case. Additionally, both stream_processor and stream_task are accepted (although stream_task is not recognized in the main configuration file)..

Incorporating Configuration directly into the core configuration file

To support directly incorporating these definitions into a single file, we can lift the YAML file contents and apply them as root elements (i.e., at the same level as the pipeline, and service, for example).

Fluent Bit book examples

Our Fluent Bit book (Manning, Amazon UK, Amazon US, and everywhere else) has several examples of using parsers and streams in its GitHub repo. We’ve added the YAML versions of the configurations illustrating parsers and stream processing to its repository in the Extras folder.

Binary Large Objects with Fluent Bit

16 Monday Dec 2024

Posted by mp3monster in Fluentbit, General, Technology

≈ Leave a comment

Tags

3.2.2, Azure, Binary object, BLOB, configuration, Fluent Bit, use cases

When I first heard about Fluent Bit introducing the support binary large objects (BLOBs) in release 3.2. I was a bit surprised; often, handling such data structures is typical, and some might see it as an anti-pattern. Certainly, trying to pass such large objects through the buffers could very quickly blow up unless buffers are suitably sized.

But rather than rush to judgment, the use cases for handling blobs became clear after a little thought. First of all, there are some genuine use cases. The scenarios I’d look to blobs to help are for:

Microsoft applications can create dump files (.dmp). This is the bundling of not just the stack traces but the state, which can include a memory dump and contextual data. The file is binary in nature, and guess what? It can be rather large.
While logs, traces, and metrics can tell us a lot about why a component or application failed, sometimes we have to see the payload that is being processed – is there something in the data we never anticipated? There are several different payloads that we are handling increasingly even with remote and distributed devices, namely images and audio. While we can compress these kinds of payloads, sometimes that isn’t possible as we lose fidelity through compression, and the act of compression can remove the very artifact we need.

Real-world use cases

This later scenario I’d encountered previously. We worked with a system designed to send small images as part of product data through a messaging system, so the data was disturbed by too many endpoints. A scenario we encountered was the master data authoring system, which didn’t have any restrictions on image size. As a result, when setting up some new products in the supply chain system, a new user uploaded the ultra-high-resolution marketing images before they’d been prepared for general use. As you can imagine, these are multi-gigabyte images, not the 10s or 100s of kilobytes expected. The messaging’s allocated storage structures couldn’t cope with the payload.

We had to remotely access the failure points at the time to see what was happening and realize the issue. While the environment was distributed, it wasn’t as distributed as systems can be today, so remote access wasn’t so problematic. But in a more distributed use case, or where the data could have been submitted to the enterprise more widely, we’d probably have had more problems. Here is a case where being able to move a blob would have helped.

A similar use case was identified in the recent Release Webinar presented by Eduardo Silva Pereira, and a use case with these characteristics was explained. With modern cars, particularly self-driving vehicles, being able to transfer imagery back in the event navigation software experiences a problem is essential.

Avoid blowing up buffers.

To move the Blob without blowing up the buffering, the input plugin tells the blob-consuming output plugin about the blob rather than trying to shunt the GBs through the buffer. The output plugin (e.g., Azure Blob) takes the signal and then copies the file piece by piece. By consuming their blob in parts, we reduce the possible impacts of network disruption (ever tried to FTP a very large file over a network for the connection to briefly drop, as a result needing to from scratch?). The sender and receiver use a database table to track the communication and progress of the pieces and reassemble the blob. Unlike other plugins, there is a reverse flow from the output plugin back to the blob plugin to enable the process to be monitored. Once complete, the input plugin can execute post-transfer activities.

This does mean that the output plugin must have a network ‘line of sight’ to the blob when this is handled within a single Fluent Bit node – but it is something to consider if you want to operate in a more distributed model.

A word to the wise

Binary objects are known to be a means by which malicious code can easily be transported within an organization. This means that while observability tooling can benefit from being able to centralize problematic data for us to examine further, we could unwittingly help a malicious actor.

We can protect ourselves in several ways. Firstly, we must first understand and ensure the source location for the blob can only contain content that we know and understand. Secondly, wherever the blob is put, make sure it is ring-fenced and that the content is subject to processes such as malware detection.

Limitations

As the blob is handled with a new payload type, the details transmitted aren’t going to be accessible to any other plugins, but given how the mechanism works, trying to do such things wouldn’t be very desirable.

Input plugin configuration

At the time of writing, the plugin configuration details haven’t been published, but with the combination of the CLI and looking at the code, we do know the input plugin has these parameters:

Attribute Name	Description
path	Location to watch for blob files – just like the path for the tail plugin
exclude_pattern	We can define patterns that exclude files other than our blob files. The pattern logic, is the same as all other Fluent Bit patterns.
database_file	These are the same options as upload_success_action but are applied if the upload fails.
scan_refresh_interval	These are the same options as upload_success_action but are applied if the upload fails.
upload_success_action	This is a value that tells the plugin what to do, when successful. The options are: 0. Do nothing – the default action if no option is provided. delete (1). Delete the blob file add_suffix (2). Emit a Fluent Bit log record emit_log (3). Add suffix to the file – as defined by upload_success_suffix
upload_success_suffix	If the upload success_action is set to use a suffix, then the value provided here will be used as the suffix.
upload_success_message	This text will be incorporated into the Fluent Bit logs
upload_failure_action	These are the same options as upload_success_action but applied if the upload fails.
upload_failure_suffix	This is the failure version of upload_success_suffix
upload_failure_message	This is the failure version of upload_success_message

Output Options

Currently, the only blob output option is for the Azure Blob output plugin that works with the Azure Blob service, but support through using the Amazon S3 standard is being worked on. Once this is available, the feature will be widely available as the S3 standard is widely supported, including all the hyperscalers.

Note

The configuration information has been figured out by looking at the code. We’ll return to this subject when the S3 endpoint is provided and use something like Minio to create a local S3 storage capability.

Securing Fluent Bit operations

18 Monday Nov 2024

Posted by mp3monster in Fluentbit, General, Technology

≈ Leave a comment

Tags

Fluent Bit

I’ve been peer-reviewing a book in development called ML Workloads with Kubernetes for Manning. The book is in its first review cycle, so it is not yet available in the MEAP programme. I mention this because the book’s first few chapters cover the application of Apache Airflow and Juypter Notebooks on a Kubernetes platform. It highlights some very flexible things that, while pretty cool, could be seen by some organizations as potential attack vectors. I should say, the authors have engaged with security considerations from the outset). My point is that while we talk about various non-functional considerations, including security, there isn’t a section dedicated to security. So, we’re going to talk directly about some security considerations here.

It would be very easy to consider security as not being important when it comes to observability – but that would be a mistake, for a few reasons:

Logging Payloads

It is easy to incorporate all an application’s data payloads into observability signals such as traces and logs. It’s an easy mistake to make during initial development – you just want to initially see everything is being handled as intended during development, so include the payload. While we can go back and clean this up or even remove such output as we tidy up code – these things can slip through the wires. Just about any application today will want login credentials. Input credentials are about identifying who we are and determining if or what we can see. The fact that they can uniquely identify us is where we usually run into Data Protection law.

It isn’t unusual for systems to be expected to record who does what and when – all part of common auditing activities. That means our identity is going to often be attached to data flowing through our application.

This makes anywhere the records this data a potential gold mine of data, and the lack of diligence will mean that our operational support tools and processes will be soft targets.

Code Paths

Our applications will carry details of execution paths – from trace-related activities to exception stacks. We need this information to diagnose issues – it is even possible that the code will handle the issues, but it is typical to record the stack trace so we can see that the application has had to perform remediation (even if that is simply because we decided to catch an exception rather than have defensive code). So what? Well, that information tells us as developers what the application is doing – but in the wrong hands, that tells the consumer how they can induce errors and what third-party libraries we’re using (which means the reader can deduce what vulnerabilities we have) (see what OWASP says on the matter here).

Sometimes, our answer to a vulnerability might not be to fix it but to introduce mitigation strategies—e.g., we’ll block direct access to a system. The issue with such mitigations is that people will forget why they’re there or subvert them for the best of reasons, leaving them accidentally vulnerable again. So, minimizing exposure should be the second line of defense.

How does this relate to Fluent Bit?

Well, the first thing is to assume that Fluent Bit is handling sensitive data, remind ourselves of this from time to time, and even test it. This alone immediately puts us in a healthier place, and we at least know what risks are being taken.

Fluent Bit support SSL/TLS for network traffic

SSL/TLS traffic involves certificates; setting up and maintaining such things can be a real pain, particularly if the processes around managing certificates haven’t been carefully thought through and automated. Imposing the management of certificates with manual processes is the fastest way to kill off their adoption and use. Within an organization, certificates don’t have to be expensive ones that offer big payouts if compromised, such as those provided by companies like Thawte and Symantec. The Linux Foundation with Let’s Encrypt and protocols like ACME (Automated Certificate Management Environment) make it cost-free and provide automation for regular certificate rotation.

Don’t get suckered by the idea that SSL stripping at the perimeter is acceptable today. It used to be an acceptable thing to do because, among other reasons, the overhead of the processing of certificates was a measurable overhead. Moore’s law has seen to it that such computational overhead is tolerable if not fractions of a percentage cost. If not convinced, then consider the fact that there is sufficient drive that Kubernetes supports mutual SSL between containers that are more than likely to be actually running on the same physical server.

Start by Considering File systems on logs

If you’re working with applications or frameworks that direct logs to local files, you can do a couple of things. First, control the permissions on the files.

Many frameworks that support logging configuration don’t do anything with the logs (although some do, like Airflow). For those cases where log location doesn’t have a behavioral impact, we can look to control where the logs are being written. Structuring logs into a common part of the file system can make things easier to manage, certainly from a file system permissions viewpoint.

Watching for sensitive data bleed

If you’re using Fluent Bit to consolidate telemetry into systems like Loki, etc., then we should be running regular scans to ensure that no unplanned sensitive data is being caught. We can use tools like Telemetrygen to inject values into the event stream to test this process and see if the injected values are detected.

If or when such a situation occurs, the ideal solution is to fix the root cause. But, this isn’t always possible when the issue comes through a 3rd party library, an organization is reluctant to make changes or production changes are slow. In these scenarios and discussed in the book, we can use Fluent Bit configurations to mitigate the propagation of such data. But as we said earlier, if you use mitigations, it warrants verifying they aren’t accidentally undone, which takes us back to the start of this point.

Classifying and Tagging data

Telemetry, particularly traces and logs can be classified and tagged to reflect information about origin, nature of the event. This is mostly done nearest the source as understanding the origin helps the classification process. This task is something Fluent Bit can easily do and route accordingly as we can see in the book.

Don’t run Fluent Bit as root

Not running Fluent Bit with root credentials is security 101. But it is tempting when you want to use Fluent Bit to tap in and listen to the OS and platform logs and metrics, particularly if you aren’t a Linux specialist. It is worth investing in getting an OS base configuration that is secure while not preventing your observability. This doesn’t automatically mean you must use containers. Bare metal, etc., can be secured by not installing from a vendor base image but an image you’ve built, or even simpler, taking the base image and then using tools like Chef, Ansible, etc., to impose a configuration over the top.

Bottom Line

The bottom line is, as long as we keep in mind that our observability processes and data should be subject to the same care and consideration as our business data, along with the fact that security should never be an afterthought, something that we bolt on just before go live and pervasive rather than just at the boundary.

When I learnt to drive (in the dark ages), one of the things I was told is – if you assume that everyone on the road is a clueless idiot, then you’ll be ok. We should look at treating systems development and the adoption of security the same way – if you assume someone is likely to make a mistake and take defensive steps — then we’ll be ok — thiswill give us security in depth.