Fluent Bit and Otel Collectors at scale

26 Thursday Feb 2026

Posted by mp3monster in Fluentbit, General, Technology

Tags

AI, artificial-intelligence, Cloud, LLM, OpAMP, Open Telemetry, OTel, Protobuf, Spec, Technology

Fluent Bit and OpenTelemetry’s Collector (as well as many other observability tools) are designed to use a distributed/agent model for deployment. This model can pose challenges, including ensuring that all agents are operating healthily and correctly configured. This is particularly true outside of a Kubernetes ecosystem. But even within a Kubernetes ecosystem, more than basic insights are required (for example, is it running flat out or over-resourced). Fluent Bit exposes its own metrics and logs so that you can either configure Fluent Bit to forward the metrics and logs to an endpoint (or allow Prometheus to scrape the metrics).

As we’re usually using Fluent Bit to collect data and route it to tools like Prometheus, Grafana, etc., or perhaps a more commercial product. So it makes sense that Fluent Bit is also sharing its own status and health.

When it comes managing the configuration of Fluent Bit, we have lots of options for Kubernetes deployments (from forcing pod replacement, to sharing configurations via persistent volume claims, and Fluent Bit reloading configs), but given that Observability needs to operate on simple virtualized and bare metal scenarios, and not everything can be treated as dynamically replaceable more general strategies such as GITOps and potentially using Istio (yes, there really good use cases for Istio outside of Kubernetes) are available. But there is also more advanced tooling, such as Puppet, Chef, and Ansible.

The challenge is non of these tools provides out-of-the-box capabilities to fully exploit the control surface that OTel Collectors and Fluent Bit offer. So the OpenTelemetry community has elected to develop a new standard called OpAMP, which fits snuggly into the OTel ecosystem.

OpAMP defines an agent/client and server model in which the central server provides control, measurement, etc., to all Collectors. The agent/client side can be deployed in two ways: wired directly into a collector or via a separate supervisor tool. Integrating the client-side directly into the client is great, as it avoids introducing a new local process. The heart of OpAMP is the message exchange, which we’ll look at more in a moment.

Today, Fluent Bit would need to use the Supervisor model without modification (although GitHub shows a feature request to support the protocol, we haven’t heard whether or when it will be implemented). But it is early days, and some aspects of the protocol are still classified as ‘development’. That said, it is worth looking more closely at OpAMP as it offers some interesting opportunities, particularly around how we could easily evolve ideas such as chatOps.

While I’m not a fan of having a peer process for Fluent Bit, on the basis that we now have two distinct processes to support observability, we could experiment with the supervisor spawning Fluent Bit as a child process, which would let the supervisor easily spot Fluent Bit failing. At the sametime the supervisor can communicate with Fluent Bit using the localhost loopback adapter and the usual APIs.

Understanding the OpAMP Protocol and what it offers

Lets start with what the OpAMP protocol offers, firstly very little of it is mandatory (this is both good and bad – it means we can build compatibility in a more incremental manner, if certain behaviours are provided elsewhere, then the agent doesn’t have to offer a capability it is also tolerant of what an agent or collector can and can’t do):

Heatbeat and announce the agent’s existence to the server, along with what the agent is capable of/allowed to do regarding OpAMP features.
Status information, including:
- environmental information
- configuration being run
- modules being used
Perform updates to resources, including:
- Agent configuration
- TLS certificate rotation
- Credentials management
- module or even an entire agent installation
issuing of custom commands.
directing the agent’s own telemetry to specific services/endpoints.

The heart of this protocol is the contract between the client (agent/collector) and the server, defined using Protobuf3. This means you can easily create the code skeleton to handle the payload objects, which will be transmitted in binary form (giving network traffic efficiency and the price of not being humanly readable or dynamically processable, insofar as you need to know the Protobuf definition to extract any meaning).

In addition to the Protobuf definitions, there are rules for handling messages, specifying when a message or response is required, message sequence numbering, and the default heartbeat frequency. But there aren’t any complex exchanges involved.

The binary payloads are exchanged over Sockets (allowing full-duplex exchanges where the server can send requests at any time) or over HTTP (providing half-duplex, aka polling/client check-in, at which point the server can respond with an instruction) – a strategy that is becoming increasingly common today.

The benefits we see

Regardless of whether you’re in a Kubernetes environment, the ability to ask agents/collectors to quickly tweak their configuration is attractive. If you start suspecting a service or application is not behaving as expected, if Fluent Bit is filtering your logs or sampling traces, you can quickly push a config change out to allow more through, providing further insight into what is going on – you don’t need to wait for a Kubernetes scheduler to roll through with replacing pods.

With the rise of AI, having a central point of contact makes it easier to wrap a central server as an MCP tool and have a natural language command interface. Along with the possibility that you can potentially send custom commands to the agent (or supervisor) to initiate a task. This was part of the functionality we effectively implemented in our original chatops showcase. The problem was that in the original solution, we deployed a small Slack bot that directed HTTP calls to Fluent Bit. With the OpAMP framework, we can direct the request to the server, which will route the command to the correct Fluent Bit node through a more trustworthy channel.

Implementation

The OpAMP protocol is likely to be widely adopted by commercial service providers (Bindplane, OneUpTime, for example), as it allows them to start working with additional agents/collectors that are already deployed (for example, a supervisor could be used to manage a Fluentd fleet, where there isnt the appetite to refactor all the configuration to Fluent Bit or an OTel Collector). Furthermore, it has the potential to simplify by standardizing functionality.

In terms of resource richness in the OpenTelemetry GitHub repo, I suspect (and hope) there will be more to come (a UI and richer details on how a base server can be extended, for example). At the time of writing, within the OpenTelemetry GitHub repository, we can see:

The published spec
Go implementation of the protocol: the server accepts and responds to messages, and the client includes some test functionality to populate messages.
A supervisor implementation that, through configuration, can be pointed at an agent to observe, and the configuration so that the supervisor is uniquely identifiable to the server. This is also implemented in Go.
Open Telemetry Collector extension

AI to Agriculture

17 Friday Oct 2025

Posted by mp3monster in General, Oracle, Technology

≈ Leave a comment

Tags

AI, artificial-intelligence, Cloud, development, Oracle, Technology

Now that details of the product I’ve been involved with for the last 18 months or so are starting to reach the public domain (such as the recent announcement at the UN General Assembly on September 25), I can talk to a bit about what we’ve been doing. Oracle’s Digital Government Global Industry Unit has been working on a solution that can help governments address the questions of food security.

So what is food security? The World Food Programme describes it as:

Food security exists when people have access to enough safe and nutritious food for normal growth and development, and an active and healthy life. By contrast, food insecurity refers to when the aforementioned conditions don’t exist. Chronic food insecurity is when a person is unable to consume enough food over an extended period to maintain a normal, active and healthy life. Acute food insecurity is any type that threatens people’s lives or livelihoods.
World Food Programme

By referencing the World Food Programme, it would be easy to interpret this as a 3^rd world problem. But in reality, it applies to just about every nation. We can see this, with the effect the war in Ukraine has had on crops like Wheat, as reported by organizations such as CGIAR, European Council, and World Development journal. But global commodities aren’t the only driver for every nation to consider food security. Other factors such as Food Miles (an issue that perhaps has been less attention over the last few years) and national farming economics (a subject that comes up if you want to it through a humour filter with Clarkson’s Farm to dry UK government reports and US Department of Agriculture.

Looking at it from another perspective, some countries will have a notable segment of their export revenue coming from the production of certain crops. We know this from simple anecdotes like ‘for all the tea in China’, coffee variants are often referred to by their country of origin (Kenyan, Columbian etc.). For example, Palm Oil is the fourth-largest economic contributor in Malaysia (here).

So, how is Oracle helping countries?

One of the key means of managing food security is understanding food production and measuring the factors that can impact it (both positively and negatively), which range from the obvious—like weather (and its relationship to soil, water management, etc.) —to what crop is being planted and when. All of which can then be overlayed with government policies for land management and farming subsidies (paying farmers to help them diversify crops, periodically allowing fields to go fallow, or subsidizing the cost of fertilizer).

Oracle is a technology company capable of delivering systems that can operate at scale. Technology and the recent progress in using AI to help solve problems are not new to agriculture; in fact, several trailblazing organizations in this space run on Oracle’s Cloud (OCI), such as Agriscout. Before people start assuming that this is another story of a large cloud provider eating their customers’ lunch, far from it, many of these companies operate at the farm or farm cooperative level, often collecting data through aerial imagery from drones and aircraft, along with ground-based sensors. Some companies will also leverage satellite imagery for localized areas to complement these other sources. This is where Oracle starts to differentiate itself – by taking high-resolution imagery (think about the resolution level needed to differentiate Wheat and Maize, or spot rice and carrots, differentiate an orchard from a natural copse of trees). To get an idea, look at Google Earth and try to identify which crops are growing.

We take the satellite multi-spectral images from each ‘satellite over flight’ and break it down, working out what the land is being used for (ruling out roads, tracks, buildings, and other land usage). To put the effort to do this into context, the UK is 24,437,600,000 square meters and is only 78^th in the list of countries by area (here). It’s this level of scale that makes it impractical to use more localized data sources (imagine how many people and the number of drones needed to fly over every possible field in a country, even at a monthly frequency).

This only solves the 1^st step of the problem, which is to tell us the total crop growing area. It doesn’t tell us whether the crop will actually grow well and produce a good yield. For this, you’re going to need to know about weather (current, forecast, and historic trends), soil chemical composition and structure, and information such as elevation, angle, etc. Combined with an understanding of optimal crop growing needs (water levels, sun light duration, atmospheric moisture, soil types and health) – good crops can be ruined by it simply being too wet to harvest them, or store them dryly. All these factors need to be taken into account for each ‘cell’ we’re detecting, so we can calculate with any degree of confidence what can be produced.

If this isn’t hard enough, we need to account for the fact that some crops may have several growing seasons per year, or succession planting is used, where Carrots may be grown between March and June, followed by Cucumbers through to August, and so on.

Using technology

Hopefully, you can see there are tens of millions of data points being processed every day, and Oracle’s data products can handle that. As a cloud vendor, we’re able to provide the computing scale and, importantly, elasticity, so we can crunch the numbers quickly enough that users benefit from the revised numbers and can work out mitigation actions to communicate to farmers. As mentioned, this could be planning where to best use fertilizer or publishing advice on when to plant which crops for optimal growing conditions. In the worst cases recognizing there is going to be a national shortage of a staple crop and start purchasing crops from elsewhere and ensure when the crops arrive in ports they get moved out to the markets (like all large operations – as we saw with the Covid crises – if you need to react quickly, more mistakes can be made, costs grow massively driven by demand).

I mentioned AI, if you have more than the most superficial awareness of AI, you will probably be wondering how we use it, and the problems of AI hallucination – the last thing you want is a being asked to evaluate something and hallucinating (injecting data/facts that are not based on the data you have collected) to create a projection. At worst, this would mean providing an indication that everything is going well, when things are about to really go wrong. So, first, most of the AI discussed today is generative, and that is where we see issues like hallucinations. We’re have and are adopting this aspect of AI where it fits best, such as explainability and informing visualization, but Oracle is making heavy use of the more traditional ideas of AI in the form of Machine Learning and Deep Learning which are best suited to heavy numerical computational uses, that is not to say there aren’t challenges to be ddressed with training the AI.

Conclusion

When it comes to Oracle’s expertise in the specialized domains of agriculture and government, Oracle has a strong record of working with governments and government agencies from its inception. But we’ve also worked closely with the Tony Blair Institute for Global Change, which works with many national government agencies, including the agriculture sector.

My role in this has been as an architect, focused primarily on applying integration techniques (enabling scaling and operational resilience, data ingestion, and how our architecture can work as we work with more and more data sources) and on applying AI (in the generative domain). We’re fortunate to be working alongside two other architects who cover other aspects of the product, such as infrastructure needs and the presentation tier. In addition, there is a specialist data science team with more PhDs and related awards than I can count.

Oracle’s Digital Government business is more than just this agriculture use case; we’ve identified other use cases that can benefit from the data and its volume being handled here. This is in addition to bringing versions of its better-known products, such as ERP, Healthcare (digital health records management, vaccine programmes, etc.), national Energy and Water (metering, infrastructure management, etc).

For more on the agricultural product:

Fluent Bit and AI: Unlocking Machine Learning Potential

30 Monday Dec 2024

Posted by mp3monster in Fluentbit, General, Technology

≈ Leave a comment

Tags

AI, artificial-intelligence, Cloud, Data Drift, development, Fluent Bit, GenAI, Machine Learning, ML, observability, Security, Technology, Tensor Lite, TensorFlow

These days, everywhere you look, there are references to Generative AI, to the point that what have Fluent Bit and GenAI got to do with each other? GenAI has the potential to help with observability, but it also needs observation to measure its performance, whether it is being abused, etc. You may recall a few years back that Microsoft was trailing new AI features for Bing, and after only having it in use for a couple of days, it had been recorded generating abusive comments and so on (Microsoft’s Tay is such an example).

But this isn’t the aspect of GenAI (or the foundations of AI with Machine Learning (ML)) I was thinking about. Fluent Bit can be linked to GenAI through its TensorFlow plugin. Is this genuinely of value or just a bit of ‘me too’?

There are plenty of backend use cases once the telemetry has been incorporated into an analytics platform, for example:

Making it easy to query and mine the observability data, such as natural language searching – to simplify expressing what is being looked for.
Outlier / Anomaly detection – when signals, particularly metrics, diverge from the normal patterns of behavior, we have the first signs of a problem. This is more Machine Learning than generative AI.
Using AI agents to tune monitoring thresholds and alerting scenarios

But these are all backend, big data style use cases and do not center on Fluent Bit’s core value of getting data sources to appropriate destination systems for such analysis or visualization.

To incorporate AI into Fluent Bit pipelines, we need to overcome a key issue – AI tends to be computationally heavy – making it potentially too slow for streams of signals being generated by our applications and too expensive given that most logs reflecting ‘business as usual’ are, in effect, low value.

There are some genuine use cases where lightweight AI can deliver value. First, we should be a little more precise. The TensorFlow plugin is the TensorFlow Lite version, also known as LiteRT. The name comes from the fact that it is a lite-weight solution intended to be deployable using small devices (by AI standards). This fits the Fluent Bit model of having a small footprint.

So, where can we put such a use case:

Translating stack traces into actionable information can be challenging. A trained ML or AI model can help classify and characterize the cause of a stack trace. As a result, we can move from the log to triggering appropriate actions.
Targeted use cases where we’ve filtered out most signal data to help analyze specific events – for example, we want to prevent the propagation of PII data downstream. Some PII data can be easily isolated through patterns using REGEX. For example, credit card IDs are a pattern of 4 digits in 4 groups. Phone numbers and email addresses can also be easily identified. However, postal addresses aren’t easy, particularly when handling multinational addresses, where the postal code/zip code can’t be used as an indicative pattern. Using AI to help with such checks means we must filter out signals to only examine messages that could accidentally carry such information.

When adopting AI into such scenarios, we have to be aware of the problems that can impact the use of ML and AI. These use cases are less high profile than the issues of hallucinations but just as important. As we’re observing software, which will change over time. As a result, payloads or data shifts (technically referred to as data drift) and the detection rate can drop. So, we need to measure the efficacy of the model. However, issues such as data drift need to be taken into account, as the scenario being detected may change in volume, reflecting changes in software usage and/or changes in how the solution works.

There are ways to help address such considerations, such as tracking false positive outcomes, and if the model can provide confidence scoring, is there a trend in the score?

Conclusion

There are good use cases for using Machine Learning (and, to an extent, Artificial Intelligence) within an observability pipeline – but we have to be selective in its application as:

The cost of the computation can outweigh the benefits
The execution time for such computation can be notably slower than our pipeline, leading to risks of back pressure if applied to every event in the pipeline.
The effectiveness and how much data drift might occur (we might initially see very good results, but then things can fall off).

Possibly, the most useful application is when the AI/ML engine has been trained to recognize patterns of events that preceded a serious operational issue (strictly, this is the use of ML).

Forward-looking

The true potential for Gen AI is when we move beyond isolating potential faults based on pattern recognition to using AI to help recommend or even trigger remediation processes.

Fluent Bit 3.2: YAML Configuration Support Explained

23 Monday Dec 2024

Posted by mp3monster in Fluentbit, General, Technology

≈ Leave a comment

Tags

book, Cloud, config, configuration, development, Fluent Bit, parsers, streams, stream_task, YAML

Among the exciting announcements for Fluent Bit 3.2 is the support for YAML configuration is now complete. Until now, there have been some outliers in the form of details, such as parser and streamer configurations, which hadn’t been made YAML compliant until now.

As a result, the definitions for parsers and streams had to remain separate files. That is no longer the case, and it is possible to incorporate parser definitions within the same configuration file. While separate configuration files for parsers make for easier re-use, it is more troublesome when incorporating the configuration into a Kubernetes deployment configuration, particularly when using a side-car deployment.

Parsers

With this advancement, we can define parsers like this:

Classic Fluent Bit

[PARSER]
    name myNginxOctet1
    format regex
    regex (?<octet1>\d{1,3})

YAML Configuration

parsers:
  - name: myNginxOctet1
    format: regex
    regex: '/(?<octet1>\d{1,3})/'

As the examples show, we swap [PARSER] for a parsers object. Then, each parser is an array of attributes starting with the parser name. The names follow a one-to-one mapping in most cases. This does break down when it comes to parsers where we can define a series of values, which in classic format would just be read in order.

Multiline Parsers

When using multiline parsers, we must provide different regular expressions for different lines. In this situation, we see each set of attributes become a list entry, as we can see here:

Classic Fluent Bit

[MULTILINE_PARSER]
  name multiline_Demo
  type regex
  key_content log
  flush_timeout 1000
  #
  # rule|<state name>|<regex>|<next state>
  rule "start_state" "^[{].*" "cont"
  rule "cont" "^[-].*" "cont"

YAML Configuration

multiline_parsers:
  - name: multiline_Demo
    type: regex
    rules:
    - state: start_state
      regex: '^[{].*'
      next_state: cont
    - state: cont
      regex: "^[-].*"
      next_state: cont

In addition to how the rules are nested, we have moved from several parameters within a single attribute(rule) to each rule having several discrete elements (regex, next_state). In addition to this, we have also changed the use of single and double quote marks.

If you want to keep the configurations for parsers and streams separate, we can continue to do so, referencing the file and name from the main configuration file. While converting the existing conf to a YAML format is the bulk of the work, in all likelihood, you’ll change the file extension to be .YAML will means you must also modify the referencing parsers_file reference in the server section of the main configuration file.

Streams

Streams follow very much the same path as parsers. However, we do have to be a lot more aware of the query syntax to remain within the YAML syntax rules.

Classic Fluent Bit

[STREAM_TASK]
  name selectTaskWithTag
  exec SELECT record_tag(), rand_value FROM STREAM:random.0;

[STREAM_TASK]
  name selectSumTask
  exec SELECT now(), sum(rand_value)   FROM STREAM:random.0;

[STREAM_TASK]
  name selectWhereTask
  exec SELECT unix_timestamp(), count(rand_value) FROM STREAM:random.0 where rand_value > 0;

YAML Configuration

stream_processor:
  - name: selectTaskWithTag
    exec: "SELECT record_tag(), rand_value FROM STREAM:random.0;"
  - name: selectSumTask
    exec: "SELECT now(), sum(rand_value) FROM STREAM:random.0;"
  - name: selectWhereTask
    exec: "SELECT unix_timestamp(), count(rand_value) FROM STREAM:random.0 where rand_value > 0;"

Note, it is pretty common for Fluent Bit YAML to use the plural form for each of the main blocks, although stream definition is an exception to the case. Additionally, both stream_processor and stream_task are accepted (although stream_task is not recognized in the main configuration file)..

Incorporating Configuration directly into the core configuration file

To support directly incorporating these definitions into a single file, we can lift the YAML file contents and apply them as root elements (i.e., at the same level as the pipeline, and service, for example).

Fluent Bit book examples

Our Fluent Bit book (Manning, Amazon UK, Amazon US, and everywhere else) has several examples of using parsers and streams in its GitHub repo. We’ve added the YAML versions of the configurations illustrating parsers and stream processing to its repository in the Extras folder.

Fluent Bit the engine to power ChatOps – update

17 Sunday Mar 2024

Posted by mp3monster in chatbots, Fluentbit, General

≈ 1 Comment

Tags

chatops, Cloud, conference, demo, development, FluentBit, video

The other month, I described a presentation and demo (Fluent Bit – Powering Chat Ops) we’ll be doing for the Cloud Native Rejekts conference, which is the precursor event to KubeCon in Paris this week. Since that post, we’re excited to say that, with Patrick Stephens’s contributions from Chronosphere, the demo is now in the Fluent GitHub repo. It has been nicely packaged with a Docker Compose, so everything runs in a couple of containers.

In addition, if you want to see the presentation and hear us discuss the solution and explain how it works, we recorded part of the presentation dry run, which can be heard here (Demo) and here (Code overview).

I couldn’t be in Paris in person, so Patrick took the job of presenting in Paris, we tried to enable my remote participation but had audio issues. Hopefully, you’ll see the recording of Pat’s physical presentation here. But I did manage to collaborate in the demo:

This means that the original repo I mentioned can be viewed as a beta or upstream version (it’s cluttered with some generated code from Helidon, which we will eventually get around to exploiting and making the utility a native binary executable).

Fluent Bit with Oracle Cloud

09 Tuesday Jan 2024

Posted by mp3monster in Books, Fluentbit, Fluentd, General, Oracle

≈ Leave a comment

Tags

book, Cloud, demo, FluentBit, logging, monitoring, o11y, observability, OCI, Oracle

The hyper scaler cloud vendors all offer Logging and monitoring capabilities. But they tend to focus on supporting their native services. If you’re aware of Oracle’s Cloud (OCI) messaging, then you’ll know that there is a strong recognition of the importance of multi-cloud. This extends not only to connecting apps across clouds but also to be able to observe and manage cloud-spanning solutions. Ultimately, most organizations want to headline observability-related views of their solutions.

Late last year, I presented these ideas, illustrating the ideas with the use of Fluent Bit and OCI’s Observability and Management products to visualize and analyze what is happening. I finally found the time to write how the very basic demo was built from a clean sheet over on the Oracle Devs blog on Medium.

Useful Resources for Fluent Bit and Observability

This also highlights the fact that the Fluent Bit book, while I believe, once completed, will be through, can’t cover everything – and certainly not build end-to-end use cases like the Oracle Observability & Management example. To help address this, the book includes an appendix of helpful additional information, some of which I have included here, along with other content that we encounter – all of which can be found at Fluentd & Fluent Bit Additional stuff.

Speeding Ruby

30 Monday Oct 2023

Posted by mp3monster in development, Fluentbit, Fluentd, General, languages, Technology

≈ Leave a comment

Tags

Cloud, development, FluentBit, Fluentd, Ruby, Ruvy, Shopify

Development trends have shown a shift towards precompiled languages like Go and Rust away from interpreted or Just-In-Time (JIT) compiled languages like Java and Ruby as it removes the startup time of the language virtual machine and the JIT compiler as well as a smaller memory footprint. All desirable features when you’re scaling containerized solutions and percentage point savings can really add up.

Oracle has been leading the way with its work on GraalVM for some years now, and as a result, not only can GraalVM be used to produce native binary images from Java code, GraalVM also supports TuffleRuby and GraalPy, among others. As TruffleRuby is an open-source project, Oracle isn’t the only vendor contributing to it, work effort has also come from Shopify.

Helping Ruby move forward isn’t new for the Shopify engineering team, and part of that investment is that they have just announced the open-sourcing of a toolchain called Ruvy. Ruvy takes Ruby and creates a WebAssembly (WASM) from it the code. This builds on the existing project ruby.wasm. In doing so they’ve addressed the Ruby startup overhead of the language VM we mentioned. They have also simplified the process of deployment, eliminating the need for Web Assembly System Interface (WASI) arguments, and overcome constraints of class loading by reading files by having the code bundled within the assembly and then accessing the content using WASI-VFS, a simple virtual file system.

The published benchmarks show a massive performance boost in the process of executing where the Ruby code needs to be executed by the packaged JIT. For me, this is interesting as one of the related cloud-native trends is the shift from Fluentd to Fluent Bit. Fluentd was built with Ruby and has a huge portfolio of third-party extensions. But Fluent Bit is built using C to get those performance gains previously described. But it does support plugins through WASM. This raises an interesting question can we take existing Ruby plugins and wrap them so the required interfacing works – which should be minimal and more likely to be impacted by the fact Fluent Bit v2 has refined the internal data structure that was common to both Fluentd and Fluent Bit to allow Fluent Bit to more easily engaged with OpenTelemetry.

If the extra bit of wrapping code isn’t complex, then applying Ruvy should mean the core plugin can then work with Fluent Bit. If this can be templated, then Fluent Bit is going to make a big leap forward with the number of available plugins.

New Article for SE Daily…

27 Thursday Apr 2023

Posted by mp3monster in ExternalWebPublications, Fluentd, General, Technology

≈ Leave a comment

Tags

Cloud, external publications, Hybrid, monitoring, multicloud, observability, SE Daily, Software Engineering Daily

We’ve just had a new article published for Software Engineering Daily which looks at monitoring in multi-cloud and hybrid use cases and highlights some strategies that can help support the single pane of glass by exploiting features in tools such as Fluentd and Fluentbit that perhaps aren’t fully appreciated. Check it out …

Challenges of Multi-Cloud and Hybrid Monitoring

Getting in the queue – content here, there, and everywhere

10 Friday Mar 2023

Posted by mp3monster in development, ExternalWebPublications, General, Technology

≈ Leave a comment

Tags

API, Cloud, Devmio, DZone, external, external publications, log simulator, OCI, Oracle, Queue, video

We’ve been having a busy period building and helping people build content to use Oracle services:

Create and implement OCI Queue by using its APIs and the Java SDK – which includes a brand-new enhancement that illustrates the use of Stomp
Learn about selecting an appropriate messaging solution
Use Oracle Notifications service with applications
Another webinar relating to OCI Queue – with the recording freely available here.
OCI API piece on looking at the application of APIs can be found here

Contributed to the creation of a couple of demo videos:

Outside of my Oracle cloud-related content, we’ve just published an article on DZone. Those who follow this blog will be familiar with the article theme as it relates to the Log Simulator work. We’ve also written for Devmio – although we don’t yet know when the article will be published and whether the content will be publicly available or behind their paid firewall.

API Gateways to manage dynamic scaling

17 Friday Feb 2023

Posted by mp3monster in APIs & microservices, Cloud, development, General, Technology

≈ 1 Comment

Tags

Cloud, CloudEvents, CNCF, dynamic, Plans, scaling

One of the benefits of using cloud providers is the potential to scale solutions to meet demand by scaling out with additional compute nodes without concern for physical capacity limits (cost considerations are smaller, but still apply). Dynamic scaling is typically driven by monitoring defined groups of nodes for CPU use and when demand hits a threshold an additional node is spun up. Kubernetes can create some more nuanced scaling configurations but this is tends to be used for microservice style solutions.

This is well and fine, but if we need to finesse the configuration to ensure multiple container instances (such as microservices) can be allowed on a single node without compromising the deployment of containers across nodes of a cluster to provide resilience things can get a lot more challenging. We also want to manage the number of active containers – we don’t want to have unnecessary volumes of containers effectively idling. So how do we manage this?

In addition to this, what if we’re using services that demand tens of seconds or even minutes to be spun up to a point of readiness – such as instantiating database servers, adding new nodes to data caches such as Coherence which will need time to clone data, and adjust the demand balancing algorithms? Likewise, for more traditional application servers. Some service calls will impact upstream configuration changes, such as altering load-balancing configurations. Cloud-native Load Balancers typically can understand node groups, but what if your configuration is more nuanced? If you’re running services such as ActiveMQ you need to update all the impacted nodes in a cluster to be aware of the new node. Bottomline is some solutions or solution parts need lead time to handle increased demand?

This points to more advanced, nuanced metrics than simply current CPU load. And possibly the scaling algorithm needs to be aware of lead times of the different types of functionality involved. So how can we advance a more nuanced, and informed scaling logic? You could look at the inbound traffic on firewalls or load balancers. But this will require a fair bit of additional effort to apply application context. For example is the traffic just getting directed to static page content? We do need to derive some context so that the right rules are impacted. Even in simple K8s only hosted solutions a cluster may host services that aren’t related to the changes in demand and need to be scaled as a result.

A use case perspective

Let’s look at this from a hypothetical use case. We’re a music streaming service. Artists can control when their music is made available, but the industry norm is 12:01am on a Friday morning. There is always a demand spike at this time, but the size of the demand spike can vary, with some artists triggering larger spikes (this isn’t directly correlating to an artist’s popularity, some smaller groups can have very enthusiastic fans) – so dynamic scaling is needed, and reactive to demand. Reporting, analytics, and payment services see cyclical bumps around the monthly financial periods. These monthly cyclic activities can also be done during quieter operating hours. So it is clear we may wish to apply logical partitioning, but for maximum cost efficiency keep everything in the same cluster. There are correlations between different service demands. Certain data services see increased demand during the demand spikes and reporting period, but ‘The bottom line is we need to not only address dynamic scaling, but tailor the scaling to different services at different times.

Back to our question – how do we manage the scaling? We could monitor the firewall and load balancer logs and analyze the kinds of requests being received. That would need additional processing logic to determine which services are receiving the traffic. But our API gateway is likely to have that intelligence implicitly in place as we have different API policies for the different types of endpoints. So we can monitor specific endpoints or groups of endpoints very easily – meaning we can infer the types of traffic demand and how to respond. Not only that, we may have API plans in place so we can control priority and prevent the free versions of our service from using APIs to initiate high-quality media streams. So we already have business and specific process meaning. So linking scaling controls such as KEDA (Kubernetes Event-driven Autoscaling) directly to measurements such as plans and specific APIs creates a relatively easy way to control scaling. Further, we may also use the gateways to provide a rate throttle so it’s not possible to crash our backend with an instantaneous spike. This strategy isn’t that different from an approach we’ve demonstrated for scaling message processing backends (see here and here).

*Representation of how we can use the metrics from a gateway to not only support the scaling of a K8s cluster but also other cloud services*

We can also use the data KEDA can see in terms of node readiness to adjust the API Gateway rate limiting dynamically as well. Either as a direct trigger from the number of active container instances or by triggering a more advanced check because we still have to address our services that need more lead time before relaxing the rate limiting.

Handling slow scaling services

So we’ve got a way of targeting the scaling of services. But what do we do to address the slower scaling services we described? With the data dimensioned, we can do a couple of things. Firstly we can use the rate of change to determine how many nodes to add. A very sudden increase in demand and adding a node at a time will create the effect of the service performance stuttering as it scales, and all additional capacity is suddenly consumed and then scales again. Factoring in the rate of change to the workload threshold and the context of which services could generate the increased workload can be used to not only determine which databases may need scaling but also be used to adjust the threshold of workload that actually triggers the node introduction. So a sudden increase in demand on services that are known to create a lot of DB activity is then met with dropping the threshold that triggers new nodes from, say, 75% to 25% so we effectively start the nodes on a lower threshold, means the process is effectively started sooner.

Illustrating how the rate of growth in utilization can mean the need to change the threshold to trigger scaling up — *With different rates of utilization growth, we can see that we need to alter the trigger threshold for launching new resources*

For this to be fully effective we do need the Gateway tracking traffic both internally (East-West) and inbound (North-South). Using the gateway with East-West traffic means that we can establish an anti corruption layer.

Conclusion

Not all parts of a solution will be instantaneous in scaling – regardless of how fast the code startup cycle might be, some services have to address the need to move large chunks of data before becoming ready, e.g., in-memory databases. Some services may not have been built for the latest business demands and the ability to exploit cloud scale and dynamic scaling. Some services need time to adjust configurations.

We may also need to adjust our protection mechanisms if we’re protecting against service overloading.

Scaling capabilities that have response latency to the scaling process can be addressed by achieving earlier, more intelligent recognition of need than warning simply by CPU loads hitting a threshold. The intelligence that can be derived from implicit service context simplifies the effort in creating such intelligence and makes it easier to recognize the change. API Gateways, message queues, message streams, and shared data storage are all means that, by their nature, have implicit context.

Pushing the recognition towards the front of the process creates milliseconds or possibly seconds more warning of the demand than waiting for the impacted nodes to see compute spikes.