OpAMP with Fluent Bit – Observability and ChatOps

23 Monday Mar 2026

Posted by mp3monster in Fluent Observability, Fluentbit, General, OpAMP

Tags

AI, artificial-intelligence, Cloud, Fluentbit, Fluentd, LLM, observability, OpAMP, Technology

With KubeCon Europe happening this week, it felt like a good moment to break cover on this pet project.

If you are working with Fluent Bit at any scale, one question keeps coming up: how do we consistently control and observe all those edge agents, especially outside a Kubernetes-only world?

This is exactly the problem the OpAMP specification is trying to solve. At its core, OpAMP defines a standard contract between a central server and distributed agents/supervisors, so status, health, commands, and config-related interactions follow one protocol instead of ad-hoc integration per tool.

That is where this project sits. We’re implementing the OpAMP specification to support Fluent Bit (and later Fluentd).

In this implementation, we have:

a provider (the OpAMP server), and
a consumer acting as a supervisor to manage Fluent Bit deployments.

Right now, we are focused on Fluent Bit first. That is deliberate: it keeps scope practical while we validate the framework. The same framework is being shaped so it can evolve to support Fluentd as well.

The repository for the implementation can be found at https://github.com/mp3monster/fluent-opamp

Quick summary

The provider/server is the control plane endpoint. It tracks clients, accepts status, queues commands, and returns instructions using OpAMP payloads over HTTP or WebSocket.

The consumer/supervisor handles the local execution and reporting. It launches Fluent Bit, polls local health/status endpoints, sends heartbeat and metadata to the provider, and handles inbound commands (including custom ones). The server and supervisor can be deployed independently, which is important for real-world rollout patterns.

Because they follow the OpAMP protocol model, clients and servers can be interchanged with other OpAMP-compliant implementations (although we’ve not yet tested this aspect of the development).

Together, they give us a manageable, spec-aligned path to coordinating distributed Fluent Bit nodes without hard-coding one-off control logic into every environment.

Deployment options and scripts

There are a few practical ways to get started quickly:

Deploy just the server/provider using scripts/run_opamp_server.sh (or scripts/run_opamp_server.cmd on Windows).
Deploy just the client/supervisor using scripts/run_supervisor.sh (or scripts/run_supervisor.cmd on Windows).
Run both components either together in a single environment or independently across different hosts.

The scripts will set up a virtual environment and retrieve the necessary dependencies.

If you want an initial MCP client setup as part of your workflow, there are helper scripts for that too:

mcp/configure-codex-fastmcp.sh and mcp/configure-codex-fastmcp.ps1
mcp/configure-claude-desktop-fastmcp.sh and mcp/configure-claude-desktop-fastmcp.ps1

Server screenshots

Here is a first server view we can include in the post:

The UI is still evolving, but this gives a concrete picture of the provider side control plane we are discussing.

What the OpAMP server (provider) does

The provider is responsible for the shared view of fleet state and intent.

Today it provides:

OpAMP transport endpoints (/v1/opamp) over HTTP and WebSocket.
API and UI endpoints to inspect clients and queue actions.
In-memory command queueing per client.
Emission of standard command payloads (for example, restart).
Emission of custom message payloads for custom capabilities.
Discovery and publication of custom capabilities supported by the server side command framework.

Operationally, this means we can queue intent once at the server and let the next client poll/connection cycle deliver that action in protocol-native form.

What the supervisor (consumer) does for Fluent Bit

The supervisor is the practical glue between OpAMP and Fluent Bit:

Starts Fluent Bit as a local child process.
Parses Fluent Bit config details needed for status polling.
Polls Fluent Bit local endpoints on a heartbeat loop.
Builds and sends AgentToServer messages (identity, capabilities, health/status context).
Receives ServerToAgent responses and dispatches commands.
Handles custom capabilities and custom messages through a handler registry.

So for Fluent Bit specifically, the supervisor gives us a way to participate in OpAMP now, even before native in-agent OpAMP support is universal.

And to be explicit: this is the current target. Fluentd support is a planned evolution of this same model, not a separate rewrite.

Where ChatOps fits

ChatOps is where this gets interesting for day-2 operations.

In this implementation, ChatOps commands are carried as OpAMP custom messages (custom capability org.mp3monster.opamp_provider.chatopcommand). The provider queues the custom command, and the supervisor’s ChatOps handler executes it by calling a local HTTP endpoint on the configured chat_ops_port.

That gives us a cleaner control path:

Chat/user intent can go to the central server/API.
The server routes to the right node through OpAMP.
The supervisor performs the local action and can return failure context when local execution fails.

This is a stronger pattern than directly letting chat tooling call every node individually, and it opens the door to better auditability and policy controls around who can trigger what.

Reality check: we are still testing

This is important: we are still actively testing functionality.

Current status is intentionally mixed:

Core identity, sequencing, capabilities, disconnect handling, and heartbeat/status pathways are in place.
Some protocol fields are partial, todo, or long-term backlog.
Custom capabilities/message pathways are implemented as a practical extension point and are still being hardened with test coverage and real-world runs.

So treat this as a working framework with proven pieces, not a finished all-capabilities implementation.

What is coming next (based on `docs/features.md`)

Near-term priorities include:

stricter header/channel validation,
heartbeat validation hardening,
payload validation against declared capabilities,
server-side duplicate websocket connection control behaviour.

Broader roadmap themes include:

authentication/security model for APIs and UI,
persistence in the provider,
richer UI controls for node/global polling and multi-node config push,
certificate and signing workflows,
packaging improvements.

And yes, a key strategic direction is evolving the framework abstraction so it can support Fluentd in due course, not only Fluent Bit. Some feature areas (like package/status richness) make even more sense in that broader collector ecosystem.

Why this matters

OpAMP gives us a standard envelope for control-plane interactions; the server/supervisor split gives us pragmatic deployment flexibility; and ChatOps provides a human-friendly control surface.

Put together, this becomes a useful pattern for managing telemetry agents in real environments where fleets are mixed, rollout velocity matters, and “just redeploy everything” is not always an option.

If you are evaluating this right now, the right mindset is: useful today, promising for tomorrow, and still under active verification as we close feature gaps.

Enterprise scale NL2SQL

05 Thursday Mar 2026

Posted by mp3monster in AI, General, Technology

≈ Leave a comment

Tags

Cipher, context, GQL, graph, LLM, LLMs, Natural Language, NL, NL2SQL, ontology, OpenCipher, Oracle, OWL, PGQL, RDF, SPARQL

A number of books on the use of LLMs demonstrate how to generate queries against a table from natural language. There is no doubt that providing an LLM and the definition of a simple database and a natural language question, and seeing it provide a valid result is impressive.

The problem is that this struggles to scale to enterprise scenarios where queries need to run on very large databases with tens, if not hundreds, of tables and views. Take a look at the results of the evaluation and benchmarking frameworks like Bird, Spider2, NL2SQL360, and others.

So why do we see a drop in efficacy between small use cases and enterprise systems. Let’s call out the headline challenges you may run into…

The larger the database schema, the more tokens we use just to describe it.
Understanding how relationships between tables work isn’t easy – do we need SQL to include a join, union, or nested queries?
Which tables does the LLM start with?
Does our natural language need the LLM to perform on-the-fly calculations or transformations (for example, I ask for data using inches, but data is stored using millimeters)?
When the entity needs to use attributes that contain names (particularly people, places, chemicals, and medicines), the queries will fail, as there are often multiple spellings and truncations of names.
Does the LLM have permissions to access all the schema (enterprise systems will typically have data subject to PII).

Does this mean that NL2SQL, in practice, is a bust? Not necessarily, but it is a lot harder than it might appear, and how we apply the power of LLM’s may need to be applied in more informed and targeted ways. But we should ask a few questions about the application of LLMs that may impact tolerance to issues…

Will users understand and tolerate a degree of risk and the possibility of an error resulting from hallucination? It’s just like using a calculator for performing calculations – an error in the keying sequence or a failure to apply mathematical precedence, and the result will be wrong, but do you look at the answer and ask yourself whether this is in the right ‘ballpark’? For example, 4 + 2 x 8, if you simplify to 6 x 8, you’ll get the wrong result, as multipliers have precedence over addition, so we should perform 2 x 8 first. If there is no tolerance for error, you need your solution to be completely deterministic, which rules out the use of LLMs.
More generally, will we tolerate SQL that is not performance-optimized? As we’re dealing with a non-deterministic system, we can’t anticipate all the possible indices that might help or other optimizations that could improve the query. If you’re thinking of simply creating many indexes to mitigate risk, remember that maintaining indexes comes at a cost, and the more indexes the SQL engine has to choose from, the harder it has to work to select one.
What level of ‘explainability’ to the answer does the user want? Very few users of enterprise systems will want to have an answer to their questions explained by providing the SQL, and in the world of SaaS, we don’t want that to happen either.
The last dimension is: if a user submits a question prematurely or with poor expression, do we really want to return hundreds or hundreds of thousands of records?

In addition to these questions, we should look at the complexity of our database. This can be covered using several metrics, such as:

NA – Number of Attributes
RD – Referential Degree (the amount of table cross-referencing)
DRT – Depth Relational Tree (the number of relational steps that can be performed to traverse the data model)

The complexity score indicates how hard it will be to formulate queries; in other words, the LLM will have to work harder and, as a result, be more susceptible to errors. So, if we want to move forward with NL2SQL then we need to find a means to reduce the scores.

If you’d like to understand the formulas and theory, there are papers on the subject, such as:

How we respond to these questions will demonstrate our tolerance for challenges and our ability to drive enterprise adoption. Let’s look at some possible strategies to improve things.

Moving forward

One step forward is that each attribute and table can be described; this is essential when using shortened names and acronyms. Such schema metadata can be further extended by describing the relationships and purposes of tables, and even advise the LLM on how to approach common questions, start with table x when being asked about widgets, for example. In many respects, this is not too dissimilar to how we should approach APIs. The best APIs are clearly documented, have consistent naming, and so on.

We can use the LLM to analyse the question and use the result to truncate the provided schema details. For example, in an ERP system, we could easily generate natural-language questions about a supplier or monthly operating costs. But reducing the schema by excluding details from supplier-related tables will narrow the LLM’s scope and, therefore, reduce the likelihood of getting details wrong. This is no different from addressing an exam question and taking a moment to strike out redundant information, as it is noise. You could say we’re removing confounding variables. This idea could be taken further by restricting agents to only working with specific subdomains within the data model.

We can clean up questions. When names are involved, we can use standard language models to extract nouns and their types, then look them up before we formulate an SQL statement. If the question relates to a supplier name, we can easily extract it from the natural-language statement and determine whether such an entity exists. This sort of task doesn’t require us to use powerful, clever foundational LLMs; there are plenty of smaller language models that can do this.

Inject context from the UI

Ideally when a natural query is provided, the more traditional parts of the UI are showing the data, the user can navigate through in traditional manner – replacing years of UI use with a blank conversation page is going to really limit the users understanding of what is available in terms of data and the relationships that already exist, I’ve mentioned this before – Challenges of UX for AI. Assuming that we have something more than a blank conversation window, there is plenty of context to the natural language input. Let’s ensure we’re providing this in a rich manner; this could be more than just the fields on the screen, but associated attributes. For example, if the UI is showing a table, then identifying the entities or filters applied may well help.

Simplifying the schema

Most of our systems are not designed to support NL2SQL, but for core processes that demand considerations, such as frequently occurring queries that are efficient, even more importantly, so that we don’t have issues relating to data integrity. Unless you’re fortunate to be working with a greenfield and keeping the data model simple so processes like NL2SQL have an easy time, we’re going to need to take advantage of database techniques to try to simplify the schema (at least as the NL2SQL process sees it). The sort of things we can do include:

simplify with views
- We can join tables and denormalize table structures, such as supplier and address, for example. Navigating table relationships is one of the biggest risks in SQL processing.
- Use virtual columns so we can offer data in different formats, such as inches and millimetres, so the LLM doesn’t need to apply unnecessary logic, such as transforming data values. Providing the calculation ourselves means we have control over issues such as precision.
- Exclude columns that your NL2SQL does not require. For example, if each table has columns that record who made a change and when, is that needed by the LLM?
Synonyms and other aliasing techniques
- We can use synonyms to represent the same tables or columns, enabling us to map question entities to schema tables and columns. This can really help overcome specialist column names in a schema that the LLM is less likely to resolve.
- We can also use this to disambiguate meaning; for example, in our ERP, an orders table could have a column called sales. In the context of orders, that would be better expressed as sales-quantity. But then, in the accounting ledger, when we have a sales attribute, that really means sales value.

Simplifying through an Ontology and Graphs

We can model our data and the relationships using an ontology. This isn’t a small undertaking and can bring disruptions of its own. Using an ontology affords us several benefits:

Ontology can drive consistency in our language and its meaning.
An ontology in its pure form will capture the types of objects (entities), their relationships to other entities, and their connections to concepts.
We have the option to exploit standards to notate the ontology in a implementation independent manner.

Applying an ontology allows us to describe the objects, attributes, and relationships in our data model using a structured language. For example supplier has an invoicing address. One key is that we’ve abstracted away how the data is stored (some might refer to this as the physical model, but I hedge on that, as there is a whole layer of science in data-to-storage hardware).

To get to grips with ontology and how we can use it in practice, we need some familiarity with concepts such as Knowledge Graphs, OWL, and RDF. Rather than explaining all these ideas here, here are some resources that cover each technology or concept.

To get to grips with ontology and how we can practically use it, we need some familiarity with concepts such as Knowledge Graphs, OWL, and RDF. Rather than explaining all these ideas here, here are some resources that cover each technology or concept. We’ve added some references for this below.

By modelling our data this way, we make it easier for the LLM by describing our entities and relationships, and, importantly, the description makes it much easier to infer and understand those relationships.

The step up using an ontology gives us the fact that we can express it with a Graph database. We can describe the ontology using graph notation (our entities are objects connected by edges that represent relationships; e.g., a supplier object and an address object are connected by an edge indicating that a supplier has an invoicing address. When we start describing our schema this way, we can begin to think about resolving our question using a graph query.

Not only do we have the relationship between our entities, but we can also describe entity instances with the relationship of ‘is a ‘, e.g., Acme Inc is a supplier.

But, before we start thinking, this is an easily solvable problem; there are still some challenges to be overcome. There is a lot of published advice on how to approach ontology development, including not extrapolating from a database schema (sometimes referred to as a bottom-up approach). If you develop your ontology from a top-down approach, you’ll likely have a nice logical representation of your data. But, the more practical issue is that we can’t convert our existing ERP schema to a fit the logical model, or impose a graph database onto our existing solution; that has horrible implications, such as:

Our existing data model will, even with abstraction layers, affect our application code (yes, the abstractions will mitigate and simplify), but those abstractions will need to be overhauled.
Existing application schemas reflect a balancing act of performance, managing data integrity, and maintainability. Replacing the relational model with a Graph will impact the performance dimension because we’ve abstracted the relationships.
There is also the double-edged challenge of a top-down approach that may reveal entities and relationships we should have in our ERP that aren’t represented for one reason or another. I say double-edged, because knowing the potential gap and representation is a good thing, but it may also create pressure to introduce those entities into a system that doesn’t make it easy to realize them.

Taking a schema view and blending with an ontology – https://www.wikiwand.com/en/articles/Ontology_engineering

If you follow the guidance on developing an Ontology from the research referenced by NIST, which points to defining ontologies in layers and avoiding drawing on data models.

That said, there is guidance, standards, and tooling for building an ontology from a relational schema, such as the W3C’s A Direct Mapping of Relational Data to RDF, papers such as Bringing Relational Databases into the Semantic Web: A Survey. The true magic of an ontology, aside from its role in driving clarity, is that it can be modelled in a Graph database and therefore queried easily.

There is also the challenge of avoiding the ontology from imposing an enterprise-wide data model, which slows an organization’s ability to adapt to new needs quickly (the argument for microservices and schema alignment to services). While the effective use of an ontology should act as an accelerator (we know how to represent data constructs, etc.) within an organization that is more siloed in its thinking, it could well become an impediment. Eric Evans’ book Domain Driven Design tries to address indirectly.

Graphs – the magic sauce of ontology and its application

With a schema expressed as a graph (tables and attributes are nodes, and edges represent relationships), we can query the graph to determine how to traverse the schema (if possible or practical) to reach different entities. If the relational schema’s details can be enriched with strong semantic meaning in the relationships, we get further clues about which tables need to be joined. We say practical because it may be possible to traverse the schema, but the number of steps can result in a series of joins that generate unworkable datasets.

We could consider storing not just the schema as a graph, but also the data. There are limits when the nodes become records, and the edges link each node; as a result, data volume explodes, particularly if each leaf node is an attribute. Some graph databases also try to manage things in memory – for multimillion-row tables, this is going to be a challenge.

Some databases that support a converged model can map queries, schemas, and data across the different models. This is very true for Oracle, where services exist that allow us to retrieve the schema as a graph (docs on RDF-based capabilities), translate graph queries (SPARQL / GQL) into relational ones, and so on. The following papers address this sort of capability more generally:

One consideration when using graph queries is the constraints imposed by meeting data manipulation needs, since most relational database SQL provides many more operators.

Ontology more than an aide to AI

A lot of research and thought pieces on ontology have focused on its role in supporting the application of AI. But the concepts as applied to IT and software aren’t new. We can see this by examining standards such as OWL and RDF. Ontology can support many design activities, such as developing initial data models (pre-normalization), API design, and user journey description. The change is that these processes implicitly or indirectly engage with ontological ideas. For example, when we’re designing APIs, particularly for graph queries, we’re thinking about entities and how they relate. While we tend to view relationships through the CRUD lens, this is a crude classification of the relationship that exists.

Getting from Natural Language to the query

This leads us to the question of how to translate our natural language question into a Graph query without encountering the same problems. This is a question of understanding intent; the focus is no longer to jump to prompting for SQL generation, but to extract the attributes that are wanted by examining the natural language (we can even look to do this with smaller models focusing on language), and tease out the actual values (keys) that may be referenced. With this information, we can determine how to traverse the schema and potentially generate the query. If we can’t make that direct step with the graph (we may be asked to perform data manipulation operations that graph query languages aren’t designed for), relational databases offer a wealth of DML operators that graph query languages lack. We can now provide more explicit instructions to the LLM to generate SQL. To the point, our prompt for this step could look more like a developer prompt. Of course, with these prechecks, we can also direct the process to address ambiguities, such as matching the NL statement to the schemas’ attributes that should be involved.

As you can see through the use of graph representations of our relational data, we can look to move the LLM from trying to reason about complex schemas to examining the natural language, providing information on its intent, and then directing it to generate a formalized syntax – all of which leans into the strengths of LLMs. Yes, LLMs are getting better as research and development drive ever-improving reasoning. But ultimately, to assure results and eliminate hallucination, the more we can push towards deterministic behaviour, the better. After all, determinism means we can be confident that the result is right for the right reasons, rather than getting things right, but that the means by which the answer was achieved was wrong.

Growing evidence that graph is the way to go

This blog has been evolving for some time, since I started to dial into the use of ontologies and, more critically, graphs. I’ve started to see more articles on the strategy as a way to make NL2SQL work with enterprise schemas. So I’ve added a few of these references to the following useful resources.

Useful Resources

Why ontology for SQL
Palantir ontology blog series
Reverse engineering an ontology
NIST Guidance on ontology
Basic Formal Ontology
Timbr.ai on ontology for AI
Semantic Web Journal
A novel approach for learning ontology from relational database: from the construction to the evaluation
Ontology Development 101: A Guide to Creating Your First Ontology
SPARQL Standard – for querying RDF data.
Graph Query Language (GQL)
Property Graph Query Language (PGQL) (and formal standard specification)
Resource Description Framework (RDF) standard
Web Ontology Language – OWL standard
Knowledge Graph & RDF
Ontology Engineering
QueryWeaver – open source tool uses Graph to build SQL
Timbre.ai blog post

A final note on Graph Querying

Graph querying can be a bit of a challenging landscape to initially get your head around; after all, there is SPARQL, PGQL, GQL, and if you come from a Neo4J background, you’ll be most familiar with Cipher/OpenCipher. Rather than try to address this here, there is a handy blog here, and the following diagram from graph.build.

Returning to Chat Ops

03 Tuesday Mar 2026

Posted by mp3monster in chatbots, Fluent Observability, Fluentbit, Fluentd, General, OpAMP, Technology

≈ Leave a comment

Tags

AI, chatops, development, Fluent Bit, IncidentFox, innovation, LLM, OpAMP, OTel, runbooks, SRE

A couple of years ago, we wrote about the idea of Chat Ops, why the idea is valuable and interesting (see Fluent Bit – Powering Chat Ops, Fluent Bit with Chat Ops, for example). The essence of the idea was:

Using a collaboration or chat platform like Slack could ease and even accelerate the response to operational issues (as systems process more data and faster).
Conversational collaborative platforms are pervasive, usable across many devices, while still being able to enforce security. Taking away the need to log in to laptops, signing in to portals before you can even start considering what the problem is.
The quicker we can understand and resolve things, the better, and the less potential damage that may need to be addressed. Using collaboration platforms does this through collaborative work, and the fact that we can see content quickly and easily because of the push model of tools like Slack.
Any Observability agent tool that can detect issues as data is moved to a backend aggregation and analytics tool gives you a head start. This is something Fluent Bit can do easily.

We illustrated it this way:

The ChatOps deployment looked like:

ChatOps deployed using Slack to power collaboration, with Fluent Bit for ops execution

How do we advance this?

The demo was very light-touch to illustrate an idea, but we’ve since heard people pursuing the concept. So the question becomes: what has happened in the observability space that could make it easier to industrialize (e.g., scale, resilience, security, manageability)? One of the key advances being driven by the OpenTelemetry working groups within the CNCF is OpAMP. OpAMP provides a protocol for standardizing the management and control of agents and collectors such as Fluent Bit. I won’t go into the details of OpAMP here, as we covered that in a separate blog (here). But what it means is that we can either tap into the OpAMP protocol more directly as it provides a means to deliver custom commands to agents, or better still simplify by extending the management service so that we can talk to that, and it is responsible to sending on the command to the relevant agents and talk back to us if its knowledge about agent capabilities tells us its not possible.

This improves security (our bot only needs to talk to a single point in our infrastructure). Our agents, like Fluent Bit, can have a trust relationship with a central point. Effectively, we have introduced a better multifaceted trust layer into the answer. Not only that, as the OpAMP server has visibility into more of the agent deployment, we can also leverage its information and ask it to deploy our custom Fluent Bit configurations that help us observe and execute remediation processes.

Let’s look deeper

While industrialization is good, it occurs to me that there is more to collaboration and conversational (or chat) interfaces than we first envisaged.

Collaboration, to a large extent, is not about bringing multiple ideas to the table and choosing the best one; it is actually about knowledge mining. Our ideas and insights are built from individual knowledge banks, or as we might put it, experience. Conversational interfaces also allow us to see what has gone before (in effect, harvesting information to improve our knowledge). So perhaps we should be asking, within the context of a platform that allows us to easily access, share, and interpret knowledge for a specific context (i.e., our current problem), what can we do?

Knowledge is unstructured or semistructured information, and information is a semistructured composition of context and data. So how do we find and leverage information if not knowledge, particularly at 2am when we’re the only person on call? The answer is to facilitate enhanced information retrieval, which could be more Slack bots that we can use to provide structured commands to retrieve relevant data from our metrics tooling, traces, etc. Organizations that follow more ITIL-guided processes will most likely have runbooks, a knowledge base with error-code information, and previous incident logs documenting resolutions. All of which can bring together a wealth of information. Doing this in a collaborative conversational tool like Slack and MS Teams will save time and effort, as you will not have to sign in to different portals to track down the details.

But we can go better, with the rise of LLMs (large language models, Gen AI if you prefer), combined with techniques such as RAG (Retrieval Augmented Generation), MCP (Model Context Protocol), and further accelerate things as we no longer have to make our Slack bot requests use what structured commands. We use natural language, we can easily paste parts of the information we’re given as part of the request – all of which means that while a single LLM may take longer to execute a single request, it is more likely to surface the details faster because we’re not having to worry about getting the notation for the bot request correct, we’re not going to have to repeat requests to narrow data immediately. With agentic techniques, you can have the LLM pull data from multiple systems in a single go.

All of which is very achievable; many vendors (and open-source teams) have been seeking a competitive edge and exposing their APIs through MCP tools. There is the possibility that some of this is ‘AI washing’, but the frameworks to support MCP development are well progressed, so you can always refine or create your own MCP server. Here are just a few examples:

In addition to these, which align with widely used open-source solutions, there are MCP servers for vendor-specific offerings, such as Honeycomb and Chronosphere.

The use of MCP raises an interesting possibility – as it becomes very possible to have a universal MCP client, which can be easily interfaced into any conversational tool – after all, we just need to send the text that is addressed to the client, i.e., the adaptor to Slack, MS Teams, etc. is pretty simple. This means we can start to consider things like the same tooling, supporting tools like Claude Desktop, and its mobile equivalent.

If we bring AI into the equation, consider ML and small models as well, we can start looking at exploiting pattern recognition in the occurrences of issues. This would mean the value diagram becomes more like:

We should also keep in mind the development of the new OpAMP spec from the CNCF OpenTelemetry project that provides a mechanism to centrally track and interact with all our agents, such as Fluent Bit (not to mention Fluentd, OTel Collectors, and others).

Bringing all of this together isn’t a small task, and what of these ideas already exist? Are there building blocks we can lean on to show that the ideas expressed here are achievable? This brought us to IncidentFox.ai (GitHub link – IncidentFox). IncidentFox already exploits MCP servers from several products, such as Grafana, and supports a knowledge base geared towards intelligent searching that can also tap into common sources of operational knowledge, such as Confluence.

But IncidentFox has taken the use of LLMs further with its agentic approach, recommending actions, and once it has access to source code, LLMs can be used to start looking for root causes. To do this, IncidentFox has harnessed the improvements in Gen AI reasoning and the selection and use of the MCP-enabled tools we mentioned, which enable the extraction of data and information from mainstream observability tools.

IncidentFox isn’t the only organization heading in this direction (Resolve.ai, Randoli, and Kloudfuse are building AI SRE tools, but with a fully proprietary business model), but it is supporting an open-source core. All of these solutions come under the banner of AI SRE. Interestingly, a new LLM benchmark has emerged called SRESkillsBench, in addition to the models such as Bird and Spider, which specifically evaluate LLMs against SRE tasks. This kind of information certainly is going to create some interesting debates – do you use the best LLM for specific tasks, or the LLM the vendor has worked with and optimized their prompting to get the best from it?

A lot of these tools focus on detection and diagnostics, and steer you to the relevant runbooks, there seems to be less said about how runbooks can be translated into remediation actions that can be executed. The ability to translate runbooks into executable steps will require either tooling via MCP or custom models with suitable embedded functions. Whichever approach is adopted, having a well-understood control plane, such as Kubernetes, will make it easier. But not everything is native K8s. Lots of organizations still use just virtualization, or even bare metal. We also see this with cloud vendors that offer bare-metal services due to the need for maximum compute horsepower. To address these scenarios from a centralized control, either needs to allow remote access (ssh tunnels) to all the nodes, or a means to work with a distributed tooling mechanism, which is what we had with our original chatops idea and the possibilities offered through OpAMP.

Of course, introducing LLMs opens up another set of challenges in the observability space, which OpenLLMetry (driven by TraceLoop) and LangFuse are working to address. But that’s for another blog.

What does this mean to our ChatOps proof of concept?

Sticking with an open-source/standards-based approach means there is a clear direction of enabling AI Agents to be part of the OTLP (Open Telemetry Protocol) pipeline and leverage the MCP tooling of backend observability platforms. IncidentFox doesn’t yet provide an inline OTLP capability (although some of the proprietary options have gone this way), but that is less of a concern, as Fluent Bit provides a lot of capability to help identify issues (such as timeseries/frequency measurement, error identification, etc.), which can be combined with IncidentFox’ means to instruct it to initiate an incident. We could do this with direct API calls, but what would be more interesting and flexible is that, when our Fluent Bit process notifies the Ops team in Slack of an issue, it also ensures that the IncidentFox Slack agent picks up the message, allowing it to start its own analysis. In effect, a collaboration platform can be as impactful as an integration platform.

Furthermore, now that we have early warning of an issue via Fluent Bit, we can interact with IncidentFox through its tooling to start interrogating the diverse range of tools well before the new problem’s details have been fully ingested into the backend tooling and been detected as an issue. So we can now get a jump start by prompting IncidentFox, which can mine for information and recommendations. This is where, at 2am, we can still work collaboratively, albeit with a collaborator that is not another person but an LLM.

As I’ve mentioned, the challenge is making the runbook executable in environments that don’t provide a strong control plane, such as K8s. Here, we can further develop our ChatOps concept. By providing a set of MCP tools for Fluent Bit (IncidentFox supports providing your own tools), the runbook can describe remediation steps, and the use of the LLM ensures the Fluent Bit tool is invoked appropriately.

This would still mean that the central point of Incident Fox would need to talk with each Fluent Bit deployment, better than just opening up SSH directly. But we could be smarter. OpAMP provides a central coordination point, and through its Supervisor or embedded logic in an observability client, the protocol’s support for custom commands could be exploited. So we expose the way we interact with Fluent Bit via the custom command mechanism, and we have a robust, controlled mechanism. Furthermore, the concept could be extended to exploit the aggregated knowledge of agents and their configuration (or pushing out new configurations to solve operational problems as OpAMP has been designed to enable), we can automate the determination of whether the remediation action needs to be executed across operational nodes.

Let’s walk through a hypothetical (but plausible) scenario. We have manually deployed systems across a load-balanced set of servers in a pre-production environment used for load testing, each with TLS certificates. As it is a pre-production environment, we use self-signed certificates. We are starting to see errors because the certificates have expired, and someone forgot to recycle them. Fluent Bit has been forwarding the various error log messages to Loki, but the high frequency of the errors causes it to send details to Slack for the Ops team. This, in turn, nudges IncidentFox to act, determines the root issue, and finds remediation in a runbook that points to running a script on the host to provide a fresh self-signed certificate. As we have a Fluent Bit ops pipeline that can trigger that script, and it is deemed low risk, we allow IncidentFox to execute the runbook unsupervised. To execute the run book, it uses the OpAMP server to send the custom command to Fluent Bit. But the runbook says other nodes should be checked. As a result, IncidentFox works with the OpAMP server via its MCP tooling to identify which other Fluent Bit deployments are monitoring other nodes and to direct the custom command to those nodes as well.

To achieve this, we need an architecture along the following lines, in addition to IncidentFox and the associated MCP tools, we would need to have the OpAMP server, either the OpAMP client embeddable into Fluent Bit or an OpAMP supervisor that understands how to pass custom commands, something it perhaps could do by using another MCP interface to Fluent Bit directly.

Conclusion

As you can see, there has been significant advancement in how LLMs and MCP tooling can be used to enable operational support activities. To turn this into a reality, the core Fluent Bit committers need to ideally press on with supporting OpAMP. A standardized MCP tooling arrangement for an open-source OpAMP server will also make a significant advancement. Although we could get things working directly with MCP using Fluent Bit.

IncidentFox being able to function as an OpAMP server itself would really enable it shift the narrative, particularly if it provided a means to plug into a supervisor to handle custom actions.

Root cause analysis – understanding deployment changes

Looking at all of this, one dimension of operational issues is the ability to correlate an issue with environmental changes. That means understanding not only software versions, but also the live configuration. Again, not too difficult when your control plane is K8s. But trickier outside of that ecosystem. You need to identify when the change was defined and when it was applied. Is it time for some functionality that understands when details such as file stamps for configurations or certain binary files change? Perhaps a proxy service that can spot API calls to configuration endpoints and file system changes to configuration files? Then have that information logged centrally so that problems can be correlated to a system configuration or deployment problem.

Updates

Since writing this, we came across Mezmo, which, like Resolve.ai, is developing a proprietary solution but is looking to make an open-source offering. As part of that journey, they have sponsored an O’Reilly Report on Context Engineering for Observability.

Fluent Bit and Otel Collectors at scale

26 Thursday Feb 2026

Posted by mp3monster in Fluentbit, General, Technology

≈ 1 Comment

Tags

AI, artificial-intelligence, Cloud, LLM, OpAMP, Open Telemetry, OTel, Protobuf, Spec, Technology

Fluent Bit and OpenTelemetry’s Collector (as well as many other observability tools) are designed to use a distributed/agent model for deployment. This model can pose challenges, including ensuring that all agents are operating healthily and correctly configured. This is particularly true outside of a Kubernetes ecosystem. But even within a Kubernetes ecosystem, more than basic insights are required (for example, is it running flat out or over-resourced). Fluent Bit exposes its own metrics and logs so that you can either configure Fluent Bit to forward the metrics and logs to an endpoint (or allow Prometheus to scrape the metrics).

As we’re usually using Fluent Bit to collect data and route it to tools like Prometheus, Grafana, etc., or perhaps a more commercial product. So it makes sense that Fluent Bit is also sharing its own status and health.

When it comes managing the configuration of Fluent Bit, we have lots of options for Kubernetes deployments (from forcing pod replacement, to sharing configurations via persistent volume claims, and Fluent Bit reloading configs), but given that Observability needs to operate on simple virtualized and bare metal scenarios, and not everything can be treated as dynamically replaceable more general strategies such as GITOps and potentially using Istio (yes, there really good use cases for Istio outside of Kubernetes) are available. But there is also more advanced tooling, such as Puppet, Chef, and Ansible.

The challenge is non of these tools provides out-of-the-box capabilities to fully exploit the control surface that OTel Collectors and Fluent Bit offer. So the OpenTelemetry community has elected to develop a new standard called OpAMP, which fits snuggly into the OTel ecosystem.

OpAMP defines an agent/client and server model in which the central server provides control, measurement, etc., to all Collectors. The agent/client side can be deployed in two ways: wired directly into a collector or via a separate supervisor tool. Integrating the client-side directly into the client is great, as it avoids introducing a new local process. The heart of OpAMP is the message exchange, which we’ll look at more in a moment.

Today, Fluent Bit would need to use the Supervisor model without modification (although GitHub shows a feature request to support the protocol, we haven’t heard whether or when it will be implemented). But it is early days, and some aspects of the protocol are still classified as ‘development’. That said, it is worth looking more closely at OpAMP as it offers some interesting opportunities, particularly around how we could easily evolve ideas such as chatOps.

While I’m not a fan of having a peer process for Fluent Bit, on the basis that we now have two distinct processes to support observability, we could experiment with the supervisor spawning Fluent Bit as a child process, which would let the supervisor easily spot Fluent Bit failing. At the sametime the supervisor can communicate with Fluent Bit using the localhost loopback adapter and the usual APIs.

Understanding the OpAMP Protocol and what it offers

Lets start with what the OpAMP protocol offers, firstly very little of it is mandatory (this is both good and bad – it means we can build compatibility in a more incremental manner, if certain behaviours are provided elsewhere, then the agent doesn’t have to offer a capability it is also tolerant of what an agent or collector can and can’t do):

Heatbeat and announce the agent’s existence to the server, along with what the agent is capable of/allowed to do regarding OpAMP features.
Status information, including:
- environmental information
- configuration being run
- modules being used
Perform updates to resources, including:
- Agent configuration
- TLS certificate rotation
- Credentials management
- module or even an entire agent installation
issuing of custom commands.
directing the agent’s own telemetry to specific services/endpoints.

The heart of this protocol is the contract between the client (agent/collector) and the server, defined using Protobuf3. This means you can easily create the code skeleton to handle the payload objects, which will be transmitted in binary form (giving network traffic efficiency and the price of not being humanly readable or dynamically processable, insofar as you need to know the Protobuf definition to extract any meaning).

In addition to the Protobuf definitions, there are rules for handling messages, specifying when a message or response is required, message sequence numbering, and the default heartbeat frequency. But there aren’t any complex exchanges involved.

The binary payloads are exchanged over Sockets (allowing full-duplex exchanges where the server can send requests at any time) or over HTTP (providing half-duplex, aka polling/client check-in, at which point the server can respond with an instruction) – a strategy that is becoming increasingly common today.

The benefits we see

Regardless of whether you’re in a Kubernetes environment, the ability to ask agents/collectors to quickly tweak their configuration is attractive. If you start suspecting a service or application is not behaving as expected, if Fluent Bit is filtering your logs or sampling traces, you can quickly push a config change out to allow more through, providing further insight into what is going on – you don’t need to wait for a Kubernetes scheduler to roll through with replacing pods.

With the rise of AI, having a central point of contact makes it easier to wrap a central server as an MCP tool and have a natural language command interface. Along with the possibility that you can potentially send custom commands to the agent (or supervisor) to initiate a task. This was part of the functionality we effectively implemented in our original chatops showcase. The problem was that in the original solution, we deployed a small Slack bot that directed HTTP calls to Fluent Bit. With the OpAMP framework, we can direct the request to the server, which will route the command to the correct Fluent Bit node through a more trustworthy channel.

Implementation

The OpAMP protocol is likely to be widely adopted by commercial service providers (Bindplane, OneUpTime, for example), as it allows them to start working with additional agents/collectors that are already deployed (for example, a supervisor could be used to manage a Fluentd fleet, where there isnt the appetite to refactor all the configuration to Fluent Bit or an OTel Collector). Furthermore, it has the potential to simplify by standardizing functionality.

In terms of resource richness in the OpenTelemetry GitHub repo, I suspect (and hope) there will be more to come (a UI and richer details on how a base server can be extended, for example). At the time of writing, within the OpenTelemetry GitHub repository, we can see:

The published spec
Go implementation of the protocol: the server accepts and responds to messages, and the client includes some test functionality to populate messages.
A supervisor implementation that, through configuration, can be pointed at an agent to observe, and the configuration so that the supervisor is uniquely identifiable to the server. This is also implemented in Go.
Open Telemetry Collector extension

Intersection of API documentation and Search optimisation

18 Wednesday Feb 2026

Posted by mp3monster in APIs & microservices, General, Technology

≈ Leave a comment

Tags

API, API Commons, APIs.io, Async API, IETF, OAS, RFC, specifications, Technology, WELL-KNOWN

I have long said that APIs are more than just payload definitions. For public APIs or those being made available within large organisations, where discoverability and making it easy for APIs to be adopted are just as important to the definition of the API. API specification standards such as Open API Specification and Asynchronous API Specification have addressed some of the challenges, the specifications don’t address everything.

Diagram I was using in 2021, to convey the point that an API involves more than simply a specification – https://www.slideshare.net/slideshow/api-more-than-payload/247135283?utm_source=clipboard_share_button&utm_campaign=slideshare_make_sharing_viral_v2&utm_variation=control&utm_medium=share

I’ve recently been working on the public API and integration strategy for a product, and revisiting this very point to ensure there is time for the wider needs.

There is good news in this area. The IANA-governed WELL-KNOWN path has been added to as a result of the IETF RFC 9727, which defines a way of structuring a list of defined APIs for the api-catalog path (e.g., https://www.example.com/.well-known/api-catalog).

The URI then returns a JSON-based payload using the relatively new media type of linkset (application/linkset+json), which is consists of an anchor to the actual API. Additional attributes (per IETF recommendations) can be supplied to reference metadata, such as the API specification. and other attributes made up of an href and a type. The RFC includes a set of suggested additional metadata references, which could cover details such as:

Attribute Name	Description
service-doc	Link to the API specification document, such as the Open API Specification
status	Link to the endpoint providing status information the API, for example whether the service is currently available.
service-meta	Additional machine readable metadata that maybe needed.
availability	Details about the service availability, e.g. are there any SLAs/SLOs, when maintenance windows may occur etc
performance	Details of any rate limits imposed upon the API
usage	The provider of the api-catalog may wish to correlate requests to the /.well-known/api-catalog URI with subsequent requests to the API URIs listed in the catalog
current	Provide information about which services may be deprecated or no longer available.
	Suggested additional attributes.

Given this, the response data structure could look something like this:

			
{"linkset": [
     {
      "anchor": "https://developer.example.com/apis/foo_api",
      "service-desc": [
       {
        "href": "https://developer.example.com/apis/foo_api/spec",
        "type": "application/yaml"
       }
      ],
      "status": [
       {
        "href": "https://developer.example.com/apis/foo_api/status",
        "type": "application/json"
       }
      ],
      "service-doc": [
       {
        "href": "https://developer.example.com/apis/foo_api/doc",
        "type": "text/html"
       }
       ]
       },
       [#... next API service]
}

		

This opens the door to referencing additional content that may not be available in the current API specifications (and for wrapping structural context around GraphQL APIs, which don’t have as good documentation semantics as OAS, for example). But, as we’ve mentioned, even OAS doesn’t provide an easy way to publish references to SDKs for example, which using the catalog could not be easily referenced. The question is: how do we identify these different building blocks? Here, apicommons.org can come to our rescue. Rather than provide a highly prescriptive specification like the OpenAPI Specification and others. It has reviewed the different standards and practices and identified the commonly needed resources, ranging from authentication details to versioning strategies. Each entry provides recommended tags or attributes to use, which map perfectly into the additional entries for the IETF 9727 linkset.

If you review all the aspects described in apicommons.org, you’ll note that the potential list of resources that you may wish to offer is extensive, and may not be best suited to all being in each catalog entry. This isn’t an issue. Alongside the API Commons site, there is another partner site, APIS.json. This offers a machine-readable API definition. It does not seek to compete with OAS, etc., or to disrupt or displace standards such as OAS; doing so would be a real uphill struggle, as it is too well adopted (to the point we’ve seen good alternatives such as API Blueprint, and RAML fall away to minimal or no advancement). APIs.json is a wrapper definition to the likes of OAS, providing a standardization of the elements identified by APICommons. So we could simplify our catalog to the mandatory API endpoint and a supporting reference to the API.json, which will bring all the resources mentioned together.

While these sights place a clear emphasis on JSON-centric APIs, nothing prevents much of this from being applied to non-JSON APIs (in fact, if you look at the spec for JSON.APIs, you’ll see references to technologies such as WADL and others). The only JSON restriction is that files must be defined in JSON or YAML.

Human-searchable API catalog

Until a few years ago, the Programmable Web website managed a curated catalog of APIs – and it was a fantastic resource, both for discovering potential third-party resources, but also ideas on how to best model your own API specifications, as the API Evangelist blogged that site has since closed its doors. With all the above efforts, there is clearly an opportunity to automate curation of public API services, which is much more viable. This is the direction APIs.io is taking: it is not crawling the web to collect API documents. However, given the support of a search engine and the WELL-KNOWN specification, it is certainly achievable and more a question of funding and, probably, investment in functionality to filter out poor or incomplete API specifications. Currently, to have an API integrated into the catalog, you must manually submit your APIs.json specification.

When using APIs.io, APIS.json, and API Commons, we should treat these resources with respect; they’re beautifully clean sites, knowledge-rich, with no advertising to fund them, and being paid for by the likes of Kin Lane (API Evangelist), Nicolas Grenier (Typeform Inc.), and Steven Willmott (Timewarp Labs).

Useful links

Fluent Bit classic to YAML

16 Friday Jan 2026

Posted by mp3monster in development, Fluentbit, Fluentd, General, python, Technology

≈ Leave a comment

Tags

Fluent Bit, Fluentbit, tool, utility, YAML

Many of the most potent features being added to Fluent Bit are only available in the newer YAML format. At the same time Fluent Bit is celebrating its 10th anniversary, so we can be sure that are some established deployments that are being maintained with the classic format.

The transition between the two Fluent Bit formats can be a bit fiddly. We built a tool as described in my Fluent Bit config from classic to YAML post using Java over a year ago. While it addressed many of the needs, I wasn’t entirely happy with the solution, for several reasons:

Written in Java, which, as a language, is the one I’m most at home with, as an implementation, it does increase the workload for using and implementing – needing a container, etc., ideally. Not unreasonable, but as a small tool project, we’re not working with a nice enterprise build pipeline, so from a code tweak to test is just fiddlier. I’m sure any Java devs reading this will be shouting, ” Setting up Maven, Jenkins, JUnit isn’t difficult – and they’re right, but it’s a time-consuming distraction from working on the real problem.
While it works, the code didn’t feel as clean as it could or should.

Over the last couple of years, we’ve become more comfortable with Python, and messing around with LLMs and code generation has helped accelerate our development skills. Python offers a potentially better solution: many environments that use Fluentd or Fluent Bit are Linux-based and include Python by default, so you can use the tool without messing with containers if you don’t want to. For a tool intended to support development and maintenance of configurations rather than in production being able to use or modify/extend the code easily is attractive.

We’ve started afresh and built a new implementation that is more extensible and maintainable. The new Python implementation is under the same Apache 2 license, but in a different repo so that existing use and forks of the Java implementation aren’t disrupted.

We can wrap Python with all the production-class build processes, but it’s not necessary to get a tool up and running or to enable people to tinker as needed. So we have focused on that. Over time, we’ll play with Poetry or other packaging mechanisms to make it even easier.

The new implementation can be found at https://github.com/mp3monster/fluent-bit-classic-to-yaml-converter

Continue reading →

Fluent Bit Processors and Processor Conditions

18 Tuesday Nov 2025

Posted by mp3monster in General, Technology, Fluentbit

≈ Leave a comment

Tags

conditional, configuration, examples, FLB, Fluent Bit, Logs and Telemetry, processors

Fluent Bit’s processors were introduced in version 3. But we have seen the capability continue to advance, with the introduction of the ability to conditionally execute processors. Until the ability to define conditionality directly with the processor, the control has had to be fudged around (putting processors further downstream and controlling it with filters and routing, etc).

In the Logs and Telemetry book, we go through the basic Processor capabilities and constraints, such as:

Configuration file constraints
Performance differences compared to using filter plugins

So we’re not going to revisit those points here.

We can chain processors to run in a sequence by directly defining the processors in the order in which they need to be executed. You can see the chaining in the example below.

Scenario

In the following configuration, we have created several input plugins using both the Dummy plugin and the HTTP plugin. Using the HTTP plugin makes it easy for us to control execution speed and change test data values, which helps us see different filter behaviours. To make life easy we’ve provided several payloads, and a simple caller.[bat|sh] script which takes a single parameter [1, 2, 3, …] which identifies the payload file to send.

All of these resources are available in the Book’s GitHub repository as part of the extras folder, which can be found here. This saves us from embedding everything into this blog.

Filters as Processors and Chaining

Filters can be used as processors as well as the dedicated processor types such as SQL asnd content modifier. The Filters just need to be referenced using the name attribute e.g. name: regex would use the REGEX filter.

Processor Conditions

If you’re familiar with Kubernetes selectors syntax, then the definition conditions for a processor will feel familiar. The condition is made up of a condition (#–A–) and the condition will contain one or more rules (#–B–). The rule defines an expression which will yield a Boolean outcome by identifying:

An element of the payload (for example, a field/element in the log structure, or a metric, etc).
The value to evaluate the field against
The evaluation operator, which can be one of the typical operators e.g eq (equals), (see below for the full list).

Since the rules are a list, you can include as many as needed. The condition, in addition to the rule , has its own operator (#–C–), which tells the condition how to combine the results of each of the rules together. As we need a Boolean value, we can only use a logical and or a logical or. When we have a single rule then the operation is tested with itself.

In the following example, we have two inputs with processors to help demonstrate the different behavior. In the dummy source, we can see how a nested element can be accessed (i.e. $<element name>[‘<child element name>‘] ), performing a string comparison. Here we’re using a normal filter plugin as a processor.

With our HTTP source, we’re demonstrating that we can have two processors with their own conditions. The first processor is interesting, as it illustrates an exception to the convention; we can express conditionality within the Lua code (#–D–), but it ignores the condition construct. It is obviously debatable as to the value of a condition for the Lua processor, but it is worth considering, as there is an overhead when calling the LuaJIT if the condition can be quickly resolved internally.

service:
  flush: 1
  log_level: debug
  
pipeline:
  inputs:
    - name: dummy
      dummy: '{"request": {"method": "GET", "path": "/api/v1/resource"}}'
      tag: request.log
      Interval_sec: 60
      processors:
        logs:
          - name: content_modifier
            action: insert
            key: content_modifier_processor
            value: true
            condition:     #--A--
              op: and      #--C--
              rules:       #--B--
                - field: $request['method']"
                  op: eq
                  value: "GET"    
    - name: http
      port: 9881
      listen: 0.0.0.0
      successful_response_code: 201
      success_header: x-fluent-bit received
      tag: http
      tag_key: token
      processors:
        logs:
          - name: lua
            call: modify
            code: |
              function modify(tag, timestamp, record)
                new_record = record
                new_record["conditional"] = "condition-triggered"
                return 1, timestamp, new_record
              end
            condition:   #--D--
              op: and
              rules:
                - field: "$classifier"
                  op: eq
                  value:  "1"
          - name: content_modifier
            action: insert
            key: content_modifier_processor2
            value: true
            condition:
              op: and
              rules:
                - field: "$classifier"
                  op: eq
                  value: "2"
          - name: sql
            query: "SELECT token, classifier FROM STREAM;"
            condition:
              op: and
              rules:
                - field: "$classifier"
                  op: eq
                  value: "3"
  outputs:
    - name: stdout
      match: "*"

To run the demonstration, we’ve provided several test payloads and a simple script that will call the Fluent Bit HTTP input plugin with the correct file. We just need to pass the number associated with the log file e.g. <Log>1<.json> is caller.[bat|sh] 1, and so on. The script is a variation of:

set fn=log%1%.json
echo %fn
curl -X POST --location 127.0.0.1:9881 --header Content-Type:application/json --data @%fn%

An example of one of the test payloads:

{"msg" : "dynamic tag", "helloTo" : "the World", "classifier" : 1, "token": "token1"}

Conclusion

Once you’ve got a measure of the condition structure, making the processors conditional is very easy.

Operators available


Operator	Greater than ( > )
eq	Equals
neq	Not Equals
gt	Greater than, or equal to ( >= )
gte	Grather than, or equal to ( >= )
lt	Less than ( < )
lte	Less than or equal to ( =< )
in	Is a value in a defined set (array) e.g. op: in value: [“a”, “b”, “c”]
not_in	Is a value not in a defined set (array) e.g. op: not_in value: [“a”, “b”, “c”]
regex	Matches the regular expression defined e.g op: regex value: ^a*z
not_regex	Does not match the regular expression provided e.g op: not_regex value: ^a*z

MCP Security

30 Thursday Oct 2025

Posted by mp3monster in AI, development, General, Technology

≈ Leave a comment

Tags

AI, artificial-intelligence, attack, attacks, cybersecurity, MCP, model context protocol, Paper, Security, Technology, vectors

MCP (Model Context Protocol) has really taken off as a way to amplify the power of AI, providing tools for utilising data to supplement what a foundation model has already been trained on, and so on.

With the rapid uptake of a standard and technology that has been development/implementation led aspects of governance and security can take time to catch up. While the use of credentials with tools and how they propagate is well covered, there are other attack vectors to consider. On the surface, it may seem superficial until you start looking more closely. A recent paper Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions highlights this well, and I thought (even if for my own benefit) to explain some of the vectors.

I’ve also created a visual representation based on the paper of the vectors described.

The inner ring represents each threat, with its color denoting the likely origin of the threat. The outer ring groups threats into four categories, reflecting where in the lifecycle of an MCP solution the threat could originate.

I won’t go through all the vectors in detail, though I’ve summarized them below (the paper provides much more detail on each vector). But let’s take a look at one or two to highlight the unusual nature of some of the issues, where the threat in some respects is a hybrid of potential attack vectors we’ve seen elsewhere. It will be easy to view some of the vectors as fairly superficial until you start walking through the consequences of the attack, at which point things look a lot more insidious.

Several of the vectors can be characterised as forms of spoofing, such as namespace typosquatting, where a malicious tool is registered on a portal of MCP services, appearing to be a genuine service — for example, banking.com and bankin.com. Part of the problem here is that there are a number of MCP registries/markets, but the governance they have and use to mitigate abuse varies, and as this report points out, those with stronger governance tend to have smaller numbers of services registered. This isn’t a new problem; we have seen it before with other types of repositories (PyPI, npm, etc.). The difference here is that the attacker could install malicious logic, but also implement identity theft, where a spoofed service mimics the real service’s need for credentials. As the UI is likely to be primarily textual, it is far easier to deceive (compared to, say, a website, where the layout is adrift or we inspect URIs for graphics that might give clues to something being wrong). A similar vector is Tool Name Conflict, where the tool metadata provided makes it difficult for the LLM to distinguish the correct tool from a spoofed one, leading the LLM to trust the spoof rather than the user.

Another vector, which looks a little like search engine gaming (additional text is hidden in web pages to help websites improve their search rankings), is Preference Manipulation Attacks, where the tool description can include additional details to prompt the LLM to select one solution over another.

The last aspect of MCP attacks I wanted to touch upon is that, as an MCP tool can provide prompts or LLM workflows, it is possible for the tool to co-opt other utilities or tools to action the malicious operations. For example, an MCP-provided prompt or tool could ask the LLM to use an approved FTP tool to transfer a file, such as a secure token, to a legitimate service, such as Microsoft OneDrive, but rather than an approved account, it is using a different one for that task. While the MCP spec says that such external connectivity actions should have the tool request approval, if we see a request coming from something we trust, it is very typical for people to just say okay without looking too closely.

Even with these few illustrations, tooling interaction with an LLM comes with deceptive risks, partially because we are asking the LLM to work on our behalf, but we have not yet trained LLMs to reason about whether an action’s intent is in the user’s best interests. Furthermore, we need to educate users on the risks and telltale signs of malicious use.

Attack Vector Summary

The following list provides a brief summary of the attack vectors. The original paper examines each in greater depth, illustrating many of the vectors and describing possible mitigation strategies. While many technical things can be done. One of the most valuable things is to help potential users understand the risks, use that to guide which MCP solutions are used, and watch for signs that things aren’t as they should be.

Continue reading →

AI to Agriculture

17 Friday Oct 2025

Posted by mp3monster in General, Oracle, Technology

≈ Leave a comment

Tags

AI, artificial-intelligence, Cloud, development, Oracle, Technology

Now that details of the product I’ve been involved with for the last 18 months or so are starting to reach the public domain (such as the recent announcement at the UN General Assembly on September 25), I can talk to a bit about what we’ve been doing. Oracle’s Digital Government Global Industry Unit has been working on a solution that can help governments address the questions of food security.

So what is food security? The World Food Programme describes it as:

Food security exists when people have access to enough safe and nutritious food for normal growth and development, and an active and healthy life. By contrast, food insecurity refers to when the aforementioned conditions don’t exist. Chronic food insecurity is when a person is unable to consume enough food over an extended period to maintain a normal, active and healthy life. Acute food insecurity is any type that threatens people’s lives or livelihoods.
World Food Programme

By referencing the World Food Programme, it would be easy to interpret this as a 3^rd world problem. But in reality, it applies to just about every nation. We can see this, with the effect the war in Ukraine has had on crops like Wheat, as reported by organizations such as CGIAR, European Council, and World Development journal. But global commodities aren’t the only driver for every nation to consider food security. Other factors such as Food Miles (an issue that perhaps has been less attention over the last few years) and national farming economics (a subject that comes up if you want to it through a humour filter with Clarkson’s Farm to dry UK government reports and US Department of Agriculture.

Looking at it from another perspective, some countries will have a notable segment of their export revenue coming from the production of certain crops. We know this from simple anecdotes like ‘for all the tea in China’, coffee variants are often referred to by their country of origin (Kenyan, Columbian etc.). For example, Palm Oil is the fourth-largest economic contributor in Malaysia (here).

So, how is Oracle helping countries?

One of the key means of managing food security is understanding food production and measuring the factors that can impact it (both positively and negatively), which range from the obvious—like weather (and its relationship to soil, water management, etc.) —to what crop is being planted and when. All of which can then be overlayed with government policies for land management and farming subsidies (paying farmers to help them diversify crops, periodically allowing fields to go fallow, or subsidizing the cost of fertilizer).

Oracle is a technology company capable of delivering systems that can operate at scale. Technology and the recent progress in using AI to help solve problems are not new to agriculture; in fact, several trailblazing organizations in this space run on Oracle’s Cloud (OCI), such as Agriscout. Before people start assuming that this is another story of a large cloud provider eating their customers’ lunch, far from it, many of these companies operate at the farm or farm cooperative level, often collecting data through aerial imagery from drones and aircraft, along with ground-based sensors. Some companies will also leverage satellite imagery for localized areas to complement these other sources. This is where Oracle starts to differentiate itself – by taking high-resolution imagery (think about the resolution level needed to differentiate Wheat and Maize, or spot rice and carrots, differentiate an orchard from a natural copse of trees). To get an idea, look at Google Earth and try to identify which crops are growing.

We take the satellite multi-spectral images from each ‘satellite over flight’ and break it down, working out what the land is being used for (ruling out roads, tracks, buildings, and other land usage). To put the effort to do this into context, the UK is 24,437,600,000 square meters and is only 78^th in the list of countries by area (here). It’s this level of scale that makes it impractical to use more localized data sources (imagine how many people and the number of drones needed to fly over every possible field in a country, even at a monthly frequency).

This only solves the 1^st step of the problem, which is to tell us the total crop growing area. It doesn’t tell us whether the crop will actually grow well and produce a good yield. For this, you’re going to need to know about weather (current, forecast, and historic trends), soil chemical composition and structure, and information such as elevation, angle, etc. Combined with an understanding of optimal crop growing needs (water levels, sun light duration, atmospheric moisture, soil types and health) – good crops can be ruined by it simply being too wet to harvest them, or store them dryly. All these factors need to be taken into account for each ‘cell’ we’re detecting, so we can calculate with any degree of confidence what can be produced.

If this isn’t hard enough, we need to account for the fact that some crops may have several growing seasons per year, or succession planting is used, where Carrots may be grown between March and June, followed by Cucumbers through to August, and so on.

Using technology

Hopefully, you can see there are tens of millions of data points being processed every day, and Oracle’s data products can handle that. As a cloud vendor, we’re able to provide the computing scale and, importantly, elasticity, so we can crunch the numbers quickly enough that users benefit from the revised numbers and can work out mitigation actions to communicate to farmers. As mentioned, this could be planning where to best use fertilizer or publishing advice on when to plant which crops for optimal growing conditions. In the worst cases recognizing there is going to be a national shortage of a staple crop and start purchasing crops from elsewhere and ensure when the crops arrive in ports they get moved out to the markets (like all large operations – as we saw with the Covid crises – if you need to react quickly, more mistakes can be made, costs grow massively driven by demand).

I mentioned AI, if you have more than the most superficial awareness of AI, you will probably be wondering how we use it, and the problems of AI hallucination – the last thing you want is a being asked to evaluate something and hallucinating (injecting data/facts that are not based on the data you have collected) to create a projection. At worst, this would mean providing an indication that everything is going well, when things are about to really go wrong. So, first, most of the AI discussed today is generative, and that is where we see issues like hallucinations. We’re have and are adopting this aspect of AI where it fits best, such as explainability and informing visualization, but Oracle is making heavy use of the more traditional ideas of AI in the form of Machine Learning and Deep Learning which are best suited to heavy numerical computational uses, that is not to say there aren’t challenges to be ddressed with training the AI.

Conclusion

When it comes to Oracle’s expertise in the specialized domains of agriculture and government, Oracle has a strong record of working with governments and government agencies from its inception. But we’ve also worked closely with the Tony Blair Institute for Global Change, which works with many national government agencies, including the agriculture sector.

My role in this has been as an architect, focused primarily on applying integration techniques (enabling scaling and operational resilience, data ingestion, and how our architecture can work as we work with more and more data sources) and on applying AI (in the generative domain). We’re fortunate to be working alongside two other architects who cover other aspects of the product, such as infrastructure needs and the presentation tier. In addition, there is a specialist data science team with more PhDs and related awards than I can count.

Oracle’s Digital Government business is more than just this agriculture use case; we’ve identified other use cases that can benefit from the data and its volume being handled here. This is in addition to bringing versions of its better-known products, such as ERP, Healthcare (digital health records management, vaccine programmes, etc.), national Energy and Water (metering, infrastructure management, etc).

For more on the agricultural product:

Fluent Bit 4.1 what to be excited about in this major release?

10 Friday Oct 2025

Posted by mp3monster in Fluentbit, General, Technology

≈ Leave a comment

Tags

4.1, Fluent Bit, Fluentbit, release

Fluent Bit version 4’s first major release dropped a couple of weeks ago (September 24, 2025). The release’s feature list isn’t as large as 4.0.4 (see my blog here). It would therefore be easy to interpret the change as less profound. But that isn’t the case. There are some new features, which we’ll come to shortly, but there are some significant improvements under the hood.

Under the hood

Fluent Bit v4.1 has introduced significant enhancements to its processing capabilities, particularly in the handling of JSON. This is really important as handling JSON is central to Fluent Bit. As many systems are handling JSON, there is at the lowest levels, continuous work happening in different libraries to make it as fast as possible. Fluent Bit has amended its processing to use a mechanism based on a yyjson, without going into the deep technical details. If you examine the benchmarking of yyjson and other comparable libraries, you’ll see that its throughput is astonishing. So accelerating processing by even a couple of percentage points can have a profound performance enhancement.

The next improvement in this area is the use of SIMD (Single Instruction, Multiple Data). This is all about how our code can exploit microprocessor architecture to achieve faster processing. As logs often need characters like new line, quotes, and other special characters encoding, and logs often carry these characters (and we tend to overlook this – consider a stack trace, where each step of the stack chain is provided on a new line, or embedding a stringified XML or JSON block in the log, which will hve multiple uses of quotes etc. As a result, any optimization of string encoding will quickly yield performance benefits. As SIMD takes advantage of CPU characteristics, not every build will have the feature. You can check to see if SIMD is being used by checking the output during startup, like tthis:

This build for Windows x86-64 as shown in the 3rd info statement doesn’t have SIMD enabled.

Other backend improvements have been made with TLS handling. The ability to collect GPU metrics with AMD hardware (significant as GPUs are becoming ever more critical not only in large AI arrays, but also for AI at the edge).

So, how do we as app developers benefit?

If Fluent Bit’s performance improves, we benefit from reduced risk of backpressure-related issues. We can either complete more work with the same compute (assuming we’re at maximum demand all the time), or we can reduce our footprint – saving money either as capital expenditure (capex) or operational expenditure (opex) (not something that most developers are typically seeking until systems are operating at hyperscale). Alternatively, we can further utilize Fluent Bit to make our operational life easier, for example, by reducing sampling slightly, implementing additional filtering/payload load logic, and streaming to detect scenarios as they occur rather than bulk processing on a downstream platform.

Functional Features

In terms of functionality, which as a developer we’re very interested in as it can make our day job easier, we have a few features.

AWS

In terms of features, there are a number of enhancements to support AWS services. This isn’t unusual; as the AWS team appears to have a very active and team of contributors for their plugins. Here the improvement is for supporting Parquet files in S3.

Supervisory Process

As our systems go ever faster, and become more complex, particularly when distributed it becomes harder to observe and intervene if a process appears to fail or worse becomes unresponsive. As a result, we need tools to have some ‘self awareness’. Fluen bit introduces the idea of an optional supervisor process. This is a small, relatively simple process that spawns the core of Fluent Bit and has the means to check the process and act as necessary. To enable the supervisor , we can add to the command line –supervisor. This feature is not available on all platforms, and the logic should report back to you during startup if you can’t use the feature. Unfortunately, the build I’m trying doesn’t appear to have the supervisor correctly wired in (it returns an error message saying it doesn’t recognize the command-line parameter).

If you want to see in detail what the supervisor is doing – you can find its core in /src/flb_supervisor.c with the supervisor_supervise_loop function, specifically.

Conclusion

With the number of differently built and configured systems, we’ll see a 4.1.x releases as these edge case gremlins are found and resolved.

Phil (aka MP3Monster)'s Blog

~ from Technology to Music

Category Archives: Technology

Enterprise scale NL2SQL

Moving forward

Inject context from the UI

Simplifying the schema

Simplifying through an Ontology and Graphs

Graphs – the magic sauce of ontology and its application

Ontology more than an aide to AI

Getting from Natural Language to the query

Growing evidence that graph is the way to go

Useful Resources

A final note on Graph Querying

Returning to Chat Ops

How do we advance this?

Let’s look deeper

What does this mean to our ChatOps proof of concept?

Conclusion

Root cause analysis – understanding deployment changes

Updates

Intersection of API documentation and Search optimisation

Human-searchable API catalog

Useful links

Fluent Bit classic to YAML

Fluent Bit Processors and Processor Conditions

Scenario

Filters as Processors and Chaining

Processor Conditions

Conclusion

Operators available

MCP Security

Attack Vector Summary

Fluent Bit 4.1 what to be excited about in this major release?

Under the hood

So, how do we as app developers benefit?

Functional Features

AWS

Supervisory Process

Conclusion

Quick summary

Deployment options and scripts

Server screenshots

What the OpAMP server (provider) does

What the supervisor (consumer) does for Fluent Bit

Where ChatOps fits

Reality check: we are still testing

What is coming next (based on docs/features.md)

Why this matters

Share this:

Moving forward

Inject context from the UI

Simplifying the schema

Simplifying through an Ontology and Graphs

Graphs – the magic sauce of ontology and its application

Ontology more than an aide to AI

Getting from Natural Language to the query

Growing evidence that graph is the way to go

Useful Resources

A final note on Graph Querying

Share this:

How do we advance this?

Let’s look deeper

What does this mean to our ChatOps proof of concept?

Conclusion

Root cause analysis – understanding deployment changes

Updates

Share this:

Understanding the OpAMP Protocol and what it offers

The benefits we see

Implementation

Share this:

Human-searchable API catalog

Useful links

Share this:

Share this:

Scenario

Filters as Processors and Chaining

Processor Conditions

Conclusion

Operators available

Share this:

Attack Vector Summary

Share this:

So, how is Oracle helping countries?

Using technology

Conclusion

Share this:

Under the hood

So, how do we as app developers benefit?

Functional Features

AWS

Supervisory Process

Conclusion

Share this:

What is coming next (based on `docs/features.md`)