We’ve just had a new article published for Software Engineering Daily which looks at monitoring in multi-cloud and hybrid use cases and highlights some strategies that can help support the single pane of glass by exploiting features in tools such as Fluentd and Fluentbit that perhaps aren’t fully appreciated. Check it out …
Outside of my Oracle cloud-related content, we’ve just published an article on DZone. Those who follow this blog will be familiar with the article theme as it relates to the Log Simulator work. We’ve also written for Devmio – although we don’t yet know when the article will be published and whether the content will be publicly available or behind their paid firewall.
One of the benefits of using cloud providers is the potential to scale solutions to meet demand by scaling out with additional compute nodes without concern for physical capacity limits (cost considerations are smaller, but still apply). Dynamic scaling is typically driven by monitoring defined groups of nodes for CPU use and when demand hits a threshold an additional node is spun up. Kubernetes can create some more nuanced scaling configurations but this is tends to be used for microservice style solutions.
This is well and fine, but if we need to finesse the configuration to ensure multiple container instances (such as microservices) can be allowed on a single node without compromising the deployment of containers across nodes of a cluster to provide resilience things can get a lot more challenging. We also want to manage the number of active containers – we don’t want to have unnecessary volumes of containers effectively idling. So how do we manage this?
In addition to this, what if we’re using services that demand tens of seconds or even minutes to be spun up to a point of readiness – such as instantiating database servers, adding new nodes to data caches such as Coherence which will need time to clone data, and adjust the demand balancing algorithms? Likewise, for more traditional application servers. Some service calls will impact upstream configuration changes, such as altering load-balancing configurations. Cloud-native Load Balancers typically can understand node groups, but what if your configuration is more nuanced? If you’re running services such as ActiveMQ you need to update all the impacted nodes in a cluster to be aware of the new node. Bottomline is some solutions or solution parts need lead time to handle increased demand?
This points to more advanced, nuanced metrics than simply current CPU load. And possibly the scaling algorithm needs to be aware of lead times of the different types of functionality involved. So how can we advance a more nuanced, and informed scaling logic? You could look at the inbound traffic on firewalls or load balancers. But this will require a fair bit of additional effort to apply application context. For example is the traffic just getting directed to static page content? We do need to derive some context so that the right rules are impacted. Even in simple K8s only hosted solutions a cluster may host services that aren’t related to the changes in demand and need to be scaled as a result.
A use case perspective
Let’s look at this from a hypothetical use case. We’re a music streaming service. Artists can control when their music is made available, but the industry norm is 12:01am on a Friday morning. There is always a demand spike at this time, but the size of the demand spike can vary, with some artists triggering larger spikes (this isn’t directly correlating to an artist’s popularity, some smaller groups can have very enthusiastic fans) – so dynamic scaling is needed, and reactive to demand. Reporting, analytics, and payment services see cyclical bumps around the monthly financial periods. These monthly cyclic activities can also be done during quieter operating hours. So it is clear we may wish to apply logical partitioning, but for maximum cost efficiency keep everything in the same cluster. There are correlations between different service demands. Certain data services see increased demand during the demand spikes and reporting period, but ‘The bottom line is we need to not only address dynamic scaling, but tailor the scaling to different services at different times.
Back to our question – how do we manage the scaling? We could monitor the firewall and load balancer logs and analyze the kinds of requests being received. That would need additional processing logic to determine which services are receiving the traffic. But our API gateway is likely to have that intelligence implicitly in place as we have different API policies for the different types of endpoints. So we can monitor specific endpoints or groups of endpoints very easily – meaning we can infer the types of traffic demand and how to respond. Not only that, we may have API plans in place so we can control priority and prevent the free versions of our service from using APIs to initiate high-quality media streams. So we already have business and specific process meaning. So linking scaling controls such as KEDA (Kubernetes Event-driven Autoscaling) directly to measurements such as plans and specific APIs creates a relatively easy way to control scaling. Further, we may also use the gateways to provide a rate throttle so it’s not possible to crash our backend with an instantaneous spike. This strategy isn’t that different from an approach we’ve demonstrated for scaling message processing backends (see here and here).
Representation of how we can use the metrics from a gateway to not only support the scaling of a K8s cluster but also other cloud services
We can also use the data KEDA can see in terms of node readiness to adjust the API Gateway rate limiting dynamically as well. Either as a direct trigger from the number of active container instances or by triggering a more advanced check because we still have to address our services that need more lead time before relaxing the rate limiting.
Handling slow scaling services
So we’ve got a way of targeting the scaling of services. But what do we do to address the slower scaling services we described? With the data dimensioned, we can do a couple of things. Firstly we can use the rate of change to determine how many nodes to add. A very sudden increase in demand and adding a node at a time will create the effect of the service performance stuttering as it scales, and all additional capacity is suddenly consumed and then scales again. Factoring in the rate of change to the workload threshold and the context of which services could generate the increased workload can be used to not only determine which databases may need scaling but also be used to adjust the threshold of workload that actually triggers the node introduction. So a sudden increase in demand on services that are known to create a lot of DB activity is then met with dropping the threshold that triggers new nodes from, say, 75% to 25% so we effectively start the nodes on a lower threshold, means the process is effectively started sooner.
With different rates of utilization growth, we can see that we need to alter the trigger threshold for launching new resources
For this to be fully effective we do need the Gateway tracking traffic both internally (East-West) and inbound (North-South). Using the gateway with East-West traffic means that we can establish an anti corruption layer.
Conclusion
Not all parts of a solution will be instantaneous in scaling – regardless of how fast the code startup cycle might be, some services have to address the need to move large chunks of data before becoming ready, e.g., in-memory databases. Some services may not have been built for the latest business demands and the ability to exploit cloud scale and dynamic scaling. Some services need time to adjust configurations.
We may also need to adjust our protection mechanisms if we’re protecting against service overloading.
Scaling capabilities that have response latency to the scaling process can be addressed by achieving earlier, more intelligent recognition of need than warning simply by CPU loads hitting a threshold. The intelligence that can be derived from implicit service context simplifies the effort in creating such intelligence and makes it easier to recognize the change. API Gateways, message queues, message streams, and shared data storage are all means that, by their nature, have implicit context.
Pushing the recognition towards the front of the process creates milliseconds or possibly seconds more warning of the demand than waiting for the impacted nodes to see compute spikes.
We haven’t blogged too much recently as we have been busy helping get and producing content for my employer Oracle, working with Software Engineering Daily, and developing a collaborative book. So, I thought I’d pull together some links to these new resources.
One of the areas I present publicly is the use of Fluentd. including the use of distributed and multiple nodes. As many events have been virtual it has been easy to demo everything from my desktop – everything is set up so I can demo things very easily. While doing this all on one machine does point to how compact and efficient Fluentd is as I can run multiple instances concurrently it does undermine distributed capabilities somewhat.
Add to that I now work for Oracle it makes sense to use OCI resources. With that, I have been developing the scripts to configure Ubuntu VMs to set up the demo environments installing Ruby, Fluentd, and various gems needed and pulling the relevant configurations in. All the assets can be found in the GitHub repository https://github.com/mp3monster/logging-demos. The repository readme includes plenty of information as well.
While I’ve been putting this together using OCI, the fact that everything is based on Ubuntu should mean it can be run locally on VMs, WSL2, and adaptable for MacOS as well. The environment has been configured means you can still run on Ubuntu with a single node if desired.
Additional Log Destinations
As the demo will typically be run on OCI we can not only run the demo with a multinode setup, we have extended the setup with several inclusion files so we can utilize OCI services OpenSearch and OCI Log Analytics. If you don’t want to use these services simply replace the contents of several inclusion files including files with the contents of the dummy_inclusion.conf file provided.
Representation of the Demo setup
The configuration works by each destination having one or two inclusion files. The files with the postfix of label-inclusion.conf contains the configuration to direct traffic to the respective service with a configuration that will push log events at a very high frequency to the destination. The second inclusion file injects the duplication of log events to each service. The inclusion declarations in the main node Fluentd config file references an environment variable that should provide the path to the inclusion file to use. As a result, by changing the environment variable to point to a dummy file it becomes possible o configure out the use of one of the services. The two inclusions mean we can keep the store declarations compact and show multiple labels being used. With the OpenSearch setup, we have a variant of the inclusion file model where the route inclusion can reference the logic that we would use in the label directly within the sore declaration.
The best way to see the use of the inclusions is to experiment with setting the different environment variables to reference the different files and then using the Fluentd dry-run feature (more on this in the book).
Setup script
The setup script performs a number of tasks including:
Pulling from Git all the resources needed in terms of configuration files and folders
Retrieving the necessary plugins against the possibility of their use.
Setting up the various environment variables for:
Slack token
environment variables to reference inclusion files
shortcut environment variables and aliases
network (IP) address for external services such as OpenSearch
Setting up a folder for OCI tokens needed.
Setting up temp folders to be used by OCI Plugins as a file-based cache.
Feeding the log analytics service is a more complex process to set up as the feeds need to have metadata about the events being ingested. The downside is the configuration effort is greater, but the payback is that it becomes easier to extract meaningful information quickly because the service has a greater understanding of the content. For example, attributing the logs to a type of source means the predefined or default log formats are immediately understood, and maximum meaning can be retrieved from the log event.
Going to OCI Log Analytics does cut out the need for the Connections hub, which would allow rules and routing to be defined to different OCI services which functionally can help such as directing log events to PagerDuty.
As a consultant working with clients, we always need to address security considerations for clients, their networks and data. Typically this might mean ensuring I could connect to the correct network through a VPN with the secure client software installed. Then work through a Citrix set-up for the tools we’re allowed to use.
Since the start of the pandemic, there seems to be a marked shift towards issuing consultants with customer provided laptops that have been configured and locked down. This means I can’t use the client laptop to connect to my employer’s network to interact with our own systems – making it easy to leverage our existing resources to support the customer and conversely no trust or contractual position that might allow our company devices connecting to a VPN or ring-fenced part of a network.
Interestingly there seems to have been a drift away from the ideas of BYOD (Bring Your Own Device) which may come from the fact that outside of smaller very tech-savvy organizations, BYOD can be seen as challenging to support.
As this Google Trends report shows over the last five years the trend has been until the last couple of months showing a generally downward trend. Not authoritative proof, but hints that it hasn’t accelerated as you might expect given remote working.
By the customer supplying a laptop, there is an effort to control intrusion and other security risks. But the problem is, now I have a device that I could easily take off-line and work to defeat the security setup and the client would be non the wiser, or worse it is another laptop that could ‘get lost’ or ‘stolen’ with a greater chance of having sensitive material. Every new device is without a doubt an elevated risk for the client and a cost to support (this of course is also an argument for not applying BYOD).
Today was the first run of some new presentation material looking at the use of GitHub Actions using Runners deployed on OCI Free Tier. The presentation was actually physical rather than virtual which was after 2 years of virtual presenting, rather refreshing. Not to mention the UKOUG hosted the event at the Oval Cricket ground, which made for an interesting venue. The example configuration is included in our GitHub OCI Utilities repository (as we use this solution to help validate and test our development work).
The presentation itself (which includes screenshots of the setup of a simple Action and runner) is here, note I have disconnected my Runners, but you will be able to see the Action configuration but if you try to trigger activity through my repository then nothing will happen.
While there is a deserved amount of publicity around the introduction of ARM compute onto OCI with the ARM Ampere CPU offering, and the amazing level of always free compute being provided (24GB of memory and 4 cores which can be used in any combination of servers). There have been some interesting announcements that perhaps haven’t drawn as much attention that they deserve. This includes OCI support for GitHub Actions, plus several new DevOps services and an Artifact Registry. We’ll comeback to the new services in another post. Today, let’s look at GitHub Actions.
Oracle’s data integration product landscape outside of GoldenGate has with the arrival of Oracle Cloud been confusing at times. This has meant finding the right product documentation can be challenging, and knowing which product to use in your own technology road-map can be harder to formulate. I believe the landscape is starting to settle now. But to understand the position, let’s look at the causes of disturbance and the changes that have occurred.
Why the complexity?
This has come from I think a couple of key factors. The organizational challenges triggered by Thomas Kurian’s departure which has resulted in rather than the product organization being essentially in three parts aligning roughly to Infrastructure, Platform and Applications to being two Infrastructure and Apps. Add to this Oracle’s cloud has gone through two revolutions. Generation 1 now called Classic was essentially a recognition that they needed an answer to Microsoft, Google and AWS quickly (Oracle are now migrating customers off classic). Then came Generation 2, which is a more considered strategy which is leveraging not just the lowest level stack of virtualization (network and compute), but driving changes all the way through the internals of applications by having them leverage common technologies such as microservices along with a raft of software services such as monitoring, logging, metering, events, notifications, FaaS and so on. Essentially all the services they offer are also integral to their own offerings. The nice thing about Gen2 is you can see a strong alignment to CNCF (Cloud Native Compute Foundation) along with other open public standards (formal or de-facto such as Microprofile with Helidon and Apache). As a result despite the perceptions of Oracle, modern apps are standard a better chance of portability.
Impact on ODI
Oracle’s Data Integration capabilities, cloud or otherwise have been best known as Oracle Data Integrator or ODI. The original ODI was the data equivelent of SOA Suite implementing Extract Load Transform (ELT) rather than ETL as it meant the Oracle DB was fully leveraged. This was built on the WebLogic server.
Along Came Cloud
Oracle cloud came along, and there is a natural need for ODI capabilities. Like SOA Suite, the first evolution was to provide ODI Cloud Service just like SOA Suite had SOA Cloud Service. Both are essentially the same on-premises product with UIs to manage deployment and configuration.
ODI’s cloud transformation for the cloud lead ODI CS evolving into DIPC (Data Integration Platform Cloud). Very much an evolution, but with a more web centered experience for designing the integrations. However, DIPC is no longer available (except possible to customers already using it).
Whilst DIPC had been evolving the requirement to continue with on-premises ODI capabilities is needed. Whilst we don’t know for sure, we can speculate that there was divergent development happening creating overhead as ODI as an on-prem solution remained. We then saw the arrival of ODI Marketplace, this provides an easier transition, including taking into account licensing considerations. DIPC has been superseded by ODI Marketplace.
Marketplace
Oracle has developed a Marketplace just like the other major players so that 3rd party vendors could offer their technologies on the Oracle cloud, just as you can with Azure and AWS. But Oracle have also exploited it to offer their traditional products normally associated with on-premise deployments in the cloud. As a result we saw ODI Marketplace. A smart move as it offers the possibility of exploiting on-prem licensing into the cloud along with portability.
So far the ODI capabilities in its different forms continued to leverage its WebLogic foundations. But by this time the Gen2 Oracle Cloud and the organizational structures behind it has been well established, and working up the value stack. Those products in the middleware space have been impacted by both the technology strategies and organization. As a result API for example have been aligned to the OCI native space, but Integration Cloud has been moved towards the Apps space. in many respects this reflects a low code vs code native model.
OCI ODI
Earlier this year (2020) Oracle launched a brand new ODI product, to use its full name Oracle Cloud Infrastructure Data Integration. This is an OCI native (i.e. Gen2 solution leveraging microservices technologies).
This new product appears to be a very much ground up build as it exploits Apache Spark and Function as a Service (FaaS) as foundational elements. As a ground up build, it doesn’t inherit all the adapters the original ODI can offer. This does mean as a solution today it is very focused on some specific needs around supporting the data movement between the various Oracle Cloud storage and Database as a Service solutions rather than general ingestion and extraction processes.
Conclusion
Products are evolving, but the direction of travel appears to be resolving. But we are still in that period where there are capability gaps between the Gen2 native solution and the traditional ODI via Marketplace solution. As a result the question becomes less which product, but when and if I have to invest in using ODI Marketplace how to migrate when the native product catches up.
A few weeks ago Oracle announced a new tool for all Oracle cloud users including the Always Free tier. Cloud Shell provides a Linux (Oracle v7.7) environment to freely use ( (within your tenancy’s monthly limits) – no paying for VM or using your limited set of VMs (for free-tier users) or anything like that.
As you can see the Shell can be started using a new icon at the top right (highlighted). When you open the shell for the 1st time, it takes a few moments to instantiate – and you’ll see the message at the top of the console window (also highlighted). The window provides a number of controls which allows you to expand to full screen and back again etc.
The shell comes preconfigured with a number of tools, such as Terraform with the Oracle extensions, OCI CLI, Java and Git, so linking to Developer Cloud or GitHub for example to manage your scripts etc is easy (as long as you know you GIT CLI – cheat sheet here). The info for these can be seen in the following screenshots.
In addition to the capabilities illustrated, the Shell is set up with:
Python (2 and 3)
SQL Plus
kubectl
helm
maven
Gradle
The benefit of all of this is that you can work from pretty much any device you like. It removes the need to manage and refresh security tokens locally to run scripts.
A few things to keep in mind whilst trying to use the Shell:
It is access controlled through IAM, so you can of course grant or block the use of the tool. Even with access to the shell, users will need to obviously have to have access to the other services to use the shell effectively.
The capacity of the home folder is limited to 5GB – more than enough for executing scripts and a few CLI based tools and plugins – but that will be all.
If the shell goes unused for 6 months then the tenancy admin will be warned, but if not used, then the storage will be released. You can, of course, re-activate the Shell features at a future date, but of course, it will be a blank canvas again.
For reasons of security access to the shell using SSH is blocked.
The shell makes for a great environment to manage and perform infrastructure development from and will be a dream for Linux hard code users. For those who like to be lazy with a visual IDE, there are ways around it (e.g. edit in GitHub) and sync. But power users will be more than happy with vim or vi.
You must be logged in to post a comment.