A WBS (Work Breakdown Structure) is a valuable planning tool which is often overlooked in the rush to create a project schedule. A WBS is a tool defined in the PMBOK. It is a structured decomposition of the entire scope of the project. PRINCE2 also defines a similar planning mechanism…
Improving Oracle Service Bus Performance: First, Measure It
A little while back, I was asked an excellent question by one of my favourite clients. How do you tune Oracle Service Bus (OSB) to get better performance?
Like all good questions, it lacks an easy answer. In fact, it raises a question – how do you want to improve performance? It’s frustrating to go the long way around (I hate these kinds of answers myself when I run into them) but perofrmance tuning should always be done in response to a measured and understood problem. Before we tune to improve ‘performance’, we need to understand what performance means, and be able to point at metrics that show a clear bottleneck. Otherwise, we’re just fiddling with knobs to shoot for some vague notion of ‘better’ – the cargo cult version of performance tuning.
I’m going to talk in the future about the generalities of performance tuning and some of the knobs and dials that can be used to tweak OSB’s performance in later posts. But first, let’s talk about how we can measure performance. Because the same monitoring mechanisms you use to measure performance initially are going to be what you use to quantify when performance improves. So let’s look at some tools you can use, without requiring any commercial software, to zoom in from a high-level view of your application’s performance to specific performance views.
General checkups with service monitoring
Service monitoring gives you a representation of average, minimum and maximum execution time for a service, along with the number of service requests received over a configurable window of time for the service. It’s configured via clicking one checkbox in the Operational Settings tab for an individual service, so it’s a quick way to get a high level view of what’s going on.
It’s worth digressing for a moment into a discussion of statistics. With service monitoring, the key thing to watch for is going to be the maximum execution time for a service. Averages are a lousy indicator of general performance, because they gather together so many data points into a such a coarse aggregate representation. Averages are great at hiding the underlying story of what’s going on, and a lot of performance testing tools will use average response time as your key performance indicator.
Performance problems will typically fit into one of two categories:
- Consistently long-running operations, where there’s something intrinsic to the service itself, like an untuned SQL query, which causes all responses to take longer than they should. These are much easier to spot, and averages work fine for these.
- Services that have execution time that varies with a request, or has performance issues due to intermittent bottlenecks or bizarre corner conditions in algorithms.
Those corner cases can represent the performance issues frustrated users or clients will report, but when we look at the overall system performance, everything seems fine to us, so we shrug our shoulders.
What we need is a way to dig deeper – to get an indication of performance from every request that goes through the system. And that’s where HTTP access logs can help.
The performance metrics ELF
WebLogic Server has a HTTP access log for every server instance, which is configured by default to use W3C common log format – the same shape of information you’d see with an Apache web server, for example. It’s fairly straightforward to configure WebLogic Server to use Extended Logging Format (ELF) which opens up a much broader set of fields for logging, including the time taken to return a response to a calling client for every successful HTTP request.
You can find instructions on how to configure extended logging format in the supplemental material over here.
Once you’ve got statistics for every request, you can then post-process them into something a little more like this:
This gives you a visual summary of what your performance is like. Are all of your requests clustered over on the left, indicating optimal performance? Great! Do you have a small spike of responses around 3 or 5 seconds? That tells you you have an intermittent bottleneck somewhere – aberrant behaviour that needs to be analysed further. And that means drilling down into the behaviour of those individual requests, which I’ll talk about down the track.
Before I leave you for now, one last thing worth talking about is how to get visualisations quikcly from post-processing data. There are a lot of solutions out there for that problem, all of which vary in price, complexity and flexibility. My personal favourite is Splunk, because I can get answers as quickly as I can think of questions, and get both detailed information as a result and a visual summary.
If you have a particular tool that works well for you, I’d love to hear about it.