Some 'more, in-depth' information on Oracle BPEL PM, ESB and other SOA, day2day things

Friday, July 06, 2007

SOA Suite 10.1.3.3 patchset and "lost" instances

It has been a long time since I have blogged the last time - mainly due to the huge amount of miles flown over the last 2 months.

After being in europe for conferences where I did evangelism on our next generation infrastructure, that is based on Service Component Architecture, it was time to revisit my roots and do some consulting in a POC as well as helping one of our customers with their performance problems.

Meanwhile, while I was on the road, we have released the 10.1.3.3 patchset, which includes among many small fixes here and there - some real cool enhancements, e.g
  1. A fault policy framework for BPEL, which allows you to specify policies for faults (e.g remoteFault) outside of the BPEL process and trigger retry activities, or submit the activity with a different payload from the console.

  2. Performance fixes for Oracle ESB - which boost the performance way up

  3. Several SOAP related fixes, especially around inbound and outbound WS-Security - if you happen to have Siebel and use WS-Sec features, the patch will make you happy

  4. Adapters: several huge improvements on performance and scalability

  5. BPEL 2 ESB: several fixes with transaction propagation as well more as sophisticated tracking
You can download it from metalink - patch number is 6148874

After working on 10.1.3.3 for the last 3 weeks, we added an enhancement to implement a federated ESB, where an ESB system binds to another via UDDI. The enhancement request's number is 6133448 and will be part of 10.1.3.4 (our next patch release) - and works exactly the way it works today in BPEL 10.1.3.1.

Back to my performance adventure.
The customer reported that under high load of his 10.1.3.1 instance, a lot of async instances (that were submitted to the engine) "got lost", which means - they could not find any trace of a running instance, nor have the target systems that were called out of the process being updated. Strange, isn't it?

A quick look into the recovery queue (basically a select against the invoke_message table) revealed that a lot of instances have been scheduled (status 0) - but somehow they stayed in the queue. Hugh, why that? Restarting the server helped, some instances were created but, - hugh still way to many weren't.

Checking the settings that we preseed - we figured out - that there is an issue with them. Looking into the Developer's guide it states:

"the sum of dspMaxThreads of ALL domains should be <= the number of listener threads on the workerbean".

Hmm - checking orion-ejb-jar.xml, section workerbean, in the application-deployments/orabpel/ejb_ob_engine folder revealed
  1. there are no listener-threads set and
  2. there are 40 ReceiverThreads
means? Given that we seed each domain with dspMaxThreads being 100, if you have five domains, 500 workerbean threads would be needed - way to much. And what happened to listener-threads?

<message-driven-deployment name="WorkerBean" instances="100" resource-adapter="BPELjms">

A quick check with the JMS engineering enlighted me on that. As we use JMS connectors now - you need to change the ReceiverThreads, to match the above formula.

<config-property>
  <config-property-name>ReceiverThreads</config-property-name>
  <config-property-value>40</config-property-value>
</config-property>

- and tune the dispatcherThreads on the domains to a reasonable level.

Next question: what are dispatcherThreads, and what does the engine need them for?

"ReceiverThreads specifies the maximum number of MDBs that can process BPEL requests asynchronously across all domains. Each domain can allocate a subset of these threads using the dspMaxThreads property; however, the sum of dspMaxThreads across all domains must not exceed the ReceiverThreads value.

When a domain decides that it another thread to execute an activity asychronously, it will send a JMS message to a queue; this message then gets picked up by a WorkerBean MDB, which will end up requesting the dispatcher for some work to execute. If the number of WorkerBean MDBs currently processing activities for the domain is sufficient, the dispatcher module may decide not to request for another MDB. The decision to request or an MDB is based on the current number of active MDBs, the current number pending (that is, where a JMS message has been sent but an MDB has not picked up the message), and the value of dspMaxThreads.

Setting both ReceiverThreads and dspMaxThreads to an appropriate value is important for maximizing throughput and minimizing thread context switching. If there are more dspMaxThreads specified than ReceiverThreads, the dispatcher modules for all the domains will think there are more resources they can request for than actually exist. In this case, the number of JMS messages in the queue will continue to grow as long as request load is high, thereby consuming memory and cpu. If the value of dspMaxThrads is not sufficient for a domain's request load, throughput will be capped.
Another important factor to consider is the value for ReceiverThreads - more threads does not always correlate with higher throughput. The higher the number of threads, the more context switching the JVM must perform. For each installation, the optimal value for ReceiverThreads needs to be found based on careful analysis of the rate of Eden garbage collections and cpu utilization. For most installation, a starting value of 40 should be used; the value can be adjusted up or down accordingly. Values greater than 100 are rarely suitable for small to medium sized boxes and will most likely lead to high cpu utilization just for JVM thread context switching alone."

With all the above in place, and a tuned dehydration store, we got them back on the track, even under high load all messages where picked up, and ended up as instances - recap:
  1. Make sure your settings of ReceiverThreads do match the sum of dspMaxThreads of all domains, and are set appropriately.

  2. If you have external adapters in use, that connect e.g to AQ, make sure AQ is tuned and also the adapter - this is where you are most likely to get timeouts, that would also contribute to recvoverable messages.

0 Comments:

Post a Comment

<< Home