You are on page 1of 13

HP-UX 11i Knowledge-on-Demand:

performance optimization best-practices from our labs to you


Developer series

Solving Java performance problems -Webcast topic transcript

Welcome to the Solving Java Performance Problems presentation, covering Performance Optimization techniques for
multi-tier J2EE environments.
[NEXT SLIDE]
This is Patrick Sheehy from the HQ Presales Global War Room team. Today I will be presenting slides originally
created and presented by Kishan Thomas.
Kishan Thomas is a software engineer in Java Compiler and Tools Lab at HP. He has been working on Java related
technologies for the last five years.
He has worked on Java APIs, JPI and Java 3D on HP-UX 11i, with a current focus on Java and J2EE performance and
benchmarking. Earlier in his career he also worked on the first Mozilla port for HP-UX.
[NEXT SLIDE]
Id like to take a moment to level set and set the stage for todays agenda. We are here to discuss java performance
on HP Integrity Servers with dual core Montecito processors running HP-UX 11i with regard to Java 2 Enterprise
Edition, or J2EE multi-tier deployments.
[NEXT SLIDE]
Since we are looking at a multi-tier J2EE environment, we will be touching on many different topics. The overall
performance of a J2EE deployment depends on tuning and understanding each element of the whole multi-tier stack,
from the hardware and HP-UX, to the appserver and database products, as well as the network and storage systems.
Each individual component plays a role in the overall performance of the J2EE system which is ultimately measured as
transaction throughput with predefined response time limits, sometimes referred as service level agreements.
[NEXT SLIDE]

[NEXT SLIDE]
Id like to take a moment to level set and set the stage for todays agenda. We are here to discuss java
performance on HP Integrity Servers with dual core Montecito processors running HP-UX 11i with regard to Java
2 Enterprise Edition, or J2EE multi-tier deployments.
[NEXT SLIDE]
Since we are looking at a multi-tier j2ee environment, we will be touching on many different topics. The overall
performance of a J2EE deployment depends on tuning and understanding each element of the whole multi-tier
stack, from the hardware and HP-UX, to the appserver and database products, as well as the network and
storage systems. Each individual component plays a role in the overall performance of the J2EE system which is
ultimately measured as transaction throughput with predefined response time limits, sometimes referred as service
level agreements.
When we are discussing a complex J2EE application architecture it is difficult to separate out java performance
alone, without looking at the other tiers of the J2EE system. A well tuned Java appserver on the middle tier is
essential but requires a well tuned database tier as well as the other supporting systems in order to achieve
optimal performance.
[NEXT SLIDE]
Lets start by looking at the broader J2EE deployment performance issues.
[NEXT SLIDE]
Here we see a typical J2EE configuration.
This diagram has been generalized to match the majority of the J2EE application environments we see deployed
in the industry. Obviously, there will be variations from this basic structure from customer to customer and
deployment to deployment, with much more complex interactions and even more sub systems involved. The
general tuning approach we discuss in this presentation can be applied to these different variations to some
extent with this generalization covering the key elements of tuning for the majority of deployments we have come
across.
This diagram represents a 3 tier J2EE architecture, including a fourth external subsystem that we need to interact
with. We have a mid-tier which is comprised of the java application server a back-end database tier and a
client tier which supplies the transaction requests from the users. There is also external web application which
provides transaction support.
In this example, it is the mid-tier and database tier that are the performance critical servers we need to examine
and tune.
[NEXT SLIDE]
Here we see the same multi-tier architecture in terms of actual servers, storage and network layout.
In this configuration we have multiple HP Integrity rx6600 entry-level Montecito servers, and a bigger single
back-end database server like rx7620 or rx8640 connected to a storage disk array like an HP MSA, EVA or XP
storage system.
The client requests comes are managed by a load balancer system which distributes them across the multiple
appserver systems.

All of the mid-tiers are connected to the same database system in the back-end tier.
The network uses gigabit Ethernet and the database system connects to the storage array using fiber channel.
[NEXT SLIDE]
Lets begin with the configuration details of our example. This example configuration is based on an actual multitier internal tuning effort HP has performed, but simplified somewhat for presentation purposes. Next we will
cover a quick overview of the tuning approach.
First, well take a look at the specification of appserver system and workload.
The application server was deployed on an entry class HP Integrity server rx6600 with 4 CPUs and 8 core in the
mid-tier. This server uses dual-core Montecito processors running at 1.6Ghz with 24MB Cache.
The workload we used was a distributed J2EE transaction workload deployed on a BEA Weblogic Server and
Oracle database.
We initially used a single appserver instance running on all the 8 cpu cores on the mid-tier, but after the tuning
exercise we ended with an optimized configuration with multiple appservers running on multiple HP-UX Processor
Sets to achieve the best results.
Keep in mind, this is just an overview, in the later slides I will explain this approach in more detail.
[NEXT SLIDE]
One of the main performance improvements was achieved by moving from a single application server instance to
multiple application server instances on a single physical server.
A single appserver means there will be a single Java Virtual Machine (or JVM) process running a single
appserver instance compared to utilizing multiple JVM processes running a single appserver instance each. This
multiple appserver instance approach allows us to avoid some of the limits we would otherwise encounter with
regard to networking and overall Java JVM performance.
A single instance appserver listens on a single Network interface and with a high volume of network based
transaction requests it easily reaches network capacity limits. In contrast, a multiple instance deployment allows
each JVM instance to listen or bind to a port on a separate Network interface. In this manner, the same amount
of network traffic gets distributed across multiple interfaces rather than a single interface.
Additionally, multiple JVM instances can achieve overall better java performance in terms by distributing
overhead functions like Garbage Collection and Thread Management. With single JVM instance you require a
large java memory heap which increases the need for garbage collection, which is the recovery of unused
memory space for the heap. Additionally, a single JVM instance is required to manage a very large number of
threads creating additional overhead unrelated to the J2EE application itself. The overhead is reduced by
distributing the total number of threads across multiple JVM instances.
[NEXT SLIDE]
To properly configure and optimize multiple appserver instances for performance, we use some features of the
appserver along with support from some key features of HP-UX as well as some database configuration options.
Lets look at these features in more detail.
[NEXT SLIDE]

When running multiple appserver instances, running them within processor level resource partitions provides CPU
resource isolation and better performance within a single OS instance. Processor sets were chosen for their
dynamic features and their low administration overhead. Other HP-UX partitioning options like virtual machines,
vpars and npars can also be used where applicable, if more flexibility and control at the OS and hardware level
are required.
Processor sets can be dynamically created and configured with root user access, using the HP-UX command
psrset. If you are on a 8 CPU core server and you want to run 4 appserver instances for example, you can
create four processor sets each with two CPU cores. When you launch the appserver instances you will use the
psrset command to start the appserver process on each of the processor sets or assign the PID of each JVM to the
desired processor set after the launch.

[NEXT SLIDE]
With multiple appserver instances it is necessary to distribute the client transactions requests equally across these
appserver instances to achieve maximum combined throughput.
Two options present themselves to handle this task. Use of dedicated load balancing hardware or use of a
special round robin DNS configuration. In both cases all the clients will connect to a single external appserver
address but the load balancer or DNS will route them to all of the appserver instances which are bound to
separate internal addresses.
Usually there is a single database backend instance to which all appserver instances connect, however each
appserver instance will have a separate JDBC connection pool to the database.
[NEXT SLIDE]
Despite using multiple appserver instances, each individual instance is running the same business application as
all of the other instances. All of the major J2EE appserver products provide flexible configurations to enable the
administrator to host the same application across multiple appserver instances through the use of features like
application domains.
We are able to configure each instance to bind to a separate network interface and address by either modifying
the configuration xml files or using the appserver administration console.
As stated earlier, in our example, you must start the appserver Java processes in separate processor sets. This
can be accomplished by modifing the appserver startup scripts to add the psrset command.
[NEXT SLIDE]
Lets us look at what we can do at the database tier to optimize network traffic distribution. Remember, we are
using multiple appserver instances to talk to the same database.
Major RDBMS products like Oracle have a listener process which binds to a network address for incoming
database connection requests. In our example, we configure the database listener to bind to multiple network
interfaces on the database system.
For each appserver instance we use a unique database listener address when configuring the JDBC connection
pool to use separate network paths to distribute the appserver traffic to the database server.
[NEXT SLIDE]

Here we can see the processor/JVM relationship for both a single JVM instance configuration and a multiple JVM
instance deployment on the same dual-core Montecito based system like the Integrity rx6600 server
.
The first diagram on the left shows a single appserver running across all four processors, then two instances on
two processor sets of two processors and then an instance on each single-processor processor set.
These various configuration demonstrate the flexibility of processor sets and appserver instances on a multiprocessor server.
[NEXT SLIDE]
While multiple instance implementations offer greater throughput, they do present some common issues or
problems that must be dealt with.
Compared to a single appserver instance on a single server, an approach using multiple instances on multiple
servers adds additional complexity. This complexity is mainly in terms of network connectivity and load balancing
between the multiple systems with multiple network interfaces.
We have multiple transaction requests from multiple sources communicating with multiple appserver instances
and all of them communicating with a single back-end database as well as additional external transaction
support servers. Unless we plan well, we can end up with a badly load balanced configuration, with some of the
appserver instances getting overloaded, which in turn effects the overall transaction throughput and transaction
response times.
On the system level this will be indicated by unbalanced CPU utilization on each of the processor sets in which
the appserver instances are running. On a well balanced system we should see an equal amount of CPU
utilization across all processor sets.
[NEXT SLIDE]
There are a number of extra steps we can plan to ensure proper balancing of network traffic for each appserver
instance. These additional steps should to lead to properly balanced CPU load on each processor set.
Within the Java Compilers and Tools lab, the team used UNIX and HP-UX tools like sar and glance to monitor the
network and CPU activity on the systems to identify these issues. They then went through extra optimizations to
address the resulting issues.
The required extra steps are summarized here and I will go through each of them in detail in the next slides also
covering an example of a poorly balanced network configuration.
Please note that while using tools like glance is very useful during troubleshooting, you must be aware of their
overhead and disable them if needed after profiling is complete. The mideamon process started for the glance
data collection is one such component which can add overhead to the performance of your overall system.
[NEXT SLIDE]
This diagram demonstrates a poorly balanced configuration similar to our internal tests prior to tuning.
You can see we have an 8 core server like the rx6600 with four separate psets each running a processor set,
each binding to a different network interface. Before we applied the configuration optimizations this is how the
traffic was flowing on the system.
Client requests coming in to the server were distributed across appserver instances through separate network
interface which is good. However, notice that the outgoing traffic is not getting distributed at all. We can see

one of the network cards is overloaded with traffic and the processor set that services that network interface will
have to expend additional CPU cycles managing the interface which leads to an unbalanced processor set CPU
distribution of load.
In this configuration all network interfaces are on the same network subnet, which is the normal case for most
standard system configurations.
[NEXT SLIDE]
Lets take a look at how the team found out about the network traffic issue and cpu load issue arising from the
poorly optimized configuration in the last slide.
This is the output from the sar tool showing the CPU and network activity on the appserver system with four
processor sets.
We can clearly see the unbalanced network traffic with lan0 having high traffic and lan2 having no traffic.
This lack of network balance reflects in the CPU load as well. Note the CPU times are reported separately for
user, system and also interrupt handling. You can see the imbalance in the interrupt handling load and idle
cycles between processor sets and this imbalance leads to our first optimization
[NEXT SLIDE]
In implementations where we have multiple network interfaces, a single CPU core is selected by HP-UX for
handling the interrupts for each card. Interrupts required for that interface will take CPU cycles from that CPU as
load increases on that interface.
When you create multiple psets for multiple appservers, you end up with a combination of cpu cores which are
handling interrupts for different cards which can lead to an unbalanced CPU load distribution.
A worst case scenario example would be as follows: lets say we have all cores on in a specific pset handling
interrupts for network interfaces to which the appserver running on that pset is not even listening on, while you
have another pset with none of the cores doing any network interrupt handling at all. Obviously, in a vpar, npar
or virtual machine configuration, this would not be an issue since each appserver would have a dedicated OS
instance, in a Resource Partition implementation like psets, however, it is a risk.
There is a solution for this issue. After creating the pset, choose a processor core on that pset to handle the
network traffic for the interface to which the appserver running on that pset will be listening to which will define
an affinity between the appserver and the network interface it is responsible for managing.
HP-UX provides the command intctl which can be used to change the default interrupt handling cpus as assigned
by the operating system to the assignments that you specify.
After we use intctl and choose the correct interrupt handling CPUs, we have each processor set handling
interrupts for the correct network interface only.
[NEXT SLIDE]
Our next optimization is related to network traffic distribution.
In the earlier slides we looked at the unbalanced network configuration and the sar network data output. We
saw what while the incoming traffic to appserver system was distributed across application servers but the
outgoing traffic was all going through a single interface without proper distribution across the available
interfaces.

This is the expected behavior if we use multiple interfaces all on the same network subnet. The easy solution is to
use separate subnets for each interface. When we configure separate subnets, traffic coming in through a
particular subnet address can only go back through the same subnet interface.
In our example, each of the appserver instances will bind to a separate subnet address. Similarly, we should
have the same subnets available on the other tiers like database and client request tiers.
We can use a physical IP address on the appserver tier with separate physical network interfaces and also virtual
IP addresses for the database listener where we define virtual interfaces on the same physical interface, each on
different subnet.
The decision to utilize physical or virtual IP addressing depends on the expected traffic and the number of
physical network interfaces available on the system. Either way, the network layout must be carefully planned to
optimize system performance.
[NEXT SLIDE]
While the use of separate subnets is highly recommended, there are cases where this is not possible. If, for
example, due to some network configuration constraints you are forced to use the same subnet, you still have
some options within HP-UX that will allow you to achieve the same effect.
The first option is to utilize an ndd parameter which is set using the NDD command which forces network traffic
to go back on the same interface it came in even on same subnet.
Another option is adding local routes to force traffic routing through separate interfaces.
[NEXT SLIDE]
Lets go back to the bad network we started with and then compare it to the view after applying these
optimizations. Note, inbound traffic from the Transaction Request layer in the upper right are load balanced
across multiple network interfaces but outbound request in blue and green are all routed through NIC1.
[NEXT SLIDE]
Once we applied the network subnet and interrupt handling distribution tunes we have this view of the
implementation with perfectly balanced traffic across all network interfaces.
[NEXT SLIDE]
Now, if we take a look at the output from sar we can see equal network traffic and equal cpu load across all
processor sets. We have now a well balanced multi-instance configuration.
[NEXT SLIDE]
When deploying a J2EE environment with multiple instances it is important to select the best number of instances.
When the Java Compilers and Tools team did a comparison with a typical J2EE load they found one appserver
instance per two processor cores gave them the best throughput.
Please note that the comparison graphs on this slide have been normalized to an expected best performance as
100%
You can see the 4 appservers on an 8 core rx6600 provides the closest to best performance of the four
configurations.

To show the performance advantage of the dual core Montecito used on rx6600, we have the same comparison
on the right on an older Madison based 8 core rx7620.
Notice, that the rx7620 also shows the same performance pattern with best performance at one appserver
instance per two cores, but with lower overall performance than the rx6600.
The rx6600 has 8 cores on a single system board or cell because of the dual core Montecito Itanium processors
minimizing memory latency for all 8 cores. In contrast, with the rx7620 we have 8 single core Madison Itanium
processors, four each on two cell boards adding an additional latency tier to the performance equation.
Having multiple cells creates addition system performance overhead in terms of memory access.
[NEXT SLIDE]
The main performance penalty on multiple cell configurations is related to the memory latency for interleave cell
memory or memory accesses to remote LDOMs or locality domains in Cell Local configurations.
Even in a Cell Local configuration, part of the memory on each cell is local to the cell and part is interleaved
across the cells. The interleave is utilized by some components of the operating system outside of your
configuration control. There is higher latency and cost to use this cross-cell interleaved memory. With the dual
core Montecito processors, the rx6600 provides 8 core performance without these cross-cell latency issues.
[NEXT SLIDE]
Lets take a look at some details specific to java performance on HP-UX which are essential for high performing
J2EE applications deployed on HP-UX integrity servers.
While using java on Hp-UX you are using a Java virtual machine and java run time engine which are already
highly optimized for best performance on Itanium based Integrity servers, lets look at those in more detail.
[NEXT SLIDE]
HP provides a high performance Java implementation on all HP-UX systems specifically tuned to take advantage
of our Integrity servers running on Itanium processors.
On this slide you can see a list of some of the specific technology optimizations we do to provide the best
performance for HP-UX JVM specifically for Itanium architecture
These include: optimized code for the itanium architecture (taking advantage of the instruction set, profile based
optimization, speculation, predication and instruction level parallelism), improved, higher speed compile times
and the new dynamic code generator
[NEXT SLIDE]
One of the key elements of JVM performance is the garbage collection performance and memory management
aspects of the JVM itself.
HP has already highly tuned the garbage collection specifically for the Itanium based Integrity servers.
We use well-tuned allocation buffers, pre-fetching, optimized memory routines, and cache aware algorithms.
[NEXT SLIDE]
Additionally, the JVM development team works closely with the HP-UX and hardware teams to do end to end
profiling and analysis designed to improve performance which have resulted in major improvements in various

areas for JVM design as well as overall operating system performance.


[NEXT SLIDE]
On this slide, you can see the JVM performance improvement across different releases of HP-UX.
Two sets of results display performance on two different industry standard java benchmarks one for java
throughput performance and one for a j2EE distributed load.
We will have similar performance improvements continuing with the new JVM 1.6 and other future releases.
[NEXT SLIDE]
In addition to the high performance JVM, HP also provides a set of free tools for performance characterizations
and tuning java applications on HP-UX.
HPjmeter allows deep analysis and profiling of java applications, JTune is utilized for garbage collection profiling
and tuning, and HPJConfig allows you to tune system parameters for java.
These powerful tools are available free of charge from the hp website for all users.
[NEXT SLIDE]
In our next section lets take a look at HP-UX specific performance topics as they apply to multi-tier J2EE
workloads.
[NEXT SLIDE]
Here is a list of key Hp-UX features utilized by the Java Compilers and Tools team for distributed J2EE workloads.
We already saw how the HP-UX VSE technology known as processor sets helped us in the multiple appserver
instance scenario.
We also looked at the network load balancing options on HP-UX.
Some of the additional features we use are the large page support for the JVM process itself. We can run the
JVM in either 32 bit mode or 64bit mode depending on the memory requirements and workload needs of the
J2EE implementation.
HP-UX 11i v3 now provides the hyper-threading implementation for performance improvements as well.
[NEXT SLIDE]
There are some considerations with regard to selecting the physical memory. Selecting the correct density and
layout can provide better application performance on HP-UX through improved memory performance.
Select dual rank memory DIMMs over single rank and select higher density memory over lower density.
The memory DIMMs go into a memory extender, and the layout of DIMMs on the extender also affects
performance. This layout is documented for each server system in the existing product documentation. For
example on rx6600 that uses the ZX2 chipset has two memory controllers and you should load the DIMMs on
both controllers equally for best performance.
[NEXT SLIDE]

Here is a graphic where you can see the differences between different types of memory DIMMs.
The differences are from single to double sided and from single to dual rank.
[NEXT SLIDE]
Another key performance area for J2EE work load is the disk I/O. We need to look at the I/O on both the midtier and database tier.
On the database tier the database log writing should have high write performance, while the DB data I/O need
a mix of read/write performance.
On the mid-tier we need read/write performance for JMS message persistence and any transaction logging on
the appserver to ensure distributed transaction recoverability.
[NEXT SLIDE]
On HP-UX we can use the sar or Glance tools to collect I/O performance profiles.
On the sar output look for average transfer rates and I/O queues. Make sure all kernel tunable options that
affect I/O are set, you can use the kctune command for this.
Another approach would be to use the I/O profile from a well performing disk, if you are evaluating a new disk.
One example is the I/O performance evaluation between U320 SCSI disks and the newer SAS disks.
Other than the J2EE workload you can also use an I/O specific benchmark like Diskbench more specific I/O
performance evaluation.
[NEXT SLIDE]
On the database tier, for I/O performance and capacity we usually use a high performance external disk array
instead of the local disks.
Follow standard best practices when configuring the disk array. Completely populate the array with fast disks,
and use the fastest Fiber Channel interface. Completely populating the array gives you maximum parallel I/O
performance.
On the database separating the log and data I/O to separate disk arrays can provide additional isolation and
better performance.
Configure the disk array cache according the I/O requirements of your specific J2EE workload. This will depend
on the type of disk array solution you have chosen.
Some of them let you specifically set the cache options, while others automatically select the optimum caching
based on the I/O profile.
HP provides different classes of storage for the Integrity servers like the MSA, EVA and XP. Choose the right
storage solution depending on your workloads needs.
[NEXT SLIDE]
As examples, here are few guidelines for selecting the PCI slots for the network and storage fiber connection
cards.

The different PCI slots run at different frequencies, please choose the high speed slots for the most demanding
I/O components.
Give preference to the high traffic network interfaces and storage connections.
There is a specific I/O slot called core I/O which runs at a lower speed, this should be avoided for high traffic
interfaces.
Please refer the system specific users guide to identify the PCI slots and their speeds.
[NEXT SLIDE]
When using Weblogic on HP-UX make sure the Weblogic performance pack is enabled for your application.
Size the different resource pools on the appserver specifically for the workload.
Some of the key appserver resources to tune are the execute queue counts and the JDBC connection pool.
Use the recommended JDBC drivers for the database backend.
Make use of the most current Weblogic tuning guides and recommendations.
[NEXT SLIDE]
Similar to the appserver, please follow the performance recommendations for the database on HP-UX.
Some of the key recommendations are the special scheduling priority (SCHED_NOAGE) and appropriate kernel
parameters (shared memory segment size) as well as updated patches.
If you are running on a large database system with lots of cores you can isolate the log writer process to its own
processor set.
[NEXT SLIDE]
In our final section lets take a look at some of the performance advantages of using the dual-core Montecito
based Integrity servers from HP.
[NEXT SLIDE]
By moving from the single core Madison based systems to the new dual-core Montecito systems you can double
your performance for J2EE workloads.
In addition to the improved performance, Montecito provides high system availability, reliability and enhanced
scalability and capacity.
[NEXT SLIDE]
Here you can see the comparison between the Madison single core and Montecito dual core processors.
By going dual-core you get better CPU density on the servers, as well as higher performance and throughput
with the bigger on-chip caches, faster front side bus and improved latency profile.
[NEXT SLIDE]
Here is a specific feature comparison of a 4 core entry level Montecito server rx3600 to a similar system using

Madison CPU.

Hyperthreading capabilities and faster, improved chipset are both benefits.

[NEXT SLIDE]
Here are the same feature comparisons for the 8 core entry level Montecito server rx6600 to a similar system
using Madison CPU. Even more benefits here with the single cell configuration, and large cache.
[NEXT SLIDE]
Here we compare the two entry-level Montecito servers. Both of these systems can be used for high performance
J2EE mid-tier workloads
[NEXT SLIDE]
Here is a summary view of all the entry-level Montecito servers and there feature comparison.
[NEXT SLIDE]
The new Montecito based servers also come with the high performance zx2 chipset for 4 socket entry level
systems.
The zx2 chipset provides high memory bandwidth with very low latency and large memory capacity, as well as
support for PCI-X and PCI-Express.
[NEXT SLIDE]
For cell board based systems we have the new sx2000 chipset which provides the similar performance benefits
for the mid-range and high-end systems.
[NEXT SLIDE]
The new Montecito servers running the new HP-UX 11iv3, support Hyperthreading which can provide better
performance by maximizing each cores execution resources.
[NEXT SLIDE]
In closing, lets summarize what we have talked about today:

We took a look at a multi-tier configuration

We walked through the unique tuning opportunities for each configuration and we examined tunes
across all tiers.

We examined the appserver and DB configuration flexibility and how to choose the right combinations.

Finally, we took a brief look at the HW and OS capabilities to Maximize Performance with current
technologies.
Thank you for taking the time today to join us in looking at J2EE multi-tier application performance on HP-UX.
We hope your time has been well spent and that we have been able to share some useful information with you.
Good luck and good selling!

For more information:


www.hp.com/go/knowledgeondemand

2007 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without
notice. The only warranties for HP products and services are set forth in the express warranty statements
accompanying such products and services. Nothing herein should be construed as constituting an additional
warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.

You might also like