You are on page 1of 20

TIBCO Enterprise Message Service

Fault Tolerance Architecture, Tuning and Design


William P. McLane, Product Manager Robert Kutter, Director of Engineering

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services. This document is provided for informational purposes only and its contents are subject to change without notice. TIBCO makes no warranties, express or implied, in or relating to this document or any information in it, including, without limitation, that this document, or any information in it, is error-free or meets any conditions of merchantability or fitness for a particular purpose. This document may not be reproduced or transmitted in any form or by any means without our prior written permission.

Session Agenda
How to affect message delivery guarantees TIBCO Enterprise Message Service fault tolerant/high availability configuration
Internal configuration and setup External system requirements

What options are available for meeting the external requirements

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

Where and how data persistence is guaranteed and affected The areas where data persistence must be protected is any point where data is handed off from one interface to another. When using JMS messaging there are 3 areas that must be addressed.
Data persistence between publishing client and the JMS server Data persistence between the JMS server and the storage device (disk) Data persistence between the JMS server and the consuming client

TIBCO Enterprise Message Service provides options to affect each of these hand off areas.
Lets start at the beginning

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

Persistence guarantees between the publishing client and the EMS Server (1) The JMS specification defines two persistence methods for data delivery from publishing client to the JMS server.
PERSISTENT Delivery Mode: The PERSISTENT mode instructs the JMS provider to take extra care to insure the message is not lost in transit due to a JMS provider failure. In transit means between the publishing client and the JMS server because once the JMS server has successfully received the message the responsibility for successful delivery guarantee has been handed off to the JMS Server. Applications sending messages with the PERSISTENT delivery mode are blocked until the EMS server successfully receives the sent message and writes it to stable storage (disk). Once the message is received and written to stable storage the EMS server sends an EMS acknowledgement to the publishing application allowing it to resume sending.

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

Persistence guarantees between the publishing client and the EMS Server (2)
NON_PERSISTENT Deliver Mode: The NON_PERSISTENT mode is the lowestoverhead delivery mode because it does not require that the message be logged to stable storage. A JMS provider failure can cause a NON_PERSISTENT message to be lost. Applications sending messages with the NON_PERSISTENT delivery mode can block the sending operation, it is dependant on the EMS servers authentication mode. If EMS server authentication is enabled applications are blocked waiting for the EMS server acknowledgement that establishes if the client is authorized. If EMS server authentication is disabled applications that can write the message into the TCP buffer are immediately returned control after the TCP write occurs. The EMS server does not send an EMS acknowledgement when authorization is disabled since the application is not expecting one for the sent message.

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

Persistence guarantees between the publishing client and the EMS Server (3) TIBCO Enterprise Message Service provides an additional delivery mode RELIABLE.
RELIABLE Delivery Mode: The TIBCO defined RELIABLE delivery mode provides addition performance benefits above NON_PERSISTENT. The RELIABLE delivery mode provides no system or EMS level acknowledgement so publishing applications are free to send data without restriction. Applications sending data with the TIBCO defined RELIABLE delivery mode never block on the send operation. Once the message is delivered to the OS for network delivery on the publishing applications machine the send operation is returned to the application. The publishing application does not wait for a TCP acknowledgement or an EMS acknowledgement and they are actually never sent.

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

Persistence guarantees provided by the EMS Server


A message is written to safe storage under these circumstances
Original message is sent with the delivery mode set to PERSISTENT and the destination is a queue or the destination is a topic with at least one subscriber.

When writing the message to safe storage EMS can be configured to write asynchronously (non-failsafe) or synchronously (failsafe).
Asynchronous (non-failsafe) writes provide higher performance but have the possibility of lost data.
Once the message is handed off to the OS, EMS assumes the message has been written to safe storage. The OS may wait until resources are available to actually write the message so in a system failure the write operation could be lost.

Synchronous (failsafe) writes block until the message is actually written to disk therefore eliminating the possibility of message loss.

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

Persistent guarantees provided between the EMS server and the consuming client The EMS server will guarantee message delivery to any client as long as the client is connected to the EMS server To provide guaranteed delivery to non-connected applications the application should establish durable subscriptions
Durable subscriptions protect against long term network outages.

Failure protection of the EMS server is provided and managed fully by the EMS server being run in fault tolerant mode not by the durability of the consumers. TCP/IP and EMS acknowledgements provide consumer delivery guarantee.

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

Short conversation on acknowledgement modes (1)


JMS defines three types of message acknowledgement
DUPS_OK_ACKNOWLEDGE, for consumers that are tolerant of duplicate messages. Client application does not expect a confirmation of acknowledgement from the server so it is possible the server can miss the acknowledgement and resend the message. AUTO_ACKNOWLEDGE, in which the session automatically acknowledges a clients receipt of a message. Once the client application receives the message the EMS library automatically acknowledges the message before handing it off to the client application. CLIENT_ACKNOWLEDGE, in which the client acknowledges the message by calling the messages acknowledge method. The client application is responsible for acknowledging receipt of the message. Allows for message processing to occur before acknowledgement is sent.

TIBCO EMS provides an extension to the JMS acknowledgement model called NO_ACKNOWLEDGE
In NO_ACKNOWLEDGE mode the EMS server deletes all information about the message delivery to the given client because no acknowledgement is expected to be returned. Provides additional performance beyond AUTO_ACKNOWLEDGE mode since there is less data being sent between EMS server and consumer.

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

Short conversation on acknowledgement modes (2)


No single acknowledgement model can prevent the occurrence of duplicate message reception. There is always an opportunity where messages can be received twice, unless XA transactions are used. Always build applications with the ability to detect and handle duplicate messages.

10

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

How to configure EMS fault tolerance and high availability within the EMS Server
TIBCO EMS provides for a fault tolerant setup that uses a globally accessible shared state. Two EMS server machines (One Primary/One Secondary) must have access to this shared state at all times. To configure EMS fault tolerance setup two server machines with the same server name that have the ft_active parameter set to point at each other.
server = EmsServer && ft_active = tcp://EmsHostOne:7222 server = EmsServer && ft_active = tcp://EmsHostTwo:7222

Configure the two EMS server machines to point to the same globally accessible shared state.
store = /usr/global/EMS/datastore

Modify the ft_heartbeat interval if necessary


ft_heartbeat affects under what interval the primary server sends a heartbeat to the secondary server. (Default value is 3 seconds)

Modify the ft_activation interval if necessary


ft_activation affects how long a secondary server waits for a heartbeat from the active server before trying to assume the role of primary. (Default value is 10 seconds)

11

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

How to configure EMS fault tolerance and high availability within the EMS clients Client applications specify a comma separated list of server urls for fault tolerant failover.
tcp://EmsHostOne:7222, tcp://EmsHostTwo:7222

Fault tolerant url support is provided for the connection constructor for C, .NET and Java and the connection factory in Java.
Administrators can also setup fault tolerant aware factories for JNDI lookup with the tibemsadmin. EMS only supports a dual primary/secondary failover model however clients can be configured to failover to more than 2 comma separated urls.
Provides for client failover where the EMS server has multiple bound listen addresses on the same machine. (Multi-homed host) Provides for client failover to a unique set of fault tolerant servers where Disaster Recovery is warranted.

12

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

EMS system requirements for shared state used by fault tolerance (1) For complete reliability and guarantee of messaging system components outside system resources (shared state) must provide fault tolerant guarantees.

Four things are required by the shared state implementation for 100% EMS guarantee for fault tolerance.
Write Order Synchronous Write Persistence

Distributed File Locking


Unique Write Ownership

13

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

EMS system requirements for shared state used by fault tolerance (2) Write Order: The storage solution must write data blocks to shared storage in the same order as they occur in the data buffer.
When an application writes block A then block B, block A must be written to disk by the storage solution before block B. Some storage solutions re-order data writes for efficiency and space utilization reasons, this re-ordering of data writes must have the ability to be turned off for EMS fault tolerance usage

Synchronous Write Persistence: Upon return from a synchronous write call, the storage solution guarantees that all the data has been written to durable, persistent storage.
If a system failure occurs right after the write call returns successfully the storage solution must guarantee that that the data was written to long term storage. Some storage solutions buffer data writes for efficiency, buffering data opens an opportunity for data loss.

14

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

EMS system requirements for shared state used by fault tolerance (3) Distributed File Locking: The EMS servers must be able to request and obtain an exclusive file lock on the shared storage. The storage solution must not assign a file lock to two servers simultaneously.
EMS uses this exclusive lock for determining if the primary server is still online.
Some solutions allow for multiple applications to lock the same file. This must be functionally disallowed exclusive locking is required.

Unique Write Ownership: The EMS server process that has the file lock must be the only server process that can write to the file. Once the system transfers the lock to another server, pending writes queued by the previous owner must fail.

15

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

What options can satisfy the external storage solution requirements (1) Hardware options for shared storage:
Dual Port SCSI Device Generally satisfies the Write Order and Synchronous Write Persistence criteria. Distributed File Locking and Unique Write Ownership must be met by software components SAN (Storage Area Network) Generally satisfies the Write Order and Synchronous Write Persistence criteria. Distributed File Locking and Unique Write Ownership must be met by software components NAS (Network Attached Storage) NAS solutions are extremely difficult to classify since they are each unique NAS using NFS adds additional complexities to this classification All NAS solutions require the usage of Cluster Server software

16

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

What options can satisfy the external storage solution requirements (2) Software options for shared storage management:
Cluster Server (CS) A cluster server monitors the EMS server processes and their host computers, and ensures that exactly one server process is running at all times. If the primary server fails, the CS restarts it; if it fails to restart the primary, it starts the backup server instead. Clustered File System (CFS) A clustered file system lets the two EMS server processes run simultaneously. It even lets both servers mount the shared file system simultaneously. However, the CFS assigns the lock to only one server process at a time. The CFS also manages operating system caching of file data, so the backup server has an up-to-date view of the file system (instead of a stale cache).

17

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

What does TIBCO recommend?


The solution that performs the best and provides the highest level of protection is?

Storage Area Network (SAN) with Cluster File System (CFS) Software to manage it

18

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

TIBCO recommendations when deploying a fault tolerant setup Always validate with your hardware and software vendor what requirements they can fulfill. We do not certify any given solution since the requirements can be met by a number of hardware and software vendors. We will support your solution as long as the requirements for shared storage are met.

19

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

Care to share?

What solutions for fault tolerance shared state are you using?

20

2006 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary.

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and availability dates for TIBCO products and services.

You might also like