You are on page 1of 9

Mission:Messaging: Circular logs vs.

linear logs
IBM WebSphere MQ provides two logging options. Circular logs provide transaction recovery and require no maintenance. inear logs provide recovery !rom media !ailure but need to be managed.
Why the debate? IBM WebSphere MQ logs enable recovery of persistent messages from various types of failure. When the system is running properly, the logging process is an overhead that reduces the peak messaging capacity of the system in return for increased reliability. Circular logging enables the queue manager to reconcile the status of any outstanding transactions on restart. Linear logging enables recovery from this and more drastic outages such as loss of the queue file. If that as all there as to it, the obvious choice ould al ays be to use linear logs. !fter all, making the messages persistent implies that there is some e"pectation that they are recoverable. But the increased reliability of linear logging does not come free# it is slo er and requires regular maintenance. !t the end of the day, the selection of hich logging mode to use comes do n to a compromise bet een reliability and cost. In this installment of Mission# Messaging you ill learn ho to evaluate the risks and costs of both so you can make the appropriate decision. Back to top Comparing logging modes Before proceeding, let$%&s take a closer look at the different log modes. 'rotection against application, soft are, or po er failure can be achieved ith circular logs. (inear logs provide the same functionality, plus protection against media failure )a damaged queue file*. +ircular logging requires minimum human intervention because the queue manager automatically cycles through log e"tents, reusing them as needed. (inear logs are never reused and must be deleted or archived periodically. +ircular logs also provide faster throughput. ,he additional performance cost of linear logs is from creating and formatting ne e"tents and, if the logs are saved rather than deleted, moving the log e"tent to long term storage. ,he follo ing table compares the t o options# Table 1. Circular or linear logs? Category Circular logs Recovery +ircular logs are used to reconcile units of ork that ere outstanding at the time of failure. -o provision to recover from damaged queue files. Linear logs (inear logs contain a copy of all persistent messages that are queued. In a normal restart, linear logs perform the same function as circular logs .. recovery of outstanding units of ork. In addition, linear logs support recovery of data hen queue files are damaged. Performanc +ircular logs are allocated once -e linear logs must be allocated periodically e and then reused. ,herefore no hich degrades performance. In addition, the time is required to allocate and logs must be deleted or moved to prevent filling format ne log e"tents or to the underlying file system. /rive head contention delete or archive them. during archive operations reduces performance. Overhead -o administrative overhead is !dministrators must provide for management of required during normal operations. the log files. In addition, the file system must be monitored to prevent the log files from consuming all available space. 0uman processes touch the administration, operations and support

Linear logs teams. Operational ,he loss of a queue file results in ! normally running queue manager ill ris loss of all messages on that queue.eventually fill all available disk space if log files (oss of a disk partition under the are not managed regularly. ,his ill result in an queue files results in loss of all outage of the queue manager if allo ed to messages on that queue manager. happen. ,his high level comparison should give you an idea of hich mode you ould like to use, but in order to make a sound decision you ill need to understand the costs and the probability of the risk that are involved. ,o help ith that, the ne"t sections ill discuss hat happens internally. Back to top !ueue file operations ,here are t o sets of files used by the queue manager to store message and transaction data. ,he queue files are the primary storage area for persistent messages and for non.persistent messages that overflo the in.memory buffer. ,he queue files are here persistent messages are hardened to survive planned or unplanned restart of the queue manager. ,here is usually one queue file for each queue. In some cases there are more because dynamic queues leave files behind that the queue manager ill reuse. 1ecoverability is assured by making sure the message data is on disk before allo ing any subsequent operations that might 2eopardi3e it. In general, WebSphere MQ flushes messages to disk prior to returning control to the calling program. 4"ceptions to this rule are made to optimi3e performance here transactionality is not affected. 5or e"ample, hen messages are ritten inside of a unit of ork, the disk rite can be cached safely up to the point in time here the transaction is committed. When the commit call is made, all pending rites are flushed to disk before the call completes. Similarly, a persistent message ritten outside of a transaction to a aiting getter might never be ritten to the queue file at all. Because the queue files contain all recoverable messages, the si3e of the queue file depends on the number of messages queued up at any point in time. !s ne messages are queued, the queue file si3e tends to gro . !s messages are removed, the queue file si3e tends to shrink. ,o optimi3e disk operations, resi3ing of queue files does not occur in real time. ,he shrinking of the queue file al ays lags behind the removal of messages. !s a rule of thumb, the queue file for any queue is al ays as large as the amount of data stored in persistent messages on that queue hich, depending on settings for M!6MS7( )the largest allo able message* and M!6/4',0 )the ma"imum number of messages that a queue can hold* can be quite large. It can become much larger. -on.persistent messages are normally not ritten to the queue file but are held in memory. !s messages begin to accumulate, eventually the in.memory buffers fill and non.persistent messages spill over to the queue files. /uring restart, non.persistent messages in the queue are discarded unless the queue$%&s -'M+(!SS attribute is set to 0I70. In that case, the queue manager ill make an effort to restore any non.persistent messages found in the log file. Back to top Circular logging +ircular logs are used to contain the messages hile they are inside of a transaction. /uring restart of the queue manager, the log files are reconciled against the queue files to determine the disposition of these transactional messages. !ny messages enqueued under syncpoint are automatically removed, as if a rollback had been issued. Similarly, any messages dequeued under syncpoint are returned to the queue, as if a rollback had occurred, and become available for subsequent processing. ,he e"ception to this is hen the queue manager acts as an 6! resource

Category

Circular logs

manager. In this case, messages are still replayed from the logs, but their final disposition is determined by the system acting as the transaction manager. +ircular logs are allocated once and then reused as needed. Because the total log allocation is finite, there is no danger of the logs gro ing to e"ceed the allotted file space. 8f course, this assumes that the log partition is mounted to a dedicated file system of sufficient si3e to contain all of the log e"tents. If the underlying file system is too small or is shared ith other queue managers or applications, then it is possible to consume all the file space causing a queue manager outage. ,he important thing to remember about circular logs is that the only messages guaranteed to be in them are those ritten inside a transaction. Without a copy of every persistent message, the circular logs cannot repair a damaged queue file. Back to top Linear logging (inear logging provides a superset of the functionality of circular logging. Queue manager restart operations using linear logs function the same as ith circular logs# the log files are reconciled against the queue files to determine the disposition of transactional messages. In addition to the transactions under syncpoint, linear logs also contain a copy of all persistent messages. If one or more queue files are damaged, the queue can be recovered to the last kno n good state by replaying the linear logs. ,his is kno n as media recovery. 9nlike circular logs hich are reused, the number of linear logs increases ithout limit as messages move through the queue manager. ,he amount of log data produced in a daily processing cycle is proportional to the amount of data processed as persistent messages during that same period. 4ach gigabyte of persistent message data processed ill generate slightly more than a gigabyte of log files. ,he file system under the log partition must therefore be si3ed to hold all of the persistent messages that might pass through the queue manager in a typical processing day. ! ne ly created log e"tent is eligible to participate in both transaction recovery and queue recovery. Suppose an application rites a persistent message to the queue under syncpoint, but no application is there to consume it. ,he log e"tent first participates in the transaction during the enqueue operation. If the queue manager fails at this point, the message ill be rolled back. -e"t the application issues a commit and the message becomes available on the queue. ,he log e"tent still contains the message and a record of the completed transaction. It can be used to recover the message, but it is no longer needed for transaction recovery. 4ventually, the message is removed from the queue. !t that point, the message is no longer recoverable. When all messages in a log e"tent are no longer recoverable, that e"tent becomes inactive and is eligible for archival or deletion. Back to top Log file operations -o let$%&s see ho all the pieces fit together. ,he transactions under syncpoint at any given time are tracked using a log head pointer and a log tail pointer. -e put or get activity advances the head pointer hile commit or rollback calls advance the tail pointer. ,he ma"imum distance bet een the head pointer and the tail pointer is calculated as the si3e of a single log file multiplied by the number of primary and secondary log e"tents. ,his value represents the ma"imum amount of data that can be held under syncpoint by the queue manager at one time. In the case of a long.running transaction, it is possible to e"haust all of the primary and secondary log e"tents. +onsider the case of an application that gets a message under syncpoint and then never calls commit. 4ventually, all primary and secondary e"tents ill be used and the long.running transaction ill prevent the tail pointer from advancing. When this occurs, the oldest

outstanding transaction is rolled back. ,his frees the log tail pointer to advance, making room for the ne transaction. ,he application holding the rolled.back transaction receives an appropriate return code and is free to retry the operation. It is important to understand that space available for transaction recovery is al ays bounded by the number of log e"tents. ,his is true for linear as ell as for circular logged queue managers. Because transaction recovery impacts restart time, it is necessary for WebSphere MQ to enable tuning of the ma"imum si3e and number of e"tents of log files. ,he queue manager might take a hile to restart if there are many gigabytes of messages in the transaction logs. Many have ondered hy it is necessary to specify primary and secondary log e"tents ith linear logs, since they can gro indefinitely. ,he ability to trade off restart times against the amount of simultaneous transaction data is the reason hy. When a queue file is damaged, the administrator of a linear.logged queue manager can issue a command to recover messages in the queue. In this operation, the damaged queue file is deleted, an empty queue file is created, and the log file is parsed from the last kno n point of consistency. !ll put, get, and commit operations for that queue are replayed until the queue is restored to the state it as in after its last successful put, get, backout, or commit. ,he queue is then made available to the queue manager and the applications ishing to access it. Back to top "etermining the appropriate logging mode +ircular logging sacrifices the ability to recover persistent messages from a damaged queue file in return for performance and automated log file management. ,he queue manager can process many more messages per second, but if a queue file is lost, so are all the messages that ere on it. Selection of this logging method is most appropriate hen the messages are easily recreated or hen the applications involved can automatically reconcile their state. ,he use of persistent messages is often an indication that neither of these conditions is true and that linear logging might be required. 0o ever, there is a cost associated ith linear logging, and there are cases hen this cost e"ceeds the financial impact of losing persistent messages. Making that determination requires a fairly accurate estimation of the costs of linear logging and the possible financial impact of losing one or more queues full of messages. If lost messages is the lesser impact, choose circular logging. Similarly, linear logging should not be selected ithout a thorough understanding of its costs and risks. ! normally operating linear logged queue manager ill eventually consume all available disk space if the logs are not managed. If this is permitted to occur, the entire queue manager ceases to function. 5or some applications, the temporary unavailability of the queue manager is far orse than that of loss of messages. 5or these applications circular logging may be indicated. ,o help make that determination, the ne"t sections discuss the costs and risks in greater detail. Back to top Ris s of circular logging #uman error ,he most common cause of damaged queue files is human error. ! fe real.life e"amples include# Queue files that ere deleted by a system administrator responding to a disk space alarm. !utomating the regular deletion of a queue file by an administrator intending to clear messages from the queue. !ttempting to start a primary and secondary queue manager against the same set of files. Backing up WebSphere MQ files hile the queue manager is running. 9sers opening queue files under edit and acquiring locks on them.

+hanging the group membership of the MQ service account. Such risks are mitigated through training and practice. When evaluating the likelihood of human error, there is a tendency to underestimate both the impact and the probability of an occurrence. 5or e"ample, the impact estimate is often based on normal operation of the system, hen the queues are empty or nearly so. 0o ever, the queues and logs are usually si3ed to hold a significant amount of data in the event of an outage. It is during such an event .. hen the queues are at their high ater marks and tensions are running high .. that human error is most likely to occur. When responding to a critical outage, people often take e"pedient actions to prevent escalation. ,his is precisely the time hen queues tend to get lost due to disk space alarms, improper access, or improper restart of contingency systems. ,he likelihood of damaged queues also tends to rise over time as the implementation matures in the organi3ation. !lthough the systems and requirements tend to be ell understood at deployment time, it is only routine operations that are practiced on a daily basis. ,hese routine operations are thus reinforced hile the e"ception procedures fade from memory. -ormal staff turnover has a tendency to replace formal training ith on.the.2ob training focused on routine tasks. ,his further dilutes kno ledge of non.routine procedures, causing the probability of human error to tend to rise over time. 8f course, even the best trained staff can make mistakes under completely routine circumstances. Because the system depends on humans for its health and elfare, a certain amount of human error is inevitable. ,his is especially true here there is little or no e"cess capacity in the operational teams. ,he more time the team spends in triage mode and the less time spent on strategic activity, the more likely human error becomes. $ystem error ! less common cause of damaged queue files is system error. 5rom time to time problems are identified in hich there is a possibility of damage to queue files through no fault of the human participants. When these result in changes to the code, they are identified as !'!1s and published. ,he most recent !'!1 that affects logging is I+:;;:<. ,his !'!1 describes a situation in hich a ne circular log e"tent is formatted and ritten ith disk caching enabled. If the server fails hile disk rites remain cached, data might be lost or queue files damaged. 8ther !'!1S describing conditions resulting in damaged queue files include# !'!1 S4=>?@@# /uring A1ecord MQ 8b2ect Image,A queues are compacted if necessary. /uring the process of compacting the queue, a difference bet een the memory and the disk image happens as certain flags related to group messages are not propagated correctly. ,he error is thro n at the 1ecord MQ 8b2ect Image request as WebSphere MQ ensures that the bad queue image is not recorded. I+@B@?># /amaged ob2ect follo ing log errors caused by a sharing violation on the log itself. ! transaction is rolled back but a S,8'C!(( )and end queue manager* failure occurs bet een logging a +(1 log record and the end transaction, meaning the end transaction log record is never ritten. ,his S,8'C!(( also does not result in the queue manager terminating. ,he transaction then appears as active in a subsequent checkpoint, hereas it should in fact have been rolled back, causing problems during recovery and results in a damaged ob2ect. I+@<=;D# /amaged ob2ects follo ing queue file resi3ing. 5ile pointers in use by queue manager threads could be left invalid follo ing reduction of the file. ,his could result in data in subsequent rites to the file being lost. !lthough such !'!1S are rare and the possibility of triggering the sequence of events to cause the problem is remote, the fact remains that queue files do occasionally become damaged ithout disk crashes or human error. Back to top

Ris s of linear logging (inear logging dramatically reduces the chance of lost data. ,he possibility is not eliminated altogether because there is al ays a chance of losing both the queue files and log files. ,his is hy IBM recommends that queue files and log files are mounted on separate dedicated partitions. If the file systems are mounted to separate partitions and IBM recommendations are follo ed, a very high degree of recoverability can be achieved. ,his recoverability comes at a price in terms of system performance and administrative overhead. (inear logging also introduces an additional risk hich can potentially cause a complete outage of the queue manager. While the additional risk is ell understood, the cost of implementing linear log maintenance correctly is often underestimated. ! normally operating linear logged queue manager ill definitely e"perience an outage unless the logs are actively managed. /epending on message traffic load and disk space allocations, it might take a day or it might take a year, but eventually the logs ill gro to consume all available space and the queue manager ill crash unless you take steps to prevent it. ,o understand the risk, it is necessary to understand the maintenance procedures. IBM provides several Support'acs that manage linear log files. ,ypically, these tools use a scripting language to identify inactive log e"tents and dispose of them. ,he queue manager provides commands to inquire on hich e"tents are active, as an option for users ho ish to provide their o n instrumentation for this process. Whatever tooling is used, it is typically automated to run nightly. When evaluating the cost of linear log maintenance, most assessments consider only the scripting and automation. But like any other critical system process, something needs to make sure the script actually runs. Issues commonly encountered include# /ue to heavy load, the file system fills before the ne"t log archive interval. ,he log archive 2ob or automation fails silently, allo ing the file system to fill. 4"cessive M!6/4',0 and M!6MS7( settings let the queues hold more data than the capacity of the log partition. Safe implementation of linear logs requires additional instrumentation and regular human oversight of the system. In practice, the most significant risk is failure to recogni3e this requirement and commit sufficient resources to implement a robust archive process. (inear logging is not unreliable, but implementing the archive process on a shoestring budget can make it appear so. Belo are a couple of use cases describing ho linear logging as successfully implemented. ,hese are at opposite ends of the spectrum hen it comes to comple"ity, but both proved very successful. Back to top % comple& linear log use case ,he first case is a system that I implemented some years ago. It had a rather large footprint but it as robust and supported hundreds of linear logged queue managers. ,he basis of the system is Support'ac MS:=# (inear (og +leanup 9tility, a 'erl script that parses the error logs to identify the inactive e"tents and provides options to archive or delete them. In the implementation described here, the inactive logs ere deleted. ,he Support'ac as bundled into a rapper script hich performed three functions# B. ,ake a checkpoint using the rcdmqimg command. =. 4"ecute the log archive script from Support'ac MS:=. <. 1eport any errors. ,he combined script as scheduled nightly. If e had stopped there, outages ould have been a fairly common occurrence. Initially, that is e"actly hat happened. 4ventually I added negative

notifications. ,he original error reports let us kno hen something as rong, but lack of notification could either mean that the system as healthy or it could mean that the archive 2ob failed. ,he negative notifications ere daily e.mails to the team that reported success of all archive 2obs. If the notice did not arrive, e kne something as rong. ,he final implementation included many components and touched many processes# ! Web service received updates from all log archive 2obs and raised an alarm if there ere problems. ! report program reconciled all of the updates from the previous night against the database of queue managers. !ny failed archive 2obs from the previous night ere listed. In addition, the report also listed linear logged queue managers here the archive 2ob failed to report in. ,his provided negative notification. ,he report as e.mailed to all MQ administrators and as also available online using a Web bro ser. 5ailure to receive the e.mail provided additional negative notification. System monitors reported log partitions that fell lo on free space. !ll of the human processes to support the system ere documented in the various teams that participated in routine and e"ception procedures. ,hese included the MQ administrators, platform 8S support teams, and the staff of the operations command center. ,he deployment indo for ne queue managers as e"tended slightly to include setup and testing of the linear log automation. ,he provisioning process as modified to include message traffic profiling and si3ing for log files in addition to the e"isting message traffic profiling that as in place to si3e queue files. ! sandbo" environment as provided to test the archive system and to practice recovery e"ercises. ! development environment as provided to house a non.production version of the system. -e , dedicated servers ere provided for the Web server front.end and database. 8ne of these as in the production data center and one in the disaster recovery data center. In addition, organi3ational commitment as required to maintain t o.deep e"pertise on staff capable of performing maintenance of the log archiving and reporting system. !s these ere fairly comple", the requirement as not simply having someone on staff ho kne 'erl and Web services, but specifically t o people ho ere familiar ith the system and qualified to modify it to resolve an outage. When I implemented this system, e had already developed an online tool to administer the queue managers and leveraged much of that infrastructure to build out the log archive automation. ,his made the system a little more comple" than something you might build from scratch, but the real improvement as hen e mapped out all the human processes and formali3ed the touchpoints to properly provision, deploy, and administer the system. Back to top % slightly less comple& linear log use case 8ne of my clients had a different approach that en2oyed the benefit of simplicity and as 2ust as robust as the first case. ,he basic problem to be solved is that an unmonitored linear logged queue manager ill eventually run out of disk space. ,his shop had modest space requirements, so their approach as to massively over.allocate log file storage. ,hey calculated the amount of storage a day$%&s orth of message traffic ould consume and then added a generous margin to allo for gro th. ,hey then multiplied that by B@ to come up ith a t o. eek buffer. It as the 2ob of the on.call person to monitor the disk partitions a couple of times a eek. If one person neglected that duty, the on.call person the follo ing eek ould be likely to catch the error. ,o add one more safeguard, the file system monitors ere set at a very lo threshold,

calculated to signal a problem after t o to three days. When I first heard of this system, it had already been in place for several years ithout any incidents. Back to top Putting it all together ,here is a school of thought hich holds that circular logging is generally preferred, due to queue manager outages that can occur ith linear logging. 0o ever, the perceived instability of linear logged systems is not due to any frailty in the systems themselves, but largely due to failure to fund the implementation sufficiently. If the system is robustly instrumented and the human processes accounted for, linear logging can provide e"ceptional levels of reliability for those applications that require it. 9nfortunately, there have been many linear log implementations that lacked human oversight, redundancy, or negative notification, and hich subsequently failed. ,his has led to the popular isdom that linear logging introduces more risk than it mitigates. But that need not be the case. I have provided t o e"amples here linear logging as e"tremely reliable. ,he first as comple" but allo ed fine tuning of disk space and early arning of problems. ,he other as strikingly simple in using relatively cheap disk storage to account for eek.long lapses in human oversight. Eours is likely to be some here in the middle of these e"tremes. ,o make the decision of linear versus circular logging you ill need to calculate the costs and risks for your shop and applications. Which is more costly to your business# message loss or a queue manager outageF What ould it cost in your shop to implement a truly robust process for log maintenanceF /o you need the throughput that only a circular logged queue manager can provideF When you can ans er these questions ith confidence, you ill be ell equipped to make the right logging decision for your business. ,he default for WebSphere MQ is circular logging. If you do not specify other ise, create a queue manager it ill have circular logging invoked. hen you

+hanging the type of logging in the qm.ini after queue manager has been created ill not change the ay the queue manager handles logging. If you ish to convert from circular to linear or vice versa, you ould have to recreate the queue manager specifying the ne type of logging at creation time. (inear logging allo s you to recreate lost or damaged data by replaying the contents of the log. ,his developerWorks article gives a comparison bet een circular and linear logs to help you evaluate hich one might be best for your system# +ircular logs vs. linear logs Should you choose to use linear logs, this ill require that you manage the log files. 8ther ise, the log files ill gro infinitely and eventually fill your file system. Eou can archive inactive logs because they are not required for media recovery. ,here are some free MQ Support'acs such as MS;(, M;G<, and MS:= available to assist ith linear log file management and cleanup but these are provided Has.isI and ithout arranty or support by IBM. Eou can determine hich logs are no longer needed by monitoring the specific messages reporting the logs required for media and recovery. 'eriodically, the queue manager issues a pair of messages to indicate needed# hich of the log files are

Message !MQGD:G gives the name of the oldest log file required to restart the queue

manager. ,his log file and all ne er log files must be available during queue manager restart. Message !MQGD:> gives the name of the oldest log file needed for media recovery. -ote, the log files required ill change as checkpoints and record images are performed and the messages !MQGD:> and !MQGD:G ill get updated to reflect the required logs for queue manager restart and media recovery. 1unning rcdmqimg )record media image* rites the image of ob2ects to the log for media recovery, thereby freeing up old log files for archival or deletion. But the rcdmqimg command does not run automatically so it must be run manually or from an automatic task you have created. 8ften hen troubleshooting log problems, I request an output listing of the log file directory )for e"ample, ls .lt1 JvarJmqmJlog*. ,his helps determine the number of logs, hen they ere ritten and may give clues as to ho fast the logs are being utili3ed. Eou need to keep all log files back to the oldest log file required for media recovery )message !MQGD:>* in order to recovery qmgr ob2ects regardless if you archive them or not. Eou can save space and archive the log files bet een the log file required for media recovery up through the log file required for restarting the queue manager )message !MQGD:G*. !ll log files older than the log file required for Media recovery are not required and can be deleted. 8nly log files required for queue manager restart, active log files, are required to be online. Inactive log files can be copied to an archive medium such as tape for disaster recovery, and removed from the log directory. Inactive log files that are not required for media recovery can be considered as superfluous log files. Eou can delete superfluous log files if they are no longer of interest to your operation.

You might also like