You are on page 1of 16

STORAGE AREA NETWORK

Competitive Brief:
DCX Architecture and Performance
Reviews the Brocade DCX design and refutes erroneous
claims from Cisco about its architecture & performance

STORAGE AREA NETWORK

COMPETITIVE BRIEF

INTRODUCTION
Cisco has made numerous claims about the Brocade platform architecture and performance. This
paper responds to their marketing glossy Performance Testing on Brocade 48k (June 2006) and
similar claims they have made regarding the Brocade DCX Backbone.
The most obvious flaw in the Cisco testing is that it was not independent. Even if there were no other
problems with their process and results, this would be an issue. It is easy for a vendor to contrive test
conditions to artificially produce a desired result, and then further manipulate statistical claims based
on that already-manipulated result. 1 Independent testing is one way to prevent such blatant cheating.
Cisco could have chosen to allow independent testing, or to participate in an open bake-off. It appears
that they did not believe that they would fair well in a fair test, and chose to avoid any testing
environment which would allow comparisons to be conducted under experimentally valid conditions.
Along the same lines, when Cisco makes negative statements about the Brocade DCX system, they do
not provide essential configuration details such as software and firmware versions, hardware
revisions, or cabling and traffic flow information sufficient to duplicate their testing. From a scientific
standpoint, this means that there is no evidentiary value to their claims, because it is not possible to
determine what they did to achieve their claimed results. Based on some of the results they typically
report, their testing appears to have been performed with defective equipment and/or using nonproduction hardware/software/firmware on the Brocade platforms. Unlike a bake-off, the Cisco testing
was conducted without any configuration assistance from Brocade or a qualified Brocade partner, so
the Cisco internal personnel configuring the test bed were unqualified to perform Brocade DCX
installation or configuration.
Some of the known problems with their methodology have historically included:

Using 64byte frames without indicating that they were doing so


Not explaining the correlation between test patterns and real-world applications
Using defective cables and/or SFPs, or pre-production hardware / software levels
Using terminology incorrectly, such as referring to backplane traces as ISLs or muddeling the
use of the terms blocking vs. congestion
Creating congestion on latency tests, which measures queue depth instead of switching latency
Turning off features such as DPS in order to artificially create imbalances on the backplane

The Cisco claims have little if any technical validity, and in some cases are simply and demonstrably
direct falsehoods. Such claims can therefore be classified as marketing FUD (Fear Uncertainty and
Doubt) rather than as a technical comparison. Their intent appears to be to deflect attention from
their own architectural shortcomings, rather than to actually compare the platforms.
This whitepaper reviews the Brocade DCX architecture, refutes the worst of their erroneous claims,
and provides some level of discussion about SAN performance measurement methods and their
applicability to real-world performance. Where it is necessary to debunk a Cisco claim, this paper tries
to provide a balanced view of the matter, and on the potential cause of the error on the part of the
marketing personnel at Cisco who created the claims in question. For instance, if it is possible that
Cisco merely lacked knowledge or understanding, or used faulty equipment, then this paper will
indicate that this is a possibility rather than assuming that every incorrect claim was an intentional
falsehood designed to mislead their customers. The idea is not to follow Cisco down the road to a
mud-slinging match. Rather, it is to give the reader enough information to reach technically valid
conclusions about the Cisco claims, and about the Brocade platform characteristics.

Mark Twain said, There are three kinds of lies. Lies, damn lies, and statistics.

Cisco vs. Brocade Platform Performance

2 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

BROCADE DCX Features and Performance


All Platforms use a Multi-Chip Design
Cisco Claim:
The DCX uses a three stage network.
Reality:
The Brocade DCX platform does use separate chips on port blades vs. CP core blades. However, so
does Cisco and so does every other modular switching or routing product. It is logically impossible to have
one microchip which sits on multiple blades at the same time, so every bladed product will have some
form of multi-chip design. Frames crossing a Cisco platform go through port-card chips (stage 1) crossbar
chips (stage 2) and then back through still more port-card chips (stage 3). In fact, because Cisco is
essentially still shipping a first generation platform without any hardware refinement or chip-level
integration, each one of their stages actually consists of several chips.
Terminology may vary, such as the definition of stage, but the effect is identical from the perspective of a
SAN-attached application. Applications care about whether or not their data gets to the other end of the
network intact and how long it takes to get there. They do not care whether a particular backplane trace is
called a hop, stage, trace, or link by marketing personnel. It turns out that the Cisco three stage
platform design is vastly slower than the Brocade approach, from the point of view of the application.
The fact is, all bladed switches and platforms need internal connectivity between line cards or port
modules. Platforms have port-blades with chips that provide outward-facing connectivity, and
separate chips, usually on centralized blades, which provide bandwidth between blades. There are
different ways to construct and interconnect these chips. The two most popular approaches involve
shared memory and crossbar technology.
High-speed Ethernet and Fibre Channel switches and platforms use shared memory designs to
achieve the highest performance. Shared memory switches are most often built using customized
Application Specific Integrated Circuits (ASICs), which allow them to support advanced features in
hardware rather than relying on slower, less optimal approaches. Crossbars are typically used to lower
development costs, which increases profit for the company building the switch, but at the sacrifice of
performance and features. Inside the DCX, the back-end inter-ASIC connections use the same frame
format as the front-end ports, enabling the back-end ports to avoid latency due to protocol conversion.
Crossbar switches, in contrast, generally convert frames to a proprietary back-end protocol, and then
convert them back into the front-end protocol before they leave the switch. This double-protocolconversion is inherently inefficient, and is one of the many reasons why the Cisco platform is so slow.
Cisco has been making a number of false claims related to this three stage design.
Cisco Claim:
Lack of a central arbiter contributes uncoordinated forwarding decisions inside the switch.
Reality:
A central arbiter (also known as a central failure point or central bottleneck) is required in a
crossbar switch because the chips on each port blade do not know how to forward frames. Cisco uses
unintelligent, commodity parts on their port blades in order to save on development cost. In the
Brocade architecture, each and every port blade has the intelligence necessary to make frame routing
decisions. Cisco is trying to spin their cost-cutting design by implying that intelligence on port blades
is a bad thing.

Cisco vs. Brocade Platform Performance

3 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

The evidence they provide for this is that the uncoordinated forwarding decisions made by DCX
port blades can result in imbalances in the allocation of bandwidth. There are numerous problems
with this theory.
For instance, it is possible to create an imbalanced scenario in any network device. The DCX can
exhibit imbalances, sure, but so can the Cisco directors. In fact, if you build a core/edge network of
Cisco directors, and the same network of DCX chassis, it is even possible to create an identical
imbalance if you load the networks with the same traffic pattern.
The leads to the second major problem with their theory. From the point of view of an application, it
doesnt matter if an imbalance occurs within a switch, or between switches in a network. The
application doesnt even actually care about imbalances per se. (If applications crashed when
bandwidth imbalances occurred, then there wouldnt be any functioning networks anywhere in the
world, because they all have imbalances in them.) What applications care about is whether or not they
get enough bandwidth to (a) keep their storage connections from timing out, and (b) meet their
application performance goals.
Now, it is critical to understand that imbalanced portioning of bandwidth only occurs on links with
sustained congestion. Uncongested links provide each application with 100% of the requested
bandwidth. Even if the link is running at 99% of capacity, there will be no imbalance. Links which
congest only briefly may experience a slight variation in latency during the transient congestive state,
but this has no impact on application-level performance. Network imbalances such as the ones Cisco
is describing only occur and potentially impact applications when a congestive condition persists for
extended periods of time.
The significance of this is that, if you have links with continual congestion, you already have an
application level problem regardless of the balancing of the resource. You can design a congested
Cisco network which experiences imbalances. The fix is to add more ISLs, thus eliminating the
congestion, in which case the imbalance will also go away. The same scenario applies in the same way
to DCX networks.
It is also worth noting that many customers want an imbalanced allocation bandwidth on congested
links. That is the whole point of QoS software: high priority flows get a bigger share of the bandwidth
than the lower priority flows. Brocade provides a number of mechanisms of deliberately creating
imbalances in order to favor higher priority applications, including QoS, ingress rate limiting, traffic
isolation zones, and local switching. But for most customers, less is more with these features. The
best option is to design a network which doesnt experience sustained congestion in the first place.
Cisco Claim:
The DCX has internal ISLs which cannot support 8Gbit
Reality:
It is important to note that the internal ASIC connections in a Brocade DCX are not E_Ports connecting
an internal network of switches via ISLs. The entire platform is a single domain, and a single hop in a
Fibre Channel network. When a port blade is removed, a fabric reconfiguration is not sent across the
network. Back-end connections use the same frame format as front-end ports to maximize efficiency,
but because they are contained within a single switch, there is no need to run any of the higher layer
(service) FC protocols across these connections.
The Brocade DCX features an internal Channeled Central Memory Architecture (CCMA) fabric of
purpose-built ASICs capable of switching at 320 Gbit/sec per chip. (640 Gbit/sec cross-sectional.) The
DCX is powered by a matrix of these Condor2 ASICs, which delivers up to 256 Gbit/sec per slot, net
of local switching. In addition, the high-density blades can take advantage of local switching to achieve
348 Gbit/sec per slot. This yields 3 Tbit/sec (6 Tbit/sec full duplex) for a single platform, not counting

Cisco vs. Brocade Platform Performance

4 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

the Inter-Chassis Links. If two platforms are connected via ICLs, the overall system delivers 6 Tbits/sec
(12 Tbits/sec cross-sectional).
The links between the ASICs within a DCX are CCMA links, not ISLs. ISLs carry traffic for a number of
standards-defined FC services such as FSPF, name server, zoning updates, and so on. None of this
protocol overhead is present on the DCX backplane, so all of the bandwidth described above is
available for application data. Since the term ISL is defined in the FC standards in a specific way,
and the inter-ASIC links inside the DCX do not match the definition of ISL, this specific Cisco claim is
a technical falsehood rather than simply a misleading bit of spin. The inter-ASIC links are only similar
to FC ISLs in one respect: they carry Fibre Channel frames. Of course, the same can be said about the
backplane of the Cisco platform. Carrying FC frames is, after all, the whole point of those links.
Besides, lets say for the sake of argument that Cisco was telling the truth, and DCX was a network in
a can with ISLs on the backplane. Is Cisco saying that networks are bad? Cisco? Unless they are
willing to admit that their own products cannot or at least should not ever be networked together, then
their assertion that the internal characteristics of DCX are like a network should be answered with a
resounding, So what?
Beyond that, there were fundamental flaws in the method Cisco used to prove that the DCX
backplane traces are not high enough performance support 8Gbit. To reach this conclusion, they
disabled the backplane trace balancing mechanism. That is, they adjusted features on their FC frame
generator and/or turned off features on the DCX to artificially create hot spots on the backplane.
This is disingenuous for several reasons. Like the uncoordinated forwarding decisions discussion in
the previous section, it is possible to e.g. design a network of Cisco chassis and get the exact same
behaviors using the exact same test equipment settings. (Except, of course, with Cisco you need to
use 4Gbit FC test equipment.) Like Brocade, Cisco uses FC exchange boundaries when balancing
links. If you turn off exchange ID rotation on the frame generator (thus making it behave radically
differently than any real FC device) then you will get hot spots inside the DCX, but you would also
create hot spots in the Cisco network.
In any case, this is academic. Because real hosts and storage devices do change exchange IDs on a
very, very frequent basis, balancing IO on exchange boundaries works quite well. Which is why both
Brocade and Cisco use this method.

Brocade Has Frame-Level Trunking; Cisco Does Not


Cisco Claim:
Brocade can only trunk 8 adjacent links, whereas Cisco can trunk 16 non-adjacent links.
Reality:
Cisco and Brocade define the word trunking differently. There is no concrete standards-based
definition of the term, so this claim could be viewed as simply misleading rather than a direct
falsehood. However, since they radically change the definition of the word trunking in the middle of
the sentence without mentioning this to the reader, characterizing the claim as merely misleading
might be overly charitable.
It is true that the feature Brocade defines as trunking must use adjacent ports and uses a 2- to 8link design per trunk. However, Cisco does not have any equivalent feature; they simply have a totally
different feature which they market under the same name. Cisco does not have trunking by Brocade
standards, but Brocade does have a feature equivalent to Ciscos trunking. The Brocade feature is
called Dynamic Path Selection (DPS). DPS is not trunking, but it looks similar enough to trunking on
PowerPoint slides that Cisco is able to create confusion around the terminology.

Cisco vs. Brocade Platform Performance

5 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

The Brocade DPS feature does not require the use of adjacent ports, and balances IO on a perexchange basis much like the feature Cisco calls trunking. When DPS is combined with Brocade
frame-level trunking, the DCX can produce a 64-link balanced pipe. Here is how that works:
Each Condor2 ASICs ports can be combined into virtual interfaces, or frame-level trunks, of up to
64Gbits each. It is also possible to balance IO between multiple trunk groups to create a pipe of up
to 512Gbits/sec. between different platforms in a fabric by using DPS. Since Cisco is limited to just
the DPS-equivalent feature, lacks 8Gbit ports, has no true frame-level trunking feature at all, and is
limited to 16-link pipes, it is clear that the Brocade feature is actually considerably faster and more
flexible. The maximum Cisco trunk bandwidth is 64Gbit, vs. 512Gbit for DCX.
For reference, here is a feature comparison chart:
Brocade DCX Feature

Cisco Equivalent

Advanced Trunking

N/A: They have nothing equivalent; architectural flaws


prevent them from delivering this feature

Dynamic Path Selection (DPS)

Feature they call Trunking for marketing reasons,


since they were unable to deliver real trunking

Brocade Advanced Trunking provides superior performance to DPS or to Ciscos pseudo trunking.
Advanced Trunking implements true load balancing by "spraying" frames across all the links in the
trunk, while preserving in-order delivery. The other advantage over DPS is that frames are not dropped
when a member link goes down. No re-routing takes place either. The only frames that might be
dropped are those physically in flight on the link which failed. With DPS (or Cisco pseudo trunking) it is
necessary to re-route the group, which is disruptive.
DPS or Cisco Pseudo-Trunking merely implement load sharing, not true balancing, allowing some
links to be congested and some underutilized. This could happen, for instance, when multiple
exchanges hash to the same value and therefore end up on the same link. If several high traffic, longlived exchanges are directed to the same link, that link becomes congested, at the same time when
other links with low traffic exchanges have bandwidth to spare.
For a more detailed description of DPS and frame-level trunking, see Chapter 8 in the book Principles
of SAN Design, which is now in its second edition. (This edition was released in September 2007; go to
http://www.bbotw.com and search for SAN Design to purchase this book.)

Brocade Supports Cut-Through Switching; Cisco Does Not


When a frame enters the Condor2 ASIC, the address information is immediately read from the header,
which enables routing decisions to be made even before the whole frame has been received. This
allows the ASICs to perform cut-through routing: a frame can begin transmission out of the correct
destination port on the ASIC even before the initiating device has finished transmitting it. Only Brocade
offers a SAN platform that can make these types of decisions at the port level, enabling local switching
(below) and the ability to deliver 3 TBits of bandwidth in the platform. (6 TBits cross-sectional.)
Cisco Claim:
Brocade does not perform CRC calculations.
Reality:
Brocade does perform CRC calculations. Its really this simple: invert their claim to get the correct
answer. As a result, this must be classified as a direct lie on their part.

Cisco vs. Brocade Platform Performance

6 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

To be charitable, it is possible that the Cisco marketing personnel writing the document might not
have understood what a CRC is, how it is performed, what it is intended to accomplish, or how cutthrough switching works.
The purpose of a Cyclic Redundancy Check in an FC switch is to detect alterations of data during
transmission, and to cause defective frames to be discarded before they reach the end-point
application or LUN. The device dropping the bad frame could be a switch, router, HBA, or storage
controller. As long as the bad frames are dropped before they can become bad data at the application
level, then the CRC did its job.
To perform a CRC on a frame within an FC switch, it is necessary to run a formula against all of the bits
in the frame, then compare the result of the formula against the CRC which was written into the frame
by the originating device. The definition of cut-through switching is that the switch will start
transmitting a frame out its destination port before the entire frame has been received. It is a logical
impossibility to perform a CRC on a frame until the entire frame is received because of the
mathematical nature of CRC formulas. Since the frame has already been largely transmitted before
the switch has enough information to entirely calculate the CRC, it is certainly the case that a Brocade
platform can deliver a frame with a CRC error well, sort of.
The problem with Ciscos argument is that, before an FC frame can be considered delivered by a
switch port, the transmitting switch needs to append a valid 4-byte End of Frame Marker after all of
the data has been sent. This can be thought of as the green flag to the receiving device, which tells it
that all of the frames data was properly sent, as well as telling that device if the frame was the last
frame in a sequence.
Because Cisco is lying about the lack of CRC checking in a Brocade platform, it turns out that the
transmitting port will in fact have determined whether or not the CRC is valid before the time comes to
transmit the EoF marker. If there was an error in the frame, the Brocade platform simply does not
transmit the green flag signal to the receiving device, but rather marks the EoF as bad, and
standards mandate that the receiver discard the bad frame. Of course, the receiver could also
determine that the frame was bad by performing its own CRC calculation. In reality, FC end-point
devices do both: they perform their own CRC check and look for bad or missing EoF markers, so there
are actually two standards-based rules which prevent a node from accepting a bad frame.
It turns out that this is the only way to handle CRC in a cut-through architecture. Again, it is
mathematically impossible to calculate CRC until the entire frame is received, and the definition of cutthrough is that the switch begins transmitting the frame before the entire frame is received. It also
turns out that this method works exactly 100% of the time. Brocade has been using this method of
handling CRCs since its first FC switch, and therefore it is implemented in over ten million ports of
production SAN switches and routers installed throughout the world. The method has been vetted with
all Brocade OEMs, and there has not been one single case of a frame with a bad CRC landing bad
data onto an application or LUN. For that to happen, the receiving device would need to violate FC
standards with respect to the EoF, and fail to perform its own CRC calculation. It turns out that no such
device exists, ever has existed, or is anticipated to exist in the future in any production datacenter.
Remember that the purpose of a CRC is to prevent bad data from reaching the endpoint in a network.
Brocade ASICs accomplish that goal. This means that the Brocade cut-through CRC method does, in
fact, accomplish the intended purpose.
It is also worth noting that CRC errors are (a) very rare, and (b) always indicate some kind of failure in
the fabric. For example, this could be a failing SFP, a bad cable, or a malfunctioning HBA. In a
working FC fabric, there should be essentially zero CRC errors. As a percentage of all SAN traffic on
working ports ever analyzed by Brocade support, traffic with CRC errors constitutes a second-order
derivative, i.e. the number is so close to zero that it is mathematically indistinguishable from zero, and
would therefore be dropped out of any statistical calculation related to SAN traffic. Many CRC errors
have been observed, but only on catastrophically failing links.

Cisco vs. Brocade Platform Performance

7 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

This implies two things. (1) That Ciscos argument is irrelevant, even if their claim was true, which it
isnt. (2) That Cisco substantially mis-configured their test equipment or used defective gear. A
working FC fabric does not have CRC errors. The fact that Cisco was detecting CRC errors in their
testing should have told their personnel that their fabric was not working and therefore had misconfigured and/or defective equipment in the test bed, e.g. a malfunctioning testing device, numerous
bad SFPs, bad cables, etc..
Since the document appears to have been written by marketing personnel rather than by engineers, it
may not be surprising that they did not know the significance of CRC errors and thus failed to correct
the defective equipment in their test configuration, but in any case, this issue alone would be more
than enough to invalidate their testing even if no other issues existed. Results from a known defective
test bed simply cannot be used to form any scientifically valid conclusions.

Brocade DCX Backplane Performance


Each blade in the Brocade DCX can deliver up to 256 Gbit/sec of non-blocking bandwidth across the
backplane to other slots (512 Gbit/sec full duplex). This is substantially more than the MDS 9513 can
deliver between slots. For customers who take even partial advantage of local switching (below), the
DCX can deliver full-speed, full-duplex 8Gbit/sec. performance on all 384 ports at the same time for a
total of 3 Tbits (6 Tbits cross-sectional) in the platform. This is many times the performance of the
Cisco platform.
This is an interesting marketing number, and is achievable in real world deployments with appropriate
planning, but it does not necessarily tell the whole story. All high-performance networking solutions
have cases in which they deliver best-possible performance, and others in which they deliver lower
performance. Brocade could not possibly claim with technical validity that the theoretical maximum
performance will be delivered under every conceivable condition. It is always possible to find a corner
case in which performance is lower, and concoct a test to show just that one result rather than
showing other cases even if there are billions of high-performing cases for each low performing case.
Before even releasing the DCX into the external beta test phase (Early Access program) Brocade had
already conducted extensive testing and analysis. This included, amongst other things, running
countless pattern tests using both Agilent and Spirent frame generators. The typical case throughput
was in excess of 800Mbytes/sec per port, i.e. full line rate. Ports could sustain this speed in full duplex
configuration in any traffic pattern which resembled real-world SAN traffic in any way whatsoever.
Since Cisco had no independent oversight when performing their tests, and did not include
configuration details in their marketing claims, it is not possible to be certain in just what way they
jury-rigged the result. However, given the extensive testing conducted by Brocade, the fact that they
did rig the result is certain.
For example, in the 15:1 fan-out test, 15 traffic generator ports acted as initiators reading and writing
simultaneously from one traffic generator port, which acted as a target. This test is meaningful in the
real world, because it characterizes how a switch will perform in a storage consolidation solution. In
other words, the traffic pattern is indicative of something that really gets deployed in production
environments. When Brocade conducted this test, the target port was fully saturated at line rate. That
is, it reached and then sustained more than 800Mbytes/second. Each initiator port received an equal
share of that bandwidth. Ciscos marketing FUD indicates a similar result.
In some other claims, they indicate sup-par performance within DCX. Brocade was not able to
duplicate their results with a properly configured, working DCX chassis. We could only achieve similar
results by using broken equipment (such as bad cables or SFPs) or deliberately mis-configuring the
equipment (such as forcing the frame shooter not to ever rotate FC exchange IDs.)
It turns out that all of Ciscos testing either indicated a positive result for Brocade, or, if negative, fell
into one of these categories:
Cisco vs. Brocade Platform Performance

8 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

1. Cisco claim conflicts with testing performed by Brocade, which implies that Cisco had
malfunctioning equipment, did not know how to operate the gear, or falsified their results.
Given their CRC problems, the faulty equipment scenario seems likely. In any case, Brocade
has not observed a similar result, and cannot duplicate Ciscos result based on the
information in their claims. This applies to their claims about latency, performance
decreasing over time, lost frames, dead internal ISLs, VC_RDY credit loss, and needing to
reboot the chassis to clear errors. For such claims, Brocades response must be the DCX
does not seem to work they way they indicate, so they appear to be lying or mistaken. Or
2. Cisco claim is based on a contrived traffic pattern, which bears no resemblance to IO patterns
ever observed by Brocade in any SAN. This case deserves more discussion. For example
Cisco Claim:
Brocade exhibits lower performance when switching continual streams of 60-byte frames.
Reality:
The standard FC frame size is a bit more than 2k bytes: 30 times larger than the frames used in this
test. Most FC testing is conducted with 2k frames, since most frames in a real-world fabric are that
size. Brocade has been selling FC switches for more than ten years: over three times longer than
Cisco. Brocade is therefore in a good position to understand the traffic patterns of a fabric perhaps a
little better than Cisco does. This claim may simply be a mistaken understanding on their part about
what kinds of traffic patterns are actually going to traverse a fabric.
It turns out that there are usually between 5x and 20x more 2k frames in a typical FC SAN than all
other frame sizes combined. This has to do with the way that FC nodes interact with filesystem block
sizes, SCSI drivers, and HBAs. Small frames are typically only used by nodes at the beginning or end of
a conversation and all intermediate frames are full size. When Brocade conducts testing of its
switches and tunes its ASIC designs for performance, the focus is on testing and tuning things for
performance in real-world scenarios, which means optimizing for mostly large frames.
It is possible to conduct a test using nothing but 60-byte frames, even though no SAN application ever
created behaves this way. Cisco did so. Not surprisingly, the throughput of the platform is reduced
when running a continual stream of 60-byte frames. In one test, throughput dropped from over
800Mbytes/sec down to ~600Mbytes/sec. Brocade testing yielded the same result.
Brocade does not consider this to be an issue for several reasons.
(1) This traffic pattern has never been observed in a real world SAN deployment. Optimizing a product
for a non-existent traffic pattern would not seem productive. It may be that Cisco believes this case to
be important because they are too new to the industry to understand SAN traffic patterns yet, or it
could be that they contrived this case to mislead their customers. Either way, it doesnt seem valid.
(2) To a large extent, lower performance on small frames is a laws of physics problem, i.e. there isnt
even a theoretical way to get equal performance on this test. That is because the ratio of header,
trailer, inter-frame gap, and other overhead vs. payload size is different with small frames. With a 60byte frame, almost half of the total frame is overhead. Indeed, the theoretical max throughput for 2k
frames is 840Mbytes/sec. but for 60-byte frames is only ~640Mbyte/sec.. Its just going to go slower.
(3) Even if an application was trying to generate a continual stream of 60-byte frames (which doesnt
happen) it would almost certainly not be able to do so because the sending node would become IOPS
bound. That is, end points in a fabric generally have a limited number of IO operations per second that
they can support, e.g. because of CPU constraints, and a sustained 4Gbit/sec stream of 60-byte
frames would exceed the IOPS limit of even the fastest nodes. It is one thing to drive this IO pattern on
SAN testing hardware. It is quite another to drive it from a SAN-attached application.

Cisco vs. Brocade Platform Performance

9 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

(4) To create an application with this characteristic, it would be necessary to use a block size on the
order of 30 bytes. It seems likely that a block size that small would have performance problems no
matter what the SAN fabric did with the frames. Perhaps nobody on the Cisco team has any filesystem
experience outside of NFS or CIFS, but it seems likely that customers deploying SANs will know a quite
a bit about the subject. Brocade would be most interested in discussing the use case if any customer
has a technical need for a filesystem with a 30-byte block size. In any case, Brocade most often sees
filesystems with block sizes considerably larger than the FC maximum frame size, and in that case it is
simply impossible for an application to generate IO similar to the contrived Cisco test case.
(5) Finally, it is worth noting that because of reason #2 (above), Cisco also exhibits lower performance
on this kind of contrived test, so Brocade cannot really view this as a serious competitive issue.
Note that this line of reasoning applies to most of the test cases claimed by Cisco. For instance, the
HoLB test that they claim Brocade failed was contrived by using two separate continual streams of
60-byte frames from one source. Brocade has never seen an application which generates one
continual stream of 60-byte frames, much less two at the same time. The bottom line is that there is
no real world applicability to any of the Cisco results contrived using streams of 60-byte frames
which turns out to be most of the cases in which they claim a superior result.

Latency Under Load


Cisco Claim:
Brocade exhibits high latency under load.
Reality:
The short response is that Cisco is using smoke and mirrors or slight of hand magic to produce this
result. They are actually exhibiting the same application latency as Brocade in this scenario; they just
contrived the test to move the latency to a location where the Agilent test equipment would not notice
it. The more detailed response requires a bit of explanation.
First, it is necessary to understand that application-level latency matters in a real world SAN, and
generally not Layer 2 switching latency except insofar as L2 latency can contribute to application
latency. Application latency is the delay between when an application wants to send data, and when
that data actually lands at the destination. Contributing factors include congestion, L2 switching
latency, long distance links, protocol conversions, slow drain devices, FCAL loops, inefficient RAID
layouts, bottlenecks in controllers, constraints inside servers, and many more possibilities.
Switch latency is how long it takes a switch or router to process a frame and forward it onwards.
When a traffic generator (such as Agilent) measures switch latency, it calculates the time between
when a frame enters the ingress port of the switch and when it enters the far end receiving port. This
is an accurate method of measuring the uncongested L2 latency of a switch or a fabric, but not an
accurate method of measuring application latency under load.
Once the frame starts going into one port of a Brocade switch, it normally also starts being sent out
the destination port even before it has been fully received. This is known as cut-through routing and
is the most efficient way to move frames theoretically possible. Cisco does not use this method, and
instead relies on a form of store and forward switching. This means that the frame has to be
completely received before it will be processed, which results in significant delays in traffic delivery.
However, when Brocade links become saturated e.g. due to use of test equipment specifically
designed to saturate a network then it becomes logically impossible to use the cut-through method.
Saturation means that there is a queue of frames waiting to use the outbound link. In this case, if
another frame enters the switch, it cannot be cut through to the saturated destination port simply

Cisco vs. Brocade Platform Performance

10 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

because that port is already busy. The new frame must be stored in a buffer, and wait until the frames
in line ahead of it drain out of their buffers. This is shown in Figure 1.
In this example, three ingress ports (Sources #1 thru #3) are trying to transmit streams of frames to a
single destination. If each source is attempting to transmit at line rate, and the destination has the
same rate as the sources, then congestion will occur. Each source will receive a portion of its line rate
in sustained bandwidth. Since FC interfaces are serial, only one frame can be transmitted at a time, so
the switch will need to hold frames in buffer memory while previously-received frames are transmitted.
If a new frame enters the switch as shown in the upper-right of the diagram, it will have to wait for
other frames to be transmitted before it can get time on the shared interface. In this figure, depending
on how the output queue is prioritized, it would have to wait for at least seven frames (the other
source #3 traffic) or possibly more than 20 frames (adding in the source #1 and #2 traffic) before
being served by the transmission logic.

Figure 1 - Congested Transmit Queue Block Diagram 2


The delay between a new frame entering the switch and then leaving it is no longer related to the time
it takes to make a switching decision. Instead, it is a function of how many other frames were ahead of
it in the queue. When equipment is used to test the latency of a switch while saturating its links, the
test is no longer measuring switching latency. Instead, it is measuring the depth of the buffer queues
on the switch. The Agilent used in this case would measure latency as being the length of 20+ frames.
The more buffers, the higher the delay. It is easy to demonstrate this on a Brocade platform.
This is not a comprehensive diagram of a Condor2 ASIC. The specifics of the queue management
logic vary by port type and use case, and would take dozens of pages to fully explain. This figure
merely illustrates the concept of queue management on congested links.
2

Cisco vs. Brocade Platform Performance

11 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

First of all, as extensive testing has confirmed, the Brocade platform exhibits best-in-class latency
when interfaces are running even slightly below line rate. For locally switched traffic, latency is
measured in nanoseconds. For non-local traffic, latency is around 2.4 microseconds between 8Gbit
interfaces. Thats ten or more times faster than a Cisco platform, and indicates the true switching
latency which will be experienced in real-world scenarios, because real applications do not actually
sustain 100% of line rate. At most, they might sustain a percentage of line rate in the mid 90s, and
even that is rare.
Second, it is possible to prove that the increase in latency observed at full line rate is related to the
number of output buffers on a port. Brocade platforms have the ability to tweak the number of
buffers allocated to a given port. This feature was designed for long distance applications, but it can
also be used in intra-datacenter scenarios. When this feature is used to increase the buffers on a
congested output port, the latency reported by fabric test equipment goes up, showing that the test
equipment is measuring buffer depth rather than switching delay.
At this point, it may not be clear what distinction is being made. After all, why would an application
care if delay was caused by queue depth, or by slow switching logic? The answer has to do with the
way that Cisco rigged their own results in this test case.
Since serialized FC interfaces cannot be made to transmit frames in parallel, all vendors will queue
frames in this scenario. Latency caused by queue depth is therefore a fundamental mathematical and
physical phenomenon; it isnt possible to move frames across a congested interface any faster than
Brocade does it. So how can Cisco be showing lower latency in this test case? The answer is that they
moved the delay to the far side of the tester interface from where the test equipment is measuring it.
This is where Cisco performs the slight of hand. When their platform is saturated, it stops accepting
frames even if it has the memory to store them. This does not in any way whatsoever improve
application performance. It prevents frames from entering the congested switch, which means that
fewer frames are waiting in line for a congested port, which means that test equipment will show a
lower latency result. But it achieves that by preventing nodes from sending IO into the fabric, so
queuing is still occurring: it is just moved out of the fabric and into the application.
Figure 2 illustrates the same scenario as in the previous example, except that the switch is handling
the streams differently. Instead of letting frames enter the switch and handling queuing at the point of
congestion, the switch is now preventing source ports from sending data into the fabric in the first
place. Publicly available documentation from Cisco confirms that this is how their platform works.3
Frame-level testing equipment will not measure this effect at all because it has no application layer
capabilities. Still, the effect on application layer latency is equal to or greater than the latency caused
by buffering within the switch. This latency is generated artificially by congestion control
mechanisms, which push back on a node port. The effect of this on the application is that, instead of
waiting for frames to propagate through a congestion point in the fabric, the IO sits inside the host
waiting for the fabric to accept frames in the first place. In other words, it moves the latency from the
switch within the fabric to the application within host. This makes layer 2 test results look better by
hiding the delay from the test equipment, but it does not make the application run faster. The same
number of frames are still waiting in line; they still take the same amount of time to get to the head of
the line. Cisco just relocated the line.

3 For example, their web site has a PDF entitled Introduction to Storage Area Networking which
describes Forward Congestion Control this way: When a switch detects a congestion port (sic) in the
network, the switch generates an edge quench message to the sources as an alert to reduce the rate
at which frames should be injected into the fabric to avoid further head-of-line blocking. In addition to
admitting that they are solving downstream congestion by artificially producing congestion on
upstream ports, this is a public admission that their architecture is subject to head of line blocking.

Cisco vs. Brocade Platform Performance

12 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

IO gets backed up inside the


application within the host.

Source #1

Source #2

Source #3

Switch pushes back on ingress ports,


stopping them from sending frames.
This method under utilizes buffer memory
on the egress port. Instead of queuing
frames in the switch, the congestion and
associated delay still occurs; it just
happens within the application.

The circle at the end of this arrow is the point at which layer 2
latency is measured by Agilent.

The added application-level latency is


hidden from layer 2 test equipment,
which only measures delay starting
when a frame enters the fabric, not
delay which prevents the frame from
entering the fabric in a timely manner.
Delay above the indicated point
matters to the application just as much
as delay below that point. The only
thing that Cisco accomplishes by
moving the congestion above the point
at which test equipment measures
delay is that their test results appear to
show lower latency.

Figure 2 - Cisco Queue Mismanagement "Slight of Hand"


It would be like a movie theater telling patrons that they need to wait in their cars in the parking lot
instead of queuing up in front of the ticket office. If somebody were to take a photo of the outside of
the theater, the theater owner could use that photo to advertise that they had, No waiting! Shortest
lines in town! However, nobody would actually get into their seats any faster.
This also explains why Cisco is not showing buffer depletion on congested interfaces. Of course they
are not running out of buffers: they are using the attached hosts and storage controllers as their
buffering mechanisms. When a Brocade DCX has buffers available, the platform hands them out to
the hosts or storage devices which are asking for them. Since Brocade hands out all of its advertised
buffers, the platform will show buffer credit zero indications when running at 100% of line rate.
In contrast, even when Cisco has buffer credits available, these credits are useless if the MDS
architecture will never allow them to accept and queue traffic on congested links. If a memory location
never gets to hold any data, then why did Cisco charge the customer for the memory in the first place?
They advertise a bunch of buffer credits to attached nodes, and then refuse to hand out those buffers
under congestive loads which is the only time when the buffers are actually needed. Cisco seems to
be bragging about the fact that they are withholding buffers from devices in the fabric at the times
when they are needed most!

Cisco vs. Brocade Platform Performance

13 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

In order to perform the latency under load test correctly, Cisco should have run the load up only to,
say, 99% of line rate. At 100% of line rate, the test would be more properly called queue depth
detection or latency under congestive loads.
In any case, this category of test result t is only applicable to customers who are running their FC
infrastructure at 100% of line rate for extended periods of time, instead of, for example, 99.9% of line
rate or less. If links are even slightly below the congestion point, then Ciscos Forward Congestion
Control does not do anything at all. In that case, their store and forward mechanism is a tenth or less
as fast as the Brocade cut-through approach. At 100% of line rate, Brocade will exhibit greater Layer
2 delay than Cisco, but Cisco will exhibit greater application layer delay than Brocade because it will
stop source ports from inserting frames into the fabric and cause application traffic to back up
inside hosts. The bottom line is that the Cisco approach only provides a performance benefit for
customers who are using FC test equipment instead of using, for example, real applications.

Local Switching
In addition to supplying more backplane bandwidth per slot than Cisco, Brocade can deliver 8Gbit/sec.
bandwidth per port even on oversubscribed blades through a process called local switching.
In the Brocade DCX, each port blade ASIC exposes some ports for connectivity and other ports connect
to the backplane. If the destination port is on the same ASIC as the source, the chip can switch the
traffic without needing to leave the blade. On the Brocade 16- and 32-port blades, local switching is
performed within 16 port groups. On the 48-port blade, traffic can be localized in 24-port groups.
Even if the traffic in question is running on an oversubscribed blade, the localized traffic does not use
the oversubscribed resource. Since localized traffic does not use backplane bandwidth, it does not
count against the subscription ratio. It cannot impact or be impacted by traffic from other devices.
backplane, locally switched devices are guaranteed 8 Gbit/sec bandwidth. This enables every port on
a Brocade DCX high-density blade to communicate at a full 8 Gbit/sec speed with port-to-port latency
of just 800 nanoseconds, about twenty times faster than the MDS. This is an important feature for
high-density/high-performance environments because it allows oversubscribed blades to achieve full
non-congested line rate.
Cisco Claim:
Taking advantage of local switching is difficult.
Reality:
Sometimes it is difficult. Sometimes it is easy. It turns out that not all customers are trying to do the
exact same thing with their SANs.
For example, when a customer is initially migrating Directly Attached Storage (DAS) systems onto a
SAN, they already know the host-to-storage connectivity patterns since DAS traffic is 100% localized by
nature. In a DAS to SAN migration, it is not only possible to maintain locality, it actually very easy.
Similarly, many customers have a small number of mission-critical systems which have a known
relationship with their storage ports, and, because they are critical, these system tend to get changed
rarely. In such cases, it is easy to understand which hosts talk to which storage ports, and simply
plug them into the same port groups. This gets the availability and performance benefits of localization
for the systems which need it most, without needing to attempt to localize all flows in the entire fabric.
On the other hand, some SAN-native deployments (e.g. large VMware clusters) are architecturally
incompatible with the concept of localized traffic. Localization isnt hard in that case; its impossible.

Cisco vs. Brocade Platform Performance

14 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

Brocade has never mandated the use of locality. It is offered for cases in which it makes sense.
Brocade supports local switching, but does not require it. Think of it as just one more tool in the
Brocade SAN design kit, which isnt present in the Cisco kit. Their claim that local switching isnt
needed would be more credible if it werent for the fact that they are architecturally incapable of
delivering the feature. As it is, this appears to be a case of sour grapes on their part.

Credit Loss and Recovery


Cisco Claim:
The DCX backplane traces can lose credits.
Reality:
At no point has Brocade observed this behavior, or seen Cisco demonstrate it, or confirmed a case of
this happening at a customer site. This is an entirely unsubstantiated claim. It may be based on nonproduction level hardware or software, or on a misunderstanding on their part of the DCX and Condor2
architectures, or on an incorrect projection of a related issue which occurs on some vendors WDM
equipment.
In any case, the Condor2 chip supports the credit recovery standard. If it were ever discovered to be
true that the DCX backplane links lose credits, Brocade would probably just enable credit recovery on
backplane links. However, there doesnt seem to be any reason to do so at this time because as far as
we can tell, Cisco is simply making this up.
Now, it is true that some WDM equipment can drop the FC primitive frames which handle credit return.
If this happens persistently on a long distance link, performance would drop off as the credit pool
became exhausted. This is why Brocade implemented credit recovery in Condor2 in the first place.
Perhaps Cisco is simply unaware that the DCX backplane does not include a WDM, in which case their
claim might be an honest mistake rather than a deliberate and libelous falsehood.

CONCLUSION
The Cisco marketing FUD document entitled Performance Testing on Brocade 48k contains
numerous direct and seemingly deliberate falsehoods. The similar claims they make about the DCX
fall into the same category. The pseudo-technical content is misleading at best, and intentionally
dishonest at worst. They appear to have been using out of date, defective, and/or mis-configured test
equipment, and did not appear to understand basic SAN terminology or protocol characteristics. Since
they decided to conduct their testing behind closed doors rather than participating in open third-party
testing, and did not even include many test details in their marketing glossy, the document and
related claims cannot credibly be viewed as having any technical merit.
Based on extensive testing, it is clear that the Brocade DCX is the lowest-latency / highest-performing
platform on the market. The only way to illustrate other results involves concocting smoke and
mirrors test cases, which obscure the flaws in the MDS architecture, and / or deliberately avoid using
the Brocade platform in realistic ways. If Cisco wishes to participate in a neutral, third-party refereed
bake-off test, rather than hiding their methodology in the shadows, then Brocade would be happy to
meet them head to head. But until they are willing to have competitive testing performed in the open,
there can be no evidentiary value assigned to their biased and unsubstantiated claims.

Cisco vs. Brocade Platform Performance

15 of 16

STORAGE AREA NETWORK

COMPETITIVE BRIEF

2008 Brocade Communications Systems, Inc. All Rights Reserved.


Brocade, the Brocade B-weave logo, McDATA, Fabric OS, File Lifecycle Manager, MyView, Secure Fabric OS, SilkWorm, and
StorageX are registered trademarks and the Brocade B-wing symbol and Tapestry are trademarks of Brocade Communications
Systems, Inc., in the United States and/or in other countries. FICON is a registered trademark of IBM Corporation in the U.S. and
other countries. All other brands, products, or service names are or may be trademarks or service marks of, and are used to
identify, products or services of their respective owners.
Notice: This document is for informational purposes only and does not set forth any warranty, expressed or implied, concerning
any equipment, equipment feature, or service offered or to be offered by Brocade. Brocade reserves the right to make changes
to this document at any time, without notice, and assumes no responsibility for its use. This informational document describes
features that may not be currently available. Contact a Brocade sales office for information on feature and product availability.
Export of technical data contained in this document may require an export license from the United States government.

Cisco vs. Brocade Platform Performance

16 of 16

You might also like