Professional Documents
Culture Documents
NOVEMBER 2010
on this interest as well by supplying the core DSP building block and by differentiating its features from those of the C6x and StarCore DSPs used by TI and Freescale, respectively. Meanwhile, TIs strategic gaze has turned toward analog technologies and microcontrollers. Although DSPs remain an important technology for the company, they no longer have the prominence they once did, resulting in longer product-introduction cycles in some markets and thereby creating opportunities for other DSP-technology suppliersparticularly Ceva.
Table 1. Key attributes of the Ceva-X1643 and Ceva-XC323 DSPs. Sporting four MAC units, the X1643 is a high-end general-purpose DSP. The XC323 adds even more numbercrunching power by way of its two vector units, making it well suited to handling 4G-cellular signal processing. *40nm G process, SVT library, postlayout, worst-case process. (Source: Ceva)
NOVEMBER 2010
1,056KB configured as a mixture of TCM and cache. Instructions are fetched 256 bits at a time and are queued pending decoding. Basic instructions are either 16 or 32 bits wide, and a VLIW packet can have from one to eight basic instructions. Instruction words therefore range from 16 bits to 256 bits. A dedicated 64-bit or 128-bit AXI master retrieves instructions from main memory. All AXI controllers support AXIs low-power mode. In addition to the AXI ports, the X1643 has an APB port for accessing lowspeed peripherals built around the DSP. The X1643 provides a basic memory-control unit. Although virtual memory is not supported, as would be required for a general-purpose processor running a highlevel operating system (OS), this unit provides protection among different processes to assist memory management by a real-time OS (RTOS). RTOSs supporting Ceva-X include Nucleus from Mentor Graphics, OSEck from Enea, and ThreadX from Express Logic. The memory-control unit is flexible, allowing the programmer to apply policies to particular address ranges and to change these policies on the fly. Policies include whether to cache the range and whether to enable hardware prefetching of data in the range.
Figure 1. Block diagram of Ceva-X1643 DSP. A VLIW design, the X1643 can dispatch eight instructions per cycle: one to each of the four computation unit blocks, three to each of the data-address generation blocks, and one to the program-control unit. The X1643 features both caches and tightly coupled memories.
NOVEMBER 2010
a more significant risk on the base-station side of a 4G wireless link owing to the additional processing and precision required. Ceva also added new instructions to the vector unit. For example, these include instructions to accelerate Viterbi coding, which is handled by a hardware accelerator in XC321 designs. The company moved this function to software for the infrastructure-targeted XC323 because it expects designers to use multiple DSP cores allocated to a changing set of functions depending on the task at hand. Providing dedicated hardware would consume die area for a function not consistently used. The DSP also has instructions to improve performance on algorithms for channel estimation, MIMO detection, interleaving, and other 3Gand 4G-cellular functions. Ceva does not publicly release the instruction set for its advanced DSPs, so it is unclear what changes have been made. Each vector unit in the XC323 has four function blocks: one each for arithmetic, logic, MAC, and division. The division block is optional and also supports instructions for maximum-likelihood decoding, square root, and inverse square root. The XC323s two vector units can complete thirty-two 1616-bit MACs or sixty-four 168-bit MACs per cycle. The units can operate in lockstep as a double-wide SIMD unit, or they can operate independently. They can collectively issue four instructions per cycle, using four of the eight VLIW slots in the XC323 architecture. Like the X1643, the XC323 has a general-purpose computation unit with four function blocks. All four can perform arithmetic operations and multiply-accumulate operations on 16-bit operands and 40-bit accumulators. One also handles shifts and the other arithmetic and logic operations, as in the X1643. Ceva developed the computation unit with an eye toward supporting C-based general-purpose processing and 2G/3G baseband processing, deriving the unit from that in the X1643. The XC321, in contrast, had a simpler general computation unit. The XC323 also supports accelerated context switching. The quad-MAC capability provides horsepower for legacy baseband protocols. Owing to these features, the XC323 may appeal to designers of mobile basebands in addition to designers of base stations, as they may be able to consolidate in a single XC323 some functions performed by legacy 2G/3G modems and control CPUs. To keep the wide vector units fed with data, the XC323 has wide paths to memory. Two 1,024-bit paths connect to the data TCM. The AXI ports connected to the dataaddressing unit are 128 bits wide. Like the X1643, the XC323 fetches instructions 256 bits at a time. Because a baseband processor for a base station is likely to have multiple DSPs, the XC323 incorporates features for multicore designs. The bus interface controllers snoop the AXI bus to provide a degree of coherence among the DSP cores local memories. One DSP can also request
NOVEMBER 2010
Ceva claims that 9 to 12 XC323s can handle a three-sector Category 5 transceiver card, compared with 12 triple-core TCI6488 or 6 hex-core MSC8156 chips (i.e., a total of 36 TI or Freescale DSP cores) and their associated accelerators. TI, however, is preparing a counterstrike. In February 2010, the company announced that it is working on a multicore architecture designed to deliver 256 gigaMACs per second (GMACS). Assuming this performance is divided among eight cores, the design would provide the same percore performance (in terms of GMACS) as the XC323, suggesting TI is also adding vector capabilities. This new architecture has the added advantage of supporting both fixedand floating-point operationslikely a better approach to providing enhanced precision for complex algorithms than merely extending precision from 16 to 32 bits.
NOVEMBER 2010