![]()
©Celoxica 2008. T: +44 (0)1235 863656
Hardware co-processors have been discussed for a while in high-performance computing circles as a method for dramatically increasing system performance and offloading computation from CPUs. Celoxica's Accelerator is the first market data handler to directly connect a network interface to a co-processor, eliminating one of the major contributors to latency in a hardware/software co-processing system: the peripheral bus transactions between the co-processor
and the network device.
Celoxica leverages the latest FPGA technology for the co-processing. The structure of FPGAs allows the sequential tasks of parsing several protocol layers to be executed in parallel, in what is known as a pipeline. A separate pipeline is run for each network port. This means that a complex protocol stack can reliably run at wire-speed without missing a single packet. Importantly, each protocol layer only requires a small number of extra pipeline stages, which add an extra latency measured in tens of nanoseconds with no effect on throughput. The protocols that are handled in the FPGA include IP, UDP, TCP and the financial message format in use, which may be FAST, FIX or a proprietary format.
A common data feed topology has a pair of redundant UDP multicast feeds, known as the A and B feeds. Thanks to the parallel processing pipelines for each feed, the FPGA can detect errors in the A feed and switch to the B feed with zero latency. In the event that the error occurred in both feeds, then the FPGA can immediately request retransmission of the
missing data over UDP or TCP: this can save many microseconds of latency over recovery using a standard network card and the software network stack.
The processing pipeline within the FPGA can also be extended to perform a variety of processing tasks on the data payload itself. The FPGA implements filtering on the content of the messages arriving, which can be customised to the user’s needs. This filtering means that the user’s application only gets the information that is of relevance, reducing the CPU load for processing the feed. Messages can also be translated into a binary structure that can be read directly from the user’s application, avoiding any processing time associated with converting message formats on the CPU.
The co-processor board fits into a standard HyperTransport slot, which is the lowest latency server communications bus available today. The HyperTransport core implemented by Celoxica in the FPGA is optimised for low-latency data-transfer. Once converted to a binary representation, the network data is transferred via the HyperTransport bus directly to the CPU's memory. This transfer starts as soon as the data arrives off the wire, rather than buffering a complete packet before starting the transfer. The Accelerator Card initiates the data transfer itself, rather that waking up the software application to service the data waiting on the card. The C API is designed to be as simple, lightweight and low-latency as possible. Unlike the interrupt-driven API of a normal network card, the uses a querying mechanism, which shaves several microseconds off the latency of receipt of data.