![]()
©Celoxica 2009. T: +44 (0)1235 863656
FPGA coprocessors have been discussed for a while in high-performance computing circles as a method for dramatically increasing system performance and offloading computation from CPUs. Celoxica's Accelerator is the first market data handler to directly connect a network interface to a co-processor, eliminating one of the major contributors to latency in a hardware/software co-processing system: the peripheral bus transactions between the co-processor and the network device.
Once in the co-processor, the network data is processed in a highly-parallel processing pipeline at line-speed. The TCP/UDP/IP protocol stack, A/B feed arbitrage, FAST and ITCH decoders and message filtering based on customer-defined criteria are all implemented in dedicated hardware, with the lowest-possible latency.
The co-processor board fits into standard PCI Express (PCIe) or HyperTransport (HTX) slot, which offer high-speed and low-latency interconnects. Once converted to a binary representation, the network data is transferred via the high-speed bus directly to the CPU's memory. A C/C++ API is then available for you integrate this system with your own application, getting feed updates before the competition.
A common data feed topology has a pair of redundant UDP multicast feeds, known as the A and B feeds. Thanks to the parallel processing pipelines for each feed, the FPGA can detect errors in the A feed and switch to the B feed with zero latency. In the event that the error occurred in both feeds, then the FPGA can immediately request retransmission of the
missing data over UDP or TCP: this can save many microseconds of latency over recovery using a standard network card and the software network stack.
Rather than receive the full feed on the host CPU, you can define custom filters through a simple API, dramatically reducing the amount of data that the CPU has to handle.
A co-processor FPGA directly connected to the network port means that the Accelerator can provide consistent low latency which is independent of the data volume. The high-speed link to the host CPU provides the lowest latency bus transfers available in standard servers.
