![]()
©Celoxica 2008. T: +44 (0)1235 863656
To demonstrate the performance of Celoxica’s packet capture technology, we performed some benchmarks comparing our hardware solution with a normal Gigabit Ethernet card. We tweaked the performance of the Ethernet card to minimise latency to make the tests as fair as possible. The Accelerator Card by Celoxica was used as part of the benchmarking system as this gives very fine resolution on timestamps.
Download Benchmark PDF - A4
Download Benchmark PDF - US letter
The benchmarks were performed on a standard HP DL145 1U server with the latest production BIOS from HP.
The Ethernet card was optimised for latency by disabling packet coalescing, as this is the most likely situation in a production trading application where latency is a major concern. The system was set up to provide a fair comparison:
Accurate latency benchmarking for networking applications is notoriously difficult due to variations in latency through a software network stack, buffering on a network card and the ability to accurately measure time. The Accelerator Card itself provides an ideal platform for benchmarking both itself and any Gigabit Ethernet cards. The Accelerator Card is running a clock at 125 MHz (for Gigabit Ethernet data rate), so timestamps can be taken with 8 ns accuracy. Packets can be generated in hardware on the Accelerator Card and a timestamp taken when the packet it sent from one Ethernet port directly connected to the FPGA. When the host application receives the complete packet via the Ethernet Card under test, it can tell the Accelerator Card via the HyperTransport bus to generate another timestamp, which will give the total trip time.
The two diagrams below show how the test works for benchmarking the Accelerator Card and the Ethernet card. The time that we are measuring is the time from the end of the packet being sent out onto the network to the complete packet being in user memory.
1. Timestamp on sending SOP on Port 0
2. Port 1 receives packet
3. Packet transferred to host
4. EOP appears in user memory
5. CPU requests timestamp from RCHTX
6. Timestamp taken

1. Timestamp on sending SOP on Port 0
2. Eth 0 receives packet
3. Packet transferred to host
4. EOP appears in user memory
5. CPU requests timestamp from RCHTX
6. Timestamp taken

The graph below shows the results that were measured for a single packet. The bottom line in blue shows the transmission time for the packet across the Gigabit Ethernet link: this is a constant dictated by the very nature of the network. The red crosses show the measurements taken for the built-in Ethernet card: you can see that the gradient of this line is steeper that the transmission time, demonstrating a linear growth in latency with the size of the packet. The Accelerator Card measurements, in green, show much lower latency that the Ethernet card, with no variation with the size of the packet.

The next graph shows the same data, but this time with the transmission time subtracted from the measurements for the Ethernet card and the Accelerator Card. This clearly shows the constant latency, independent of packet size, for the Accelerator Card by Celoxica.

The measurements recorded and displayed above include the constant times for the transmit PHY (< 0.1 microseconds) on the Accelerator Card for sending the test data out and the read request over the HyperTransport bus to take the second timestamp (~0.25 microseconds).
After testing receipt of a single packet, we devised a test that would simulate real-world situations under varying network load. This test sends bursts of 500 packets from the Accelerator Card, inserting a delay between packets: the longer the delay, the lower the network load. We ran the test for a number of packet lengths to see how that impacted performance.
The resultant graph shows the latency (y) for different network loads (x). Each of the lines represents the measurements for a given packet size, except for the Accelerator Card measurements, which are all the same, so are only plotted once. It is clear from this graph that the Accelerator Card shows consistent, low latency, regardless of the size of the packet and the network load.

The Ethernet Card manages to give consistent latency for larger packets, but cannot keep up with bursts of smaller packets. The steep lines for 64, 128 and 256 bytes packets show that the latency increases very quickly as the system has to buffer packets. The system rapidly starts dropping packets in these cases.
The Ethernet card was configured with best-latency coalescing options and the IRQ handler was run on a different CPU to the packet receiver. If the IRQ handler is run on the same CPU, the average latency is lowered by around 2 microseconds, but worst-case performance deteriorates. This accounts for why the latencies on this graph start at 10 microseconds for the Ethernet card, whereas the previous graphs shows them starting at around 8 microseconds.
Results were averaged over 1000 repetitions. The quoted packet lengths include payload, UDP, IP and Ethernet headers, but not CRC or preamble.
These benchmarks demonstrate the unique capabilities of the Accelerator Card by Celoxica: constant ultra-low-latency access to network data regardless of network load. The Accelerator Card is an important component of a solution for traders who need the lowest-latency access to market data, even during spikes in message volumes.