It is not possible to see the real difference in throughput performance between the DLPI and TLI APIs over Fast Ethernet, because both types of tests soak the bandwidth near the theoretical limit (100 Mbits/sec). To observe more definitive differences between the different protocol stacks and APIs, we executed the same types of test running over Gigabit Ethernet. The DLPI tests shows the kind of throughput that can be achieved by stripping out the intermediate layers (TIMOD, UDP, and IP).
|
Figure 3.8 shows the sender and receiver throughput using the DLPI and TLI APIs, and their respective stacks, for message sizes up to the Maximum Transmission Unit (MTU) of 1500 bytes 3.2 DLPI achieves higher throughput (as one would expect) since the DLPI stack has fewer processing modules. The throughput performance increases almost linearly with message size peaking at the sender's side for DLPI at around 280 Mbits/sec, and 250 Mbits/sec for TLI. The sudden drop in performance beyond 1400 bytes is likely due to fragmentation of the messages into multiple Ethernet MTUs. Packet loss occurs at the receiver for these tests since no flow-control exists between the sender and receiver. Figure 3.8 also shows the overheads involved in using extra processing modules for the TLI stack compared to the DLPI stack. Using results obtained in the graphs, a throughput performance gain of around 7% was observed for DLPI over TLI.
The test was continued over TLI for large message sizes. As previously pointed out, the DLPI driver and Ethernet frame limitations restrict it to a 1500 byte payload. With TLI, however, the intermediate processing modules handle segmentation and reassembly, so the test was continued over Gigabit Ethernet using TLI up to the maximum Transport Service Data Unit (TSDU) of 65507 bytes.
|
The results are shown in Figure 3.9. The graph shows a continuous decrease in packet loss at the receiver with increasing message size at the sender. For instance, using a message size of 3000 bytes, sender throughput is around 270 Mbits/sec, while receiver throughput is only 8 Mbits/sec (97% packet loss). However, using a message size of 60,000 bytes, with a sender throughput of 270 Mbits/sec, the packet loss at the receiver drops to 20%. To explain this improvement and the general convergence trend of the sender/receiver throughput, one should understand the allocation of messages by the Stream head. The STREAMS subsystem provides a number of variable-sized kernel buffers that are pre-allocated by the system for STREAMS messages. When the Stream head copies a user buffer into kernel space, it attempts to put it into a best-fit pre-allocated kernel buffer. If the user message exceeds the largest allocatable kernel buffer size, the Stream head will segment the message and copy it into a number of kernel buffers chained together into a complex STREAMS message. For very large message sizes (e.g. 60,000 bytes), additional latency overhead is incurred segmenting these large messages. In addition, if the STREAMS subsystem runs low on memory, the kernel will dynamically allocate additional memory (up to a threshold value). This also can incur additional latency. Finally, memory copy operations are known to be expensive in terms of latency. As such, more system time is spent doing the actual copying. This last point is likely the biggest latency contributor. All of these operations introduce extra delay in the Sender, and this in turn, affects the number of packets sent out on the wire during a fixed interval. The incurred delay at the Sender allows the Receiver more time to handle incoming packets. In essence, sending very large user messages works as a crude flow-control mechanism, allowing better throughput measurements. For instance, in the TLI library, t_sndudata() is used by the sender to send connectionless datagrams. The call returns when the Stream head has finished copying the user buffer into the STREAMS kernel buffers. For the duration of our test, we note that t_sndudata() was executed 212,388 times for a message size of 1000 bytes, and only 4854 times for a message size of 65,507 bytes.
The SUNGigabit card used in the tests limits the maximum Ethernet MTU to 1518 bytes. It seems logical from the graph, that if this limitation were removed, much higher throughput could be achieved for both DLPI and TLI. This would also be the case for Sockets, since they are similar to TLI (using SOCKMOD instead of TIMOD). Many vendors [1] do provide hardware support for larger Gigabit Ethernet MTUs (also called "jumbo frames", up to 9000 bytes). Obviously, this would require additional support in the switch hardware, as well as the network adaptors. In addition, the network device drivers would need to be modified to provide this support (e.g. the SUN Gigabit Ethernet driver would need to modify the DLPI interface to support larger frames.)