The Homa network protocol
Most networking applications are still based on TCP, which was designed for efficient and reliable transport of streams of data across a distributed Internet. Data-center applications, instead, are often dominated by large number of small messages between many locally connected hosts. The requirements of TCP, including the establishment of connections and ordering of data, add a lot of overhead to that kind of application. The design of Homa is intended to remove that overhead while taking advantage of what current data-center networking hardware can do, with a focus on minimizing the latency between a request and its response.
A quick Homa overview
At its core, Homa is designed for remote procedure call (RPC) applications; every interaction on a Homa network comes down to a request and associated reply. A client will send a request message to a server that includes a unique request ID; the server will send a reply back that quotes that ID. The only state that exists on the server is held between the receipt of the request and the receipt of the response by the client.
Much of the key to the performance of this protocol can be found in how these messages are handled. There is no connection setup; instead, the client starts transmitting the request, with no introductory handshake, to the server. There is a limit on how many bytes of this "unscheduled" request data can be sent in this manner, which is determined by the round-trip time of the network; it should be just high enough to keep the request-transmission pipeline full until an initial response can be received from the server side. The figure of about 10,000 bytes appears in some of the Homa papers.
The initial request packet includes the length of the full request. If the request does not fit into the size allowed for the unscheduled data, the client will wait for a "grant" response before sending any more. That grant should, if the server is responding quickly, arrive just as the initial request data has finished transmitting, allowing the client to continue sending without a pause. Grants include a maximum amount of data that can be sent, and thus function like the TCP receive window.
This machinery is intended to get a request to the server as quickly as possible, but without the need for much, if any, buffering in the network path between the two machines. Priority queues are used to manage this traffic, with unscheduled packets normally having the highest priority. Lower priorities are used for granted traffic; the requests with the least amount of data remaining to be received are given the highest priority.
Once the server has received the full request and processed it, a response is sent back to the client. Once again, the initial bytes are sent as unscheduled packets, with grants required for the rest if the response is large enough. In the earlier descriptions of the protocol, the server would forget everything it knew about the request immediately after sending the response. That created the possibility that requests could be resent (if the response never arrives) and executed multiple times. More recent publications include an explicit acknowledgment message indicating that a response has been received, with the sender retaining the necessary state to retransmit a reply until that acknowledgment is received.
The details of the protocol are, of course, rather more complex than described here. There are, for example, mechanisms for clamping down on the amount of unscheduled data sent if a server is finding itself overloaded. The receiving side of a message can request retransmission if an expected packet does not arrive; unlike TCP and many other protocols, Homa puts the responsibility for detecting lost packets onto the receiving side. There is also a fair amount of thought that has gone into letting systems overcommit their resources by issuing more grants than they can immediately handle; the purpose here is to keep the pipelines full even if some senders do not transmit as quickly as expected.
See this paper for a more complete (and surely more correct) description of the Homa protocol, this page, which reflects some more recent changes, and this 2022 article for more details.
Homa on Linux
The Unix socket interface was designed around streams, and is not a perfect fit for Homa, but the implementation sticks with it to the extent it can. A socket() call is used to create a socket for communication with any number of other systems; the IPPROTO_HOMA protocol type is used. Homa can run over either IPv4 or IPv6. For server systems, a bind() call can be used to set up a well-known port to receive requests; clients need not bind to a port.
Messages are sent and received, as one might expect, with sendmsg() and recvmsg(), but there are some Homa-specific aspects that developers must be aware of. When sending a message, an application must include a pointer to this structure in the msg_control field of the msghdr structure passed to sendmsg():
struct homa_sendmsg_args { uint64_t id; uint64_t completion_cookie; };
If a request is being sent, id should be set to zero; the protocol implementation will then assign a unique ID to the request (and write it into id) before sending it to the server. For a reply message, id should be the ID value that arrived with the request being responded to. The completion_cookie value, which is only used for requests, will be passed back to the caller with the reply data when it is received.
The receive side is a bit more complicated, because Homa requires that the buffer space for replies be registered before sending the first request on a socket. To do so, the process should allocate a range of memory, then pass it into the kernel with SO_HOMA_RCVBUF setsockopt() operation, using this structure:
struct homa_rcvbuf_args { void *start; size_t length; };
The start address must be page-aligned. This memory is split into individual buffers, called "bpages", each of which is HOMA_BPAGE_SIZE in length; that size is 64KB in the current implementation. Each message will occupy at least one bpage; large messages will be scattered across multiple, not necessarily contiguous, bpages.
A message is received by making a call to recvmsg() with a pointer to this structure passed in the msg_control field of struct msghdr:
struct homa_recvmsg_args { uint64_t id; uint64_t completion_cookie; uint32_t flags; uint32_t num_bpages; uint32_t bpage_offsets[HOMA_MAX_BPAGES]; };
The flags field describes what the caller is willing to receive; it is a bitmask that can include either or both of HOMA_RECVMSG_REQUEST (to receive request messages) and HOMA_RECVMSG_RESPONSE (to receive responses). If id is zero, then HOMA_RECVMSG_RESPONSE will cause any response message to be returned; otherwise, only a response corresponding to the provided request ID will be returned. On return, num_bpages will indicate the number of bpages in the registered buffer area have been used to hold the returned message; bpage_offsets gives the offset of each one.
The bpages returned by this call are owned by the application at this point, and will not be used by the kernel until they have been explicitly returned. That is done with a subsequent recvmsg() call, where num_bpages and bpage_offsets will indicate a set of bpages to be given back.
This code has been "stripped down to the bare minimum
" to be able to
actually transmit requests and responses across the net; it is evidently
about half of the full set of Homa patches. The intent, of course, is to
ease the task of reviewing the work and getting initial support into the
kernel; the rest of the work can come later. In its current form,
according to the cover letter, its performance "is not very
interesting
", but that is expected to improve once the rest of the work
is merged.
See this paper for more information on the Linux implementation of Homa.
Prospects
The Homa protocol originates at Stanford University, with support from a number of technology companies. Academic work often does not successfully make the transition from interesting prototype into production-quality code that can be accepted into Linux. In this case, though, Ousterhout seems determined to get the code into the mainline, and is trying to do the right things to get it there. Thus far, the four postings of the code have yielded some conversations about the protocol, but have not yet resulted in a detailed review of the code. That suggests that the initial merge of Homa is not imminent.
It does seem likely to happen at some point, though. Then, it will be a
matter of whether the operators of large data centers decide that it is
worth using. Complicating that question is Ousterhout's assertion (in the
above-linked paper) that, even in a kernel with less overhead than Linux,
CPUs simply are not fast enough to keep up with the increases in networking
speed. The real future for Homa, he suggests, may be inside the networking
hardware itself. In that case, the merging into Linux would be an
important proof of concept that accelerates further development of the
protocol, but its use in real-world deployments might be limited. It does,
in any case, show how Linux is firmly at the center of protocol development
for modern networks.
Index entries for this article | |
---|---|
Kernel | Networking/Protocols |
IP protocol number
Posted Dec 30, 2024 23:02 UTC (Mon)
by cesarb (subscriber, #6266)
[Link]
Posted Dec 30, 2024 23:02 UTC (Mon) by cesarb (subscriber, #6266) [Link]
Why not UDP
Posted Dec 30, 2024 23:05 UTC (Mon)
by npws (subscriber, #168248)
[Link] (10 responses)
Posted Dec 30, 2024 23:05 UTC (Mon) by npws (subscriber, #168248) [Link] (10 responses)
On a related note, 10, 15 years after DCCP has been merged, is there any real life use of this protocol?
Why not UDP
Posted Dec 31, 2024 0:21 UTC (Tue)
by willy (subscriber, #9762)
[Link] (9 responses)
Posted Dec 31, 2024 0:21 UTC (Tue) by willy (subscriber, #9762) [Link] (9 responses)
Why not UDP
Posted Dec 31, 2024 0:51 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
Posted Dec 31, 2024 0:51 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)
Why not UDP
Posted Dec 31, 2024 1:19 UTC (Tue)
by josh (subscriber, #17465)
[Link] (2 responses)
Posted Dec 31, 2024 1:19 UTC (Tue) by josh (subscriber, #17465) [Link] (2 responses)
Why not UDP
Posted Dec 31, 2024 1:21 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Dec 31, 2024 1:21 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)
Why not UDP
Posted Dec 31, 2024 1:24 UTC (Tue)
by josh (subscriber, #17465)
[Link]
Posted Dec 31, 2024 1:24 UTC (Tue) by josh (subscriber, #17465) [Link]
Why not UDP
Posted Dec 31, 2024 7:03 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (4 responses)
Posted Dec 31, 2024 7:03 UTC (Tue) by dvdeug (guest, #10998) [Link] (4 responses)
Why not UDP
Posted Dec 31, 2024 9:48 UTC (Tue)
by npws (subscriber, #168248)
[Link] (3 responses)
Posted Dec 31, 2024 9:48 UTC (Tue) by npws (subscriber, #168248) [Link] (3 responses)
Why not UDP
Posted Jan 3, 2025 17:30 UTC (Fri)
by john_ousterhout (guest, #175303)
[Link]
Posted Jan 3, 2025 17:30 UTC (Fri) by john_ousterhout (guest, #175303) [Link]
Why not UDP
Posted Jan 3, 2025 19:55 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (1 responses)
Posted Jan 3, 2025 19:55 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)
Except that the costs (as I understand it) basically destroy all the benefits! It's called UDP/IP for a reason (although we normally drop the IP bit, same as TCP). The IP stands for Internet Protocol, which is designed to go through routers etc etc. Homa is designed for a LAN, where typically you're routing MAC addresses through a switch, not IP addresses through a router.
So yes, the fundamental idea behind Homa and UDP is very similar. But the *implementation* is very different, and that's the whole point. And the whole point of Homa is that - by deleting all that overhead of routing - it gets a massive speedup.
As the original article said, don't try to run Homa over a WAN. It's the (in)ability to be efficient over the WAN, that helps it be the Cheetah in the server room.
Cheers,
Wol
Why not UDP
Posted Jan 3, 2025 20:17 UTC (Fri)
by corbet (editor, #1)
[Link]
I don't think that routing, in particular, is key to the design of Homa, with the possible exception of the assumption that round-trip times will be short and constant. Remember that Homa, too, is built on top of IP...
Posted Jan 3, 2025 20:17 UTC (Fri) by corbet (editor, #1) [Link]
Should this be implemented in endpoints at all?
Posted Dec 31, 2024 6:03 UTC (Tue)
by buck (subscriber, #55985)
[Link] (6 responses)
Posted Dec 31, 2024 6:03 UTC (Tue) by buck (subscriber, #55985) [Link] (6 responses)
If the expectation is that host hardware won't be able to keep up and that the functionality will end up being in NICs, and if it requires prioritization in the network for different message types, then aren't you most of the way to Infiniband or maybe ROCE with Data Center Networking (DCQCN etc.) tuned up on all the switches?
There was a somewhat skeptical viewpoint that came out on Homa from back about the same time LWN first wrote about it:
https://blog.ipspace.net/2023/01/data-center-tcp-replacem...
One of the LinkedIn comments reposted at the bottom contains a particularly eye-opening statement:
> Over 65% of Azure traffic (bytes and packets) uses RoCEv2, not TCP. TCP is a distant second.
I wonder how the proportion has changed in the interim.
While I think a dumb network and the end-to-end principle is usually for the best, it doesn't sound like anybody is really looking to add any value here but speed, which seems like a job for hardware, since the field is already full of hardware competitors.
Should this be implemented in endpoints at all?
Posted Dec 31, 2024 9:26 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (1 responses)
Posted Dec 31, 2024 9:26 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)
This is a see-saw. Look at pretty much anything, you'll see it implemented in the CPU because hardware can't keep up, then dedicated hardware takes over because it's faster, then the CPU takes over because the hardware has ossified ...
This is just another iteration of the same loop...
Cheers,
Wol
Should this be implemented in endpoints at all?
Posted Jan 1, 2025 5:06 UTC (Wed)
by Jordan_U (subscriber, #93907)
[Link]
Posted Jan 1, 2025 5:06 UTC (Wed) by Jordan_U (subscriber, #93907) [Link]
Progress!
Should this be implemented in endpoints at all?
Posted Jan 3, 2025 17:34 UTC (Fri)
by john_ousterhout (guest, #175303)
[Link] (3 responses)
Posted Jan 3, 2025 17:34 UTC (Fri) by john_ousterhout (guest, #175303) [Link] (3 responses)
Should this be implemented in endpoints at all?
Posted Jan 3, 2025 22:29 UTC (Fri)
by bvanassche (subscriber, #90104)
[Link] (1 responses)
Posted Jan 3, 2025 22:29 UTC (Fri) by bvanassche (subscriber, #90104) [Link] (1 responses)
Infiniband has just about all of the performance problems of TCP when it comes to congestion control etc.. The only advantage of Infiniband is that people like Mellanox built really nice NICs for it that bypass the kernel.
Is there any scientific paper that backs the above statement about congestion? Multiple papers have been published about how to handle congestion in datacenter RDMA networks. Two examples:
- Zhu, Yibo, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. "Congestion control for large-scale RDMA deployments." ACM SIGCOMM Computer Communication Review 45, no. 4 (2015): 523-536.
- Shpiner, Alexander, Eitan Zahavi, Omar Dahley, Aviv Barnea, Rotem Damsker, Gennady Yekelis, Michael Zus, Eitan Kuta, and Dean Baram. "RoCE rocks without PFC: Detailed evaluation." In Proceedings of the Workshop on Kernel-Bypass Networks, pp. 25-30. 2017.
Should this be implemented in endpoints at all?
Posted Jan 6, 2025 17:10 UTC (Mon)
by paulj (subscriber, #341)
[Link]
Posted Jan 6, 2025 17:10 UTC (Mon) by paulj (subscriber, #341) [Link]
Congestion control is a bit easier on low-hop, tightly controlled networks - i.e. DCs - but even there it is not solved. Fairness across different kinds of CC in particular is a bitch, as is fairness across flows with very different RTTs and/or BDPs. E.g., congestion controller might work great competing with low-latency, fast connections (i.e. intra-DC), but have issues with fairness competing with flows with different properties, like much higher RTT (e.g., cross-region DC to DC). It's clearly not at all an easy problem.
Should this be implemented in software at all?
Posted Jan 4, 2025 6:35 UTC (Sat)
by buck (subscriber, #55985)
[Link]
Posted Jan 4, 2025 6:35 UTC (Sat) by buck (subscriber, #55985) [Link]
Your reply was, by contrast, most gracious (I say as someone who has no emotional attachment to Infiniband design [grin]).
But since I can't withdraw my comment (I think), I at least fixed the Subject of this reply to reflect what my provocative question really was. (I certainly didn't mean to exclude NICs as an implementation target, which are probably considered, by anybody's definition, part of an "endpoint", or maybe even an "endpoint" in their own right, if they are smart NICs/"DPUs".)
That said, if you are being gracious enough to give your code away, it's not my business to question what use the rest of the world may find to make of it. Clearly it has found plenty of use for Raft, TCL, etc.
Rx
Posted Dec 31, 2024 14:46 UTC (Tue)
by grawity (subscriber, #80596)
[Link]
Posted Dec 31, 2024 14:46 UTC (Tue) by grawity (subscriber, #80596) [Link]
And while I too like some protocols having their own IP protocol numbers, it definitely seems like a very limiting choice here, when UDP would provide multiplexing and unprivileged userspace access alike without any kernel changes (it's literally what UDP was meant for).
Homa?
Posted Jan 5, 2025 18:26 UTC (Sun)
by tao (subscriber, #17563)
[Link]
Posted Jan 5, 2025 18:26 UTC (Sun) by tao (subscriber, #17563) [Link]
Does the Homa network protocol support zero-copy data reception?
Posted Jan 6, 2025 22:39 UTC (Mon)
by bvanassche (subscriber, #90104)
[Link] (1 responses)
To achieve high performance, the CPU should not copy the contents of network data buffers. Does the Homa network protocol support sending and receiving data without copying it? The RDMA userspace verbs API supports this. The Berkeley socket API does not support zero-copy TCP receive (except when using mmap()).
Posted Jan 6, 2025 22:39 UTC (Mon) by bvanassche (subscriber, #90104) [Link] (1 responses)
Does the Homa network protocol support zero-copy data reception?
Posted Jan 7, 2025 11:29 UTC (Tue)
by paulj (subscriber, #341)
[Link]
Posted Jan 7, 2025 11:29 UTC (Tue) by paulj (subscriber, #341) [Link]
IIUC correctly anyway.
On the Tx side, sendmsg() API is already capable of zerocopy - see https://www.kernel.org/doc/html/latest/networking/msg_zer... . I assume, but do not know, that Homa relies on this existing userspace API for zerocopy on the Tx side.