|
|
Subscribe / Log in / New account

The Homa network protocol

By Jonathan Corbet
December 30, 2024
The origins of the TCP and UDP network protocols can be traced back a full 50 years. Even though networks and their use have changed radically since those protocols were designed, they can still be found behind most networking applications. Unsurprisingly, these protocols are not optimal for all situations, so there is ongoing interest in the development of alternatives. One such is the Homa transport protocol, developed by John Ousterhout (of Tcl/Tk and Raft fame, among other accomplishments), which is aimed at data-center applications. Ousterhout is currently trying to get a minimal Homa implementation into the kernel.

Most networking applications are still based on TCP, which was designed for efficient and reliable transport of streams of data across a distributed Internet. Data-center applications, instead, are often dominated by large number of small messages between many locally connected hosts. The requirements of TCP, including the establishment of connections and ordering of data, add a lot of overhead to that kind of application. The design of Homa is intended to remove that overhead while taking advantage of what current data-center networking hardware can do, with a focus on minimizing the latency between a request and its response.

A quick Homa overview

At its core, Homa is designed for remote procedure call (RPC) applications; every interaction on a Homa network comes down to a request and associated reply. A client will send a request message to a server that includes a unique request ID; the server will send a reply back that quotes that ID. The only state that exists on the server is held between the receipt of the request and the receipt of the response by the client.

Much of the key to the performance of this protocol can be found in how these messages are handled. There is no connection setup; instead, the client starts transmitting the request, with no introductory handshake, to the server. There is a limit on how many bytes of this "unscheduled" request data can be sent in this manner, which is determined by the round-trip time of the network; it should be just high enough to keep the request-transmission pipeline full until an initial response can be received from the server side. The figure of about 10,000 bytes appears in some of the Homa papers.

The initial request packet includes the length of the full request. If the request does not fit into the size allowed for the unscheduled data, the client will wait for a "grant" response before sending any more. That grant should, if the server is responding quickly, arrive just as the initial request data has finished transmitting, allowing the client to continue sending without a pause. Grants include a maximum amount of data that can be sent, and thus function like the TCP receive window.

This machinery is intended to get a request to the server as quickly as possible, but without the need for much, if any, buffering in the network path between the two machines. Priority queues are used to manage this traffic, with unscheduled packets normally having the highest priority. Lower priorities are used for granted traffic; the requests with the least amount of data remaining to be received are given the highest priority.

Once the server has received the full request and processed it, a response is sent back to the client. Once again, the initial bytes are sent as unscheduled packets, with grants required for the rest if the response is large enough. In the earlier descriptions of the protocol, the server would forget everything it knew about the request immediately after sending the response. That created the possibility that requests could be resent (if the response never arrives) and executed multiple times. More recent publications include an explicit acknowledgment message indicating that a response has been received, with the sender retaining the necessary state to retransmit a reply until that acknowledgment is received.

The details of the protocol are, of course, rather more complex than described here. There are, for example, mechanisms for clamping down on the amount of unscheduled data sent if a server is finding itself overloaded. The receiving side of a message can request retransmission if an expected packet does not arrive; unlike TCP and many other protocols, Homa puts the responsibility for detecting lost packets onto the receiving side. There is also a fair amount of thought that has gone into letting systems overcommit their resources by issuing more grants than they can immediately handle; the purpose here is to keep the pipelines full even if some senders do not transmit as quickly as expected.

See this paper for a more complete (and surely more correct) description of the Homa protocol, this page, which reflects some more recent changes, and this 2022 article for more details.

Homa on Linux

The Unix socket interface was designed around streams, and is not a perfect fit for Homa, but the implementation sticks with it to the extent it can. A socket() call is used to create a socket for communication with any number of other systems; the IPPROTO_HOMA protocol type is used. Homa can run over either IPv4 or IPv6. For server systems, a bind() call can be used to set up a well-known port to receive requests; clients need not bind to a port.

Messages are sent and received, as one might expect, with sendmsg() and recvmsg(), but there are some Homa-specific aspects that developers must be aware of. When sending a message, an application must include a pointer to this structure in the msg_control field of the msghdr structure passed to sendmsg():

    struct homa_sendmsg_args {
	uint64_t id;
	uint64_t completion_cookie;
    };

If a request is being sent, id should be set to zero; the protocol implementation will then assign a unique ID to the request (and write it into id) before sending it to the server. For a reply message, id should be the ID value that arrived with the request being responded to. The completion_cookie value, which is only used for requests, will be passed back to the caller with the reply data when it is received.

The receive side is a bit more complicated, because Homa requires that the buffer space for replies be registered before sending the first request on a socket. To do so, the process should allocate a range of memory, then pass it into the kernel with SO_HOMA_RCVBUF setsockopt() operation, using this structure:

    struct homa_rcvbuf_args {
	void *start;
	size_t length;
    };

The start address must be page-aligned. This memory is split into individual buffers, called "bpages", each of which is HOMA_BPAGE_SIZE in length; that size is 64KB in the current implementation. Each message will occupy at least one bpage; large messages will be scattered across multiple, not necessarily contiguous, bpages.

A message is received by making a call to recvmsg() with a pointer to this structure passed in the msg_control field of struct msghdr:

    struct homa_recvmsg_args {
	uint64_t id;
	uint64_t completion_cookie;
	uint32_t flags;
	uint32_t num_bpages;
	uint32_t bpage_offsets[HOMA_MAX_BPAGES];
    };

The flags field describes what the caller is willing to receive; it is a bitmask that can include either or both of HOMA_RECVMSG_REQUEST (to receive request messages) and HOMA_RECVMSG_RESPONSE (to receive responses). If id is zero, then HOMA_RECVMSG_RESPONSE will cause any response message to be returned; otherwise, only a response corresponding to the provided request ID will be returned. On return, num_bpages will indicate the number of bpages in the registered buffer area have been used to hold the returned message; bpage_offsets gives the offset of each one.

The bpages returned by this call are owned by the application at this point, and will not be used by the kernel until they have been explicitly returned. That is done with a subsequent recvmsg() call, where num_bpages and bpage_offsets will indicate a set of bpages to be given back.

This code has been "stripped down to the bare minimum" to be able to actually transmit requests and responses across the net; it is evidently about half of the full set of Homa patches. The intent, of course, is to ease the task of reviewing the work and getting initial support into the kernel; the rest of the work can come later. In its current form, according to the cover letter, its performance "is not very interesting", but that is expected to improve once the rest of the work is merged.

See this paper for more information on the Linux implementation of Homa.

Prospects

The Homa protocol originates at Stanford University, with support from a number of technology companies. Academic work often does not successfully make the transition from interesting prototype into production-quality code that can be accepted into Linux. In this case, though, Ousterhout seems determined to get the code into the mainline, and is trying to do the right things to get it there. Thus far, the four postings of the code have yielded some conversations about the protocol, but have not yet resulted in a detailed review of the code. That suggests that the initial merge of Homa is not imminent.

It does seem likely to happen at some point, though. Then, it will be a matter of whether the operators of large data centers decide that it is worth using. Complicating that question is Ousterhout's assertion (in the above-linked paper) that, even in a kernel with less overhead than Linux, CPUs simply are not fast enough to keep up with the increases in networking speed. The real future for Homa, he suggests, may be inside the networking hardware itself. In that case, the merging into Linux would be an important proof of concept that accelerates further development of the protocol, but its use in real-world deployments might be limited. It does, in any case, show how Linux is firmly at the center of protocol development for modern networks.

Index entries for this article
KernelNetworking/Protocols


to post comments

IP protocol number

Posted Dec 30, 2024 23:02 UTC (Mon) by cesarb (subscriber, #6266) [Link]

For those who, like me, were wondering: this is IP protocol number 146 according to https://www.iana.org/assignments/protocol-numbers/protoco...

Why not UDP

Posted Dec 30, 2024 23:05 UTC (Mon) by npws (subscriber, #168248) [Link] (10 responses)

This begs the question why the protocol has not been built on top of UDP.

On a related note, 10, 15 years after DCCP has been merged, is there any real life use of this protocol?

Why not UDP

Posted Dec 31, 2024 0:21 UTC (Tue) by willy (subscriber, #9762) [Link] (9 responses)

QUIC is layered on top of UDP because it has to contend with middleboxes. Homa is designed for the datacenter where you will generally not see middleboxes. What advantage would there be to putting UDP between Homa and IP?

Why not UDP

Posted Dec 31, 2024 0:51 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

Not having to do it in the kernel?

Why not UDP

Posted Dec 31, 2024 1:19 UTC (Tue) by josh (subscriber, #17465) [Link] (2 responses)

You can send and receive raw IP from userspace: https://www.man7.org/linux/man-pages/man7/raw.7.html

Why not UDP

Posted Dec 31, 2024 1:21 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Raw sockets require special permissions, and are in general not great for performance, if you have multiple listeners (i.e. multiple applications using Homa).

Why not UDP

Posted Dec 31, 2024 1:24 UTC (Tue) by josh (subscriber, #17465) [Link]

That's a much better argument for UDP, sure. I was noting that it's *possible* to write raw IP protocols from userspace.

Why not UDP

Posted Dec 31, 2024 7:03 UTC (Tue) by dvdeug (guest, #10998) [Link] (4 responses)

Because there's an open source kernel that lets them implement what they want instead of using workarounds?

Why not UDP

Posted Dec 31, 2024 9:48 UTC (Tue) by npws (subscriber, #168248) [Link] (3 responses)

It is questionable whether this would really be a workaround. "The only state that exists on the server is held between the receipt of the request and the receipt of the response by the client". This sounds like pretty much what UDP is meant for, stateless query-response protocols. The benefits of using UDP are obvious, no need for a kernel implementation, at least initially, easier development, easier deployment.

Why not UDP

Posted Jan 3, 2025 17:30 UTC (Fri) by john_ousterhout (guest, #175303) [Link]

A user-space implementation of Homa is probably not a good idea because Homa's congestion control algorithm depends on having complete information about all traffic on the uplink. With a user-space implementation, no one app would have the complete information needed for congestion control (perhaps you could use shared memory among the apps to keep track of this, but that sounds pretty complex).

Why not UDP

Posted Jan 3, 2025 19:55 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> The benefits of using UDP are obvious, no need for a kernel implementation, at least initially, easier development, easier deployment.

Except that the costs (as I understand it) basically destroy all the benefits! It's called UDP/IP for a reason (although we normally drop the IP bit, same as TCP). The IP stands for Internet Protocol, which is designed to go through routers etc etc. Homa is designed for a LAN, where typically you're routing MAC addresses through a switch, not IP addresses through a router.

So yes, the fundamental idea behind Homa and UDP is very similar. But the *implementation* is very different, and that's the whole point. And the whole point of Homa is that - by deleting all that overhead of routing - it gets a massive speedup.

As the original article said, don't try to run Homa over a WAN. It's the (in)ability to be efficient over the WAN, that helps it be the Cheetah in the server room.

Cheers,
Wol

Why not UDP

Posted Jan 3, 2025 20:17 UTC (Fri) by corbet (editor, #1) [Link]

I don't think that routing, in particular, is key to the design of Homa, with the possible exception of the assumption that round-trip times will be short and constant. Remember that Homa, too, is built on top of IP...

Should this be implemented in endpoints at all?

Posted Dec 31, 2024 6:03 UTC (Tue) by buck (subscriber, #55985) [Link] (6 responses)

Maybe I shouldn't be posting in ignorance, but …

If the expectation is that host hardware won't be able to keep up and that the functionality will end up being in NICs, and if it requires prioritization in the network for different message types, then aren't you most of the way to Infiniband or maybe ROCE with Data Center Networking (DCQCN etc.) tuned up on all the switches?

There was a somewhat skeptical viewpoint that came out on Homa from back about the same time LWN first wrote about it:

https://blog.ipspace.net/2023/01/data-center-tcp-replacem...

One of the LinkedIn comments reposted at the bottom contains a particularly eye-opening statement:

> Over 65% of Azure traffic (bytes and packets) uses RoCEv2, not TCP. TCP is a distant second.

I wonder how the proportion has changed in the interim.

While I think a dumb network and the end-to-end principle is usually for the best, it doesn't sound like anybody is really looking to add any value here but speed, which seems like a job for hardware, since the field is already full of hardware competitors.

Should this be implemented in endpoints at all?

Posted Dec 31, 2024 9:26 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

> While I think a dumb network and the end-to-end principle is usually for the best, it doesn't sound like anybody is really looking to add any value here but speed, which seems like a job for hardware, since the field is already full of hardware competitors.

This is a see-saw. Look at pretty much anything, you'll see it implemented in the CPU because hardware can't keep up, then dedicated hardware takes over because it's faster, then the CPU takes over because the hardware has ossified ...

This is just another iteration of the same loop...

Cheers,
Wol

Should this be implemented in endpoints at all?

Posted Jan 1, 2025 5:06 UTC (Wed) by Jordan_U (subscriber, #93907) [Link]

But now there won't be ossification because you'll just be able to update the docker container running on your NIC!

Progress!

Should this be implemented in endpoints at all?

Posted Jan 3, 2025 17:34 UTC (Fri) by john_ousterhout (guest, #175303) [Link] (3 responses)

Infiniband has just about all of the performance problems of TCP when it comes to congestion control etc.. The only advantage of Infiniband is that people like Mellanox built really nice NICs for it that bypass the kernel.

Should this be implemented in endpoints at all?

Posted Jan 3, 2025 22:29 UTC (Fri) by bvanassche (subscriber, #90104) [Link] (1 responses)

Infiniband has just about all of the performance problems of TCP when it comes to congestion control etc.. The only advantage of Infiniband is that people like Mellanox built really nice NICs for it that bypass the kernel.

Is there any scientific paper that backs the above statement about congestion? Multiple papers have been published about how to handle congestion in datacenter RDMA networks. Two examples:

  • Zhu, Yibo, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. "Congestion control for large-scale RDMA deployments." ACM SIGCOMM Computer Communication Review 45, no. 4 (2015): 523-536.
  • Shpiner, Alexander, Eitan Zahavi, Omar Dahley, Aviv Barnea, Rotem Damsker, Gennady Yekelis, Michael Zus, Eitan Kuta, and Dean Baram. "RoCE rocks without PFC: Detailed evaluation." In Proceedings of the Workshop on Kernel-Bypass Networks, pp. 25-30. 2017.

Should this be implemented in endpoints at all?

Posted Jan 6, 2025 17:10 UTC (Mon) by paulj (subscriber, #341) [Link]

Congestion control is far from a solved problem, especially not across wider, less-controlled / uncoordinated networks. Some say ECN is the magic bullet, especially in its recent updated form of L4S / TCP Prague. Others (including a regular commenter here on LWN on networking matters, and Linux congestion/buffering contributor) I think disagree with that.

Congestion control is a bit easier on low-hop, tightly controlled networks - i.e. DCs - but even there it is not solved. Fairness across different kinds of CC in particular is a bitch, as is fairness across flows with very different RTTs and/or BDPs. E.g., congestion controller might work great competing with low-latency, fast connections (i.e. intra-DC), but have issues with fairness competing with flows with different properties, like much higher RTT (e.g., cross-region DC to DC). It's clearly not at all an easy problem.

Should this be implemented in software at all?

Posted Jan 4, 2025 6:35 UTC (Sat) by buck (subscriber, #55985) [Link]

Sorry the comment you were responding to was so provocative.

Your reply was, by contrast, most gracious (I say as someone who has no emotional attachment to Infiniband design [grin]).

But since I can't withdraw my comment (I think), I at least fixed the Subject of this reply to reflect what my provocative question really was. (I certainly didn't mean to exclude NICs as an implementation target, which are probably considered, by anybody's definition, part of an "endpoint", or maybe even an "endpoint" in their own right, if they are smart NICs/"DPUs".)

That said, if you are being gracious enough to give your code away, it's not my business to question what use the rest of the world may find to make of it. Clearly it has found plenty of use for Raft, TCL, etc.

Rx

Posted Dec 31, 2024 14:46 UTC (Tue) by grawity (subscriber, #80596) [Link]

It reminded me a little bit of Rx, a fairly old RPC-transport used by AFS (and which also happens to have special kernel support as AF_RXRPC). From what I recall – if I recall correctly, that is –, Rx also does per-call congestion control instead of per-connection (it is connectionless), and it's what leads to its miserable performance over any sort of non-single-digit latency... But I guess it won't matter for a data center?

And while I too like some protocols having their own IP protocol numbers, it definitely seems like a very limiting choice here, when UDP would provide multiplexing and unprivileged userspace access alike without any kernel changes (it's literally what UDP was meant for).

Homa?

Posted Jan 5, 2025 18:26 UTC (Sun) by tao (subscriber, #17563) [Link]

With a name like Homa I was fully expecting it to deal with Home automation, not data-center applications...

Does the Homa network protocol support zero-copy data reception?

Posted Jan 6, 2025 22:39 UTC (Mon) by bvanassche (subscriber, #90104) [Link] (1 responses)

To achieve high performance, the CPU should not copy the contents of network data buffers. Does the Homa network protocol support sending and receiving data without copying it? The RDMA userspace verbs API supports this. The Berkeley socket API does not support zero-copy TCP receive (except when using mmap()).

Does the Homa network protocol support zero-copy data reception?

Posted Jan 7, 2025 11:29 UTC (Tue) by paulj (subscriber, #341) [Link]

I don't know Homa, other than via 2 articles on LWN and the papers on it. The interface Homa presents, as described in this article, appears to be explicitly designed for zero-copy receive though - note how userspace allocates the buffer, and the API then allows responsibility/ownership of the buffer to be transferred back and forth between kernel and userspace. I.e., userspace assigns a set of buffers to the kernel, allowing the kernel to setup direct DMA to the buffer, and then ownership is transferred back to userspace on completion - allowing it to do what it wants with the same buffer.

IIUC correctly anyway.

On the Tx side, sendmsg() API is already capable of zerocopy - see https://www.kernel.org/doc/html/latest/networking/msg_zer... . I assume, but do not know, that Homa relies on this existing userspace API for zerocopy on the Tx side.


Copyright © 2024, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds