Winsock Programmer’s FAQ: Intermediate Winsock Issues

3.1 - How do I speak { HTTP, POP3, SMTP, FTP, Telnet, NNTP, etc. } with Winsock?

Winsock proper does not provide a way for you to speak these protocols, because it only deals with the layers underneath these application-level protocols. However, there are many ways for you to get your program to speak these protocols.

The easiest method is to use a third-party library. The Resources section lists several of these.

If you only need to speak the HTTP, FTP or gopher protocols, you can use the WinInet library exposed by Microsoft’s Internet Explorer. Newer versions of Microsoft’s development tools include components that make accessing WinInet simple.

Finally, you can always roll your own. You should start by reading the specification for the protocol you want to implement. Most of the Internet’s protocols are documented in RFCs. The Important RFCs page links to the most commonly referenced application-level RFCs. The complexity of the protocols vary widely, and the only way to gauge the difficulty of implementing the protocol is to read the relevant RFC(s). HTTP, for example, is a pretty simple protocol, but the authors of its RFC managed to fill 176 pages talking about it. Most RFCs aren’t that pretentious, luckily.

If you’ve read the RFC and still can’t figure the protocol out, you can always fall back to the standard Q&A resources.

3.2 - How can I encrypt my TCP stream with SSL/TLS?

Modern versions of Windows have a SSL/TLS mechanism built in.

Windows NT derivatives offer SSL through the security API. You can find sample code to show how these mechanisms work in the MSDN Library. You can also find code in the Platform SDK, in the Samples\WinBase\Security\SSL subdirectory.

Windows CE has a different SSL mechanism. There is an article in MSDN that describes how to use the functionality.

If all you need is basic HTTP, you can also use the WinInet API exposed by Internet Explorer. You can access https URLs through it just as well as http ones. You lose out on a lot of flexibility relative to writing your own HTTP client code, of course, but it's a lot simpler, too. MS Knowledge Base article KB168151 shows how to use this feature.

3.3 - How do I get my computer’s IP address?

There are three methods, which each have advantages and disadvantages:

The simplest method is to call getsockname() on a connected socket. If you pass it a disconnected socket, it will most likely return something useless, like 0.0.0.0.
To get your machine’s address without opening a socket first, call WSAIoctl(), passing SIO_GET_INTERFACE_LIST for the second parameter. [C++ Example]
A more portable alternative to the previous item, with much the same effect, is to call gethostbyname(), passing the value gethostname() returns. It doesn’t return as much detail as the previous method. [C++ example]

The thing that makes this difficult is that a computer can have any number of network interfaces, each with an IP, all of which are “your” IP address. Which to choose?

At minimum, you will find two interfaces, one of which is the loopback interface, leaving the other as the one you want. With the second method, there is a flag you can check to find the loopback interface. In the third, if you can assume IPv4, you can heuristically detect it by checking if the first octet of the address is 127. (Don’t check for 127.0.0.1: all 127.x.y.z addresses are legal loopback addresses.)

So now the question is, how likely is it that your program will be running on a machine with more than two external network interfaces? Quite likely, unfortunately. The computer I’m typing this on has seven: two Ethernet ports, a WiFi adapter, three virtual interfaces for VMware, and a Firewire port. (Firewire can be used for an ad-hoc peer-to-peer network.) You may also find a modem configured for PPP dial-up, a satellite Internet adapter, a cell network data adapter, etc. Any of these might be the one you want.

The computer’s network stack uses the route table to figure out which interface to use in a given situation. You can also retrieve the route table and try to work out the same answer, but it’s best, whenever possible, to leave this to the stack. This is the primary virtue of the first method: in establishing a connection to a remote peer, the stack selected one of the network interfaces based on the route table, so as far as that remote peer is concerned, you have only one IP.

If you’re simply trying to connect to a server running on the same machine with sockets, use the loopback interface.

If you cannot intelligently decide which interface based on heuristics like loopback vs. non, and cannot establish a conneciton to make the stack figure this out for you, it’s best to just ask the user. Use the second or third method above to collect a list of addresses, present it to the user, and make them pick one.

3.4 - What’s the proper way to impose a packet scheme on a stream protocol like TCP?

The two most common methods are delimiters and length-prefixing.

An example of delimiters is separating packets with, say, a caret (^). Naturally your delimiter must never occur in regular data, or you must have some way of “escaping” delimiter characters.

An example of length-prefixing is prepending a two-byte integer containing the packet length on every packet. See the FAQ article How to Use TCP Effectively for the proper way to send integers over the network. Also see the How to Packetize a TCP Stream example.

There are hybrid methods, too. The HTTP protocol, for example, separates header lines with CRLF pairs (a kind of delimiting), but when an HTTP reply contains a block of binary data, the sever also sends the Content-length header before it sends the data, which is a kind of length-prefixing.

I favor simple length-prefixing, because as soon as you read the length prefix, you know how many more bytes to expect. By contrast, delimiters require that you blindly read until you find the end of the packet.

3.5 - I’m writing a server. What’s a good network port to use?

If you’re writing a server for an existing, popular Internet protocol, it’s already got a port number assigned to it. You can find the most common of these numbers at the website for the Internet Assigned Numbers Authority (IANA).

If you’re writing a server for a new protocol, there are a few rules and suggestions you should obey when choosing your server’s port:

Ports 1-1023 are off-limits to people inventing new protocols. They are reserved by the IANA for standard protocols like POP3 and HTTP (110 and 80, respectively). Until your protocol is granted a port in this range by the IANA, you should use something outside this range. id Software’s choice of port 666 for their DOOM game server is cute, but it violates this rule. They cleaned up their act with Quake: it uses port 6112.
Ports 1024 through 49151 are Registered Ports, which are a good range to choose your ports from. Just beware that the entire world is choosing from ports in this range, so it may make sense for you to register your port, or at least check the current list of assigned ports.
Ports 49152 through 65535 are Dynamic Ports, meaning that operating systems use ports in this range when choosing random ports. (The FTP protocol, for example, uses random ports in the data transfer phase.) This is a poor range to choose ports from, because there’s a fairly decent chance that your program and the OS will fight over a given port eventually.
Many OSes pick local ports for client programs from the 1024-5000 range. You would do well to pick server ports higher than 5000, but this is not as rigid a rule as the previous ones.
There are plenty of uncontested port numbers to choose from in the “safe” 5000-49151 range. You should avoid port numbers with patterns to them, or a widely-recognized meaning. People tend to pick these since they’re easy to remember, but this increases the chances of a collision. Ports 6969, 5150 and 22222 are bad choices, for example.

You should also give some thought to making your program’s port configurable, in case your program is run on a machine where another server is already using that port. One way to do this is through Winsock’s getservbyname() function: if that function returns a port number, use that, otherwise use the default port number. Then users can change your program’s port by editing the services file, located in %WINSYSDIR%\drivers\etc.

3.6 - What is TCP?

The Transmission Control Protocol is a reliable stream transport protocol:

“Reliable” means that Winsock always succeeds in sending the data to the remote peer: TCP can deal with lost, corrupted, duplicated and fragmented packets.
“Stream” means that the remote peer sees incoming data as a stream of individual bytes: there is no notion of packets, from the program’s viewpoint.
“Transport” refers to a layer of the network stack just above the hardware layer, which works out how to transport raw data from one machine to another. Winsock is above the transport layer and your program is above the Winsock layer, so your program does not see TCP directly. It’s simply a service you request from Winsock by passing SOCK_STREAM as the second argument to socket(). You can dig down to the TCP layer with a sniffer or raw sockets.

TCP can coalesce sends, for efficiency: if you make four quick send() calls to Winsock with 100, 50, 30 and 120 bytes in each, Winsock is likely to pack all these up into a single 300-byte TCP packet when it decides to send them out on the network. (This is called the Nagle algorithm.) Compare UDP.

3.7 - What is UDP?

The User Datagram Protocol is an alternative to TCP. Sometimes you see the term “TCP/IP” used to refer to all basic Internet technologies, including UDP, but the proper term is UDP/IP, meaning UDP over IP.

Winsock gives you a UDP socket when you pass SOCK_DGRAM as the second argument to socket().

UDP is an “unreliable” protocol: the stack does not make any effort to handle lost, duplicated, or out-of-order packets. UDP packets are checked for corruption, but a corrupt UDP packet is simply dropped silently.

The stack will fragment a UDP datagram when it’s larger than the network’s MTU. The remote peer’s stack will reassemble the complete datagram from the fragments before it delivers it to the receiving application. If a fragment is missing or corrupted, the whole datagram is thrown away. This makes large datagrams impractical: an 8 KB UDP datagram will be broken into 6 fragments when sent over Ethernet, for example, because it has a 1500 byte MTU. If any of those 6 fragments is lost or corrupted, the stack throws away the entire 8 KB datagram.

Datagram loss can also occur within the stack at the sender or the receiver, usually due to lack of buffer space. It is even possible for two communicating programs running on the same machine to have data loss if they use UDP. (This actually happens on Windows under high load conditions, because it starts dropping datagrams when the stack buffers get full.) This limits UDP’s value as a local IPC mechanism.

If any of these types of loss occur, no notification will be sent to the sender or receiver, even if the loss happens within the network stack.

Duplicated datagrams are not dropped: they are delivered to the receiver. It is up to the application to detect this problem, and it is the program’s choice what to do with the duplicate datagram.

UDP datagrams can be delivered in any order. Datagrams often get reordered on the network when two datagrams get delivered via different routes, and the second datagram’s route happens to be quicker.

3.8 - What is UDP good for?

From the above discussion, UDP looks pretty useless, right? Well, it does have a few advantages over reliable protocols like TCP:

UDP is a slimmer protocol: its protocol header is fixed at 8 bytes, whereas TCP’s is 20 bytes at minimum and can be more.
UDP has no congestion control and no data coalescing. This eliminates the delays caused by the delayed ACK and Nagle algorithms. (This is also a disadvantage in many situations, of course.)
There is less code in the UDP section of the stack than the TCP section. This means that there is less latency between a packet arriving at the network card and being delivered to the application.
Only UDP packets can be broadcast or multicast.

This makes UDP good for applications where timeliness and control is more important than reliability. Also, some applications are inherently tolerant of UDP problems. You have likely experienced blips, skips and stutters in streaming media programs: these are due to lost, corrupted or duplicated UDP frames.

Be careful not to let UDP’s advantages blind you to its bad points: too many application writers have started with UDP, and then later been forced to add reliability features. When considering UDP, ask yourself whether it would be better to use TCP from the start than to try to reinvent it. Note that you can’t completely reinvent TCP from the Winsock layer. There are some features of TCP like path MTU discovery that require low-level access to the OS’s networking layers. Other features of TCP are possible to duplicate over UDP, but difficult to get right. Keep in mind, TCP/IP was created in 1981, and the particular implementation you are using probably has code in it going back nearly that far. A whole lot of effort has gone into tuning this protocol suite for reliability and performance. You will be throwing away those decades of experience in trying to reinvent TCP or invent something better.

If you need a balance between UDP and TCP, you might investigate RTP (RFC 1889) and SCTP (RFC 2960). RTP is a higher level prototocol that usually runs over UDP and adds packet sequence numbers, as well as other features. SCTP runs directly on top of IP like TCP and UDP; it is a reliable protocol like TCP, but is datagram oriented like UDP.

3.9 - How do I send a broadcast packet?

With the UDP protocol you can send a packet so that all workstations on the network will see it. (TCP doesn’t allow broadcasting.)

To send broadcast packets, you must first enable the SO_BROADCAST option with the setsockopt() function. Then you simply send packets out using a special broadcast address.

The universal broadcast address is 255.255.255.255. Its advantage is that it’s generic. The disadvantage is that, because it can theoretically refer to every IP-connected machine on the planet, many network nodes will drop universal broadcast packets.

A smarter plan is to use your subnet’s “directed broadcast” address. This is an address you calculate using a network interface’s IP address and its netmask; packets sent to that address will stay within the subnet, so often routers that would drop a universal broadcast will pass directed broadcasts. To construct the directed broadcast address, do something like this:

        u_long host_addr = inet_addr("172.16.77.88");   // local IP addr
        u_long net_mask = inet_addr("255.255.224.0");   // LAN netmask
        u_long net_addr = host_addr & net_mask;         // 172.16.64.0
        u_long dir_bcast_addr = net_addr | (~net_mask); // 172.16.95.255

Potential Problems: Broadcasts can be useful at times, but keep in mind that this creates a load on all the machines on the network, even on machines that aren’t listening for the packet. This is because the part of the stack that can reject the packet is several layers down. To get around this problem, you may want to consider multicasting instead.

3.10 - Is Winsock thread-safe?

On modern Windows stacks, yes, it is, within limits.

It is safe, for instance, to have one thread calling send() and another thread calling recv() on a single socket.

By contrast, it’s a bad idea for two threads to both be calling send() on a single socket. This is “thread-safe” in the limited sense that your program shouldn’t crash, and you certainly shouldn’t be able to crash the kernel, which is handling these send() calls. The fact that it is “safe” doesn’t answer key questions about the actual effect of doing this. Which call’s data goes out first on the connection? Does it get interleaved somehow? Don’t do this.

Instead of multiple threads accessing a single socket, you may want to consider setting up a pair of network I/O queues. Then, give one thread sole ownership of the socket: this thread sends data from one I/O queue and enqueues received data on the other. Then other threads can access the queues (with suitable synchronization).

Applications that use some kind of non-synchronous socket typically have some I/O queue already. Of particular interest in this case is overlapped I/O or I/O completion ports, because these I/O strategies are also thread-friendly. You can tell Winsock about several OVERLAPPED blocks, and Winsock will finish sending one before it moves on to the next. This means you can keep a chain of these OVERLAPPED blocks, each perhaps added to the chain by a different thread. Each thread can also call WSASend() on the block they added, making your main loop simpler.

3.11 - If two threads in an application call `recv()` on a socket, will they each get the same data?

No. Winsock does not duplicate data among threads.

3.12 - Is there any way for two threads to be notified when something happens on a socket?

No.

If two threads call WSAAsyncSelect() on a single socket, only the thread that made the last call to WSAAsyncSelect() will receive further notification messages.

If two threads call WSAEventSelect() on a socket, only the event object used in the last call will be signaled when an event occurs on that socket.

You can’t trick Winsock by calling WSAAsyncSelect() on a socket in one thread and WSAEventSelect() on that same socket in another thread. These calls are mutually exclusive for any single socket.

You also cannot reliably call select() on a single socket from two threads and get the same notifications in each, because one thread could clear or cause an event, which would change the events that the other thread sees.

As recommended above, you should give sole ownership of a socket to a single thread, then have it communicate with the other threads.

3.13 - How do I detect if there is an Internet connection?

It is sometimes useful for a Winsock program to only do its thing if the computer is already connected to the Internet.

In the old days, “connected to the Internet” usually meant having an established dial-up networking (DUN) connection. If that’s still the case in your situation, see this example. This doesn't help if the modem is configured to auto-dial: the fact that the DUN connection is down is not a problem, because attempting the connection will bring it up.

Now that analog phone line modems are no longer the primary way to connect to the Internet, what it means to be “connected” has gotten much fuzzier. You can check if there is a LAN or WiFi connection, such as by poking around in the network interface list, but this is a heuristic method at best. Your program can never guess all the different interface types that might be used as an Internet connection, nor rule out all plausible but wrong choices. Even if you do somehow manage to find the interface used for Internet access, you won’t know if there is a break in the Internet connection somewhere down the line without simply trying to connect.

The moral of the story is, rely on the user to know more about their system than your program can guess. If they started your program and the first thing it does is connect to the Internet, assume that is what the user wanted. If it runs in the background and only periodically connects to the Internet, it might still be best to just try blindly. Back in the days of modems, it was often more polite to let the user control the connection themselves, either manually or with a preference to check for a DUN connection, but this is fast becoming part of history.

3.14 - How can I get the local user name?

Use the Win32 function GetUserName(). [C++ Example].

3.15 - I’ve heard that asynchronous sockets are unreliable. Is this true?

Asynchronous sockets are reliable if your program obeys the letter of the Winsock specification.

Every so often, you hear stories about a program that loses asynch notification messages. As far as I can tell, it’s always due to a bug in the complainer’s program, due to misunderstanding Winsock’s parsimonious notification policy.

Consider the FD_WRITE notification. That only gets sent when a client’s connection is accepted by the remote peer, and from then on only when output buffer space becomes available after Winsock gives you a WSAEWOULDBLOCK error. To put it another way, FD_WRITE only gets sent to say, “Before now, it was not okay to write data on this socket; now it’s okay.” The conservative way to handle this is to always try to send data when you have it, whether you’ve received an FD_WRITE or not. You might get a WSAEWOULDBLOCK error, but that’s harmless and easy to handle. Your handler for FD_WRITE then just tries to send everything queued up until it sends it all or gets another WSAEWOULDBLOCK.

I’ve been using asynchronous sockets almost exclusively for many years now with no problems. Others who’ve been using asynchronous notification for years longer than I have agree. If you believe you’re losing notifications, you have to ask yourself whether it’s more likely that we’ve overlooked a bug in the stack or that there’s a bug in your program.

3.16 - What is the Nagle algorithm?

The Nagle algorithm is an optimization to TCP that makes the stack wait until all data is acknowledged on the connection before it sends more data. The exception is that Nagle will not cause the stack to wait for an ACK if it has enough enqueued data that it can fill a network frame. (Without this exception, the Nagle algorithm would effectively disable TCP’s sliding window algorithm.) For a full description of the Nagle algorithm, see RFC 896.

So, you ask, what’s the purpose of the Nagle algorithm?

The ideal case in networking is that each program always sends a full frame of data with each call to send(). That maximizes the percentage of useful program data in a packet.

The basic TCP and IPv4 headers are 20 bytes each. The worst case protocol overhead percentage, therefore, is 40/41, or 98%. Since the maximum amount of data in an Ethernet frame is 1500 bytes, the best case protocol overhead percentage is 40/1500, less than 3%.

While the Nagle algorithm is causing the stack to wait for data to be ACKed by the remote peer, the local program can make more calls to send(). Because TCP is a stream protocol, it can coalesce the data in those send() calls into a single TCP packet, increasing the percentage of useful data.

Imagine a simple Telnet program: the bulk of a Telnet conversation consists of sending one character, and receiving an echo of that character back from the remote host. Without the Nagle algorithm, this results in TCP’s worst case: one byte of user data wrapped in dozens of bytes of protocol overhead. With the Nagle algorithm enabled, the TCP stack won’t send that one Telnet character out until the previous characters have all been acknowledged. By then, the user may well have typed another character or two, reducing the relative protocol overhead.

This simple optimization interacts with other features of the TCP protocol suite, too:

Most stacks implement the delayed ACK algorithm: this causes the remote stack to delay ACKs under certain circumstances, which allows the local stack a bit of time to “Nagle” some more bytes into a single packet.
The Nagle algorithm tends to improve the percentage of useful data in packets more on slow networks than on fast networks, because ACKs take longer to come back.
TCP allows an ACK packet to also contain data. If the local stack decides it needs to send out an ACK packet and the Nagle algorithm has caused data to build up in the output buffer, the enqueued data will go out along with the ACK packet.

The Nagle algorithm is on by default in Winsock, but it can be turned off on a per-socket basis with the TCP_NODELAY option of setsockopt(). This option should not be turned off except in a very few situations.

Beware of depending on the Nagle algorithm too heavily. send() is a kernel function, so every call to send() takes much more time than for a regular function call. Your application should coalesce its own data as much as is practical to minimize the number of calls to send().

3.17 - When should I turn off the Nagle algorithm?

Almost never.

Inexperienced Winsockers usually try disabling the Nagle algorithm when they are trying to impose some kind of packet scheme on a TCP data stream. That is, they want to be able to send, say, two packets, one 40 bytes and the other 60, and have the receiver get a 40-byte packet followed by a separate 60-byte packet. (With the Nagle algorithm enabled, TCP will often coalesce these two packets into a single 100 byte packet.) Unfortunately, this is futile, for the following reasons:

Even if the sender manages to send its packets individually, the receiving TCP/IP stack may still coalesce the received packets into a single packet. This can happen any time the sender can send data faster than the receiver can deal with it.
Winsock Layered Service Providers (LSPs) may coalesce or fragment stream data, especially LSPs that modify the data as it passes.
Turning off the Nagle algorithm in a client program will not affect the way that the server sends packets, and vice versa.
Routers and other intermediaries on the network can fragment packets, and there is no guarantee of “proper” reassembly with stream protocols.
If a packet arrives that is larger than the available space in the stack’s buffers, it may fragment a packet, queuing up as many bytes as it has buffer space for and discarding the rest. (The remote peer will resend the remaining data later.)
Winsock is not required to give you all the data it has queued on a socket even if your recv() call gave Winsock enough buffer space. It may require several calls to get all the data queued on a socket.

Aside from these problems, disabling the Nagle algorithm almost always causes a program’s throughput to degrade. The only time you should disable the algorithm is when some other consideration, such as packet timing, is more important than throughput.

Often, programs that deal with real-time user input will disable the Nagle algorithm to achieve the snappiest possible response, at the expense of network bandwidth. Two examples are X Window servers and multiplayer network games. In these cases, it is more important that there be as little delay between packets as possible than it is to conserve network bandwidth.

For more on this topic, see the Lame List and the FAQ article How to Use TCP Effectively.

3.18 - What is TCP’s sliding window?

In a naïve implementation of TCP, every packet is immediately acknowledged with an ACK packet. Until the ACK arrives from the receiver (in this naïve implementation, at any rate), the sender does not send another packet. If the ACK does not arrive within some particular time frame, the sending stack retransmits the packet.

The problem with this is that all that waiting limits network throughput drastically. The minimum time between packets with such a scheme must be at least twice the minimum round trip time for that network, for the time to send the packet and for the time for the receiver to send back an ACK. Add in processing time on each end, temporary hardware faults (e.g. Ethernet collisions), retransmissions, routing delays, and who knows what else: the stacks end up spending more time waiting for ACKs than sending data. This is a problem because it means you can’t effectively fill a network pipe with a single socket.

The limit of data throughput over a network link is the maximum amount of data it is possible to have in transit at once divided by the round trip time. Imagine a naive TCP/IP implementation running over a 100BaseT Ethernet. The maximum payload size for TCP over Ethernet is 1460 bytes, and the 100BaseT round trip time is roughly 0.3 ms. 1460 divided by 0.0003 seconds comes out to 4.8 MByte/s. If you’ve done any speed testing on a 100BaseT Ethernet, you know you can hit 6 MByte/s easily, 9 MByte/s with switched Ethernet, and with good hardware and software you can approach the theoretical maximum of 12.5 MByte/s. That’s two to three times the data rate we calculated above. We owe that speed jump to TCP’s “sliding window.”

A sliding window means that the stack can have several unacknowledged packets “in flight” before it stops and waits for the remote peer to acknowledge the first packet. When the TCP connection is established, the stacks tell each other how much buffer space they’ve allocated for this connection: this is the maximum window size. Since each peer knows how big the remote peer’s buffer is and how many unacknowledged bytes it has sent, it will stop sending data when it calculates that the remote peer’s buffer is full. Each peer then sends window size updates in each ACK packet, telling the remote peer that stack buffer space has become available.

Aside: “Why is it called a sliding window,” you ask? Imagine a TCP data stream as a long line of bytes. The sliding window is how the sender sees the receiver’s buffer: as a fixed-size “window” sliding along the stream of bytes. One edge of the window is between the last byte the receiver has read and the next byte to be read, and the other edge is between the last byte in the receiver’s input buffer and the first byte to be sent from the sender’s output buffer. As the receiver reads bytes out of the network buffers, the window slides down the stream; any time it slides into the sender’s buffer, the sender sends more data to fill up the window.

In modern Winsock stacks, the default sliding window is at least 8 KB. (You can change it in the registry: KB120642) That means that if it sends 8 KB of data without receiving an acknowledgement for the first packet, the stack won’t send any more data until the first packet is acknowledged or the retry timer goes off, at which point it will try to send the first packet again. As each packet at the front of the window gets acknowledged, the 8 KB window slides along the data stream, allowing the remote peer to send more data.

Dividing Microsoft’s 8 KB value by 0.0003 seconds gives about 26 MByte/s, which means you hit the medium’s maximum data rate (~12 MByte/s) before you hit the limit imposed by the round trip time.

Some networks have long round trip times which require large TCP windows if your application needs to be able to fill the entire pipe with a single TCP stream. Satellite systems are the most common example of this: the minimum round trip time we see on our satellite Internet connection at work is about 600 ms! Some DSL systems have pretty long round trip times, too, though not nearly as bad as satellite systems. You need to run the numbers to find out what the situation is for your system.

For what it’s worth, typical modem round trip times are in the 100-250 ms range. Calculating for 250 ms comes out to 32 KB/s, about five times the data rate of the fastest modem connections you’re likely to see. In other words, an 8 KB window is plenty large for modems, despite the long round trip times.

See the next two items for related discussion.

3.19 - What is the silly window syndrome?

The silly window syndrome results when the sender can send data faster than the reciever can handle it, and the receiver calls recv() with very small buffer sizes.

The fast sender will quickly fill the receiver’s TCP window. The receiver then reads N bytes, N being a relatively small number compared to the network frame size. A naïve stack will immediately send an ACK to the sender to tell it that there are now N bytes available in its TCP window. This will cause the sender to send N bytes of data; since N is smaller than the frame size, there’s relatively more protocol overhead in the packet compared to a full frame. Because the receiver is slow, the TCP window stays very small, and thus hurts throughput because the ratio of protocol overhead to application data goes up.

The solution to this problem is the delayed ACK algorithm. This causes the window advertisement ACK to be delayed a bit, hopefully allowing the slow receiver to read more of the enqueued data before the ACK goes out. This results in a larger window advertisement, so the fast sender can send more data in a single frame.

Note that the delayed-ACK solution doesn’t mean your program can safely use small recv() buffers. You should still read as much as is reasonable in a single call, if only to minimize the number of context switches between kernel and user space.

3.20 - What is the delayed ACK algorithm?

In a simpleminded implementation of TCP, every data packet that comes in is immediately acknowledged with an ACK packet. (ACKs help to provide the reliability TCP promises.)

In modern stacks, ACKs are delayed for a short time (up to 200 ms, typically) for several reasons:

to avoid the silly window syndrome
to allow ACKs to piggyback on a reply frame if one is ready to go when the stack decides to do the ACK
to allow the stack to send one ACK for several frames, if those frames arrive within the delay period.

The stack is only allowed to delay ACKs for up to 2 frames of data.

3.21 - What platform should I deploy my server on?

Assuming that you’ve decided to use Windows, your only real choice for handling high loads is one of the Server class versions of Windows.

It it known that Microsoft’s modern home and workstation class operating systems are based on the same underlying kernel as the server class ones. There is thus no technical reason these less expensive operating systems should be less capable in kernel-controlled tasks like network communication.

Unfortunately, in order to segment the market, Microsoft has designed these operating systems so that the non-server kernels cripple themselves at startup time with respect to the equivalent capabilities offered by the server operating systems. This behavior was clear back in the WinNT and Win2K days when the workstation and server OSes came out together, but it’s still the case these days with the releases more apparently separate. Windows Server 2003 and Windows XP share a similar underlying kernel, as do Windows Server 2008 and Windows Vista. There may be differences due simply to progress — Win2K3 came out a few years after XP — but the main reason for capability differences is simply market segmentation.

The most important difference is that the connection backlog on the workstation-class OSes is limited to 5 slots. This means that your program has to call accept() fast enough that not more than 5 connections build up in the network stack’s connection backlog. The stack rejects new connections as long as the queue is full. For a well-written server, this is not normally a problem, but it does mean that a concerted attack (a SYN flood, for example) can fill the queue, denying service to legitimate users. The server-class OSes have much higher connection backlog limits and also have features specifically designed to minimize the impact of a SYN attack.

A less important difference from a practical standpoint is that the EULA for Microsoft’s workstation-class operating systems prohibit running a program that handles more than than 10 connections concurrently. I don’t know of any recent version of Windows that enforces this limit in the kernel.