Problem : http smtp traffic truncating at 5k / Solution : read on

So with the generous help of my father and his colleague Dr. Bell the problem that has been plaguing us has been solved. Here is a summary.

The machine in question

HP LP 1000 R rack mountable server

The operating system

Redhat Fedora Core 2

The symptoms

From the machine in question if I try to pull down web pages off of the internet over http the communication stops about 5k bytes in. For example, if I tried to, from the machine in question, do a "curl http://www.yahoo.com/" the first 5026 bytes of the page would come down just fine, and then stop. This would happen for probably 90% of the machines out there. Though for some it wouldn't happen. When pulling down http://www.cnn.com/ the whole 28kb page would come down just fine. The other curious thing is that every time I would pull a page, it would always fail at the exact same point. For each page this point would be different (for example : yahoo would fail at 5026 bytes, slashdot would fail at 5122 bytes, etc), however every single time I pulled any given page, it would fail in the same place. This wasn't happening when this machine was on a different subnet and behind a different firewall in our network. Additionally, I've got another server, on the same subnet as the machine in question, that I can do the same thing on (pull down www.yahoo.com) and it works fine. The differences between the servers is 1) the hardware and 2) the functioning server is running Redhat 7.3. I also found out that inbound SMTP traffic would stop in the same way. SSH traffic and FTP traffic seemed to be unaffected.

The process

After dong a tcpdump of the conversation (see below), and an analysis by my colleagues, they found something interesting. The available _window size_ that the server in question was advertising was very small.

The problem

In the opening SYN packet from the machine in question, a "wscale" of 7 is advertised. Yahoo doesn't wscale that far, but they leave the option active, so the machine in question thinks it's initial request was accepted. The reason that this happens is best explained by the r0x0r authors over at lwn.net. In this article, TCP window scaling and broken routers, the details of the why and how of this problem are explained. The basic series of variables that have caused this are 1) A current linux kernel has the window scale factor set to 7 instead of 0 and 2) somewhere in my providers upstream provider there is an old router that's zeroing out the window scale value, but leaving window scaling set to on. Read the article at lwn.net, it's well written and explains exactly what's gong on.

Leave a Reply

Your email address will not be published. Required fields are marked *