Shlomif's Technical Posts Community - Follow-up and New Summary of my Linux Networking Problem [entries|archive|friends|userinfo]
Shlomif's Technical Posts Community

[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Links
[Links:| Shlomi Fish's Homepage Main Journal Homesite Blog Planet Linux-IL Amir Aharoni in Unicode open dot dot dot ]

Follow-up and New Summary of my Linux Networking Problem [Jan. 23rd, 2008|10:26 pm]
Shlomif's Technical Posts Community

shlomif_tech

[shlomif]
[Tags|, , , , , , , , , , , , ]
[Current Location |Home]
[Current Mood |tiredtired]
[Current Music |Yehuda Poliker - Lo Yodea]

As a follow-up to my previous post about a networking problem I've encountered on my Linux box I'd like to make a follow-up and re-summarise all the currently available information.

Someone suggested the problem may be caused due to a bad Ethernet card, so I borrowed an Intel Ethernet card, replaced my Ethernet card with it, and tried it for a few days. After keeping the computer on for a while, it started exhibiting similar problems as the old one: no connectivity to the host of www.shlomifish.org, bad connectivity to Google, etc. Solvable only after a reboot.

I've uploaded Wireshark pcap dumps of a "lynx http://www.shlomifish.org/" command from both cards before and after the problem is exhibited.

So here's what I know now:

  1. The symptom is that after the computer is on for a while (two days or so), one cannot connect using TCP to some hosts, and the connection times out.
  2. The IP that causes the problems is 212.143.218.31, but it also affects www.google.com and possibly other hosts.
  3. It is exhibited by kernels 2.6.23, 2.6.24-rc1 and 2.6.24-rc2 and 2.6.24-rc8. (At least)
  4. A different computer on the same Home LAN connected via a NAT/router has no problem with that IP, at the same time the machine running Linux exhibits the problem.
  5. During one time this happened, I could connect using telnet to port 80 eventualy, but it took an awfully long time.
  6. I have problem with both HTTP and port 80, POP and SSH.
  7. Restarting the network ("/etc/init.d/network restart") does not help - only a reboot.
  8. Some other hosts in the network, like Yahoo work fine.
  9. Doing "echo 0 > /proc/sys/net/ipv4/tcp_window_scaling" after the problem appeared didn't solve the problem after at least 30 minutes.
  10. The problem is exhibited by both a RealTek card I have and an Intel Ethernet card. (both 100 Mbps).
  11. pcap dumps of an HTTP connection to the offending site before and after the problem occurs are available for both cards.
  12. I can't ping www.shlomifish.org, because it doesn't answer to pings at all, but when the problem occurs again, I can try pinging www.google.com and see what happens.
  13. iptables is completely off:

    [root@telaviv1 ~]# iptables -L
    Chain INPUT (policy ACCEPT)
    target     prot opt source               destination
    
    Chain FORWARD (policy ACCEPT)
    target     prot opt source               destination
    
    Chain OUTPUT (policy ACCEPT)
    target     prot opt source               destination
    
  14. My household is connected to the Internet through a Sweex NAT/Router that doesn't have any updates yet.
  15. Quoting srlamb:

    The difference in these two packet captures seems to be that in the "bad" case, the www.shlomifish.org->you packets have a bad checksum and are ignored (see it retrying the SYN after it's already gotten a SYN/ACK?). I don't know why this would be, but it seems worthy of mention to the mailing lists you are asking for help.

Some extra discussion can be found at the previous entry.

LinkReply

Comments:
From: (Anonymous)
2008-02-04 07:08 am (UTC)
try ss -s and see if you have tons of timewaters.
(Reply) (Thread)
From: (Anonymous)
2008-02-04 07:17 am (UTC)
have a look on your dumps, and the problem has little to do with timewaiters. what is that misterious 212.143.218.31 and why does it chop off one off the tcp checksum value?
(Reply) (Parent) (Thread)
[User Picture]From: shlomif
2008-02-04 07:38 am (UTC)

212.143.218.31 is www.shlomifish.org

<<<<<<<<<<<<
shlomi:~$ host www.shlomifish.org
www.shlomifish.org is an alias for gilgamesh.eonspace.net.
gilgamesh.eonspace.net has address 212.143.218.31
>>>>>>>>>>>>

The dumps are generated by recording the output of "lynx http://www.shlomifish.org/".

I don't know why it chops one off the TCP checksum value, but like I said, Google exhibits a similar phenomenon.
(Reply) (Parent) (Thread)
[User Picture]From: shlomif
2008-02-06 02:11 pm (UTC)

Update: Can Ping to www.google.com

As an update to this message, I'd like to note that I can ping to www.google.com after the problem is exhibited. Here is the dump of the ping command:

shlomi:~$ ping www.google.com
PING www.l.google.com (209.85.129.104) 56(84) bytes of data.
64 bytes from fk-in-f104.google.com (209.85.129.104): 
icmp_seq=1 ttl=246 time=72.7 ms
64 bytes from fk-in-f104.google.com (209.85.129.104): 
icmp_seq=2 ttl=246 time=72.3 ms
64 bytes from fk-in-f104.google.com (209.85.129.104):
icmp_seq=3 ttl=246 time=73.1 ms
64 bytes from fk-in-f104.google.com (209.85.129.104): 
icmp_seq=4 ttl=246 time=72.1 ms
64 bytes from fk-in-f104.google.com (209.85.129.104):
icmp_seq=5 ttl=246 time=72.9 ms
64 bytes from fk-in-f104.google.com (209.85.129.104):
icmp_seq=6 ttl=246 time=71.8 ms
64 bytes from fk-in-f104.google.com (209.85.129.104):
icmp_seq=7 ttl=246 time=73.4 ms
64 bytes from fk-in-f104.google.com (209.85.129.104):
icmp_seq=8 ttl=246 time=72.5 ms
64 bytes from fk-in-f104.google.com (209.85.129.104):
icmp_seq=9 ttl=246 time=71.8 ms
64 bytes from fk-in-f104.google.com (209.85.129.104):
icmp_seq=10 ttl=246 time=71.8 ms
64 bytes from fk-in-f104.google.com (209.85.129.104):
icmp_seq=11 ttl=246 time=72.5 ms
64 bytes from fk-in-f104.google.com (209.85.129.104):
icmp_seq=12 ttl=246 time=73.4 ms
64 bytes from fk-in-f104.google.com (209.85.129.104): 
icmp_seq=13 ttl=246 time=73.2 ms

--- www.l.google.com ping statistics ---
14 packets transmitted, 13 received, 7% packet loss, time 17092ms
rtt min/avg/max/mdev = 71.810/72.619/73.416/0.587 ms
shlomi:~$                                          

(Reply) (Thread)
[User Picture]From: shlomif
2008-08-23 07:09 pm (UTC)

Update - setting the MTU

I should note that after taking someone's advice and lowering the router's and the Linux machine's MTU to 1394, the problem seems to have become worse. Now when the problem occurs, I have problems accessing the wikipedia and other wikimedia sites, and most other sites, and so am almost completely neutralised.

I was told now that the Sweex products are bad, so this may be part of the problem.

(Reply) (Thread)