|
Main Menu
Online
|
Long-term mystery SOLVED!
Tuesday, 06 February, 2007, 01:23 PST For a few years, I have been having an intermittent problem with certain, infrequent Internet SMTP messages inducing TNOS crashes on ALWGW. The occurrences have caused more than one headache, but I am glad to say that I now have a solution for it!
SymptomsHere's what happens. A message for one of my users arrives and is processed by Postfix, is queued up, and then delivery is attempted to TNOS. TNOS actually receives the message and writes it to the user's mail file on the system, but does not complete the SMTP transaction before receiving a SIGSEGV from the operating system. Consequently, Postfix keeps the message in its queue, and TNOS restarts. After the timeout period expires, Postfix again attempts to deliver the message to TNOS, and the process repeats itself. Over and over, every time the delivery of the message is attempted, TNOS crashes. Each time, an additional copy of the message is written to the user's mail area before TNOS dies. Interestingly, when listing these messages, each one has the recipient's address and the message size in the listing, but no sender, date, or subject (even though these appear in the header). I didn't realize that this was significant until later. In the past, I've looked at the offending messages with Postfix's postcat utility, but never really noticed anything that caught my eye.DiagnosisDue to an assortment of crashing problems last month, I was finally able to get core dumps out of this latest batch of segmentation violation crashes. This was excellent news! Running gdb on the core file, I found that a strcmp instruction was being called from within TNOS' reject.c code (line 149), where one of the string variables pointed to memory location 0x0. When TNOS attempted to do a strcmp at memory location 0x0, the Linux kernel firmly gave it a SIGSEGV, thus the crash.Since this was occurring in reject.c, would the offending message pass if "pbbs reject" was turned off? Since I had two offending messages to work with this morning, I gave TNOS the "pbbs reject off" command and then released ("unheld" with Postfix's postsuper) one of the messages causing the crashes. Sure enough, it was delivered without causing TNOS to crash. Progress! However, I noticed that the message delivered looked the same on the filesystem as the others described above, that is, the message listing still left off the sender, date, and subject. Now, why was pointer variable cmd[1] incorrectly pointing to address 0x0 instead of a location with a NULL character or something equally valid? As I mulled this one over, and examined all of the memory contents that I could think of, I decided to review the messages again, and compare them with other similar messages that had not caused delivery problems. It was then that I noticed a line in the message header that looked like this (in red) : User-Agent: KMail/1.9.1 X-Face: (?K`WPum>k,$xD:^5lco~&[g7t2C%Q5tO@~cnea''dNhA2\bd"=?iso-8859-15?q?6=5D=7DHnvlOcT= \ 3F+=5C/=3B=25TT=7DU=0A=090jAxk-?="Wt>*Xora^<,'Eykz^Ary#B"b`7TI7*Qf(-ooWi!c([h$y19t \ (=?iso-8859-15?q?=7DAx=5C=3DVlDK=25=25w=5Cl=0A=09=5Ey/I=5F/rn=26lR?=(t#;q>sPRN9dwE|ZStatus: R %sml&so".vsD%^>Ce7`+^t*tx*dP"=?iso-8859-15?q?=7B8rMQHP+9=7D!54M=0A=09ndN=5CPCUl=3F3=5CQ=5BUSU=5B?=)GY:feNS-m Message-Id: <200702052247.38968.xxxxx@xxx.xxx.xx> *Note: Most of it was all on one line, but the formatting messed up the readability of this page, so I added the carriage returns where the backslashes are located. Also, the Message-Id was doctored to protect the innocent. Hmm, X-Face? I did a search on X-Face, and learned this. I also noticed that the recipient address appeared above the X-Face line, but the sender, date, and subject all appeared below the X-Face line. I now suspected this was significant, and decided to test my hypothesis. Solution(NOTE: While the solution below does in fact work, the SMTP crashing problem is now known to occur with other header lines, as documented here, where you will find a new workaround that should take care of them all. --Updated 10 February, 2007) In my Postfix header_checks file, I added the following regular expression line:/^X-Face:/ IGNORE With this instruction, when Postfix encounters a line in the header that begins with "X-Face:", it silently deletes that line as it continues to process the message.I then turned TNOS' pbbs reject on, requeued the last held message using postsuper -r <queue_id>, and sure enough, it delivered as perfectly as could be! The resulting message listing had all of the correct information: recipient, sender, date, size, and subject. An examination of the delivered message also showed that the X-Face line had been stripped from the header as expected. Ultimately, reject.c and probably smtpserv.c need to be tweaked so that they gracefully handle unexpected surprises without performing illegal instructions or attempting to access prohibited memory locations. Some checks should be done to prevent a pointer being assigned 0x0. For now, however, this works for me. In a future article, I will document another observation that I made while chasing these pesky TNOS gremlins and analyzing the fallout. Stay tuned! Note: The information in this article is still relevant, but the solution is superceded by More TNOS SMTP crashes and a workaround.
|
Recent Stories
Login
|