Desperation!
Monday, 22 January, 2007, 21:53 PST

Beginning about 9 January, 2007, the TNOS server here (ALWGW) began going on a crashing spree. TNOS' uptime has been anywhere from a few seconds to a few hours, and then it receives a segmentation violation signal (SIGSEGV) from the operating system and restarts. The log file simply shows a rather unhelpful, A 'crashprotect'-preventable restart occurred from a SIGSEGV signal. Why? And is there any hope in correcting the problem? The answer to the second question is an unqualified, Maybe!
I've been running TNOS since 1998. Over the years, I've had quite a tolerate-hate relationship with it. It's rich o要 features, but poor o要 stability -- at least from a Linux administrator's point of view. However, I have usually been able to tolerate the application crashes when they occur, because ordinarily ALWGW will run for two or three weeks between restarts. In an amateur radio environment, this is generally tolerable. But the recent outbreak went way too far.
From past experience, I suspected that I was dealing with o要e of two likely culprits causing the TNOS crashes. First and most probable, a corrupted message was being received by the SMTP server, and it was dying trying to process it. When I say "corrupted," this also includes some messages with characters or a character set that TNOS doesn't like, but I've never been able to put my finger o要 "what" exactly it doesn't like. The second typical crash culprit is a corrupted file, typically a configuration file or mail control file. Error checking is not o要e of TNOS' strong suits.
Okay, so in this most recent spree, I went through my usual steps to correct the problem. No mail was queued up o要 my Postfix mail server, nor was there anything in TNOS' incoming queue. I examined numerous files, finding some small problems here and there, and correcting said problems, but the corrections never fixed the big problem. I deleted the spool/history file, cleaned up the users.dat file, worked with the white pages files, scrutinized everything I could think of, and it wasn't making a difference. What I desperately needed was some respectable logging information or a memory core dump, so I'd know more about the cause of the crashes. How could I coax a core file out of it?
I knew there was a "dump core" option in makefile.unx in the TNOS source code that could be enabled at compile-time, so I tweaked o要 the Makefile, and compiled a new binary. Or rather, I tried. It seems that some of the old code in TNOS 2.40 is not completely compatible with the gcc and libraries that are o要 Slackware 10.2. So, I scrounged up a libc5 package from an old Slackware version and installed it. I also edited the Makefile so it would compile without the barrage of compiler warnings (removing -Wid-clash-31 from WARNINGS and adding -fno-builtin-log to DEBUG). TNOS 2.40 then compiled without too many objections, so I put it into service and looked eagerly for my core file when it crashed. Not there. Dang!
After searching around o要 the 'Net, I learned that many modern Linux distributions (and probably most modern *nix variants) no longer dump core by default. It is controlled and configured via the command shell, using the "ulimit" command. I added ulimit -c unlimited to my TNOS startup script, and sure enough, a core file appeared.
I examined the core file with the GNU debugger gdb, and did a backtrace o要 it. The crashes were mostly occurring after a strcmp() was taking place in aprs.c, and this was following the receipt of APRS packets o要 various AX.25 interfaces! At this point, I figured that I better compile in the newest NOSaprs extensions to TNOS, by VE4KLM. I did this, and the problems continued. Considering that I had never configured the system to run APRS, I thought maybe I should, so I edited autoexec.nos and inserted the appropriate lines for a basic APRS setup. What do you know? It was still crashing, but the crashes and core files changed! Now I was getting a variety of SIGSEGVs and SIGABRTs and they were not always from the same TNOS processes. Generally, it looked like there were various memory bugs cropping up. How o要 earth to fix this without completely rewriting TNOS?
I finally decided that maybe the answer was to compile a statically-linked TNOS binary in an old Slackware environment. After all, TNOS 2.40 is basically ancient code, and there have been many changes to gcc and libc (glibc) over the past decade or so. Not being anywhere near the server, and not having any other machines with an old Slackware version installed, I schemed about placing an old version somewhere o要 the filesystem and chrooting into it for a step back into history. I happened to still have an old backup of my entire Slackware 8.1 installation located on the server, so I copied it into /home/slackware-8.1, inserted PS1='(chroot) '$PS1 into /home/slackware-8.1/etc/profile, and executed the following commands:
mount -t proc none /home/slackware-8.1/proc
chroot /home/slackware-8.1 /bin/bash --login
Voila! I had a shell prompt from
mostly inside a Slackware 8.1 environment.
From there, I updated the TNOS 2.40 source and Makefile as needed, and compiled the binary with Slackware 8.1's gcc 2.95.3 and glibc 2.2.5. Best I can tell, it seemed to have worked. The compiler was nice and quiet, and I copied the resulting binary back over to the main filesystem and started running it.
As I write this, that all occurred almost 4 hours ago and so far, no crashes. (knock o要 the wooden bit bucket) I really don't know if the problem is solved, but I should be able to analyze core files if the crashes continue to give me grief. I
did want to document this before the cobwebs of time obscured the procedure that I took to get here. Hopefully it will benefit you, too, in case this actually
does work.