Consistent TNOS errors!
Wednesday, 24 January, 2007, 00:08 PST

I think I finally have something that I've been seeking for awhile.
I am now getting consistent errors as TNOS shuts down!
The problem appears to occur when a reply to a DNS query is
received, and the resource record exceeds the buffer size
allocated for it (which is 512 bytes, plus some margin for overhead) in
TNOS' domain.c.
This quantity is defined in RFC1035,
as mentioned in domain.c's source code.
Here's the core file backtrace as viewed in GNU debugger (gdb):
Program
terminated
with signal 11, Segmentation fault.
(gdb)
bt full
#0
chunk_alloc (ar_ptr=0x81a9e80, nb=2056) at malloc.c:2928
victim = (mchunkptr) 0x82035f0
victim_size = 1836016496
idx = 89
bin = (mbinptr) 0x8203df8
remainder = (mchunkptr) 0x8203df8
remainder_size = 1836014440
remainder_index = 136640600
block = 135962240
startidx = 136638992
fwd = (mchunkptr) 0x6d6f5b69
bck = (mchunkptr) 0x81a9e80
q = (mbinptr) 0x6d6f5b69
#1
0x08121b11 in __libc_malloc (bytes=2048) at malloc.c:2810
bytes = 1836014440
ar_ptr = (arena *) 0x81a9e80
nb = 2056
victim = (mchunkptr) 0x81a9e80
hook = (void *(*)()) 0x6d6f5b69
#2
0x080f51ad in mallocw (size=2048) at unix.c:280
p = (void *) 0x4
waited = 0
#3
0x0809713c in tcmdprintf (fmt=0x8160fca "*** Exiting TNOS...\n")
at sockuser.c:263
fmt = 0x8160fca "*** Exiting TNOS...\n"
buf = 0x1446 <Address 0x1446 out of bounds>
len = 135663562
#4
0x08049bba in where_outta_here (resetme=1, where=0x816cccf "proc_query")
at main.c:1304
StopTime = 136323568
fp = (FILE *) 0x8202020
inbuff = 0x1446 <Address 0x1446 out of bounds>
intmp = 0x82042d0 ""
bptr = 0x82021f0 "`\004\201"
#5
0x0807f3f6 in proc_query (unused=0, d=0x82049d0, b=0x8202020)
at domain.c:2471
i = 1836014440
len = 5190
buf = 0x82021f0 "`\004\201"
server = {sin_family = 21930, sin_port = 21930, sin_addr = {
s_addr = 1437226410}, sin_zero =
"ªUªUªUªU"}
rrp = (struct rr *) 0x82021f0
rrans = (struct rr *) 0x82042d0
rrtmp = (struct rr *) 0x8203df8
qp = (struct rr *) 0x1446
#6
0x080935e8 in proc_launch () at ksubr.c:139
No
locals.
#7
0x00000000 in ?? ()
No
symbol table info
available.
And here are lines 2462 - 2481 in domain.c
to show the code being processed at the time of the crashes:
if (Dtrace) {
tcmdprintf ("DNS: replying");
dumpdomain (dhdr, 0);
}
/* Maximum reply size is 512, see rfc1034/1035 */
/* buf =
mallocw(512);
*/
buf = callocw (1,
5120);
/* quick patch */
len = htondomain (dhdr, buf, 5120);
if (len > 5120) /* insufficient buffer space, we've
trashed the arena */
where_outta_here (1, "proc_query");
free_dhdr (dhdr);
server.sin_family = AF_INET;
server.sin_port = dp->port;
server.sin_addr.s_addr = dp->address;
(void) sendto (Dsocket, buf, len, 0, (char *) &server, sizeof
(server));
free (buf);
free ((char *) dp);
dns_process_count--;
As you can see, the value of len in the
backtrace exceeds the size allocated for the buffer, which is 5120, so
a controlled shutdown of TNOS is initiated.
So, what's the solution? I run DJB's dnscache
on my Linux server here. According to the information I've
been able to scrounge up, dnscache supposedly does not accept non-RFC
compliant replies. I used to have TNOS configured to query my
dnscache and also the
university's DNS as a backup.
I have since removed the uni's backup entry from autoexec.nos, and only have my local
dnscache's IP, e.g.
domain
addserver 44.12.3.129 10.
If this doesn't work, I could
(and might)
increase the buffer size in domain.c, but that seems like just putting
a bandage on the problem.
Stay tuned...