For the past little while cron on my laptop has been completely broken, with cron dumping an endless succession of segfaults (Segmentation Faults -- program attempts to access a forbidden area of memory) to the syslog. Since there was no apparent movement in the bug report, I set about trying to do something about it.
First I ran strace and ltrace on cron to see if it was failing for some kind of obvious reason: no luck, nothing interesting in the output.
Next thing to try: isolate exactly what line of code was failing using GDB.
First build, install, and restart a debug version of cron and libpam-mount (otherwise gdb output will just be a mess of numbers). Then start "gdb" in a root terminal (cron runs as root) and at the gdb prompt, attach to the running cron:
(gdb) attach 3489 Attaching to process 3489 Reading symbols from /usr/sbin/cron...done. Reading symbols from /lib/libpam.so.0...done. Loaded symbols for /lib/libpam.so.0 Reading symbols from /lib/libselinux.so.1...done. Loaded symbols for /lib/libselinux.so.1 Reading symbols from /lib/i686/cmov/libc.so.6...done. Loaded symbols for /lib/i686/cmov/libc.so.6 Reading symbols from /lib/i686/cmov/libdl.so.2...done. Loaded symbols for /lib/i686/cmov/libdl.so.2 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/i686/cmov/libnss_compat.so.2...done. Loaded symbols for /lib/i686/cmov/libnss_compat.so.2 Reading symbols from /lib/i686/cmov/libnsl.so.1...done. Loaded symbols for /lib/i686/cmov/libnsl.so.1 Reading symbols from /lib/i686/cmov/libnss_nis.so.2...done. Loaded symbols for /lib/i686/cmov/libnss_nis.so.2 Reading symbols from /lib/i686/cmov/libnss_files.so.2...done. Loaded symbols for /lib/i686/cmov/libnss_files.so.2 0xb7f5e424 in __kernel_vsyscall ()
where the number after "attach" is the pid (process id) of cron (obtainable by running "ps -ef | grep cron"). Then:
(gdb) set follow-fork-mode child (gdb) cont Continuing.
The follow-fork-mode parameter has to be set to "child" because cron is in the habit of spawning child processes, and it is in the child process where the error occurs. Then let cron (frozen at the moment of the "attach") continue running with "cont". And wait for a segfault:
Continuing. Program received signal SIGSEGV, Segmentation fault. [Switching to process 19892] 0x00000000 in ?? () (gdb)
Now dump out some interesting information about the error:
(gdb) info frame 0 Stack frame at 0xbfba1aa0: eip = 0x0; saved eip 0xb7b4cda1 called by frame at 0xbfba1af0 Arglist at 0xbfba1a98, args: Locals at 0xbfba1a98, Previous frame's sp is 0xbfba1aa0 Saved registers: eip at 0xbfba1a9c (gdb) up #1 0xb7b4cda1 in read_password (pamh=0x8841b00, prompt=0x8846278 "reenter password for pam_mount:", pass=0xbfba1b38) at pam_mount.c:160 160 retval = conv->conv(nargs, message, resp, conv->appdata_ptr); (gdb) up #2 0xb7b4ddf3 in pam_sm_open_session (pamh=0x8841b00, flags=32768, argc=1, argv=0x8843ce0) at pam_mount.c:511 511 ret = read_password(pamh, Config.msg_sessionpw, &system_authtok); (gdb) up #3 0xb7f693c1 in _pam_dispatch (pamh=0x8841b00, flags=32768, choice=4) at pam_dispatch.c:108 108 retval = h->func(pamh, flags, h->argc, h->argv); (gdb) up #4 0xb7f6cfeb in pam_open_session (pamh=0x8841be8, flags=32768) at pam_session.c:23 23 retval = _pam_dispatch(pamh, flags, PAM_OPEN_SESSION); (gdb) up #5 0x0804e848 in child_process (e=0x88418f8, u=0x88418d8) at ../do_command.c:228 228 retcode = pam_open_session(pamh, PAM_SILENT); (gdb) up #6 0x0804e36d in do_command (e=0x88418f8, u=0x88418d8) at ../do_command.c:102 102 child_process(e, u); (gdb) up #7 0x0804e1e3 in job_runqueue () at ../job.c:68 68 do_command(j->e, j->u); (gdb) up #8 0x0804a777 in main (argc=142875624, argv=0x0) at ../cron.c:270 270 job_runqueue(); (gdb) up Initial frame selected; you cannot go up.
Above I used the "up" and "down" commands to step through the bugtrace and see the stack of procedure calls (and which line in each procedure) that was happening at the moment of the segfault. From this we can divine that the problem seems to happen in libpam-mount, specifically in pam_mount.c:160 of frame #2. (Note that frame 0 is a mess of numbers without symbolic information, because the piece of software coinciding with frame 0 has not been compiled with debug.)
(gdb) frame 0 #0 0x00000000 in ?? () (gdb) up #1 0xb7b4cda1 in read_password (pamh=0x8841b00, prompt=0x8846278 "reenter password for pam_mount:", pass=0xbfba1b38) at pam_mount.c:160 160 retval = conv->conv(nargs, message, resp, conv->appdata_ptr); (gdb) print *resp Cannot access memory at address 0x0 (gdb) print resp $3 = (struct pam_response *) 0x0 (gdb)
Above I used the "print" command to show the values the pointer *resp, which turns out to still be set to NULL, and is being passed on to another procedure (frame 0) which barfs. This is the likely problem.
I issued a bug report against libpam-mount suggesting a simple patch (which turned out to have bad side effects....) which prompted another developer to jump in and document an existing patch already applied to upstream. The fix is on the way....