Monday, November 2, 2009

Solaris: analyze a core dump file

Whenever there is a coredump, the task of uncovering the source of the problem may not be easy. In Solaris we have some tools that will give us clues in order to solve the problem.

Coredump files on Solaris are generated in the directory:
/var/core/

Once in that directory, find your file core. Then run the following commands:
$ pstack 
core 'core_host_progname_1001_1_1256228346_13885' of 13885: /opt/coolstack/bin/perl -w /export/home/user/bin/progname
-----------------  lwp# 1 / thread# 1  --------------------
 fede5d77 lwp_yield (fed32a00) + 7
 feddfa66 mutex_unlock (fee5c5d0) + 133
 fed846ce malloc   (15, 8be8ed0, 8dcf888, 80cf342) + 3e
 080c20ac Perl_safesysmalloc () + 2c
-----------------  lwp# 2 / thread# 2  --------------------
 fede5ceb __lwp_park (87a93c8, 87a9398, 0) + b
 fede04f0 cond_wait_queue (87a93c8, 87a9398, 0) + 5e
 fede09c4 _cond_wait (87a93c8, 87a9398) + 64
 fede0a06 cond_wait (87a93c8, 87a9398) + 21
 fede0a3f pthread_cond_wait (87a93c8, 87a9398, 1, fe7a8108) + 1b
 fe7a82a8 XS_threads__shared_cond_wait (0, 0, 0, 0, 0, 0) + 1ac
 4d580000 ???????? ()
-----------------  lwp# 3 / thread# 3  --------------------
 feabf697 add_collation (fe71bc28, feaf49b7, 1a, feaf49b7) + 54b
 feac939b cs_leave (fe71bca0, fe71bd28, 1a, 0, fe71b0c0, fefd261b) + 83
 feada60d my_xml_leave (0, fe71b160, fe71b178, feada92b) + 141
 feadac9a my_xml_parse (fe71bca0, 8dee0e8, 4755, 0, 0, 0) + 662
 feac9ae7 my_parse_charset_xml (8dee0e8, 4755, feabf14c, 0, feffdd28, 4755) + 77
 feabf7d5 my_read_charset_file (fe71bf88, fe71bf00, fe71bf50, feffb818, fe71bf14, fefce6f6) + dd
 feac01f2 get_charset_by_csname (8dd0a08, 20, 10, 0) + 16a
 feadcbab mysql_init_character_set (8b00598, 8d01a8b, c, 2) + 5f
 feaddcb5 mysql_real_connect (8b00598, 8af44a8, 89ab6f8, 89b28a0, 8afa190, cea) + 6cd
 fec24031 mysql_dr_connect () + 1add
-----------------  lwp# 4 / thread# 4  --------------------
 feabf697 add_collation (fe65cc28, feaf49b7, 1a, feaf49b7) + 54b
 feac939b cs_leave (fe65cca0, fe65cd28, 1a, 0, feaaed64, feffb104) + 83
 feada60d my_xml_leave (0, fe65c160, fe65c178, feada92b) + 141
 feadac9a my_xml_parse (fe65cca0, 8df2848, 4755, 0, 0, 0) + 662
 feac9ae7 my_parse_charset_xml (8df2848, 4755, feabf14c, 0, fee5c5d0, 4755) + 77
 feabf7d5 my_read_charset_file (0, 0, 0, 0, 0, 0) + dd
 feac01f2 get_charset_by_csname (8dd0a18, 20, 10, 0) + 16a
 feadcbab mysql_init_character_set (8bfd4c0, 8d03a9b, c, 2) + 5f
 feaddcb5 mysql_real_connect (8bfd4c0, 8bf1bf8, 8bf2380, 8bf23c8, 8bf2440, cea) + 6cd
 fec24031 mysql_dr_connect () + 1add

The command above shows the last state of the stack for each thread (lwp) of the program.
That isn't enough we need to know in each thread the problem have occurred. There is another command that lets us know that, the "pflags" command.
$ pflags 
core 'core_host_progname_1001_1_1256228346_13885' of 13885: /opt/coolstack/bin/perl -w /export/home/user/bin/progname
        data model = _ILP32  flags = MSACCT|MSFORK
 /1:    flags = STOPPED  yield()
        why = PR_SUSPENDED
        sigmask = 0xffbff8f7,0x0000fff7
 /2:    flags = STOPPED  lwp_park(0x0,0x0,0x0)
        why = PR_SUSPENDED
 /3:    flags = STOPPED
        why = PR_SUSPENDED
        lwppend = 0x00000400,0x00000000
 /4:    flags = 0
        sigmask = 0xffffbefc,0x0000ffff  cursig = SIGSEGV

Now we know know that thread 4 have received the SIGNAL "SIGSEGV" (Segmentation Fault).
So, the SIGSEGV was generated in Thread 4 when using a MySQL function "add_collation".

With that information isn't easy to know the reason of the coredump. Lets use another command, the "pmap".
The "pmap" command display information about the address space of a process, in this case the program that causes the coredump.
$ pmap | less
core 'core_host_progname_1001_1_1256228346_13885' of 13885: /opt/coolstack/bin/perl -w /export/home/user/bin/progname
08042000      24K rw---    [ stack ]
08050000    1128K r-x--  /opt/coolstack/bin/perl
08179000      52K rwx--  /opt/coolstack/bin/perl
08186000   12752K rwx--    [ heap ]
FE65A000      20K rw---    [ stack tid=4 ]
FE660000      12K r-x--  /lib/nss_dns.so.1
FE673000       4K rw---  /lib/nss_dns.so.1
FE680000      24K r-x--  /lib/nss_files.so.1
FE696000       4K rw---  /lib/nss_files.so.1
FE6A0000      64K rwx--
FE6C0000      64K rwx--
FE719000      20K rw---    [ stack tid=3 ]
FE75E000       4K rw---    [ stack tid=2 ]
FE760000      64K rw---
FE780000      64K rw---
FE7A0000      44K r-x--  /opt/coolstack/lib/perl5/5.8.8/i86pc-solaris-thread-multi/auto/threads/shared/shared.so
FE7BA000       8K rwx--  /opt/coolstack/lib/perl5/5.8.8/i86pc-solaris-thread-multi/auto/threads/shared/shared.so
FE7C0000       8K r-x--  /opt/coolstack/lib/perl5/5.8.8/i86pc-solaris-thread-multi/auto/Cwd/Cwd.so
FE7D1000       4K rwx--  /opt/coolstack/lib/perl5/5.8.8/i86pc-solaris-thread-multi/auto/Cwd/Cwd.so
FE7E0000      64K rwx--
FE800000      40K r-x--  /usr/sfw/lib/libcrypto_extra.so.0.9.7
FE81A000       4K rw---  /usr/sfw/lib/libcrypto_extra.so.0.9.7
FE820000      36K r-x--  /usr/sfw/lib/libssl_extra.so.0.9.7
FE839000       4K rw---  /usr/sfw/lib/libssl_extra.so.0.9.7
FE840000      56K r-x--  /lib/libmd.so.1
FE85E000       4K rw---  /lib/libmd.so.1
FE860000      28K r-x--  /lib/libaio.so.1
FE877000       4K rw---  /lib/libaio.so.1
FE878000       4K rw---  /lib/libaio.so.1
FE880000     936K r-x--  /usr/sfw/lib/libcrypto.so.0.9.7
FE97A000      84K rw---  /usr/sfw/lib/libcrypto.so.0.9.7
FE98F000       8K rw---  /usr/sfw/lib/libcrypto.so.0.9.7
FE9A0000     200K r-x--  /usr/sfw/lib/libssl.so.0.9.7
FE9E2000      12K rw---  /usr/sfw/lib/libssl.so.0.9.7
FE9F0000      24K r-x--  /lib/libgen.so.1
FEA06000       4K rw---  /lib/libgen.so.1
FEA10000     216K r-x--  /lib/libresolv.so.2
FEA56000       8K rw---  /lib/libresolv.so.2
FEA60000      24K r-x--  /lib/librt.so.1
FEA76000       4K rw---  /lib/librt.so.1
FE878000       4K rw---  /lib/libaio.so.1
FE880000     936K r-x--  /usr/sfw/lib/libcrypto.so.0.9.7
FE97A000      84K rw---  /usr/sfw/lib/libcrypto.so.0.9.7
FE98F000       8K rw---  /usr/sfw/lib/libcrypto.so.0.9.7
FE9A0000     200K r-x--  /usr/sfw/lib/libssl.so.0.9.7
FE9E2000      12K rw---  /usr/sfw/lib/libssl.so.0.9.7
FE9F0000      24K r-x--  /lib/libgen.so.1
FEA06000       4K rw---  /lib/libgen.so.1
FEA10000     216K r-x--  /lib/libresolv.so.2
FEA56000       8K rw---  /lib/libresolv.so.2
FEA60000      24K r-x--  /lib/librt.so.1
FEA76000       4K rw---  /lib/librt.so.1
FEA80000      60K r-x--  /usr/lib/libz.so.1
FEA9E000       4K rwx--  /usr/lib/libz.so.1
FEAA0000     352K r-x--  /opt/coolstack/mysql_32bit/lib/mysql/libmysqlclient.so.16.0.0
FEB00000       4K rwx--
FEB07000    1040K rwx--  /opt/coolstack/mysql_32bit/lib/mysql/libmysqlclient.so.16.0.0
FEC0B000       4K rwx--  /opt/coolstack/mysql_32bit/lib/mysql/libmysqlclient.so.16.0.0

Since the problem occurred in the invocation of a MySQL function, with the "pmap" command, I've detected that my Perl used a DBD:: mysql library which was linked with a 32-bit MySQL library. My OS is 64 bits, however that is not a problem since 32 Bits libmysqlclient_32_bits is compatible with 64 bits MySQL DB's.
Despite all efforts to detect the problem in the program that I was developing, it was found that the problem was in the instance of a 64 bits MySQL DB that was using 32 bits libraries.