Monday, November 2, 2009

Solaris: analyze a core dump file

Whenever there is a coredump, the task of uncovering the source of the problem may not be easy. In Solaris we have some tools that will give us clues in order to solve the problem.

Coredump files on Solaris are generated in the directory:
  1. /var/core/  

Once in that directory, find your file core. Then run the following commands:
  1. $ pstack <core_filename>  
  2. core 'core_host_progname_1001_1_1256228346_13885' of 13885: /opt/coolstack/bin/perl -w /export/home/user/bin/progname  
  3. -----------------  lwp# 1 / thread# 1  --------------------  
  4.  fede5d77 lwp_yield (fed32a00) + 7  
  5.  feddfa66 mutex_unlock (fee5c5d0) + 133  
  6.  fed846ce malloc   (15, 8be8ed0, 8dcf888, 80cf342) + 3e  
  7.  080c20ac Perl_safesysmalloc () + 2c  
  8. -----------------  lwp# 2 / thread# 2  --------------------  
  9.  fede5ceb __lwp_park (87a93c8, 87a9398, 0) + b  
  10.  fede04f0 cond_wait_queue (87a93c8, 87a9398, 0) + 5e  
  11.  fede09c4 _cond_wait (87a93c8, 87a9398) + 64  
  12.  fede0a06 cond_wait (87a93c8, 87a9398) + 21  
  13.  fede0a3f pthread_cond_wait (87a93c8, 87a9398, 1, fe7a8108) + 1b  
  14.  fe7a82a8 XS_threads__shared_cond_wait (0, 0, 0, 0, 0, 0) + 1ac  
  15.  4d580000 ???????? ()  
  16. -----------------  lwp# 3 / thread# 3  --------------------  
  17.  feabf697 add_collation (fe71bc28, feaf49b7, 1a, feaf49b7) + 54b  
  18.  feac939b cs_leave (fe71bca0, fe71bd28, 1a, 0, fe71b0c0, fefd261b) + 83  
  19.  feada60d my_xml_leave (0, fe71b160, fe71b178, feada92b) + 141  
  20.  feadac9a my_xml_parse (fe71bca0, 8dee0e8, 4755, 0, 0, 0) + 662  
  21.  feac9ae7 my_parse_charset_xml (8dee0e8, 4755, feabf14c, 0, feffdd28, 4755) + 77  
  22.  feabf7d5 my_read_charset_file (fe71bf88, fe71bf00, fe71bf50, feffb818, fe71bf14, fefce6f6) + dd  
  23.  feac01f2 get_charset_by_csname (8dd0a08, 20, 10, 0) + 16a  
  24.  feadcbab mysql_init_character_set (8b00598, 8d01a8b, c, 2) + 5f  
  25.  feaddcb5 mysql_real_connect (8b00598, 8af44a8, 89ab6f8, 89b28a0, 8afa190, cea) + 6cd  
  26.  fec24031 mysql_dr_connect () + 1add  
  27. -----------------  lwp# 4 / thread# 4  --------------------  
  28.  feabf697 add_collation (fe65cc28, feaf49b7, 1a, feaf49b7) + 54b  
  29.  feac939b cs_leave (fe65cca0, fe65cd28, 1a, 0, feaaed64, feffb104) + 83  
  30.  feada60d my_xml_leave (0, fe65c160, fe65c178, feada92b) + 141  
  31.  feadac9a my_xml_parse (fe65cca0, 8df2848, 4755, 0, 0, 0) + 662  
  32.  feac9ae7 my_parse_charset_xml (8df2848, 4755, feabf14c, 0, fee5c5d0, 4755) + 77  
  33.  feabf7d5 my_read_charset_file (0, 0, 0, 0, 0, 0) + dd  
  34.  feac01f2 get_charset_by_csname (8dd0a18, 20, 10, 0) + 16a  
  35.  feadcbab mysql_init_character_set (8bfd4c0, 8d03a9b, c, 2) + 5f  
  36.  feaddcb5 mysql_real_connect (8bfd4c0, 8bf1bf8, 8bf2380, 8bf23c8, 8bf2440, cea) + 6cd  
  37.  fec24031 mysql_dr_connect () + 1add</core_filename>  

The command above shows the last state of the stack for each thread (lwp) of the program.
That isn't enough we need to know in each thread the problem have occurred. There is another command that lets us know that, the "pflags" command.
  1. $ pflags <core_filename>  
  2. core 'core_host_progname_1001_1_1256228346_13885' of 13885: /opt/coolstack/bin/perl -w /export/home/user/bin/progname  
  3.         data model = _ILP32  flags = MSACCT|MSFORK  
  4.  /1:    flags = STOPPED  yield()  
  5.         why = PR_SUSPENDED  
  6.         sigmask = 0xffbff8f7,0x0000fff7  
  7.  /2:    flags = STOPPED  lwp_park(0x0,0x0,0x0)  
  8.         why = PR_SUSPENDED  
  9.  /3:    flags = STOPPED  
  10.         why = PR_SUSPENDED  
  11.         lwppend = 0x00000400,0x00000000  
  12.  /4:    flags = 0  
  13.         sigmask = 0xffffbefc,0x0000ffff  cursig = SIGSEGV</core_filename>  

Now we know know that thread 4 have received the SIGNAL "SIGSEGV" (Segmentation Fault).
So, the SIGSEGV was generated in Thread 4 when using a MySQL function "add_collation".

With that information isn't easy to know the reason of the coredump. Lets use another command, the "pmap".
The "pmap" command display information about the address space of a process, in this case the program that causes the coredump.
  1. $ pmap <core_filename>| less  
  2. core 'core_host_progname_1001_1_1256228346_13885' of 13885: /opt/coolstack/bin/perl -w /export/home/user/bin/progname  
  3. 08042000      24K rw---    [ stack ]  
  4. 08050000    1128K r-x--  /opt/coolstack/bin/perl  
  5. 08179000      52K rwx--  /opt/coolstack/bin/perl  
  6. 08186000   12752K rwx--    [ heap ]  
  7. FE65A000      20K rw---    [ stack tid=4 ]  
  8. FE660000      12K r-x--  /lib/nss_dns.so.1  
  9. FE673000       4K rw---  /lib/nss_dns.so.1  
  10. FE680000      24K r-x--  /lib/nss_files.so.1  
  11. FE696000       4K rw---  /lib/nss_files.so.1  
  12. FE6A0000      64K rwx--  
  13. FE6C0000      64K rwx--  
  14. FE719000      20K rw---    [ stack tid=3 ]  
  15. FE75E000       4K rw---    [ stack tid=2 ]  
  16. FE760000      64K rw---  
  17. FE780000      64K rw---  
  18. FE7A0000      44K r-x--  /opt/coolstack/lib/perl5/5.8.8/i86pc-solaris-thread-multi/auto/threads/shared/shared.so  
  19. FE7BA000       8K rwx--  /opt/coolstack/lib/perl5/5.8.8/i86pc-solaris-thread-multi/auto/threads/shared/shared.so  
  20. FE7C0000       8K r-x--  /opt/coolstack/lib/perl5/5.8.8/i86pc-solaris-thread-multi/auto/Cwd/Cwd.so  
  21. FE7D1000       4K rwx--  /opt/coolstack/lib/perl5/5.8.8/i86pc-solaris-thread-multi/auto/Cwd/Cwd.so  
  22. FE7E0000      64K rwx--  
  23. FE800000      40K r-x--  /usr/sfw/lib/libcrypto_extra.so.0.9.7  
  24. FE81A000       4K rw---  /usr/sfw/lib/libcrypto_extra.so.0.9.7  
  25. FE820000      36K r-x--  /usr/sfw/lib/libssl_extra.so.0.9.7  
  26. FE839000       4K rw---  /usr/sfw/lib/libssl_extra.so.0.9.7  
  27. FE840000      56K r-x--  /lib/libmd.so.1  
  28. FE85E000       4K rw---  /lib/libmd.so.1  
  29. FE860000      28K r-x--  /lib/libaio.so.1  
  30. FE877000       4K rw---  /lib/libaio.so.1  
  31. FE878000       4K rw---  /lib/libaio.so.1  
  32. FE880000     936K r-x--  /usr/sfw/lib/libcrypto.so.0.9.7  
  33. FE97A000      84K rw---  /usr/sfw/lib/libcrypto.so.0.9.7  
  34. FE98F000       8K rw---  /usr/sfw/lib/libcrypto.so.0.9.7  
  35. FE9A0000     200K r-x--  /usr/sfw/lib/libssl.so.0.9.7  
  36. FE9E2000      12K rw---  /usr/sfw/lib/libssl.so.0.9.7  
  37. FE9F0000      24K r-x--  /lib/libgen.so.1  
  38. FEA06000       4K rw---  /lib/libgen.so.1  
  39. FEA10000     216K r-x--  /lib/libresolv.so.2  
  40. FEA56000       8K rw---  /lib/libresolv.so.2  
  41. FEA60000      24K r-x--  /lib/librt.so.1  
  42. FEA76000       4K rw---  /lib/librt.so.1  
  43. FE878000       4K rw---  /lib/libaio.so.1  
  44. FE880000     936K r-x--  /usr/sfw/lib/libcrypto.so.0.9.7  
  45. FE97A000      84K rw---  /usr/sfw/lib/libcrypto.so.0.9.7  
  46. FE98F000       8K rw---  /usr/sfw/lib/libcrypto.so.0.9.7  
  47. FE9A0000     200K r-x--  /usr/sfw/lib/libssl.so.0.9.7  
  48. FE9E2000      12K rw---  /usr/sfw/lib/libssl.so.0.9.7  
  49. FE9F0000      24K r-x--  /lib/libgen.so.1  
  50. FEA06000       4K rw---  /lib/libgen.so.1  
  51. FEA10000     216K r-x--  /lib/libresolv.so.2  
  52. FEA56000       8K rw---  /lib/libresolv.so.2  
  53. FEA60000      24K r-x--  /lib/librt.so.1  
  54. FEA76000       4K rw---  /lib/librt.so.1  
  55. FEA80000      60K r-x--  /usr/lib/libz.so.1  
  56. FEA9E000       4K rwx--  /usr/lib/libz.so.1  
  57. FEAA0000     352K r-x--  /opt/coolstack/mysql_32bit/lib/mysql/libmysqlclient.so.16.0.0  
  58. FEB00000       4K rwx--  
  59. FEB07000    1040K rwx--  /opt/coolstack/mysql_32bit/lib/mysql/libmysqlclient.so.16.0.0  
  60. FEC0B000       4K rwx--  /opt/coolstack/mysql_32bit/lib/mysql/libmysqlclient.so.16.0.0</core_filename>  

Since the problem occurred in the invocation of a MySQL function, with the "pmap" command, I've detected that my Perl used a DBD:: mysql library which was linked with a 32-bit MySQL library. My OS is 64 bits, however that is not a problem since 32 Bits libmysqlclient_32_bits is compatible with 64 bits MySQL DB's.
Despite all efforts to detect the problem in the program that I was developing, it was found that the problem was in the instance of a 64 bits MySQL DB that was using 32 bits libraries.