Native mono crashes [kernel fix released]

threz_ · April 10, 2015, 5:37pm

Good to know. I’ll wait before doing anything with Mono.

If there is anything further I can do to help test, please let me know.

Taloth · April 10, 2015, 5:38pm

@threz_ You still have the session open? Can you get on IRC?

threz_ · April 10, 2015, 5:41pm

I’m at work right now. I’ll pop on to IRC when I get home this evening (roughly 5 hours from now).

Taloth · April 10, 2015, 5:54pm

That’s a problem, In 5 hours it will be 01:00 for me.

Okay, if you still have the gdb session open at home then you need to put in some commands. It’s better to add this to your ~/.gdbinit file before you run it next time, otherwise you’ll have to type it in the gdb console.

handle SIGXCPU SIG33 SIG35 SIGPWR nostop noprint

define mono_backtrace
 select-frame 0
 set $i = 0
 while ($i < $arg0)
   set $foo = (char*) mono_pmip ($pc)
   if ($foo)
     printf "#%d %p in %s\n", $i, $pc, $foo
   else
     frame
   end
   up-silently
   set $i = $i + 1
 end
end

define mono_stack
 set $mono_thread = mono_thread_current ()
 if ($mono_thread == 0x00)
   printf "No mono thread associated with this thread\n"
 else
   set $ucp = malloc (sizeof (ucontext_t))
   call (void) getcontext ($ucp)
   call (void) mono_print_thread_dump ($ucp)
   call (void) free ($ucp)
 end
end

Instead of backtrace you can then do mono_backtrace 15. This would add more details to the 4023cdd4 in ?? in the backtrace you provided earlier, or similar backtraces.
mono_stack will dump the managed stack into the application console (not the debugger console).

threz_ · April 10, 2015, 6:07pm

Ok. I’ll do it this tonight.

I can also pop into IRC tomorrow morning, which I guess would be around 3pm your time if you’ll be around then.

threz_ · April 10, 2015, 11:02pm

This is a new session in gdb, not the one from before. It seems like mono_backtrace gives varying results, depending on when the crash actually happens. Here are a few results:

[Info] HousekeepingService: Running housecleaning tasks

Program received signal SIGSEGV, Segmentation fault.
0x00000000005a86f5 in mono_object_isinst_mbyref ()
(gdb) mono_backtrace 15
#0  0x00000000005a86f5 in mono_object_isinst_mbyref ()
#1 0x400162b8 in  (wrapper managed-to-native) object:__icall_wrapper_mono_object_isinst (object,intptr) + 0x58 (0x40016260 0x400162e9) [0x9c5790 - NzbDrone.exe]
#2  0x0000000000b8a3b0 in ?? ()
#3  0x00007fffed776799 in ?? ()
#4  0x00007ffff6b29648 in ?? ()
#5  0x0000000000000006 in ?? ()
#6  0x0000000001374110 in ?? ()
#7  0x0000000000a33e60 in ?? ()
#8  0x0000000000000000 in ?? ()
Initial frame selected; you cannot go up.

Another session:

[Info] RssSyncService: Starting RSS Sync
[New Thread 0x7fffedaef700 (LWP 8521)]
[New Thread 0x7fffed230700 (LWP 8522)]
[New Thread 0x7fffed64f700 (LWP 8523)]
[New Thread 0x7fffc30ff700 (LWP 8524)]
[New Thread 0x7fffc2efe700 (LWP 8526)]
[New Thread 0x7fffc2c7b700 (LWP 8527)]
[New Thread 0x7fffc2a7a700 (LWP 8533)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffedeff700 (LWP 8374)]
0x000000004047af13 in ?? ()
(gdb) mono_backtrace 15
#0 0x4047af13 in Cannot access memory at address 0xffffffffd8324f90

A third for good measure:

[Warn] Newznab: NZB Example https://www.example/api?t=tvsearch&cat=5030,5040&extended=1&apikey=whatapikey&offset=0&limit=100 Error: SendFailure (Error writing headers)
[Thread 0x7fffc70fd700 (LWP 9887) exited]
[Thread 0x7fffed64f700 (LWP 9885) exited]
[New Thread 0x7fffed64f700 (LWP 9908)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffedb73700 (LWP 9666)]
0x0000000000000008 in ?? ()
(gdb) mono_backtrace 15
#0  0x0000000000000008 in ?? ()
#1 0x404b3c8c in Cannot access memory at address 0xffffffffdc0dd4f0

This is what I get from mono_stack:

[Info] Database: Database Compressed
[Thread 0x7fffedb73700 (LWP 10740) exited]
[New Thread 0x7fffedb73700 (LWP 10762)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffeedff700 (LWP 10734)]
0x000000004004a4a1 in ?? ()
(gdb) mono_stack

"Threadpool worker" tid=0x0x7fffeedff700 this=0x0x7fffeefd0dd0 thread handle 0x433 state : not waiting owns ()

And:

[Info] HousekeepingService: Running housecleaning tasks

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) mono_stack

"<unnamed thread>" tid=0x0x7ffff7fed7c0 this=0x0x7ffff7f78010 thread handle 0x403 state : not waiting owns ()

snazy2000 · April 11, 2015, 7:47pm

Ubuntu 14.04.2 LTS
Sonarr Ver. 2.0.0.3004
Mono JIT compiler version 3.12.1
MediaInfoLib - v0.7.72
SQLite version 3.8.2 2013-12-06 14:53:30

http://pastebin.com/fU7cwyHP

Ran again with gdb installed

http://pastebin.com/ZWrDM7qi

Thought i would run it again and this time it managed to get to the end of the scanning, once it done i thought i would try get big bang theory update then it crashed again

http://pastebin.com/vskWxut4

Tommy_Frossman · April 15, 2015, 5:22pm

Any progress here ? I have the same problems on ubuntu, i actually had to add a cronjob that restarts (starts) sonarr every other hour.

It’s getting very annoying for sure.

Anything i could help debugging-wise ?

Taloth · April 15, 2015, 8:12pm

Tommy, you’re welcome to dive into it using gdb. But you’ll want to install mono-runtime-dbg to get better traces.

So far I’ve seen countless backtraces, I’ve run over a couple of dozen debugging sessions, with little to show for it. Compiled mono no less than 6 times on various machines and vms.
First user had a physical memory corruption problem, there goes a couple of days.
Second user had a virtualbox vm (which I have a copy), on which I can reproduce crashes. But it’s too easily reproducible, I suspect something else is messing it up but still dunno what.
I’ve been working on the vm for the last couple of days. Ran it on 3.10 and 3.12, same crashes.

If by “Progress” you mean getting closer to a solution, then no, if you meant to ask if it’s being worked on, yes, to the point of utter frustration.

Problem is that by the time it crashes, the stack is already corrupted, it doesn’t make sense at all. Not giving up though.

Edit: Problem went away when I changed the nr of processors on the vm from 2 to 1. So could be a vbox vs mono thingy related to that.

Michael_Thwaite · April 16, 2015, 12:57pm

I’m experiencing the same problem however, I’m not using sonarr (hope I can still join in) I’m running a bespoke ASP.NET app and use XSP as the web server. I started to experience the SIGSEGV crash of XSP however I use monit to restart. On April 7th, the logged events began to climb - the failures occur more often when the process is dealing with more customers - I wound up three more front-end servers to spread the load and, discovered that my new servers where a little more reliable but load is not as high so I think that could be discounted.

I was running mono 3.12.1 on ubuntu 14.04.2LTS but regressed to 3.10.0 to test that - No improvement at all.

I suspect a component updated on April 7th, my APT history for that day follows.

I have a sever that I can play with to test/recreate but I’m unfamiliar with gdb but happy to learn fast - this is creating a real problem for my service.

Let me know if there’s anything that I can offer - ssh access, logs, testing, etc.

Thanks, Michael

APT log from April 7th

Michael_Thwaite · April 16, 2015, 2:01pm

I have logging to the console set - I usually get a SIGSEGV but this time managed to log this as the XSP process quit:

Handling exception type NullReferenceException
Message is Object reference not set to an instance of an object
IsTerminating is set to True
Object reference not set to an instance of an object
at System.Threading.Timer+Scheduler.SchedulerThread () [0x00000] in :0
at System.Threading.Thread.StartInternal () [0x00000] in :0
[ERROR] FATAL UNHANDLED EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object
at System.Threading.Timer+Scheduler.SchedulerThread () [0x00000] in :0
at System.Threading.Thread.StartInternal () [0x00000] in :0

Is the fact that this is somehow related to threads and scheduling a connection to the SIGSERV in /lib/x86_64-linux-gnu/libpthread.so.0?

Michael_Thwaite · April 16, 2015, 3:38pm

I’ve noticed that my servers that are configured for single CPU are more reliable than the two cpu servers.

Taloth · April 16, 2015, 9:58pm

Michael, tnx for the info, definitely something we can explore. However, I’ve seen plenty of crashes, both native and managed nullreferenceexceptions. Both likely are caused by the same stack corruption we’re seeing.

The mono 3.12.0->3.12.1 is unrelated, since the problem happens in 3.8.x, 3.10.x and 3.12.x.
Didn’t think a kernel update would be an issue, but for completeness sake I switch back to an earlier linux kernel version… no crashes.

Can you reboot and while in grub boot pick Advanced options for Ubuntu? Select linux 3.13.0-43 or -44, if still available.

Btw. I use mono test bug-18026.exe, which generates a whole lot of threads. Noticed it failed the test when I built mono a couple of times, and crashed fairly consistently on this vm (kernel -49). Tweaked the test a bit to make it hit a harder, but even with that I sometimes have to run it a couple of times before it crashes.

You can grab the sources from github to check it yourself:

wget https://github.com/mono/mono/raw/master/mono/tests/bug-18026.cs
mcs bug-18026.cs
mono bug-18026.exe

It’s almost unbelievable the kernel would be causing this, so lets not jump the gun and verify.
So far I ran all these versions a couple of times:
3.13.0-43: fine
3.13.0-44: fine
3.13.0-46: fine
3.13.0-48: crashes
3.13.0-49: crashes

Please verify, need a second opinion on this. Kind afraid of false-positive/negatives.

So far -46 hasn’t crashed though, which doesn’t exactly match with the changes you observed on ~~April~~ March 7th.

LJSeinfeld · April 17, 2015, 4:59am

Just reverted to 3.13.0-46 from 3.13.0-49 and Sonarr is much more stable. I couldn’t even get it to stay running for more than a couple of minutes previously. SIGSEGV error. (Running inside VM)

Linux version 3.13.0-46-generic (buildd@orlo) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC 2015

Sonarr Version 2.0.0.3004 - Mar 17 2015

Mono Version 3.10.1 (master/30ed59c Tue Oct 14 15:41:46 CDT 2014)

Michael_Thwaite · April 17, 2015, 11:33am

That crash test is telling - on my original machines that went bad on April 7th, it crashes instantly every time. On other machines and on Debian, no failure however, I have two machines that are both:

Linux version 3.13.0-49-generic (buildd@brownie) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #81-Ubuntu SMP Tue Mar 24 19:29:48 UTC 2015

One works, one fails.

Launching an older kernel is tricky for me - never done it before on Amazon EC2 as console access is tricky (I think) plus, I often auto remove old kernels.

I started exploring machines, configs and CPU count - something I can change easily and I believe I’ve found a pattern that’s easily recreated. Single CPU is fine, two or more, and it fails reliably.

Here’s my testing results, feel free to add/edit/extend: Spreadsheet of tests that pass or fail

For now, I’m going to think about my options - continue with dual CPU on Debian or drop to single CPU on Ubuntu.

Taloth · April 17, 2015, 5:12pm

Tnx for checking this on a variety of system, I guess it’s confirmed. something in that kernel update messes up mono, or mono did it wrong to begin with.

Going to compile mono 4.0 see if it fixes the issue. If not then I guess i’ll have to submit a bug to mono.

And a reference so I wont forget: https://bugzilla.xamarin.com/show_bug.cgi?id=29008

Michael_Thwaite · April 17, 2015, 7:18pm

I can perhaps save you the pain - crash, crunch, failure. I tried Debian Jessie with kernel 3.16 and the built in mono 3.2.8 - it worked however, switch to the alpha branch and instal 4.0.0 and, well, even the installer was blowing up with SIGSEGV and the test app failed too. Switching to a single processor didn’t help however, the install failed and it’s not designed for this release so, not totally unexpected. It seems that we’ve hit a version limit kernel 3.13.0-46, mono 3.12.1 for multi-cpu installations.

Michael_Thwaite · April 17, 2015, 7:39pm

Reported in to MONO on https://bugzilla.xamarin.com/show_bug.cgi?id=29212

Taloth · April 17, 2015, 9:22pm

Tnx Michael for posting it to the mono bugtracker. I couldn’t get 4.0.0 to build cleanly yet, much less run.

I’m just afraid that the cause of the issue might be hard to trace and thus might take a while till it’s fixed and released. I’m gonna try to come up with some workarounds for our userbase.
Setting processor affinity prolly works, ugly though.

xgz · April 19, 2015, 6:05am

Hey guys I’m getting a TON of crashes in OSX as well, does this look like it might be related to what you’re trying to troubleshoot? There’s no rhyme or reason to why/when this is occuring. Could be two minutes after start, could be two days. During some crashes Sonarr is idle, during others it’s being hammered.

uname: Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64 x86_64
Sonarr Version: 2.0.0.3004
Mono Version: 3.12.1 ((detached/b7764aa Fri Mar 6 15:32:47 EST 2015)
MediaInfo: mediainfo @0.7.72_0 (active)
SQLite: sqlite3 @3.8.8.3_0 (active)

http://pastebin.com/f2ssT9TX
http://pastebin.com/a914Bkwe