PDA

View Full Version : Debugging client errors


stone_cold_Jimi
03-20-2006, 08:41 AM
What's the status on errors? I have completed 6 jobs on my development 3800 so far and lost 3 to either:

<message> - exit code -1073741811 (0xc000000d)

or that plus

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x7C910F29 read attempt to address 0xBE363F94

These occur toward or at the end of a unit when computing time has passed 12,000 seconds (2 of these in the 14,000 second range). The WU's are overlapped between cores.

I note that my Venice 3000+ has not had a similar error after 10 units. Both machines are running the A64 updates.

Is this simply machine instability, likeliest since I've only had the CPU for a few days? Or bad luck / dodgy WU etc?

mad mikee
03-20-2006, 08:58 AM
...5 is a memory error, usually because something's 'not quite right'

Stability wise PRIME > D2oL > Rosetta (Rosetta will really test your machine.

Will Edit and add more later

lv_dicedealer
03-20-2006, 09:23 AM
I have been getting a rash of similar errors on one of my rigs that has GeiL Ultra 4400 2x512Gb sticks in it....
Thanks for the direction there Mikee, will look into the memory probs later today.

mad mikee
03-20-2006, 09:59 AM
Mem (bad or 'just not it's correct settings'), ctlr, chipset if Intel :shrug:
If it happens 1/every 2-3days, I ignore it, but if there are ALOT, something is wrong :(

I ignore the 0x1 errs since they die @ 13 or 18 seconds :D

For example of why we say this is a great stability test:

I had my wife's puter @ 10x285 after running Dual prime for 2 days :D

Had to knock it down to 10x280 for D2OL :rolleyes:

Rosetta made me drop to 10x275 :mad:
(& then I zapped the chip :upset: )


:off: When I am checking stuff to get a cruncher going.

My 'tweak/qualify' suite

0. If new mem - run a quick few passes fo memtest 5/8 to see if it works @ all

1. 1m/4m Spi (quick see if I'm close)

2. S&M 1.7.6 - set both Core to do 2 loops INCLUDING Power test 2 (video, will die w/in 40secs if something awry )

3. Min 300-500% run on memtest/win (1 for 1G, 2 for 2G)

4. 1 cycle of dual prime 1024-4096 using 390m (1G) or 900m (2G) each.

5. If that is okay, I go let rosetta try it out and adjust if needed.

The pain of course comes when something fails and I have to tweak and start over :brick:.

stone_cold_Jimi
03-20-2006, 10:02 AM
Thanks, that's very helpful - I wonder how to track down something so infrequent. Just upped a good one so that's 4 good 3 bad. I'll have to take the machine off - is there a way to save work while pulling the client out for the duration?

stone_cold_Jimi
03-21-2006, 01:41 AM
I just had 2 unhandled exception errors in a WU without the client failing - got a success and credit -

HB_BARCODE_30_1opd__351_14372_0

Does Rosetta recover properly from errors, i.e. does it self-validate as it runs? Or is that result above unsafe?

Had another fail with the exit code. I'm wondering, does setting affinities for Rosetta help? I've just set that, see if that helps.

mad mikee
03-21-2006, 08:38 AM
Since the ...4.82... starts and stops every WU and is started by boinc. Shouldn't make difference and is fighting the OS.

stone_cold_Jimi
03-21-2006, 08:46 AM
Since the ...4.82... starts and stops every WU and is started by boinc. Shouldn't make difference and is fighting the OS.

Well, I'm going to try for a while - the first WU after I did this was clean as a whistle. If I get a failure or mem error, I'll stop doing it. Definitely seemed to make the machine slicker at other concurrent tasks, though.

Edit: since I started setting affinity, there have been no more errors (last 5 WUs?). If it shouldn't work, it's not working the way it should. How can I automate this?