PDA

View Full Version : News on the Error-front


Fr3ak
03-19-2006, 01:45 AM
Dr. David Baker has just posted some interenting information here:
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=1177

Looks like its being worked on the errors.

lv_dicedealer
03-19-2006, 02:33 AM
Good, less babysittting is always a plus :)

DAK1640
03-19-2006, 02:57 AM
This is excellent news...Can't wait till it's more automatic (similar to D2OL)

gpcola
03-19-2006, 04:39 AM
If Rom has really managed to reduce the failure rate 20% (not counting the Mac and linux clients) on ralph@home in just 9 days then he's my hero :p: :D

Hope they apply these changes to rosy asap.

"Now I’m on to the next biggest problem which has been deemed the ‘1% bug’." <-- this is the one we need fixed the most of course though

Carlz0r
03-19-2006, 07:37 PM
"Now I’m on to the next biggest problem which has been deemed the ‘1% bug’." <-- this is the one we need fixed the most of course though
The 1% bug? What's that?

[XC] moddolicous
03-19-2006, 08:12 PM
Where some WU's get stuck @ 1% for like 10hrs +

Carlz0r
03-19-2006, 08:37 PM
Where some WU's get stuck @ 1% for like 10hrs +
Ah, I figured that would be it, since it's happening to me right now.
CPU time: 36:00:57

chew*
03-19-2006, 08:39 PM
I dunno maybe i'm just not lucky but i never get those :stick:

[XC] moddolicous
03-19-2006, 08:42 PM
Ah, I figured that would be it, since it's happening to me right now.
CPU time: 36:00:57
Sorry to hear. That happens every once in a while. Unfortunetly, some "babysitting" is involved in Rosetta.

Fhqwhgads6680
03-19-2006, 09:10 PM
hmm, I have yet to have that happen...is it a pretty often occurance? Does the time to completion just never go down? Good to hear they are working on the error.

Brandon J

Carlz0r
03-19-2006, 10:15 PM
Sorry to hear. That happens every once in a while. Unfortunetly, some "babysitting" is involved in Rosetta.
Is there any way to fix it, or do I just have to wait it out?

chew*
03-19-2006, 10:19 PM
Carlzor try downclocking ram a tad. I've found that rosetta is very harsh on ram and will find instabilities PI32m and prime 95 won't. I get no errors whatsoever.

Carlz0r
03-19-2006, 10:25 PM
Carlzor try downclocking ram a tad. I've found that rosetta is very harsh on ram and will find instabilities PI32m and prime 95 won't. I get no errors whatsoever.
Hmm. I downclocked it and rosetta gave me a computation error :confused: Well, it's on to the next one. Hmm. That appears to have fixed the problem though. Downclocked ram to 245 and I'm 7 minutes in and at 2.25%.

Fr3ak
03-19-2006, 10:35 PM
The 1% thing doesnt have to do with ram.
The error is on the projects side

chew*
03-19-2006, 10:39 PM
Then explain why i never ever error? And my machine cranks out more units than most machines so has a higher chance of getting an error. Anyway LMK how you make out Carlzor.

I know that rosetta has some probs but if your repeatedly getting errors it's more than likely clientside.

Fr3ak
03-20-2006, 12:35 AM
You might be simply lucky :)
I have had that error with machines that were running on stock speed.
I havent had a 1% wu on my Opteron for more than 2 months before I got the first. And that thing didnt produce a single error in those 2 months, so I d say it cant be more stable.
Its true that Rosetta uses Ram a lot. I had pcs crash with rosetta that are 48h prime95 at priority10 stable and didnt crash even once in the 2 years I used them as gaming rigs.
Those 1% WUs were also much rarer 2 months ago. There are a lot more ppl having them right now than 2 months ago.
So it looks for me that those 1% errors are caused on the project's side and not the user's.
If you have similar or different experiences, let us know.

stone_cold_Jimi
03-20-2006, 01:01 AM
Those 1% WUs were also much rarer 2 months ago. There are a lot more ppl having them right now than 2 months ago.
So it looks for me that those 1% errors are caused on the project's side and not the user's.
If you have similar or different experiences, let us know.

I think Dr. DB stated that the jobs were getting bigger and more complex in the last couple of months (and the errors had increased). Certainly the content has changed.

Fr3ak
03-20-2006, 01:08 AM
More complex units. Does that mean the memory is stressed more now, which causes the 1% errors? Or does that mean the units got too complex, so there are loops in a few of them, which causes the 1% error?

I still dont think its a memory problem, personally. With the memory not being stable you get computating errors, the pc freezes or reboots.
At least from what I have noticed.

stone_cold_Jimi
03-20-2006, 01:14 AM
More complex units. Does that mean the memory is stressed more now, which causes the 1% errors? Or does that mean the units got too complex, so there are loops in a few of them, which causes the 1% error?

I still dont think its a memory problem, personally. With the memory not being stable you get computating errors, the pc freezes or reboots.
At least from what I have noticed.

Ah, just had a comp error (89% in or something). Core 0, I suspect there's a little bit of instability there.

From David Baker's journal 4th March:

"the large increase in errors two weeks ago was not due to new bugs, but was an unintended byproduct of our effort to help dial up users who wanted longer work units and the many other users who requested work units with more consistent lengths. since most work units had been taking on the order of an hour or two previously, when we set the default work unit length to 8 hours, this resulted in 4 -8 times more structures being generated per work unit. This unmasked rare possibiliities for error that were not evident before"

There's more to it, anyway:

http://boinc.bakerlab.org/rosetta/forum_thread.php?id=1177

gpcola
03-20-2006, 04:25 AM
For those of you who weren't aware of the 1% bug before reading this thread here's what the R@H faq says on the matter:

Q. Why is my computer stuck on 1% ? *


A. This is another known BOINC and Rosetta@Home issue, there are a couple of things you can do about it, they should be done in the order here:

First make certain the work unit is actually stuck or hung =

Depending on how the Wu is configured, some may have over 1,500,000 steps in the first model and still not reach 1%. This can take over 5 hours of CPU time. There are a few even larger ones.

Between checkpoints (Saves to disk, which occur when the percent complete changes) the time to completion in the BOINC work tab, will increase instead of decreasing. This is because BOINC calculates the time to completion based on the number of CPU seconds that have passed and the percent complete. So if the CPU time rises and the percent remains the same, the time to completion will increase. When the running Wu checkpoints, the percent complete will suddenly jump up, and the time to completion will suddenly drop. Then the entire process will repeat until the next checkpoint, or the WUs finishes.

The time settings in your preferences have a lot to do with what the jumps will be as the WU progresses. If you have your time set to say 2 hours, and one of the large Wus comes along, it is possible for it to run for over five hours showing 1%, and suddenly jump to 100% and finish. If the Wu is one of the smaller ones, running with the same settings, it might only run 5 min or less, doing about 35,000 steps for each model, and the percent will jump to something like 5% and keep rising until it hits about 1 hour and 55 mins and finishes.

So look at the graphic display and see what is going on every so often on these longer ones.

The project has said they are working on "leveling" all of this out so there is not so much variation. Those Wu should start appearing soon.

If the Work Unit is really Hung -


1. suspend the Work Unit, BOINC Manager -> Work (tab) -> click on the Work Unit click the Suspend button (on the left hand side) then Resume button, wait for the computer to re-start the Work Unit (it will need to finish the new Work Unit it started, if it had another available) and see if it's still stuck, give it about 20min.
2. Shutdown BOINC, restart BOINC see if the Work Unit is still stuck, give it about 20min.
3. Reboot your computer. See if the Work Unit is still stuck, give it about 20min.
4. Abort the Work Unit, BOINC Manager -> Work (tab) -> click on the Work Unit that's stuck click the Abort button (on the left hand side).

This is a bug and is not caused by instability on the rig in question. There are going to be WUs out there that are bugged - it goes with the territory and there's not a lot we can do but wait for the R@H staff to fix them, which it looks like they are doing right now :D

In my experience instability will usually cause a WU to error out within minutes from the start of the operation rather than hours into it.