HCC points calculations on multicores [Archive]

twilyth

12-01-2007, 02:35 PM

Here's a recent post from Lawrence Hardin at WCG about the pagefault problem on multicore machines and the resulting loss of points.

Everyone has their own reasons for running WCG, so I won't pretend to advise anyone, but you should at least be aware of the problem so you can make your own choices.

I suppose I'll dip my toe into the water again. Brrr. . . it's chilly!

A page fault occurs when the core wants to access memory that is not loaded into cache. This will slow things down because the kernel will have to load a new page from memory into the cache, while the core waits. Any application with a lot of page faults will run more slowly than one with only a few. But there is a second possible problem with performance. Multiple cores can 'queue up' a series of page faults so that each core has to wait until its own page fault gets serviced. This is called memory contention. If a number of cores are running applications with a high number of page faults, then performance will drop even more because of this memory contention.

How can this performance inefficiency be cured? The normal way is to run a preprocessing step over the data arrays and produce a new array that clusters data together the way that the program will access it. Sometimes this is possible. Unfortunately, sometimes it is not. It all depends on the algorithm. Even when it is possible, it produces unreadable data structures. This need not be a problem but when developing a new program that has to be rapidly changed to match research needs it is almost always a problem.

[A personal reminiscence. A generation ago I spotted a neat 15-25 line section in an image processing assembler routine that I could optimize to speed up the program by 10%-15%. Even with paperwork, this change only cost me 2 or 3 days and we were running it constantly on a number of computers, so I considered it time well spent. I actually congratulated myself about this. (sob..) A little more than 2 years later the new computers changed the cache organization and I suddenly realized that my change was bound to cause problems down the road if the cache changed even more. After thinking it over for an hour, I eliminated the change. Programming to meet specific cache designs is very dangerous practice that has to be considered very suspiciously.]

So what is my estimate of the situation? I don't think that it makes sense to reprogram the application for this. The project scientists should be concentrating on the results and overworking the programmers to change the application to produce better results. Faster should be ignored at this stage.

But how should individual members of the World Community Grid feel about this? The high page-fault count is simply an artifact of the algorithm. It will slow down the flops/second but that will not matter as such. The CPU time spent running the kernel to load in new pages will show up as reduced credit, but for a single core the points impact should not be substantial. Memory contention will be much more substantial, so 4 and 8 core machines would show a much greater drop in points if running more than 1 HCC work unit. The WCG scheduler is sending out the HCC work units so a member can eliminate HCC from these multi-core machines without slowing down progress on HCC. And they could then run other projects such as FAAH and DDT that would otherwise have to run on the single core computers that can handle HCC with the greatest efficiency.

An unrelated note. Some days ago someone posted a work unit awarded 8.3 points in this thread. This was immediately reported to the WCG staff. I don't know what went wrong and we have a number of more urgent issues, but it is an error unrelated to the main problem being addressed in this thread.

Lawrence
http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=17244&offset=50

sierra_bound

12-01-2007, 03:01 PM

This does not fully explain the problem. Even if the app is running slower than normal because of page faults, this should not affect the points awarded. The problem lies, in part, in the quorum. Work units crunched on machines with more than four cores are often not paired with similar machines. In some cases, some lowly single-core Pentium machine is being paired in the quorum with a high-powered multi-core cruncher. This is one of the failings of the quorum system. It's hard for a dual-Clovertown to be paired with a similar machine because there just aren't that many.

twilyth

12-01-2007, 03:29 PM

I'd don't get that either but then I'm pretty fuzzy on how points are calculated to begin with. The one thing I do understand is the bottom line - don't run HCC on multicores. Anyway, it doesn't seem it's a priority for the WCG staff, so I see no need to make it my priority either.

Sparky

12-01-2007, 03:38 PM

Well I only have single core CPUs on WCG so I guess I'm safe then

STEvil

12-01-2007, 06:24 PM

Just had a thought occur to me: what was that cache option for the i5000xt we were all talking about quite some time ago causing performance issues with quadcores under certain circumstances? Maybe this is one of them.

sierra_bound

12-01-2007, 06:36 PM

Claimed credit is not an issue. Sometimes with an 8-core system you will get WU's with 120-130 in claimed credit, but only 50-60 in granted credit. Something is wrong there. And it's not because of abnormal CPU benchmarks either. Dual and quad-cores will often have higher benchmarks, but the granted credit is closer to what was claimed.

Movieman

12-01-2007, 06:43 PM

Thanks twilyth for posting this.
I saw this earlier today as I'm subscribed to that thread.
Sierra is right on the credit part but I think the issue is if a HCC WU goes to 2 machines, one a single and one a 8 core, then the single does it in say 4 hours and claims proportionally. The 8 core does it in 7-8 hours and claims accordingly. The credit algorithm then sees that the 8 core must be overclaiming and awards both the lower value.
That's my take and as always I could be wrong..:D

[XC] serlv

12-01-2007, 06:46 PM

When WCG runs a Cancer project that effectively utilizes, and, yes, "compensates" multi-core machines then I will run it on those machines. However, all my single core rigs are retired. So no Cancer project for me. :(

Cancer is what prompted me to "crunch" to begin with.

sierra_bound

12-01-2007, 06:48 PM

In an ideal world the quorum should compare the results of similar machines. That rarely happens with Clovertown rigs because there are so few of them on the grid. Clovertowns get punished in the quorums in other projects as well, like FAAH. But it's even worse with HCC.

Movieman

12-01-2007, 06:53 PM

I think the answer for me is I will run FAAH on the clovers and try HCC on Carl's FX51..It's at 2423mhz and making app the same as my DX2400 was doing..Yea, with one core..Not electrically efficient but it's such a little brute that I can't shut it off.
I have a baseline of what it's doing in FAAH and in a week will have numbers to compare to for HCC.

sierra_bound

12-01-2007, 06:58 PM

The one project where Clovertowns do reasonably well is DDDT. Claimed and granted credit are often very close.

Movieman

12-01-2007, 07:02 PM

The one project where Clovertowns do reasonably well is DDDT. Claimed and granted credit are often very close.

That's interesting..Call it 20 points per hour per core then times 7 for 26,880 a day..

sierra_bound

12-01-2007, 07:07 PM

28K today :D

Movieman

12-01-2007, 07:33 PM

28K today :D

Yea, but thats 8 days and 14 hours..:ROTF:
Just did the following:
Mesh clover stays on FAAH
My Clover went to DDDT
The Q6600 stays with FAAH
FX51 went to HCC
Lets see what happens.
Should get a good baseline from those moves.

sierra_bound

12-01-2007, 07:39 PM

It takes awhile to validate all the WU's that are currently pending. So you may not get a definitive picture of how each machine is doing until next weekend.

Movieman

12-01-2007, 07:47 PM

It takes awhile to validate all the WU's that are currently pending. So you may not get a definitive picture of how each machine is doing until next weekend.

agreed, and I'm not aborting whats there ands thats 3 days work so it will take longer. At least I feel we're on the right track here.:up:

=CDU= CNP

12-01-2007, 07:51 PM

I saw he mentioned the core having to dig up files from the HD because it wasn't in cache. Wouldn't a ram disk solve this problem?

twilyth

12-01-2007, 07:51 PM

agreed, and I'm not aborting whats there ands thats 3 days work so it will take longer. At least I feel we're on the right track here.:up:
I've aborted a bunch of HCC wu's - why did you decide keep the ones in your cache?

sierra_bound

12-01-2007, 07:59 PM

I've aborted a bunch of HCC wu's - why did you decide keep the ones in your cache?
One of the problems with aborting WU's is that you get penalized for doing that. WCG treats them as errors and will likely send you fewer WU's, at least for a short period. Also, there are people who don't like to see their those WU's in the error section of their WCG page.

I saw he mentioned the core having to dig up files from the HD because it wasn't in cache. Wouldn't a ram disk solve this problem?
You can be the first to test that.;)

Again, something goes haywire during the validation process. WU's crunched on Clovertown rigs often get the appropriate claimed credit, but then get hosed when it comes to granted credit.

=CDU= CNP

12-01-2007, 08:01 PM

One of the problems with aborting WU's is that you get penalized for doing that. WCG treats them as errors and will likely send you fewer WU's, at least for a short period. Also, there are people who don't like to see their those WU's in the error section of their WCG page.
Oops I dumped them all !:eek:

Movieman

12-01-2007, 08:03 PM

I've aborted a bunch of HCC wu's - why did you decide keep the ones in your cache?

Didn't have any HCC on any machine..None to keep,just added HCC to the FX51 single core..:D

sierra_bound

12-01-2007, 08:03 PM

Oops I dumped them all !:eek:
I wouldn't worry too much.:) You'll still get work units, though they may be slower in coming for awhile.

Movieman

12-01-2007, 08:05 PM

OT: We got a nasty bugger of a storm heading in here tomorrow.
Word is 9-14F and over a ft of snow with 50-60MPH winds so if you don't see my ugly spamming face here you'll know why..:D

Sparky

12-01-2007, 08:06 PM

I wouldn't worry too much.:) You'll still get work units, though they may be slower in coming for awhile.

:with:

When I started BOINC back up on fold3 I found it started crunching old units from a long time ago. So silly me hit abort on them all. So not only were they reported extremely late, but also aborted, and WCG didn't give fold3 too much to do for a few days... :yawn:

Philly_Boy

12-03-2007, 07:32 AM

Yea, but thats 8 days and 14 hours..:ROTF:
Just did the following:
Mesh clover stays on FAAH
My Clover went to DDDT
The Q6600 stays with FAAH
FX51 went to HCC
Lets see what happens.
Should get a good baseline from those moves.I know my micro-farm doesn't hang with the big boys...;) but I will switch the work and home lappy's (both single core Pentium M machines) to HCC. I already removed it from the quad last week when the HCC issues first started cropping up. The quad responded with an 18K day on Saturday...my second best day ever. The quad seems to chew thru these DDDT WU's in as little as 0.34 hour/WU. I am averaging 20-22 points per hour per core so I am pleased that I am in the right places with the quad.

We'll see how this shakes out later this week with the HCC WU's.

123bob

12-05-2007, 01:32 AM

I've largely stayed out of the fray on the HCC point scoring topic, since I only run quads, and can't compare smaller or larger machines. But you know, the explanation in the beginning of this thread pizzes me off. As my teammates know, I'm here to absolutely F. U. cancer. It is my war. What the explanation gives me is that page faults on WCG is somehow OK??? Well, it ain't. Why? We are not giving our best to the architecture of the current project. Anything labeled "fault" or "error" ain't OK....:mad:

Warning, RANT ON
I think it's time to let the biomath folks on the HCC project know we are not able to give our best. The caveats......I agree, in principle, that programming should not be done for cache architecture, processor type, 32 vs 64 bit, et.al., but I would think hot machines should provide hot results??? Mine are not slackers, they are mostly SLACRs, and other secret stuff!! What's going to happen when I bring $15k worth of yorkies, and next year's Nehalem 8 cores to the team? Are they going to be penalized? Will they provide the max, accurate work results they could provide? Will their results be useful to the mission? Have I pizzed thousands of dollars into the wind??

Don't get me wrong. I'm definitely not about points. I'm about useful production. This conversation on the HCC output has turned to performance on work units, not just awarded points. Now you got my attention.....
RANT OVER, for now

I thank Dave for bringing this to my attention to start with. I thought it was a contained "point problem", but I'm thinking now, it's not. Don't make me get out my slide rule.....!!:rofl: I mean it!! :ROTF:

Regards,
Bob

Movieman

12-05-2007, 01:48 AM

Hey Bob..
Up front, I'm not a software guy at all and if I step on my "Richard" here bear with me.
From what I've read it "seems" that this HCC project WU functions a bit different than the others. To get whatever info they need on these WU's it sounds like they are saying it works in X way and that is creating the page faults on mulitcores.
Add into this that they have a limited staff and that staff right now is working on getting HCC to work on Linux machines and you see whay they put our concern on the back burner.
That's how I read this and yes, as always I could be wrong.
The fly in the ointment is when you get a guy like Diddly replying it sets people's tempers off and clouds the issue.
I, like you, want to run HCC but at the same time I want to get my honest "pay" for the time involved and that can't happen right now.
Now here's the killer: That old tech FX51 that Carl sent is eating those HCC WU up like crazy. I know, it's not an answer but it's all I got pal.
I'm trying the diplomatic route with WCG..
So we wait 2 weeks and ask again and see what is happenning.

123bob

12-05-2007, 01:59 AM

Movieman

12-05-2007, 02:05 AM

I understand Dave, and I'm not that diplomatic on this topic, as you know. What I see is a hot, overclocked core 2 machine taking longer to produce a WU. That ain't right, is it? Why? The page faults....My hot processors have to recover, don't they? I admit, I'm no software guru either, but I do have a small understanding about memory hardware access.......

I can see in my farm production that something is really wrong. And not just in my points. My WU production is off too....!!!!

Bob

Exactly what I saw here. The clovers do the other WU 95% as fast per core as a Q6600..close anyway, you get my meaning..Under 4 hours a WU where the Q6600 does them maybe 10-15 mins faster.
One of the guys PM'd me with an idea..Load the whole app onto a ram disk, not hardware, just into memory, could that work or not?

123bob

12-05-2007, 02:23 AM

Exactly what I saw here. The clovers do the other WU 95% as fast per core as a Q6600..close anyway, you get my meaning..Under 4 hours a WU where the Q6600 does them maybe 10-15 mins faster.
One of the guys PM'd me with an idea..Load the whole app onto a ram disk, not hardware, just into memory, could that work or not?

From the description of the "problem" I get in the first post, no, it won't help. The problem seems to be going outside the processor cache, to RAM memory. Not the HD. Others correct me if I'm off here....BTW, Dave,YGPM

Bob

Jaco

12-05-2007, 02:25 AM

Bob,

The cancer project is very important to me , but after reading DDTung's post, I decided to temporarly cancel the project :eek:
As a team ,we need to make a strong statement here. WCG has to fix the issues first , then we'll jump on this project again.

just my :2cents:

123bob

12-05-2007, 02:30 AM

I'm very close to saying the same thing. I've moved my "hottest" machines to DDDT already......

I've got many more still doing HCC, but it is making me think HPF2 is where they will go....

Qoute my posts to WCG, I don't care. If they can't hear, we can/will make them listen.......I'm no slacker on the production page there either.........

Bob

Movieman

12-05-2007, 02:36 AM

From the description of the "problem" I get in the first post, no, it won't help. The problem seems to be going outside the processor cache, to RAM memory. Not the HD. Others correct me if I'm off here....BTW, Dave,YGPM

Bob

See, told you I don't know software..:rofl:
But I can learn..So this is happenning because the computations are going to a pagefile, IE the pagefaults, and that is the slowdown?

123bob

12-05-2007, 02:39 AM

See, told you I don't know software..:rofl:
But I can learn..So this is happenning because the computations are going to a pagefile, IE the pagefaults, and that is the slowdown?

Yup.....

Movieman

12-05-2007, 02:43 AM

Yup.....

Could a bigger cached chip be the answer?
I wonder what would happen if I set the clover to do just one WU at a time for an experiment..

123bob

12-05-2007, 02:49 AM

Could a bigger cached chip be the answer?
I wonder what would happen if I set the clover to do just one WU at a time for an experiment..

Try it. It may cost short term, one to two days, but we may learn by it....Let me know ASAP.

We'll bring our specs to Intel regarding Nehalem, and try it there too......:ROTF: I'm sure they will listen......:up:

Movieman

12-05-2007, 02:54 AM

Try it. It may cost short term, one to two days, but we may learn by it....Let me know ASAP.

We'll bring our specs to Intel regarding Nehalem, and try it there too......:ROTF: I'm sure they will listen......:up:

Why go to Intel , I'm just about to assemble my 16 core dual Nehalem tomorrow..
(Pinches self, wakes up, wonders why desk now has a hole punched through it from the bottom up, notices torn pants and damp feeling)

I'll try one on the clover BUT it's not a real quad and had 2-4mb caches per chip, wonder if they can both be used.
Worth a one day shot..Just to see the time it takes to do the WU

123bob

12-05-2007, 07:21 AM

Why go to Intel , I'm just about to assemble my 16 core dual Nehalem tomorrow..
(Pinches self, wakes up, wonders why desk now has a hole punched through it from the bottom up, notices torn pants and damp feeling)

I'll try one on the clover BUT it's not a real quad and had 2-4mb caches per chip, wonder if they can both be used.
Worth a one day shot..Just to see the time it takes to do the WU

I'll tweak a profile and try a similar experiment here....

I do notice it is my fastest machines that get hit worse....to the tune of about 1/2 claimed vs granted when it occurs. It appears to occur on about 1/4 to 1/3 of the WUs.

EDIT: here's some examples. Maybe others can see more here. This is a quad at 3.6 with 2 gig RAM on a P5K deluxe. Like your new rig MM....

X0000048581513200503181019_ 0-- bobs-farm-01 Valid 12/04/2007 08:01:44 12/05/2007 12:30:11 2.80 64.2 / 62.9
X0000048580218200503181042_ 1-- bobs-farm-01 Valid 12/04/2007 07:08:56 12/05/2007 13:25:17 4.06 93.3 / 108.0
X0000048490284200504072220_ 1-- bobs-farm-01 Valid 12/04/2007 05:18:51 12/05/2007 11:57:14 2.93 65.4 / 75.6
X0000048490273200504072220_ 0-- bobs-farm-01 Valid 12/04/2007 05:17:08 12/05/2007 09:45:04 3.34 75.0 / 75.0
X0000048480998200503171255_ 0-- bobs-farm-01 Valid 12/04/2007 03:03:24 12/05/2007 09:45:04 4.67 104.7 / 59.3
X0000048480269200503171308_ 0-- bobs-farm-01 Valid 12/04/2007 02:35:40 12/05/2007 09:10:57 4.69 104.6 / 97.3
X0000048441493200504071942_ 0-- bobs-farm-01 Valid 12/04/2007 02:25:52 12/05/2007 06:44:47 4.39 102.2 / 55.5
X0000048441076200504071949_ 1-- bobs-farm-01 Valid 12/04/2007 02:12:31 12/05/2007 06:44:47 4.14 96.4 / 56.7
X0000048431444200503171022_ 1-- bobs-farm-01 Valid 12/03/2007 23:24:44 12/05/2007 06:44:47 5.12 119.2 / 87.7
X0000048431378200503171023_ 0-- bobs-farm-01 Valid 12/03/2007 23:23:06 12/05/2007 06:44:47 4.59 106.8 / 106.8
X0000048430824200503171033_ 0-- bobs-farm-01 Valid 12/03/2007 23:01:00 12/05/2007 06:44:47 3.46 80.4 / 80.4
X0000048290481200504061533_ 1-- bobs-farm-01 Valid 12/03/2007 21:54:02 12/05/2007 06:44:47 4.78 111.3 / 97.8
X0000048020870200503252217_ 1-- bobs-farm-01 Valid 12/03/2007 20:26:16 12/05/2007 06:44:47 4.59 106.8 / 73.2
X0000048020067200503041142_ 1-- bobs-farm-01 Valid 12/03/2007 19:05:46 12/05/2007 06:44:47 5.20 120.9 / 66.9

Movieman

12-05-2007, 07:29 AM

I'll tweak a profile and try a similar experiment here....

I do notice it is my fastest machines that get hit worse....to the tune of about 1/2 claimed vs granted when it occurs. It appears to occur on about 1/4 to 1/3 of the WUs.

I was seeing almost 8 hours to do them vs 4 for other machines.
Now the clovers might not be the fastest out there but there is nothing that is on air that will out do them by that factor..not yet anyway..:D
Maybe a Yorkie at 5000 will, but there aren't any of those on WCG yet that I know of..Maybe 4400-4600 tops..

Movieman

12-05-2007, 08:05 AM

Ok, double post but I want to show some numbers:
Mesh clover @3111mhz..
First is with a mix of FAAH and HCC WU:
Date...........CPU Time..........Points...WU

11/29/2007 0:007:08:26:21 20,387 39
11/28/2007 0:009:11:04:39 22,878 42
11/27/2007 0:010:00:02:18 21,436 38
11/26/2007 0:010:12:53:39 23,824 43

This is JUST FAAH with no other changes:
12/04/2007 0:009:18:04:44 33,084 62
12/03/2007 0:010:02:14:11 32,971 66
12/02/2007 0:009:03:34:00 28,976 58
12/01/2007 0:010:14:55:05 33,998 64

The change was made on November 30 so I left that day out intentionally.
Remember, all this machine does is WCG, nothing else at all.
Look at the 2 highlighted days..Almost the same amount of cpu time but a difference of 11,500+ points..
Scary huh?

123bob

12-05-2007, 08:16 AM

Yeah, real scary....I'll have to analyze my WU production difference on a couple machines. I suspect my HCC machines are not as efficient as they could/should be....

twilyth

12-05-2007, 10:45 AM

Yeah, real scary....I'll have to analyze my WU production difference on a couple machines. I suspect my HCC machines are not as efficient as they could/should be....
Bob and Dave - great set of posts.

I just want to get some clarification on the exact nature of the problem. Forgive me if this is a little pedantic, but I want to help all the members who may not have a clear understanding of some of this stuff.

When a cpu needs data it first looks in the L1 cache, which normally runs at the same speed as the cpu. Then it looks to L2, which I think is usually slower and finally, depending on the chip architecture, L3. If it misses on all 3, that's not technically a page fault is it?

When the cpu looks beyond L3 and goes to the memory controller, it tries to find the data in RAM. If it's not in ram it looks in the page file and if it's not there then it looks on the hard drive.

It's a page fault when it doesn't find it in the page file - is that right? Or is it a page fault when it doesn't find it in RAM?

Assuming that by page fault they mean not finding it in RAM, what would cause that and how could the application be written differently. LH mentioned something about clustering data but I didn't really understand that.

So is the problem with the HCC wu's the fact that it's memory access is fragmented - in the sense that it's not asking for data when the cpu expects it to?

I'd really like to have a firm handle on exactly what's going on.

Thanks.

sierra_bound

12-05-2007, 11:03 AM

I don't think the WCG advisers really know what's going on. They're just making an educated guess. If they knew what the problem was, they would fix it. But they aren't going to bother searching for a solution because it mainly affects a small number of users. Rather disappointing attitude on their part. Fortunately there is more than one project available.

This is not a totally new problem. Clovertown rigs generally do poorly in the quorums. And they do even worse in HCC quorums.

Newer projects generally have issues, which is one reason I'm hesitant to run them when they first become available.

linflas

12-05-2007, 11:36 AM

Seems no worse or better than any other WU I do on Vista 64, The media center is Q6600 @ 3.0, Cube is Q6600 @ 3.6. All WU from all projects report about the same way for me, and have from the start.

Sparky

12-05-2007, 12:00 PM

Mine is screwed up, take a look at the pic. Only 2 in the whole list there are the granted points even close to the claimed. Fold3 and fold5 are both single core machines too.

twilyth

12-05-2007, 12:04 PM

Whoa there Sparky - looks like you're gettin' r@ped. Those are some nasty lookin numbers.

123bob

12-05-2007, 01:54 PM

Yeah Sparky, that's AFU. I had thought from reading over here and on WCG that this situation was isolated to multicore machines. Sparky's numbers say it's single core too!

twilyth, it appears from the wikipedia explaination here (http://en.wikipedia.org/wiki/Page_fault), that there are a number of mechanisms involved and various levels of impact when it occurs.

I did get the mention in this about overclocking possibly messing up pointers. So, even though my machines never had problems with the other projects, I will drop the clocks to stock on two machines and monitor. (I know, I know, I'm talking blasphemy....It will be temporary and only for the benefit of the team....:p: )

Another interesting observation is that I seem to be having more machine lockups lately. I dropped one of the problem children on DDDT. Let's see how that runs.

I'm willing to suffer the growing pains to run HCC and do the testing to get it right. I just want to make sure there isn't a real problem underneath what we are seeing....right now I'm not so sure....:shrug:

Regards,
Bob

Jaco

12-05-2007, 02:30 PM

Yeah Sparky, that's AFU. I had thought from reading over here and on WCG that this situation was isolated to multicore machines. Sparky's numbers say it's single core too!

Another interesting observation is that I seem to be having more machine lockups lately. I dropped one of the problem children on DDDT. Let's see how that runs.

I'm willing to suffer the growing pains to run HCC and do the testing to get it right. I just want to make sure there isn't a real problem underneath what we are seeing....right now I'm not so sure....:shrug:

Regards,
Bob

Linux or windows boxes ?

Oc-Ghost

12-05-2007, 02:40 PM

Hmm... got some of em.
Not bad thou, but not good either :(

WU..........................................runtim e...granted/claimed
X0000047310996200502181306_ 1-- 7.61 96.6 / 103.8
X0000045910150200501220043_ 1-- 8.68 107.4 / 111.2
X0000045250365200502042359_ 0-- 8.49 106.8 / 66.9
X0000044691341200412271450_ 0-- 7.64 87.6 / 97.1
X0000044021388200501071233_ 0-- 8.94 108.9 / 48.3
X0000042011346200411111423_ 1-- 8.53 103.9 / 61.0
X0000041930549200411271541_ 1-- 9.52 115.9 / 115.9

under those 2 low point wu`s:
first:
X0000045250365200502042359_ 0-- 8.49 106.8 / 66.9 mine
X0000045250365200502042359_ 1-- 8.03 66.9 / 66.9

second:
X0000044021388200501071233_ 0-- 8.94 108.9 / 48.3 mine
X0000044021388200501071233_ 1-- 5.55 48.3 / 48.3

RAMMIE

12-05-2007, 02:42 PM

Something is wrong with 123bobs and SparkyJJOs run time.I'm seeing the same as linflas.Q6600 @3.6 and E6600 @3.6
No way a Q6600 @ 3.6 takes 6-14hrs to do these WUs.

Sparky

12-05-2007, 05:45 PM

123bob

12-05-2007, 11:27 PM

Jaco;2605775']Linux or windows boxes ?

The two involved on most lockups are both XSOS server 2003 32 bit based....Again, they have run other projects for months, no problem....:confused: They are at stock as of today.....

Something is wrong with 123bobs and SparkyJJOs run time.I'm seeing the same as linflas.Q6600 @3.6 and E6600 @3.6
No way a Q6600 @ 3.6 takes 6-14hrs to do these WUs.

That's possibly an indicator of errors and the recovery causing them to take longer? Half the farm has been moved tonight to other projects to get some more data......Not what the HCC folks would want to have happen, but tough. They need to hear us....

Bob

Movieman

12-06-2007, 03:07 PM

I've been reading this thread and it just facinates me as with everything folding.

I never get reduced points via a quorum on my QX9650. Takes me about 2.3 hours to fold a HCC WU and I claim between 60-70 points and alway get around that plus or minus.

X0000041910744200411272106_ 2-- xquad9650-2 Valid 12/06/2007 07:17:21 12/06/2007 21:00:56 2.23 62.2 / 62.2
X0000049361506200506092127_ 0-- xquad9650-2 Valid 12/06/2007 06:53:40 12/06/2007 20:59:16 2.24 62.3 / 64.8
X0000049280738200506081630_ 1-- xquad9650-2 Valid 12/06/2007 04:10:09 12/06/2007 16:27:54 2.13 59.3 / 57.9
X0000049270056200506081616_ 1-- xquad9650-2 Valid 12/06/2007 01:40:07 12/06/2007 13:21:14 2.16 60.2 / 61.4
X0000049250624200506081517_ 1-- xquad9650-2 Valid 12/06/2007 00:04:11 12/06/2007 12:19:23 2.12 59.0 / 65.7
X0000049201347200505171649_ 0-- xquad9650-2 Valid 12/05/2007 20:10:55 12/06/2007 10:24:56 2.25 62.7 / 57.2
X0000049201275200505171650_ 0-- xquad9650-2 Valid 12/05/2007 20:09:15 12/06/2007 10:08:18 2.27 63.3 / 70.2
X0000049201105200505171652_ 1-- xquad9650-2 Valid 12/05/2007 20:04:57 12/06/2007 10:08:18 2.25 62.9 / 67.2

But I am starting to see a couple of of reduced points as outlined in this thread for my Q6600.

X0000049160108200505161058_ 1-- tiny-quad-q6600 Valid 12/05/2007 16:08:23 12/06/2007 22:42:29 4.21 81.9 / 85.9
X0000048670853200504140937_ 1-- tiny-quad-q6600 Valid 12/04/2007 14:35:53 12/06/2007 19:45:22 7.44 146.2 / 48.9
X0000048670948200503240943_ 0-- tiny-quad-q6600 Valid 12/04/2007 13:42:14 12/06/2007 17:12:32 7.00 137.2 / 137.2
X0000048591296200504111313_ 0-- tiny-quad-q6600 Valid 12/04/2007 10:38:02 12/06/2007 10:10:44 4.01 76.5 / 79.5
X0000048591313200504111312_ 1-- tiny-quad-q6600 Valid 12/04/2007 10:36:23 12/06/2007 09:42:29 4.04 76.6 / 76.6
X0000048590501200504111326_ 0-- tiny-quad-q6600 Valid 12/04/2007 10:09:43 12/06/2007 09:40:49 5.09 95.9 / 87.6
X0000048580045200504111308_ 1-- tiny-quad-q6600 Valid 12/04/2007 08:03:12 12/06/2007 08:30:58 6.16 118.0 / 65.4

So I wonder if my QX9650 isn't effeted by the page fault issue becuase of the 12 meg cache?

Just a thought and I'll do some testing. Just moved both the boxes off CCH for a day or so to see the different results.

Again just facinating how all this is calculated and works. I'd hate to move away from HCC, it's my favorite cause. Maybe I'll leave the Yorkie on HCC, and the move the Q6600 to something else.

As much as I love the points, can't make it all about that, but that's just me.

andyc

Do me a favor? Put your Yorkie JUST on HCC for a week and lets see what happens to your averages.
Obviously if you see it going to hell fast pull it but it would be good to know if that bigger 12mb cache is a factor.

Philly_Boy

12-06-2007, 04:50 PM

I took the quad off HCC and it's output increased from like 15K-16K a day to 19-20K over the past few days. The lappy's (work and home are both core solo and are only doing HCC...the average output from them hasn't changed much. I'll let this run for a bit to see if it continues to bear fruit, but I have had a few 20K++ days from the micro-farm since I took HCC off the quad.

123bob

12-10-2007, 06:22 PM

I've got a few days under the belt for some testing on this.

I took Farm-6 and put it all at stock clocks. It is running win server 2003 based 32bit op sys. I still generates some HCC WUs that take in the 6 hour timeframe and get awarded about 50% of claimed. It appears to be slightly less than before, but still noticeable.

I took Farm-09 and Farm-10 and installed Vista ultimate 64 bit. Farm-09 is overclocked, Farm-10 is not. Both of these machines have run totally clean with work units taking in the 2-3 hour time frame and it's getting very close to all of the claimed points.

I could conclude based on this that there is something up with 32bit op sys and HCC. I can also conclude my overclocks don't have diddly to do with the problem.

So, I'm going to ultimate across the farm and put this behind me.....

EDIT: HARDWARE
Farm 6 - Q6600 B3, eVGA 680i LT, stock clocks
Farm 9 - Q6600 G0, Abit IPpro35, overclocked
Farm 10 - Q6600 G0, DS3R Rev2, stock clocked (soon to get the cr*p clocked out of it...)

Bob

Sparky

12-10-2007, 06:32 PM

123bob

12-10-2007, 06:37 PM

Hmm perhaps I should put my copy of xp pro 64 bit to use then? I hadn't used it on my PC since when I got it (bundled with xp pro 32bit, student software) it wasn't supported well so I just used 32 bit instead on my pc.

You might try it on a dedicated cruncher that was causing a prblem for you before. Let us know the hardware and how it runs. I'd be curious....

Also, I edited my above post to include hardware types.

Thx,
Bob

twilyth

12-10-2007, 06:41 PM

Movieman

12-10-2007, 06:42 PM

Funny you should mention that. I don't have enough base line to see a solid average, but I have noticed big swings. That's one of the reasons I haven't been tweaking or pushing the OC over 4.15. Didn't want anything getting in the way of getting a baseline like lockups while I'm a sleep.

Also, I have about 40-55 WU in the "Pending Validation" que on any given day so I was hoping they would catch up. I guess I'm folding the units so much faster than the average it takes a while to get a quorum.

So once I get a good average, I'll do that HCC test. Should be another good experiment which I enjoy much.

One thing that's funny. Now that I have the Yorkie pulling in DDDT's it's spitting one out every 8 minutes on average. I'm getting about 17-18 points on those.

andyc

Now that is funny..8 minutes...JEEZ...What is the average timeframe on a FAAH WU?
An hour?:rofl:

123bob

12-10-2007, 06:47 PM

Thanks for taking the time to do this and for the info.

Did you post any of this on WCG or email the staff? I don't know how interested they are in resolving the issue, but if they are interested, they might be able to use it.

I have not monitored the thread on WCG. I have not had the time. Anyone here who is, is welcome to cut and paste my posts over to there.

Bob

Sparky

12-10-2007, 06:48 PM

You might try it on a dedicated cruncher that was causing a prblem for you before. Let us know the hardware and how it runs. I'd be curious....

Also, I edited my above post to include hardware types.

Thx,
Bob

I may wait until after break. At this point by the time I get it set up and starting to run I'd be shutting down for a month. Don't think that would help us much.

ShootStraight

12-10-2007, 08:52 PM

:D

I'd say 34 by the looks of things. Things just a crunching monster man:eek:

faah2776_ tl3_ xmd16280_ 01_ 0-- xquad9650-2 Valid 12/07/2007 23:30:07 12/08/2007 14:40:54 2.25 62.6 / 62.6
faah2776_ tl3_ xmd11780_ 0A_ 0-- xquad9650-2 Valid 12/07/2007 16:15:24 12/08/2007 06:51:53 2.27 63.2 / 62.2
faah2776_ tl3_ xmd11060_ 01_ 0-- xquad9650-2 Valid 12/07/2007 15:15:13 12/08/2007 05:01:24 2.26 63.0 / 66.5
faah2776_ tl3_ xmd09830_ 03_ 0-- xquad9650-2 Valid 12/07/2007 13:26:39 12/08/2007 04:42:36 2.26 63.0 / 69.4
faah2776_ tl3_ xmd09710_ 01_ 1-- xquad9650-2 Valid 12/07/2007 13:12:00 12/08/2007 02:47:25 2.26 62.9 / 68.0
faah2776_ tl3_ xmd07400_ 04_ 1-- xquad9650-2 Valid 12/07/2007 09:48:20 12/08/2007 00:12:22 2.27 63.2 / 71.4
faah2776_ tl3_ xmd07350_ 09_ 1-- xquad9650-2 Valid 12/07/2007 09:42:28 12/07/2007 22:44:46 2.26 63.0 / 58.8

I took the average it took to do one WU and divided by 4 for the DDDT's which was 32/4. So guess you could say the FAAH's are coming out one every about every 34 minutes or so. At least that's the way I was doing the math.

Just fun as hell,

andyc

The bold are the run times for the job. There is no dividing as that is measured core time. One core - one wu. Still fast though:up:

Edit: I like your quorums too. Its doing very well. What OS?
-SS

ShootStraight

12-10-2007, 09:08 PM

1 every 2.26h. If you were doing one every 34 minutes then there would have to be completion times close to 34 minutes which there arent. See? It is displayed per result not in the aggregate so there is no dividing for the # of cores. 4 * 2.26 = 9.04 hours computation time per 2.26hrs wall time.

-SS

Movieman

12-10-2007, 10:05 PM

Oh OK,

I was calculating how many WU were getting chrunched in a period of time total or "spit out" which would be mis leading I guess.

Thanks, my bad.

andyc

I understand your math but we'd just say it as one WU every 2.25 hours and understand that your actually doing 4 at a time.
my Q6600 at 3600 is doing a FAAH unit in app 2 hours 48-49 minutes.
That's a fast little bastage you got there..
Hmm..some day we're going to see one of these at 5000mhz on water and it'll be scary..

123bob

12-11-2007, 08:19 AM

Andyc, what op system are you running on that yorkie?

Thx,
Bob

123bob

12-12-2007, 10:56 PM

OK, as requested earlier in this thread, I have put my .02 cents on the relavent WCG thread. It is here. (http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=17244&lastpage=yes)

I hope the real folks that matter, the WCG techs and project folks see it....

Bob

(You guys let me know if I was out of line. Maybe I get banned with only 3 posts over there, I don't know......:rofl: )

EDIT: twilyth or mods, should the title of this thread be edited to be " HCC points calculations"? It's more than multi cores.....

twilyth

12-12-2007, 11:12 PM

Sparky

12-13-2007, 09:11 AM

Actually twilyth, go to the WCG forum then double click next to the title of the thread. You may be able to rename it that way (sometimes it works, sometimes it doesn't :shrug:)

twilyth

12-13-2007, 10:07 AM

Actually twilyth, go to the WCG forum then double click next to the title of the thread. You may be able to rename it that way (sometimes it works, sometimes it doesn't :shrug:)
Thanks but I wasn't very clear - I meant that I tried to change the title for this thread. I thought it was possible but couldn't figure out how to do it.

I'm a butthead http://www.esmileys.com/mig/albums/butts/buttsmileys5.gif] [img]http://www.esmileys.com/mig/albums/butts/buttsmileys14.gif http://www.esmileys.com/mig/albums/butts/buttsmileys10.gif http://www.esmileys.com/mig/albums/butts/buttsmileys8.gif http://www.esmileys.com/mig/albums/butts/buttsmileys11.gif

123bob

12-13-2007, 10:26 AM

I took wing position on your post and changed the title.

Edit - ok, only changed title for post - no way to change title for thread - Dave might be able to do it though.

Thanks again.

twilyth, thank you for taking wing. I got three adviser responses in a row so I think it was heard over there....:D

That post by "Highwire" was interesting. I'm not a programmer so I did not understand much of it, but I hope the project programmers see it....

Regards,
Bob

Dave, can you tweak the thread title? Thx.

123bob

12-20-2007, 09:20 AM

Posted the below at WCG....Vista for me, it looks like...

Some may already know this, but.....

I've been looking for a work around. I may have found one. The data below (I hope it formats right on the forum... [:p]) shows a problem machine before and after a vista upgrade. This machine was running serv2003 32 bit before, with BOINC 5.10.13. It is now running Vista Ult 64 bit with BOINC 5.10.28. This is a Q6600, mildly overclocked. (and yes, this also happens on two stock clocked machines I've been testing too...)

I'm not sure if it's the Vista or the BOINC that stabilized this thing. Page fault counts went from billions to thousands with this move!! Look at how consistent the WUs run.

I've made this move on four machines. All four show the same stabilization.

Regards,
Bob

Win Server 2003 32 bit, BOINC 5.10.13
Result Name Device Name Status Sent Time Time Due / Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
X0000053701414200507181229_ 1-- BOBS-FARM-04 Valid 12/15/2007 11:47 12/17/2007 1:14 3.21 69.0 / 67.2
X0000053700504200507181245_ 0-- BOBS-FARM-04 Valid 12/15/2007 11:16 12/17/2007 1:14 3.1 66.5 / 73.7
X0000053700476200507181246_ 1-- BOBS-FARM-04 Valid 12/15/2007 11:15 12/17/2007 4:03 5.74 122.9 / 74.5
X0000053700736200507180920_ 1-- BOBS-FARM-04 Valid 12/15/2007 10:27 12/17/2007 1:14 5.3 113.9 / 69.0
ll117_ 00044_ 2-- BOBS-FARM-04 Valid 12/15/2007 10:26 12/16/2007 23:52 4.21 90.2 / 83.7
X0000053691317200508152322_ 1-- BOBS-FARM-04 Valid 12/15/2007 9:42 12/16/2007 23:52 5.17 110.8 / 57.8
ll116_ 00160_ 4-- BOBS-FARM-04 Valid 12/15/2007 9:19 12/16/2007 23:52 4.11 88.0 / 83.3
X0000053691167200507181207_ 1-- BOBS-FARM-04 Valid 12/15/2007 8:40 12/16/2007 23:52 5.03 107.9 / 73.3
X0000053691233200507180843_ 1-- BOBS-FARM-04 Valid 12/15/2007 7:35 12/16/2007 16:08 4.55 97.3 / 103.8
X0000053690140200507180901_ 1-- BOBS-FARM-04 Valid 12/15/2007 6:49 12/16/2007 23:52 6.11 131.0 / 131.0
X0000053341031200507130919_ 1-- BOBS-FARM-04 Valid 12/14/2007 18:07 12/16/2007 1:58 5.09 109.2 / 68.3
X0000053201106200507120844_ 0-- BOBS-FARM-04 Valid 12/14/2007 12:59 12/16/2007 1:58 3.46 74.2 / 77.4
X0000053200888200507120849_ 0-- BOBS-FARM-04 Valid 12/14/2007 12:49 12/16/2007 1:58 3.83 82.2 / 75.1
X0000053140693200507111427_ 0-- BOBS-FARM-04 Valid 12/14/2007 11:30 12/15/2007 22:25 5.13 110.4 / 78.3
X0000052980196200507080905_ 1-- BOBS-FARM-04 Valid 12/14/2007 9:31 12/15/2007 22:25 5.36 115.5 / 85.3

Vista Ultimate 64 bit, BOINC 5.10.28
Result Name Device Name Status Sent Time Time Due / Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
X0000055740864200508171137_ 1-- bobs-farm-04 Valid 12/19/2007 5:38 12/20/2007 16:29 2.69 68.1 / 76.8
X0000055740640200508171139_ 1-- bobs-farm-04 Valid 12/19/2007 5:29 12/20/2007 16:29 2.65 67.1 / 59.4
X0000055740408200508171144_ 1-- bobs-farm-04 Valid 12/19/2007 5:14 12/20/2007 16:29 2.7 68.4 / 66.3
X0000055520867200509022121_ 1-- bobs-farm-04 Valid 12/19/2007 3:01 12/20/2007 16:29 2.71 68.8 / 72.0
X0000055520791200509022122_ 0-- bobs-farm-04 Valid 12/19/2007 3:00 12/20/2007 16:29 2.65 67.2 / 60.5
X0000055521380200508191534_ 1-- bobs-farm-04 Valid 12/19/2007 1:13 12/20/2007 16:29 2.7 68.4 / 76.6
X0000055520127200508120832_ 0-- bobs-farm-04 Valid 12/18/2007 23:07 12/20/2007 8:20 2.74 69.4 / 78.8
X0000055520004200508120834_ 0-- bobs-farm-04 Valid 12/18/2007 23:05 12/20/2007 7:19 2.65 67.1 / 71.2
X0000055511496200509022044_ 0-- bobs-farm-04 Valid 12/18/2007 23:04 12/20/2007 7:19 2.66 67.3 / 67.4
X0000055511365200509022047_ 1-- bobs-farm-04 Valid 12/18/2007 23:02 12/20/2007 6:02 2.77 68.7 / 68.7
X0000055511363200509022047_ 0-- bobs-farm-04 Valid 12/18/2007 23:02 12/20/2007 6:02 2.67 66.1 / 68.3
X0000055510854200508262138_ 0-- bobs-farm-04 Valid 12/18/2007 18:52 12/20/2007 2:39 2.66 65.9 / 66.0
X0000055510719200508262140_ 1-- bobs-farm-04 Valid 12/18/2007 18:42 12/20/2007 1:08 2.7 66.8 / 74.1
X0000055510644200508262141_ 1-- bobs-farm-04 Valid 12/18/2007 18:40 12/20/2007 0:51 2.61 64.6 / 64.5

sierra_bound

12-20-2007, 09:31 AM

I'm not sure the discrepancy between claimed and granted credit is simply due to page faults. On Linux systems, for example, there is often a wide gap between claimed and granted credit, especially when using 64-bit versions.

You may be on to something. But personally I refuse to run Vista because I feel it's a piece of bloatware. Server 2008 is as close as I'm willing to get.

My suspicion is that there is more than one factor causing some people who get hosed in the quorums. Excessive page faults may be one factor. I think there are others. Clovertown rigs have always had problems in the quorums. Linux 64-bit systems tend to produce huge benchmark results, but granted credit is not always that impressive. I had a Dhrystone score of over 15,000 with Kubuntu 64-bit. But as Movieman and others have noted, BOINC tends to punish machines with results that look too good to be true.

Ultimately, the answer should not have to be upgrading an OS. This has been Microsoft's strategy for years - drop support for older operating systems and force people to upgrade to a newer OS requiring more powerful (and expensive) hardware. This is one reason why Linux has been a Godsend for people with older, less powerful PC's.

123bob

12-20-2007, 10:06 AM

DISCLAIMER - I don't pretend to know what I'm doing here. Just observing the parts I can understand.

I'm relatively sure it does have to do with page faults. I have watched the counts and how long it has taken to execute specific WUs. The long ones generate billions of faults. Then when I went to Vista, all WUs execute in about the same time and the faults only number in the thousands....There must be some relation. Whether it is "cause and effect", or "disease and symptom", I don't know.

"Highwire" on the WCG forum went into some detail on this. I don't understand half of what he's talking about, but the general concepts seem logical.

I do agree that Vista is a bloated pig. I'm using it for evaluation purposes on dedicated cruncher machines only. My main rig will stay XP Pro for as long as I can see....Having said that, Vista sure looks to be a possible solution for those running HCC.

Do you have any experience with other 64 bit op systems and HCC? I am wondering if its the Vista, the Boinc version, or just 64 vs 32 bit that has an effect?

I don't have server 2003 or XP 64 bit. I'd be willing to test it out if I can get some.

Regards,
Bob

sierra_bound

12-20-2007, 10:13 AM

You can download trial versions of Server 2003 and Server 2008 from Microsoft's site.

My feeling is that this has more to do with a coding/programming issue than anything else. Other projects like FAAH and DDDT don't seem to be as adversely affected.

I realize many people want to run HCC. It's a worthy cause and easy on memory resources. But there's no way I'm going to put my Clovertown rig back on that project until this issue has been resolved to my satisfaction.

123bob

12-20-2007, 10:23 AM

I fully agree with you. I had pulled all but 4 of my machines from HCC for the same reason. I really want to work cancer, but I can't stand to have my machines used inefficiently. There is no doubt that machines exhibiting the problem with HCC take longer to produce WUs. (My scores the last few days can prove that...I went to vista on 4 machines. Their scores, and my totals, went up....:D )

I also agree that it is unique to HCC. FAAH, HPF2, or DDDT do not show the points, or the page fault issues. That directly leads to a coding issue inside HCC. The million point question is why I can get it to run so smooth on Vista and Boinc 5.10.28....:confused:

Thanks for the trial download tip. I think I'll do it and report results in a few days.

Regards,
Bob

brot

12-20-2007, 11:10 AM

Well, i think they changed the hcc app - because linux clients now get hcc wu's.
AFAIK they stopped the distribution of hcc to linux clients because they had some troubles, maybe they killed the bug :)

Jaco

12-20-2007, 11:13 AM

123bob

12-20-2007, 06:09 PM

The problem is apparently still there. See this reply from WCG to my post over there....

"Thanks 123bob,

This shows that the problem is a solvable one. I don't have it (much) on my one-core machine, so I have been just guessing why it hits other people so hard. It looks as though it wil be some sort of problem such as Highwire suggested.

The techs are aware of this problem (and several others) but I won't start nagging them until after New Year's Day. devilish

Lawrence"

Brot, I would be curious to see if you are seeing this issue in linux. How are your scores looking there?

Regards,
Bob

twilyth

01-24-2008, 09:28 AM

I believe the major problems are confined to people with 8 cores. 4 cores may also cause problems, and hyperthreading may be a problem (but I don't recommend hyperthreading anyway, and this is a good example of why not). It depends how much memory you have available.

If in doubt, check your page fault count.

As a rule of thumb, I would say "don't run HCC on more than 2 cores at once". Also, "allow 1GB per core running HCC". Like all rules of thumb, these may be totally useless. But I believe they cover the vast majority of cases reported here.
This is Diddy's comment on which machines are affected by the HCC problem (read his previous post too - pretty funny). Anyone care to refute these claims? Here's the thread - http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=17244&offset=130