I am confident that you and Particle will solve it in a fast way.. if possible keep us updated :)
Printable View
2128mhz
pretty cool josh
So you guys with all the experience.....what will be my chances of getting better than stock on a dual socket Tyan similar if not exactly this http://www.tyan.com/product_SKU_spec...&SKU=600000042
I should have this within the next week or so.
Well, you can expect 220-230Mhz HTT.. frequency depends on the CPUs mult then.
BTW, it turns out the Opty hates HFCC WUs.. takes 24h for one, it claims 250 credits per WU but only gets <100 average. So I put it on HCC only for now.
Can you run CineBench please? You beat my Wprime by 8 seconds...that's a helluva PC you have there.
cinebench would probly only get a ten times speed up because it doesnt scale well with cores.
Cinebench is totally :banana::banana::banana::banana:ed up on this one, for some reason.. 1577 single 14500 multi CPU :rofl:
And if you think the Opty is fast in wprime.. check out this ;)
Turns out I do have a faster Intel rig for wprime :wasntme:
cyberduid how about this?
http://www.hwbot.org/result.do?resultId=874727
no $1000 pop .. just around 200 as said before...http://cgi.ebay.com/AMD-1-9GHZ-Opter...d=p3286.c0.m14
what does 700 stand for? No 700 opties at least in socket F as long as I can recall
jcool
did you figure out your memory problem?
Nope.. still there. No idea what I can still try at this point. Except for contacting Supermicro.
Some news on the matter,
thanks to 06F150fx4 who runs the same CPUs on a different motherboard and is getting the same bad latency, my suspicion that it may be due to the CPUs being B2 stepping (hello TLB bug) seems confirmed. I will ask Supermicro if there is a workaround to this issue, but they'll probably just answer that Quads aren't supported on my Rev. 1,01 board anyway because it has no split power planes etc. :rolleyes:
no wonder those cpu's didnt cost as much as the other optis on the egg.
Yeah. Ever wonder why they are so cheap? Now you know :rolleyes:
heres why its so damn slow.http://en.wikipedia.org/wiki/Transla...okaside_Buffer similar to branch prediction in p4 but not quite as bad.
Miss penalty: 10 - 30 clock cycles
Got a reply from SM tech support:
Neat idea, I was getting all excited when I read it earlier today, but now I tried and.. well. Same result, nothing has changed, at least not in Everest/Sandra.Quote:
Hi Sir,
For the memory speed question, please disable the “CPU Page Translation Table” option in the BIOS. Go to BIOS, Advanced / CPU Configuration / CPU Page Translation Table
Hmmm, so you're certain it's due to the TLB bug and not how it's dealing with NUMA? Sort of curious how a software managed TLB would do, tho I don't know what the normal hit is for such... you could mess with Linux if you want to find out :D
An aside:
Chumbucket, a better comparison would be to the L1 Cache, if there's a miss then it'll be looking at a rather similar amount of cycles for either getting it from the L2 [, L3] or main memory. The reason for it being a better comparison is the page table is also a resident of memory*, it's just segregated to it's own spot and specialized (whereas L1 caches are for general program Instruction/Data bytewords which are used in a different context); so if the entry [from the page table] being requested not cached in the TLB then it'll have to take the slow route and go find it. Now the thing is just like the general caches, there should be a fairly low miss rate, so the delay shouldn't make too much of an impact in the grant scheme of things... unless there's a bug/workaround involved :p:
*The reason I bring this up is because branch-prediction is just that, it predicts what's going to happen based on accumulated history, whereas a page table is just a bunch of entries saying each Effective[Virtual] address really points to this Real[Physical] address. The former deals with guessing where a computation(branch evaluation) will lead and pushes instructions into the pipeline ahead of time based on that, whereas E-A translation (page table lookup) is a strict correlation and must be looked up.
EDIT: hopefully this explanation reads through a little better...
Back to the issue:
I guess I never delved too deep into the Barcelona TLB bug, but I thought it was you either ran without the BIOS fix and it went pretty much full bore (for the architecture/implementation) with the risk of failure (freezing is what I remember hearing) under high load, or else you ran with the fix and encountered a 10-20% hit. I could be completely wrong on this, so if anyone knows please correct me.
I assume the BIOS feature you flipped puts you in the former situation (or latter, since you're disabling it??? confuses me now), hence I'm curious if it has to do with NUMA or some other part...
I am not certain about anything here, except for the fact that this rig has a HUGE problem with memory latency causing it to suck ass in some apps. Fortunately, it runs HCC WUs decently.
Unfortunately I neither have the time or nerve to start wrestling with Linux...
Erm.. wut? :confused: :shrug: :rofl: :p:Quote:
An aside:
Chumbucket, a better comparison would be to the L1 Cache, if there's a miss then it'll be looking at a rather similar amount of cycles for either getting it from the L2 [, L3] or main memory. The reason for it being a better comparison is the page table is also a resident of memory*, it's just segregated to it's own spot and specialized (whereas L1 caches are for general program Instruction/Data bytewords which are used in a different context); so if the entry [from the page table] being requested not cached in the TLB then it'll have to take the slow route and go find it. Now the thing is just like the general caches, there should be a fairly low miss rate, so the delay shouldn't make too much of an impact in the grant scheme of things... unless there's a bug/workaround involved :p:
*The reason I bring this up is because branch-prediction is just that, it predicts what's going to happen based on accumulated history, whereas a page table is just a bunch of entries saying each Effective[Virtual] address really points to this Real[Physical] address. The former deals with guessing where a computation(branch evaluation) will lead and pushes instructions into the pipeline ahead of time based on that, whereas E-A translation (page table lookup) is a strict correlation and must be looked up.
EDIT: hopefully this explanation reads through a little better...
No idea really, it definitely doesn't freeze tho (unless I overclock it too high :cool: )Quote:
Back to the issue:
I guess I never delved too deep into the Barcelona TLB bug, but I thought it was you either ran without the BIOS fix and it went pretty much full bore (for the architecture/implementation) with the risk of failure (freezing is what I remember hearing) under high load, or else you ran with the fix and encountered a 10-20% hit. I could be completely wrong on this, so if anyone knows please correct me.
I assume the BIOS feature you flipped puts you in the former situation (or latter, since you're disabling it??? confuses me now), hence I'm curious if it has to do with NUMA or some other part...
No idea if switching that one setting made any impact on performance, will find out about that soon I guess. But since it changed absolutely nothing in the synthetic benchies, I am guessing there won't be any real world difference here.
Maybe SM enabled the TLB fix permanently in their bios, who knows.
lol i know BP and TLB are very different things. i was comparing the miss penalty(even though the penalty can be much worse for p4). wouldnt a full pipeline flush be worse than a cache miss though?
the fix should already be enabled in the bios. here is an article for the patch.http://techreport.com/articles.x/13741/ latency is actually worse with it on but its better than a system hang.
Does the board have the option of turning the TLB workaround off? Because if so, I'd do it. For crunching, it isn't an issue, but the workaround in the bios for the TLB does hinder performance greatly. I had a 9600BE crunching and made sure the TLB thing wasn't enabled, and it crunched fine.
I've heard rumors that certain windows OSes on certain service packs automatically force the TLB thing (though maybe that was just rumors).
that's not a rumor ... it's a fact :mad:
IF AMD uses the same register settings as for the Phenom than you might can use this tool http://xtreview.com/images/TLB_ver1.04.rar (Phenom tlb fix disable tool - for certain OSes :yepp:)
It works very well on my Phenom.
Vista with service pack 1 and later. I don't know about XP, though :shrug:
He can test it very easy. When he runs the benchmark in WINRAR (ALT+B) and gets more than 1000 kb/s (can be also a little bit lower depending on the current software running) then the TLB bug fix is disabled. if he only gets around 300/400 kb/s then the fix is enabled (then it's time for the tool I mentioned above).
Got it. Thanks :D
Sorry jcool if you feel I'm detracting too much from the thread, it's the WCG forum after all :eek:
I would say try that program mreuter80 linked to, since the registers SHOULD be the same, except there's the fact you're running 4physical cpus, so hopefully it knows how to address/configure all of them. I don't know the specific registers involved, but doing a registry dump might shed some light on the matter as well, but the winrar trick might be much easier.
The only reason I bring up NUMA is because if your tests are trying to pull from the wrong bank with the wrong cpu, then you'll definately feel a performance hit (CPU0-> HT -> CPU1's MemCtrlr/RAM -> HT -> CPU0. I've never had the chance to use a NUMA machine myself tho, so I don't know if it would be a problem by default or not.
K, I just found the analogy a little bit on the far side at the time, so I had to say my 2cents :shakes: :shrug: :) :p:
(was bored at work waiting for a simulation to finish ;) )
As for pipeline flush vs cache misses, that all depends on the pipeline length and memory subsystem design, quite situation/implementation dependent. Also, I assume you're mainly referring to flushes caused by branch mispredicts, though quite a few other things can cause them as well.
You could say a branch mispredict caused pipeline flush is often (but not always, as below) cycle bounded by the pipeline's length (baaad for P4), whereas a cache miss could potentially cause a pipeline flush (since the subsequent instructions issued might depend on that load/save hit), plus the cache miss will have to wait for the retrieval from L2/Ram/Storage, which can also vary based on outstanding requests (transaction could take tens to hundreds-of-thousands of cycles, so probably much longer than the pipeline flush).
Note that I'm mainly referring to data cache misses, since if there's an instruction cache miss then you just plain have to wait for it to load from memory before you can fill the pipeline again, which could happen on the mispredict if the prefetcher didn't do its job well enough :)
Now if you'd throw in SMT, trying to compare things get even more fun, everything pulling from the same caches/etc., except pipeline flushes can now be marginalized by being thread specific. The kick-back is total throughput/efficiency gets a good boost, something we can see with having all the WCG workunit threads go, where we care about the total output :up:
Damn, I forgot this is the 4x4 thread :brick: I guess this program will only run for one CPU.
You also need to install Crsytal CPU ID before running the program (such a long time back when I installed it, that I forgot it now -- getting older ... who is Dave :rofl:).
Oooh, the benchmark numbers above are for a Phenom. I'm sure your Opteron system will show different numbers.
OK, with Crustal CPU ID comes the MSR editor where you can access the register settings directly (manually). Here is the link to the download page: http://crystalmark.info/download/ind...l#CrystalCPUID
Then apply the same steps for each "CORE" ... which is for your True 4x4 quite a bit :rofl:
here is the link to our own xtremesystems guide with some nice pictures http://www.xtremesystems.org/forums/...d.php?t=171105Quote:
Select the Core in the main window.
Enter C0010015 in the MSR Number field and hit RDMSR.
Change the last hex digit. Bit Nr. 3 (8h) must be unset. If the last digit is 8h use 0h if it's 9h use 1h. Hit WRMSR to apply the changes.
Now enter C0011023 in the MSR Number field and hit RDMSR.
Change the last hex digit. Bit Nr. 1 (2h) must be unset. If the last digit is 2h change it to 0h. Hit WRMSR to apply the changes.
Close the MSR Editor, select the next core start the MSR editor and change the registers the same way as described above.
There was also a way to do this as a batch in this guide. So if it works you might want to look into it to have a batch running when you start the machine.
I hope that will help. Good luck man ... I keep my fingers crossed :up:
Thanks for that, unfortunately the program doesn't load. It says "vcl60.bpl missing" if I fire up the TLB disable exe and "unable to load dll" if I try the enable. Server 08 x64 SP2.
Winrar sucks ass BTW, 230 kb/s... :rolleyes:
And I haven't seen a bios option directly advertising a TLB fix, but I guess it's the Translation table thingy, why else would SM support tell me to disable it?
Maybe the motherboard disabled it but windooze won't.. argh!
Hey mreuter,
thanks, this seems to fire up at least.
But I don't really understand what I need to change the values to. The guide says:
"Change the last hex digit. Bit Nr. 3 (8h) must be unset. If the last digit is 8h use 0h if it's 9h use 1h. Hit WRMSR to apply the changes."
Which field are they talking about? And how do I convert hex code to actual numers that I have to enter? :shrug:
Entering MSR number 0xC0010015 gives me 0x01000018 for EAX.
Entering MSR number 0xC0011023 gives me 0x00A00022 for EAX.
So, into what do I change them?
And by the way.. damn. Doing that 16 times will be tedious, I need to use that batch file if it works :rofl:
jcool.... long shot but is the MCP55 cooled enough? Did you remove its HS to make sure its all nicely TIMed? I was thinking a bit of thermal throttling on the chipsets.... :shrug:
Holy :banana::banana::banana::banana: Jeesus, it worked :D
I just followed this:
"Hi,
Guess i found a way to disable the tlb fix if aod does not work and there is no option in the bios.
The latest bios for my M2A-VM included the tlb-fix. My everest memory read bandwidth dropped around 20%.
I expected bit nr 3 in the MSR register C0010015 to be responsible for the fix. So I compared the values between the two bios versions.
The old version showed 0x00000000 0x01000010 the new one 0x00000000 0x01000018 (bit nr 3 set)."
Since it showed 0x01000018 for mine as well, I just changed all the 16 entries to 0x01000010 and bam...
1630kb/s in winrar instead of 270kb/s :D
Quite the performance increase..
mreuter, do you think I need to change that 2nd entry too? The 0xC0011023 register, that is.
Edit: Memory latency improved from 280ns to 151ns. Still crappy, but better.
Ok guys,
I got both fixes to work now, using this batch (and extending it until CPU 16)
To give you an idea of what changed - first up, stock (well not really stock :D ) Quad Opteron 8347HE:Quote:
Originally Posted by mibo
http://database.he-computer.de/Bilde...tencyissue.jpg
Yeah, it sucks. Big time.
Next up: Changing the 0xC0010015 register from 0x01000018 to 0x01000010 on all cores:
http://database.he-computer.de/Bilde...erest_1fix.jpg
Yay! Latency still sucks, but overall a big improvement.
And, one step further: Changing 0xC0011023 register from 0x00A00022 to 0x00200020 (not sure if I should change the A in there? oh well it works)
http://database.he-computer.de/Bilde...erest_2fix.jpg
Now that's even better. Note how it improves the L3 cache latency.
Some real world number improvements:
1. Winrar: No fix: 270KB/s - Fix 1: 1630KB/s - Fix 2: 1660KB/s
2. Cinebench: No fix: 14600 xCPU - Fix 2 - 19000 xCPU
Will try more :)
A HUGE thank you goes out to mreuter80 for being spot-on with his analysis and pointing me in the right direction! :toast:
Glad you got it working :clap: :up:
Sweet! Nice to see things are going in the right direction!
{Coffee sipping} MOIN
Great to see it works.
Don't change the A. I checked on my Phenom and the value should be 0x00A00020
:shocked: ... :woot:
Thanks for the flowers, but I didn't do the analysis. I just gave you the hint with the software.
I'm very glad it works and the numbers are pretty cool. I'm curious whether it will work fine with all cores crunching. Most of the processors can do it, but it is a bug and might have an effect.
Now I wonder whether PoppaGeek's opterons might have that issue as well and he can increase his numbers. I will send him a PM to check.
Awesome.
Stupid MS and forcing that TLB crap!
I just changed the file to write 0x00A00020 instead of 0x00200020 for the 2nd register. Performance decreased slightly, about in par with Fix 1 (without writing anything to the 2nd register).
So you should maybe try 0x00200020, it seems faster for me. No stability issues so far, been running benches and crunching for a while now.
Re-enabled for the machine to run HFCC WUs as well. Shouldn't take 25h per WU now! :rolleyes:
@Riptide: Yeah the MCP is getting pretty toasty, that's probably the reason why this damned mobo won't go any higher than 211 HTT ATM for stable operation.
I already removed the stock fans (those 2 tiny HSFs are actually 1 piece, cooling the MCP and an AMD PCI-X bridge chip). Unfortunately they use an extremely thick thermal pad for the MCP, like 5mm :rofl:
Explains the :banana::banana::banana::banana:ty temps, but due to the HSF sitting on the PCI-X bridge as well I can't just remove it and put real TIM on there. I'll have to find new, individual heatsinks (thinking something real big for the MCP ^^ )
Right now I'd love to put my phase head on the MCP and see how it clocks at -45C :p: :rofl:
Vmods for the chipset, anyone?
According to this 1354 does not have tlb-bug.
Poppa, all B2 step Opterons suffer from the TLB Bug, regardless of their model number. B3 and newer procs don't.
By the way, the 4x4 has passed the night crunching just fine and it seems that even the stoically ram-ignoring HCC project has gained a little, 16 WUs now complete in 7:20h instead of 8h (at 2Ghz CPU speed)
Ah, so there are no 1354's with B2 step? That's fine then :)
2090mhz vs 2004mhz
unfair!!!
:p
Too hot for 2,1 :p: