Thanks for copy/pasting my post :rolleyes:Quote:
Originally Posted by SunFlowerSeeds
http://www.xtremesystems.org/forums/...&postcount=116
Printable View
Thanks for copy/pasting my post :rolleyes:Quote:
Originally Posted by SunFlowerSeeds
http://www.xtremesystems.org/forums/...&postcount=116
Yes, original post by informal. Nice info that fit this thread. :toast:
Quote:
Originally Posted by informal
Nevermind :),i posted it in that thread about K8L,since someone mentioned a "bug" in intel 64b implementation(which doesn't exist).Nice to have that info here,since it's the right place for it to be.:)
Quote:
Originally Posted by SunFlowerSeeds
Conroe can't make use of 64bit as well as the Athlons can = True.Quote:
Originally Posted by Carfax
Conroe runs 64 bit slower as was said here "it's still not up to par with AMD's." = Faulse. It's app and compiler dependent=P Note? There are 64bit apps that take a hit on Athlon64 as well. Every 64bit doesn't gain on A64. Fact, not my opinion.
Where Athlon64 can gain more performance, many MORE times than NOT, it was STILL SLOWER when it came to final results.
Example 64bit game in 64 bit Mode;
Conroe while going from 32bit 119 to 64bit 122 is a smaller increase than A64 going from 32bit 99 to 105 FPS, A64 was still slower. Or were AMD was beat by 37% in 32bit or now only 30% in 64bit mode during another app . Get it straight?
http://techreport.com/reviews/2006q3.../index.x?pg=13
Just because it is on some website, doesn't make it fact. Even this one could be wrong. 98% of this reviews was performed with WinXP Pro-64bit edition.Quote:
Originally Posted by TR
Now this;
Though the P4 still ended up slower, so far, it has shown the greatest gains from going to 64bit:)Quote:
Originally Posted by SunFlowerSeeds
How do threads go from Conroe sees less of a gain or about only 2%, to Conroe runs slower using 64bit than 32bit? Conroe gained 2% lower than A64's 5% that still leaves it by a WIDE MARGIN margin slower!
Question; Do any of you guys have this software installed?
this is a good thread, good point being made here. All we have to do is sit back and wait to see why there's such a gap in performance from 32 to 64. Could just be, intel, holding the chip back from further optimizations that they alrdy have, simply because they can, (it alrdy beats the competition currently, so why make it better than it has to be)...
Nevertheless, this is still a good thread when you look at the info it presents.
You know, just because Conroe doesn't gain as much in 64-bit doesn't mean EM64T is flawed. It means Conroe isn't as optimized for 64-bit.
Mountain out of a molehill.
Isn't woodcrest a little differant than Conroe. Yes I know that its merom based but I thought I read that there were special optimazations for woodcrest. It would be interesting to compare Conroe to woodcrest in 64 bit apps.
Right!Quote:
Originally Posted by dandragonrage
I think your screename suits you very well... It would suit you even better however, if you added a "d" to the end :banana:Quote:
Originally Posted by dum
fair? nothing is fair in this world :)Quote:
Originally Posted by AndrewZorn
You're right.. My apologies :toast:Quote:
Originally Posted by thecoldanddark
Taken from the link in Informal's post:
http://babelfish.altavista.com/babel...igai288_01.jpg
I guess we now know why 64-bit doesn't produce as significant a performance boost as it does with the K8.
Macro-Ops fusion doesn't work in 64-bit mode, and Macro-Ops fusion is an important advantage of C2D architecture!
Why Intel did this, no one knows? Maybe to save power, transistors, or because they didn't have enough time to properly implement it.
Another reason could be that they didn't want C2D to excel too much in HPC, as thats where they are positioning IA-64.. :confused:
like everyone else said compared to x6800, the 62 gets pwnd
Quote:
Originally Posted by Carfax
LOL Conroe has been created for Vista not XP :slapass:
Even in Xp, it is fast does not matter if is 32x or even 64x.
Wait and See for Vista benchmarks.
This is all rubbish, I tested it myself. Conroe 64-bit code to Opteron 64 bit code has a about the same speed different as Conroe 32-bit code to Opteron 32 code. There is no weakness in EM64T in Conroe.
I had the info here and why not give the full cut'n'paste just in case:
http://www.xtremesystems.org/forums/...d.php?t=104646
[begin copy'n'paste]
I ran some 64 bit benchmarks in addition to my 32bit benchmarks.
In my 32 bit FreeBSD tests for C/C++ compilation a 2.40 Conroe with DDR533 is 5-8% faster than a 2.60 with DDR432 (*), or about 14-17% faster per MHz.
Running some similar C/C++ compilations on 64 bit Linux (all software 64 bit), the 2.40 GHz Conroe is 18-25% faster than a 2450 MHz Opteron.
I'd say the lack of 64 bit performance on Conroe is a myth.
This is preliminary. I have problems controlling the Conroe's clocks from Linux, I am not 100% sure I am actually at 2.4 GHz here. Damn BadAxe...
(*) Before anybody cries foul: the 5-8% of the 2.4 Conroe versus 2.6 Opteron are only for C/C++ compilation, Conroe goes to 20% advantage for scripting languages and Lisp and to 25-30% for low quality video. The number here are purely for C/C++ compilation. I full 64 bit version of my benchmarks is in the works.
%%
I ran some other not to be named stuff and I see more Conroe kicking.
I have two cases of Conroe being 100% faster per clockspeed (3.66 GHz DC Conroe three times faster than 2.4 GHz dual single-core Opteron - which has slow RAM).
One case is 32 bit code and one is 64 bit code, so that is really a non-issue.
The performance characteristic of the 32 bit case is very wired. If you are familiar with Common Lisp programs, you can run Common Lisp programs in "fast mode" and "safe mode". Safe mode adds all times of checks: integer overflow, extra type checking, argument checking etc. Without the extra checks the conroe is 50% faster per clockspeed, with the extra checks it goes towards 100%. It is almost like you get the checks for free.
The 64 bit case of 100% faster is just another random C/C++ compilation is a huge tree and I currently don't have an explanation why the difference is bigger than in my own benchmarks. The tree in question is certainly template-heavy, maybe it is the bigger cache.
%%
I am running out of time to analyse some of the more interesting/puzzling results and won't come to conclusions before the 4th of July weekend.
I'm throwing some raw data at you, numbers are CPU time used (this is single-thread):
This is compiling/building a Common Lisp system. (BTW, Netburst completely sucks here, I'm glad that episode is over.)Code:Lisp build:
fast safe (fast/safe = target policy)
2.4 939-DC-Opt 505 826 (memory 218 MHz)
2.6 939-DC-Opt 469 731 (memory 216 MHz)
2.4 940-SC-Opt 537 846
2.4 Conroe 328 439
3.66 Conroe 222 296
You can build the program two ways: safe-mode and fast-mode. The safe-mode build will insert a lot of extra checks like type checks, integer overflow checking, function parameter counting etc.
As you can see, the Conroe completely spanks the AMD64s in the safe-mode build, more so than in the safe-mode build. That is puzzling, I have never in my 15 years of Lisp programming seen anything like this. Suddenly you get the insertion of the extra checks almost for free, whereas so far you had to weight getting the checks against twiddling your thumbs for a while duing the build.
Here is a relative index, the first rows giving you:
- performance/MHz fast-mode
- (same, relative to 2.4 Opteron)
- performance/MHz safe-mode
- (same, relative to 2.4 Opteron)
That means Conroe reaches a 50% advantage over AMD64 per clockspeed for the fast-mode build but goes up to 90% advantage for the save-mode build.Code:
fast (rel) safe (rel) fast safe
------------------------------------------------------------
1212.0 (1.0) 1982.4 (1.0) 2.4 939-DC-Opt 505 826
1219.4 (1.0) 1900.6 (1.0) 2.6 939-DC-Opt 469 731
1288.8 (0.9) 2030.4 (1.0) 2.4 940-SC-Opt 537 846
787.2 (1.5) 1053.6 (1.9) 2.4 Conroe 328 439
812.5 (1.5) 1083.4 (1.8) 3.66 Conroe 222 296
Possible explantions include:
- It's the larger cache. Since this compilation is single-thread the AMD64 has 1 MB, the Conroe has between 2 and 4 depending on how it is associated with the one core running the compiler
- Some parallel execution unit in the Conroe picks up the extra instructions to insert the safety checks
- We get the extra instructions cheaper than on AMD64 due to some clever rescheduling in Core2
- The extra instructions to insert the safety check in the compiler normally blow memory prefetch in AMD64 and Netburst but Core2's prefetch is clever enough to find the stuff in advance
- I am doing something wrong here (not likely, I re-ran it a few times)
- All of the above
I need to run a few programs that report cache hits and the like but I
don't think any of the performance counter programs that exist are
running on Conroe yet.
uOpt, I appreciate what you've done, but I think you're making an erroneous claim concerning EM64T as having no weaknesses.
From Intel themselves, it's known that Macro-Ops fusion is not supported in long mode, which is definitely a significant handicap.
Perhaps your code doesn't make much use of Macro-Ops fusion, but that doesn't mean other programs won't, as evidenced by the Panorama scores..
Anyway, all in all it's not such a big deal. Conroe's EM64T is still more than good enough to compete with AMD's.
The K8L however will most likely change this, as it's expected to have a very large increase in HPC code..
Hopefully Intel will enable more 64-bit optimizations in a future rev or die shrink, perhaps the Penryn core (45nm shrink of C2D)
From what I can see macrofusion (or macro-opt fusion) in existing Intel designs including Core 2 is limited to the one case of a compare/test followed by a branch.Quote:
Originally Posted by Carfax
While this is a common sequence there is no way it is common enough to cause a visible performance difference on it's own (let's say 1-2% total performance). Because the branch will trigger all kinds of other actions such as cache lookups (possibly misses), entering the data into the branch prediction for reuse in later runs through the same place and a few other things.
The claim that my code doesn't make use of compare/tst followed by a branch is outright ludicrous. Much of my benchmarking is compiler runs which is in fact very heavy on that sequence. Media encoding would be an example where it's less involved.
I also miss an official Intel statement that macrofusion is in fact disabled in EM64T on Core2.
Do not tell me your running 3-4-4-8 on the opteron platform :(Quote:
Originally Posted by uOpt
Taken from Anandtech:Quote:
Originally Posted by uOpt
Ofcourse, this doesn't disqualify your statement, but it does shed some light on the discussion I think..Quote:
The result is that on average in a typical x86 program, for every 10 instruction, two x86 instructions (called macro-ops by Intel) are fused together. When two x86 instructions are fused together, the 4 decoders can decode 5 instructions in one cycle. The fused instruction travels down the pipeline as a single entity, and this has other advantages: more decode bandwidth, less space taken in the Out of Order (OoO) buffers, and less scheduling overhead. If Intel's "1 out of 10" claims are accurate, macro-ops fusion alone should account for an 11% performance boost relative to architectures that lack the technology.
If I were a programmer, I would be able to have a much more indepth discussion with you about this.. However, I work in the medical field so I'll have to do a cop out :D
Yep, that was my fault. I shouldn't have said your code doesn't make use of Macro-Ops fusion.Quote:
The claim that my code doesn't make use of compare/tst followed by a branch is outright ludicrous. Much of my benchmarking is compiler runs which is in fact very heavy on that sequence. Media encoding would be an example where it's less involved.
It depends on whether you consider blue slides with the Intel logo official ;)Quote:
miss an official Intel statement that macrofusion is in fact disabled in EM64T on Core2.
That is the wrong assumption right there.Quote:
The result is that on average in a typical x86 program, for every 10 instruction, two x86 instructions (called macro-ops by Intel) are fused together. When two x86 instructions are fused together, the 4 decoders can decode 5 instructions in one cycle. The fused instruction travels down the pipeline as a single entity, and this has other advantages: more decode bandwidth, less space taken in the Out of Order (OoO) buffers, and less scheduling overhead. If Intel's "1 out of 10" claims are accurate, macro-ops fusion alone should account for an 11% performance boost relative to architectures that lack the technology.
First of all the math is wrong. If 10% of instructions are test/cmp-lcc sequences that are sped up by 50% then you end up with a theoretical advantage of 5%, not 10%. You don't eliminate those 10%, you cut them in half.
But much more important, just reducing the execution time of pure instructions by 5% in a modern microprocessor doesn't make your program run 5% faster, by far. If that was the case you needed no caches, no fast RAM, no out-of-order execution, no pipelines. In fact reducing just the execution time can account for almost nothing. That is particularly true for jump instructions that are subject to potential overhead all over the place as I outlined above.
I don't say it is useless. But the effect will be small, and my benchmarks show it isn't a big deal if it is in fact not supported on EM64T. Conroe keeps it's advantage over AMD64 equally in 32 and 64 bit code.
I would also like to see a statement confirming the initial chart stating that it is not supported on EM64T by Intel, on Intel's site. The above charts are hosted by a third party and seem to be photographed from some live presentation. While I don't doubt as such that the presentation was real, there still is room for screwups. In any case, for me there is no such disadvantage of 64 bit code in Core2.
To be a little more useful I decided it's time to give Intel's current own compiler a spin to see whether that speeds up some of the code here.
Please don't quote mega-posts in full.Quote:
Originally Posted by fhpchris
Yes, the particular test for 64 bits above was done with memory at 3-4-4-8. I needed 4 GB of RAM.
You can compare the effect of memory timings and memory frequency on the AMD64 for applications similar to what I used for these numbers here:
http://forum.useless-microoptimizati...ch-memory.html
You can see my base benchmarks for Core2 and AMD64, along with Netburst and Pentium-Ms here. This chart does include different memory setups for AMD64:
http://www.cons.org/cracauer/crabench/core2.user.html
http://www.cons.org/cracauer/crabench/core2.wall.html
Or cutting out some of the other platforms here: http://www.cons.org/cracauer/crabenc...only.user.html
http://www.cons.org/cracauer/crabenc...only.wall.html
Irony at its very, very best.Quote:
Originally Posted by Carfax