A weakness in Core 2's armor?

Printable View

Show 100 post(s) from this thread on one page

07-18-2006, 02:42 AM
informal

Quote:

Originally Posted by SunFlowerSeeds

EM64T is indeed inferior as compared to AMD64. This make Conreo run slower on 64bit mode than 32bit mode.

http://babelfish.altavista.com/babel...fkaigai288.htm

http://www.aceshardware.com/forums/r...2277&forumid=1

Thanks for copy/pasting my post :rolleyes:

http://www.xtremesystems.org/forums/...&postcount=116
07-18-2006, 03:34 AM
SunFlowerSeeds

Yes, original post by informal. Nice info that fit this thread. :toast:

Quote:

Originally Posted by informal

Thanks for copy/pasting my post :rolleyes:

http://www.xtremesystems.org/forums/...&postcount=116
07-18-2006, 03:43 AM
informal

Nevermind :),i posted it in that thread about K8L,since someone mentioned a "bug" in intel 64b implementation(which doesn't exist).Nice to have that info here,since it's the right place for it to be.:)
07-18-2006, 09:35 AM
Donnie27

Quote:

Originally Posted by SunFlowerSeeds

Yes, original post by informal. Nice info that fit this thread. :toast:

Quote:

Originally Posted by Carfax

There's been alot of controversy about Core 2's 64-bit performance, and while I would go so far as to say that Core 2's 64-bit performance is fairly good and much better than the P4's, it's still not up to par with AMD's.

Also, it could be that AMD's 64-bit compilers are just much better than Intel's, or that the programs themselves were optimized for AMD64 alone seeing as AMD64 was the first x86-64 implementation, and not Em64T..

All in all, even if Core 2's 64-bit performance isn't as fast as AMD's, 64-bit won't be much of a factor in the mainstream for 3-4yrs..

Conroe can't make use of 64bit as well as the Athlons can = True.
Conroe runs 64 bit slower as was said here "it's still not up to par with AMD's." = Faulse. It's app and compiler dependent=P Note? There are 64bit apps that take a hit on Athlon64 as well. Every 64bit doesn't gain on A64. Fact, not my opinion.

Where Athlon64 can gain more performance, many MORE times than NOT, it was STILL SLOWER when it came to final results.

Example 64bit game in 64 bit Mode;
Conroe while going from 32bit 119 to 64bit 122 is a smaller increase than A64 going from 32bit 99 to 105 FPS, A64 was still slower. Or were AMD was beat by 37% in 32bit or now only 30% in 64bit mode during another app . Get it straight?

http://techreport.com/reviews/2006q3.../index.x?pg=13

Quote:

Originally Posted by TR

This one is another victory for the Core 2, but the contest is a little closer this time, with Athlon 64 processors taking second and fourth places.

Just because it is on some website, doesn't make it fact. Even this one could be wrong. 98% of this reviews was performed with WinXP Pro-64bit edition.

Now this;

Quote:

Originally Posted by SunFlowerSeeds

EM64T is indeed inferior as compared to AMD64. This make Conreo run slower on 64bit mode than 32bit mode.

Though the P4 still ended up slower, so far, it has shown the greatest gains from going to 64bit:)

How do threads go from Conroe sees less of a gain or about only 2%, to Conroe runs slower using 64bit than 32bit? Conroe gained 2% lower than A64's 5% that still leaves it by a WIDE MARGIN margin slower!

Question; Do any of you guys have this software installed?
07-18-2006, 09:39 AM
fareastgq

this is a good thread, good point being made here. All we have to do is sit back and wait to see why there's such a gap in performance from 32 to 64. Could just be, intel, holding the chip back from further optimizations that they alrdy have, simply because they can, (it alrdy beats the competition currently, so why make it better than it has to be)...

Nevertheless, this is still a good thread when you look at the info it presents.
07-18-2006, 10:20 AM
dandragonrage

You know, just because Conroe doesn't gain as much in 64-bit doesn't mean EM64T is flawed. It means Conroe isn't as optimized for 64-bit.
07-18-2006, 10:31 AM
Mako88

Mountain out of a molehill.
07-18-2006, 10:32 AM
Turtle 1

Isn't woodcrest a little differant than Conroe. Yes I know that its merom based but I thought I read that there were special optimazations for woodcrest. It would be interesting to compare Conroe to woodcrest in 64 bit apps.
07-18-2006, 11:57 AM
Donnie27

Quote:

Originally Posted by dandragonrage

You know, just because Conroe doesn't gain as much in 64-bit doesn't mean EM64T is flawed. It means Conroe isn't as optimized for 64-bit.

Right!
07-18-2006, 12:51 PM
Carfax

Quote:

Originally Posted by dum

troll

I think your screename suits you very well... It would suit you even better however, if you added a "d" to the end :banana:
07-18-2006, 12:53 PM
`odin

Quote:

Originally Posted by AndrewZorn

yeah, if you want to be 'fair', its highest vs highest

but if youre bragging, its lowest vs highest, hahahaha
cant wait for my connie

fair? nothing is fair in this world :)
07-18-2006, 12:55 PM
Carfax

Quote:

Originally Posted by thecoldanddark

read the title, sheash.

quote
This is the 64-bit version of Cinebench, primed and ready for these 64-bit processors.

Next time you go and slam something I write maybe you should read the whole page

had you looked 2k04 you would have realized they have the 32 and 64 bit.

One more thing, that whole review was done in win 64 (except 1 test), the applications have 64 bit, they used 64 bit.

You're right.. My apologies :toast:
07-18-2006, 12:59 PM
Carfax

Taken from the link in Informal's post:

http://babelfish.altavista.com/babel...igai288_01.jpg

I guess we now know why 64-bit doesn't produce as significant a performance boost as it does with the K8.

Macro-Ops fusion doesn't work in 64-bit mode, and Macro-Ops fusion is an important advantage of C2D architecture!

Why Intel did this, no one knows? Maybe to save power, transistors, or because they didn't have enough time to properly implement it.

Another reason could be that they didn't want C2D to excel too much in HPC, as thats where they are positioning IA-64.. :confused:
07-18-2006, 01:05 PM
kais

like everyone else said compared to x6800, the 62 gets pwnd
07-18-2006, 01:11 PM
Metroid

Quote:

Originally Posted by Carfax

Did YOU bother to look at the chart? As did EastCoasthandle?

The benchmarks show that Core 2 is SLOWER in 64-bit mode, compared to it's 32-bit performance..

The K8 GAINS in 64-bit mode, compared to 32-bit..

Yes, Core 2 is faster than AMD in both 32 and 64-bit, but the point still stands..

Core 2 is SLOWER in 64-bit than in 32....atleast in these benchmarks and the others in my link.

LOL Conroe has been created for Vista not XP :slapass:

Even in Xp, it is fast does not matter if is 32x or even 64x.

Wait and See for Vista benchmarks.
07-18-2006, 01:26 PM
uOpt
This is all rubbish, I tested it myself. Conroe 64-bit code to Opteron 64 bit code has a about the same speed different as Conroe 32-bit code to Opteron 32 code. There is no weakness in EM64T in Conroe.

I had the info here and why not give the full cut'n'paste just in case:
http://www.xtremesystems.org/forums/...d.php?t=104646

[begin copy'n'paste]
I ran some 64 bit benchmarks in addition to my 32bit benchmarks.

In my 32 bit FreeBSD tests for C/C++ compilation a 2.40 Conroe with DDR533 is 5-8% faster than a 2.60 with DDR432 (*), or about 14-17% faster per MHz.

Running some similar C/C++ compilations on 64 bit Linux (all software 64 bit), the 2.40 GHz Conroe is 18-25% faster than a 2450 MHz Opteron.

I'd say the lack of 64 bit performance on Conroe is a myth.

This is preliminary. I have problems controlling the Conroe's clocks from Linux, I am not 100% sure I am actually at 2.4 GHz here. Damn BadAxe...

(*) Before anybody cries foul: the 5-8% of the 2.4 Conroe versus 2.6 Opteron are only for C/C++ compilation, Conroe goes to 20% advantage for scripting languages and Lisp and to 25-30% for low quality video. The number here are purely for C/C++ compilation. I full 64 bit version of my benchmarks is in the works.

%%

I ran some other not to be named stuff and I see more Conroe kicking.

I have two cases of Conroe being 100% faster per clockspeed (3.66 GHz DC Conroe three times faster than 2.4 GHz dual single-core Opteron - which has slow RAM).

One case is 32 bit code and one is 64 bit code, so that is really a non-issue.

The performance characteristic of the 32 bit case is very wired. If you are familiar with Common Lisp programs, you can run Common Lisp programs in "fast mode" and "safe mode". Safe mode adds all times of checks: integer overflow, extra type checking, argument checking etc. Without the extra checks the conroe is 50% faster per clockspeed, with the extra checks it goes towards 100%. It is almost like you get the checks for free.

The 64 bit case of 100% faster is just another random C/C++ compilation is a huge tree and I currently don't have an explanation why the difference is bigger than in my own benchmarks. The tree in question is certainly template-heavy, maybe it is the bigger cache.

%%

I am running out of time to analyse some of the more interesting/puzzling results and won't come to conclusions before the 4th of July weekend.

I'm throwing some raw data at you, numbers are CPU time used (this is single-thread):

Code:

Lisp build: fast safe (fast/safe = target policy) 2.4 939-DC-Opt 505 826 (memory 218 MHz) 2.6 939-DC-Opt 469 731 (memory 216 MHz) 2.4 940-SC-Opt 537 846 2.4 Conroe 328 439 3.66 Conroe 222 296

This is compiling/building a Common Lisp system. (BTW, Netburst completely sucks here, I'm glad that episode is over.)

You can build the program two ways: safe-mode and fast-mode. The safe-mode build will insert a lot of extra checks like type checks, integer overflow checking, function parameter counting etc.

As you can see, the Conroe completely spanks the AMD64s in the safe-mode build, more so than in the safe-mode build. That is puzzling, I have never in my 15 years of Lisp programming seen anything like this. Suddenly you get the insertion of the extra checks almost for free, whereas so far you had to weight getting the checks against twiddling your thumbs for a while duing the build.

Here is a relative index, the first rows giving you:
- performance/MHz fast-mode
- (same, relative to 2.4 Opteron)
- performance/MHz safe-mode
- (same, relative to 2.4 Opteron)

Code:

fast (rel) safe (rel) fast safe ------------------------------------------------------------ 1212.0 (1.0) 1982.4 (1.0) 2.4 939-DC-Opt 505 826 1219.4 (1.0) 1900.6 (1.0) 2.6 939-DC-Opt 469 731 1288.8 (0.9) 2030.4 (1.0) 2.4 940-SC-Opt 537 846 787.2 (1.5) 1053.6 (1.9) 2.4 Conroe 328 439 812.5 (1.5) 1083.4 (1.8) 3.66 Conroe 222 296

That means Conroe reaches a 50% advantage over AMD64 per clockspeed for the fast-mode build but goes up to 90% advantage for the save-mode build.

Possible explantions include:
- It's the larger cache. Since this compilation is single-thread the AMD64 has 1 MB, the Conroe has between 2 and 4 depending on how it is associated with the one core running the compiler
- Some parallel execution unit in the Conroe picks up the extra instructions to insert the safety checks
- We get the extra instructions cheaper than on AMD64 due to some clever rescheduling in Core2
- The extra instructions to insert the safety check in the compiler normally blow memory prefetch in AMD64 and Netburst but Core2's prefetch is clever enough to find the stuff in advance
- I am doing something wrong here (not likely, I re-ran it a few times)
- All of the above
I need to run a few programs that report cache hits and the like but I
don't think any of the performance counter programs that exist are
running on Conroe yet.
07-18-2006, 01:43 PM
Carfax

uOpt, I appreciate what you've done, but I think you're making an erroneous claim concerning EM64T as having no weaknesses.

From Intel themselves, it's known that Macro-Ops fusion is not supported in long mode, which is definitely a significant handicap.

Perhaps your code doesn't make much use of Macro-Ops fusion, but that doesn't mean other programs won't, as evidenced by the Panorama scores..

Anyway, all in all it's not such a big deal. Conroe's EM64T is still more than good enough to compete with AMD's.

The K8L however will most likely change this, as it's expected to have a very large increase in HPC code..

Hopefully Intel will enable more 64-bit optimizations in a future rev or die shrink, perhaps the Penryn core (45nm shrink of C2D)
07-18-2006, 01:56 PM
uOpt

Quote:

Originally Posted by Carfax

uOpt, I appreciate what you've done, but I think you're making an erroneous claim concerning EM64T as having no weaknesses.

From Intel themselves, it's known that Macro-Ops fusion is not supported in long mode, which is definitely a significant handicap.

Perhaps your code doesn't make much use of Macro-Ops fusion, but that doesn't mean other programs won't, as evidenced by the Panorama scores..

From what I can see macrofusion (or macro-opt fusion) in existing Intel designs including Core 2 is limited to the one case of a compare/test followed by a branch.

While this is a common sequence there is no way it is common enough to cause a visible performance difference on it's own (let's say 1-2% total performance). Because the branch will trigger all kinds of other actions such as cache lookups (possibly misses), entering the data into the branch prediction for reuse in later runs through the same place and a few other things.

The claim that my code doesn't make use of compare/tst followed by a branch is outright ludicrous. Much of my benchmarking is compiler runs which is in fact very heavy on that sequence. Media encoding would be an example where it's less involved.

I also miss an official Intel statement that macrofusion is in fact disabled in EM64T on Core2.
07-18-2006, 02:09 PM
fhpchris
Quote:
Originally Posted by uOpt

This is all rubbish, I tested it myself. Conroe 64-bit code to Opteron 64 bit code has a about the same speed different as Conroe 32-bit code to Opteron 32 code. There is no weakness in EM64T in Conroe.

I had the info here and why not give the full cut'n'paste just in case:
http://www.xtremesystems.org/forums/...d.php?t=104646

[begin copy'n'paste]
I ran some 64 bit benchmarks in addition to my 32bit benchmarks.

In my 32 bit FreeBSD tests for C/C++ compilation a 2.40 Conroe with DDR533 is 5-8% faster than a 2.60 with DDR432 (*), or about 14-17% faster per MHz.

Running some similar C/C++ compilations on 64 bit Linux (all software 64 bit), the 2.40 GHz Conroe is 18-25% faster than a 2450 MHz Opteron.

I'd say the lack of 64 bit performance on Conroe is a myth.

This is preliminary. I have problems controlling the Conroe's clocks from Linux, I am not 100% sure I am actually at 2.4 GHz here. Damn BadAxe...

(*) Before anybody cries foul: the 5-8% of the 2.4 Conroe versus 2.6 Opteron are only for C/C++ compilation, Conroe goes to 20% advantage for scripting languages and Lisp and to 25-30% for low quality video. The number here are purely for C/C++ compilation. I full 64 bit version of my benchmarks is in the works.

%%

I ran some other not to be named stuff and I see more Conroe kicking.

I have two cases of Conroe being 100% faster per clockspeed (3.66 GHz DC Conroe three times faster than 2.4 GHz dual single-core Opteron - which has slow RAM).

One case is 32 bit code and one is 64 bit code, so that is really a non-issue.

The performance characteristic of the 32 bit case is very wired. If you are familiar with Common Lisp programs, you can run Common Lisp programs in "fast mode" and "safe mode". Safe mode adds all times of checks: integer overflow, extra type checking, argument checking etc. Without the extra checks the conroe is 50% faster per clockspeed, with the extra checks it goes towards 100%. It is almost like you get the checks for free.

The 64 bit case of 100% faster is just another random C/C++ compilation is a huge tree and I currently don't have an explanation why the difference is bigger than in my own benchmarks. The tree in question is certainly template-heavy, maybe it is the bigger cache.

%%

I am running out of time to analyse some of the more interesting/puzzling results and won't come to conclusions before the 4th of July weekend.

I'm throwing some raw data at you, numbers are CPU time used (this is single-thread):

Code:

Lisp build: fast safe (fast/safe = target policy) 2.4 939-DC-Opt 505 826 (memory 218 MHz) 2.6 939-DC-Opt 469 731 (memory 216 MHz) 2.4 940-SC-Opt 537 846 2.4 Conroe 328 439 3.66 Conroe 222 296

This is compiling/building a Common Lisp system. (BTW, Netburst completely sucks here, I'm glad that episode is over.)

You can build the program two ways: safe-mode and fast-mode. The safe-mode build will insert a lot of extra checks like type checks, integer overflow checking, function parameter counting etc.

As you can see, the Conroe completely spanks the AMD64s in the safe-mode build, more so than in the safe-mode build. That is puzzling, I have never in my 15 years of Lisp programming seen anything like this. Suddenly you get the insertion of the extra checks almost for free, whereas so far you had to weight getting the checks against twiddling your thumbs for a while duing the build.

Here is a relative index, the first rows giving you:
- performance/MHz fast-mode
- (same, relative to 2.4 Opteron)
- performance/MHz safe-mode
- (same, relative to 2.4 Opteron)

Code:

fast (rel) safe (rel) fast safe ------------------------------------------------------------ 1212.0 (1.0) 1982.4 (1.0) 2.4 939-DC-Opt 505 826 1219.4 (1.0) 1900.6 (1.0) 2.6 939-DC-Opt 469 731 1288.8 (0.9) 2030.4 (1.0) 2.4 940-SC-Opt 537 846 787.2 (1.5) 1053.6 (1.9) 2.4 Conroe 328 439 812.5 (1.5) 1083.4 (1.8) 3.66 Conroe 222 296

That means Conroe reaches a 50% advantage over AMD64 per clockspeed for the fast-mode build but goes up to 90% advantage for the save-mode build.

Possible explantions include:

It's the larger cache. Since this compilation is single-thread the AMD64 has 1 MB, the Conroe has between 2 and 4 depending on how it is associated with the one core running the compiler
Some parallel execution unit in the Conroe picks up the extra instructions to insert the safety checks
We get the extra instructions cheaper than on AMD64 due to some clever rescheduling in Core2
The extra instructions to insert the safety check in the compiler normally blow memory prefetch in AMD64 and Netburst but Core2's prefetch is clever enough to find the stuff in advance
I am doing something wrong here (not likely, I re-ran it a few times)
All of the above

I need to run a few programs that report cache hits and the like but I
don't think any of the performance counter programs that exist are
running on Conroe yet.
Do not tell me your running 3-4-4-8 on the opteron platform :(
07-18-2006, 02:14 PM
Carfax

Quote:

Originally Posted by uOpt

From what I can see macrofusion (or macro-opt fusion) in existing Intel designs including Core 2 is limited to the one case of a compare/test followed by a branch.

While this is a common sequence there is no way it is common enough to cause a visible performance difference on it's own (let's say 1-2% total performance). Because the branch will trigger all kinds of other actions such as cache lookups (possibly misses), entering the data into the branch prediction for reuse in later runs through the same place and a few other things.

Taken from Anandtech:

Quote:

The result is that on average in a typical x86 program, for every 10 instruction, two x86 instructions (called macro-ops by Intel) are fused together. When two x86 instructions are fused together, the 4 decoders can decode 5 instructions in one cycle. The fused instruction travels down the pipeline as a single entity, and this has other advantages: more decode bandwidth, less space taken in the Out of Order (OoO) buffers, and less scheduling overhead. If Intel's "1 out of 10" claims are accurate, macro-ops fusion alone should account for an 11% performance boost relative to architectures that lack the technology.

Ofcourse, this doesn't disqualify your statement, but it does shed some light on the discussion I think..

If I were a programmer, I would be able to have a much more indepth discussion with you about this.. However, I work in the medical field so I'll have to do a cop out :D

Quote:

The claim that my code doesn't make use of compare/tst followed by a branch is outright ludicrous. Much of my benchmarking is compiler runs which is in fact very heavy on that sequence. Media encoding would be an example where it's less involved.

Yep, that was my fault. I shouldn't have said your code doesn't make use of Macro-Ops fusion.

Quote:

miss an official Intel statement that macrofusion is in fact disabled in EM64T on Core2.

It depends on whether you consider blue slides with the Intel logo official ;)
07-18-2006, 02:42 PM
uOpt

Quote:

The result is that on average in a typical x86 program, for every 10 instruction, two x86 instructions (called macro-ops by Intel) are fused together. When two x86 instructions are fused together, the 4 decoders can decode 5 instructions in one cycle. The fused instruction travels down the pipeline as a single entity, and this has other advantages: more decode bandwidth, less space taken in the Out of Order (OoO) buffers, and less scheduling overhead. If Intel's "1 out of 10" claims are accurate, macro-ops fusion alone should account for an 11% performance boost relative to architectures that lack the technology.

That is the wrong assumption right there.

First of all the math is wrong. If 10% of instructions are test/cmp-lcc sequences that are sped up by 50% then you end up with a theoretical advantage of 5%, not 10%. You don't eliminate those 10%, you cut them in half.

But much more important, just reducing the execution time of pure instructions by 5% in a modern microprocessor doesn't make your program run 5% faster, by far. If that was the case you needed no caches, no fast RAM, no out-of-order execution, no pipelines. In fact reducing just the execution time can account for almost nothing. That is particularly true for jump instructions that are subject to potential overhead all over the place as I outlined above.

I don't say it is useless. But the effect will be small, and my benchmarks show it isn't a big deal if it is in fact not supported on EM64T. Conroe keeps it's advantage over AMD64 equally in 32 and 64 bit code.

I would also like to see a statement confirming the initial chart stating that it is not supported on EM64T by Intel, on Intel's site. The above charts are hosted by a third party and seem to be photographed from some live presentation. While I don't doubt as such that the presentation was real, there still is room for screwups. In any case, for me there is no such disadvantage of 64 bit code in Core2.

To be a little more useful I decided it's time to give Intel's current own compiler a spin to see whether that speeds up some of the code here.
07-18-2006, 02:48 PM
uOpt

Quote:

Originally Posted by fhpchris

Do not tell me your running 3-4-4-8 on the opteron platform :(

Please don't quote mega-posts in full.

Yes, the particular test for 64 bits above was done with memory at 3-4-4-8. I needed 4 GB of RAM.

You can compare the effect of memory timings and memory frequency on the AMD64 for applications similar to what I used for these numbers here:
http://forum.useless-microoptimizati...ch-memory.html

You can see my base benchmarks for Core2 and AMD64, along with Netburst and Pentium-Ms here. This chart does include different memory setups for AMD64:
http://www.cons.org/cracauer/crabench/core2.user.html
http://www.cons.org/cracauer/crabench/core2.wall.html

Or cutting out some of the other platforms here: http://www.cons.org/cracauer/crabenc...only.user.html
http://www.cons.org/cracauer/crabenc...only.wall.html
07-18-2006, 02:57 PM
Iconyu

Quote:

Originally Posted by Carfax

I think your screename suits you very well... It would suit you even better however, if you added a "d" to the end :banana:

Irony at its very, very best.

Show 100 post(s) from this thread on one page