!!!The Ultimate K8L Thread 2007 & Beyond!!!

**MAS** · 02-12-2007, 05:07 AM

We've already had confirmation that 40% overall is simply not happening
Randy Allen and Patrik Patla (AMD directors) told us about 40per cent, and suddenly brentpresley appears and tells us

that 40% overall is simply not happening

look better at the picture

40 % advantage is for rough multitasking environment
10 % is for single-threaded appl.

**nn_step** · 02-12-2007, 06:30 AM

Originally Posted by brentpresley

NOT LIKELY. He has A2/A3 silicon (possibly EVEN B1 at this point), and unless there are MAJOR bugs to be fixed, we will not see more than 2-5% more in performance over the ES chips. That puts it AT BEST 15% better than C2D on average, but right now all we know for sure is that current steppings perform only 10% better. That is a FAR cry from the performance lead that AMD used to have. And like I said earlier, how many of us power users buy these is going to completely depend on how well they overclock.

I've already said that SSE will run better on K10, but you have SEVERELY misplaced your faith if you think programmers are going to optimize and rework their SW just for that (on average it takes 6-12 months to rework a MAJOR piece of SW to take advantage of instruction-level changes like SSE/2/3/4 and most companies just aren't going to put the resources forward to optimize it when they consider current performance as adequate). As someone who has done a few years worth of programming, I can can tell you that the VAST majority of programs will remain integer-based. Only multimedia apps will continue to use floating point operations. There simply is no benefit to making integer apps faster (how FAST can you make Word? Everything is virtually instantaneous as is).

But hey, keep dreaming.

funny in my favorite programming class, the teacher whipped code optimization and code splitting into us. knowing when to use integer approximation and when to do massive parallel floating point. Fun class, but definitely not for beginners

**~~SOLDNER-MOFO64~~** · 02-12-2007, 07:00 AM

Originally Posted by brentpresley

NOT LIKELY. He has A2/A3 silicon (possibly EVEN B1 at this point), and unless there are MAJOR bugs to be fixed, we will not see more than 2-5% more in performance over the ES chips. That puts it AT BEST 15% better than C2D on average, but right now all we know for sure is that current steppings perform only 10% better. That is a FAR cry from the performance lead that AMD used to have. And like I said earlier, how many of us power users buy these is going to completely depend on how well they overclock.

Everyone that owns a C2D depend on how well it OC's!!!.....they'd be 30%-50% less powerful if they didn't scale as well.

Well....how many of you power users OC your C2D's? 100% of you?

When you start talkin' OC's then the performance gap will only grow further.

We all know a C2D HAS to be overclocked in order to attain the performance levels everyone talks of, why should it be any dif for AMD?

**Motiv** · 02-12-2007, 07:08 AM

All we can go off are estimates of Performance and a speculative 10% from s7's under NDA guy.

If this is the only figure we know of, then we expect probably 15% more speed than a c2 (clock for clock). Until other figures are released it is nothing but a pointless argument.

**~~SOLDNER-MOFO64~~** · 02-12-2007, 07:11 AM

Originally Posted by brentpresley

Barcelona INFO straight from AMD:

http://www.amd.com/us-en/Corporate/V...115794,00.html

Availability left open as "mid-2007"

Not much we didn't know, but good to have it in an official statement from AMD.

Agreed

Originally Posted by Motiv

All we can go off are estimates of Performance and a speculative 10% from s7's under NDA guy.

If this is the only figure we know of, then we expect probably 15% more speed than a c2 (clock for clock). Until other figures are released it is nothing but a pointless argument.

Here, here

**Shintai** · 02-12-2007, 08:01 AM

Just to finish the SSE FUD that somehow started. Look on C2D vs CD. Is the C2D like 6x faster? C2D got 6x higher potential SSE throughput. But its not really that much of the total code thats SSE.

Less dreaming, more reality please. x87 to SSE patches for games dont even bring that much.

And SSE is still widely missing at many places....MS tries to force this with no x87 in 64bit. But mandatory SSE. However...dont dream...SSE is a nice boost but no miracle. Its more a matter of cleaning up the stupid x87 and get it removed with time from the CPU.

**nn_step** · 02-12-2007, 09:33 AM

Originally Posted by brentpresley

Then you know first-hand as well how hard it is.

It is not just a matter of changing a compiler flag and there you go.

absolutely, fortunately a well made and documented program can be updated rather quickly. I remember helping in a project to convert an Audio encryption from Integer to SSE3, took a couple days but the performance boost was huge.
So it ultimately how important performance is to you.

**\Karting_freak** · 02-12-2007, 11:30 AM

the clue is "estimates"
i hope its atleast half true

would still give c2d a run for the money

**nn_step** · 02-12-2007, 12:30 PM

Originally Posted by brentpresley

he he, haven't seen too many of those.

honestly.

I definitely agree with you there, heck take ten seconds to look at Microsoft source code and you'll wonder how the hell they got it to run. Some of them just seem to love the "goto statements" But I must admit their Binary interfaces and the assembly they use for it are extremely well made.
Unfortunately the technically skilled aren't the ones writing the most code.
And if you really want to see a 300% speed increase, transcribe .Net programs to pure C code. Talk about a huge improvement.

**Lightman** · 02-12-2007, 02:57 PM

Some interesting bits:

A 65nm silicon-on-insulator process is used for producing the near-450-million transistor device, with dual stress liners and a silicon germanium process is used to speed up the pFETs. Eleven layers of copper and low-k dielectrics connect the device.

At 95 degrees Celsius, modelling suggests the processor will run at between 2.2 and 2.8GHz at 1.15 volts. Each of the four cores include eight temperature sensors. The on-chip northbridge contains a further six.

The memory interface is 400 to 800Mbps from a 1.7 to 1.9 volt supply for DDR2, and 800 to 1,600Mbps from 1.4 to 1.6 volts for DDR3.

The HyperTransport interface supports legacy HT1 and 2 modes as well at HT3 at 2.4Hbps with a peak of 5.2Gbps.

Source:
http://www.edn.com/article/CA6415782.html?partner=enews

Enjoy!

**accord99** · 02-12-2007, 03:39 PM

Originally Posted by LOE

64 bit - c2d is slower in 64bit mode due to 2 reasons - the iAMD64 and the lack of macro ops fusion in 64bit mode, c2d could easily loose 7-10% of its performance

But it's not slower, sometimes its faster, sometimes its slower depending on the application, just like the K8. Overall, it still remains the fastest 64-bit x86 processor available today.

heavy multithreading - we already see quad FX running inferior chips outperforming core2quad in heavy multithreaded scenarios, that gap will only grow bigger when K10 comes out

We see a one or two unrealistic scenarios where this happens and requires specific situations that benefit from the Quad FX's additional memory controller. However, in a single-socket system, the desktop versions of Barcelona will only have 1 memory controller and 12.8GB/s of memory bandwidth.

Most other heavy multi-threaded scenarios have the QX6700 beating the Quad FX just as easily as it does in single-threaded scenarios.

are you sure?

C2D can process one 128bit sse instruction per cycle, do you mean pentium has a 21.33 (128/6) bit SSE engine

A C2D can execute 1 128-bit multiply, 1 128-bit add plus a load, store and jump in the same cycle.

**~~SOLDNER-MOFO64~~** · 02-12-2007, 03:45 PM

Originally Posted by Lightman

Some interesting bits:

Source:
http://www.edn.com/article/CA6415782.html?partner=enews

Enjoy!

**doompc** · 02-12-2007, 05:22 PM

Originally Posted by accord99

But it's not slower, sometimes its faster, sometimes its slower depending on the application, just like the K8. Overall, it still remains the fastest 64-bit x86 processor available today.

All CPUs speed up in 64 bits due to the larger amout of registers and the standard SSE2 instructions.
But Core2 does not speed up as much as K8 since MacroFusion doesn't work in long mode.

On SSE execution K10 has little advantage.
Core2 has 3 SSEs plus one load and one store units.
K8 has 3 FPUs (that do SSE) plus the load/store unit that do two loads/stores per cycle, on K10 the FPUs are widened to 128 bit so it can do 3 128 bit SSE per cycle plus 2 load/stores.
So Core2 does 3 SSE, 1 load and 1 store. K10 does 3 SSE, 1 load and 1 store or 2 loads or 2 stores.
http://www.xbitlabs.com/articles/cpu...amd-k8l_5.html

**JumpingJack** · 02-12-2007, 10:55 PM

Originally Posted by savantu

Really ? Not even Conroe manages 1-1.2 except on few codes.

P4 was around 0.3-0.7 and K8 0.5-0.9 at least for SPEC IIRC.

I don't know why 0.9 to 1.2 keeps sticking in my head, but in some code base yes, P4 could do that 0.9 to 1.2 (some apps within the SPECINT bench showed this high):

http://www.princeton.edu/~jdonald/re...uck_pact03.pdf

The benchmarks that perform
best in this environment are mcf, art and swim at 93%, 97%
and 98% of peak respectively. eon and wupwise have relatively
high instruction throughput of 0.9 and 1.2 IPC respectively,
while mcf and swim have relatively low IPCs of .08,
.2 and .4 (all IPCs measured in ��ops). Not unexpectedly,
then, those applications with low instruction throughput demands
due to poor memory performance are less affected by
the statically partitioned execution resources. See Figure 1
for a summary of results from these runs.

(EDIT: it is reading this paper sometime ago that 0.9 to 1.2 sticks in my head, because my first thought was wow... a P4 can actually do that

)..

The IPC, of course, is very code dependent (compiler optimizations, instruction ordering, etc) and how the architecture handles the ILP efficiency, combined with all sorts of factors. Truth is I have looked over probably half dozen to dozen papers where the IPC is measured/calculated, HT helps, I have seen IPC as high as 1.6 in some code base. However, the original point is that it really really stunk in a general sense.... a long pipeline with unoptimized code for that situation will generally crater the efficiency.

Another example of who well and poor the P4 can do IPC wise:
http://www.geocities.com/ykchen913/p...ions/CAECW.pdf

In h.264, the IDCT chain could get as high as 1.16 (see table 4). This is a good paper, as it also shows FSB utilization on a P4 is quite low even with a high L2 miss rate.... this is on a 533 MHz FSB .... and multimedia is likely to have the highest demand on FSB.

Anyway, C2D I do believe is significantly higher than 1.0 IPC on average (some will be low of course, but others high), but I have not found any studies or data that has measured it.

Barcelona appears to be heading for a good IPC boost, achieving something higher that C2D will be a true accomplishment, C2D did a good job in this department to show the improvements. I am anxious to see the data.

Jack

**JumpingJack** · 02-12-2007, 11:08 PM

Originally Posted by Lightman

Some interesting bits:

Source:
http://www.edn.com/article/CA6415782.html?partner=enews

Enjoy!

95C ... this has gotta be a typo....

**Lightman** · 02-13-2007, 12:36 AM

Originally Posted by JumpingJack

95C ... this has gotta be a typo....

No it's not a typo. They are modeling core frequency at high temperatures because most servers are running in tight blade cases, people in Africa want to use air cooling

, etc.
To be serious I think it is industrial standard for measurements. Look at speed modeling of other CPUs like this Intel 80core monster. It's speed was modeled at the same 95C temperature.

This of course has good prospects for us because if you will keep this core at around 65C then @1.15V you can go a bit higher than 2.8GHz on Quad core.

. Add some volts and job done! 3GHz should be easy....

**~~Grayfox84~~** · 02-13-2007, 12:52 AM

Originally Posted by brentpresley

Keep on flamebaiting and you will go the way of Serge84.

Does it make your e-Pen_is bigger to throw barbs when you can't come up with any facts to refute my arguments. Only 2 year olds throw temper tantrums.

He didn't violate NDA to me, just confirmed for me what my other friend at DOE had already said.

Its not a fact, lol its only a delay. XD

And friends tend to lend you things.

Anyways back on topic...
Please by all means show us the K10 your friend has? We would all love to see. I mean if he really did show you these performance numbers why don't you/he post them? Any kind of numbers. If your being truthful with all your arguments please state/show more facts on what your saying. Not trying to flame or anything because my statment is not stated as such. Only your not the kind of person that backs up their clams very well, we all just want to know more. Thats what this thread is all about after all.

Originally Posted by Lightman

No it's not a typo. They are modeling core frequency at high temperatures because most servers are running in tight blade cases, people in Africa want to use air cooling

, etc.
To be serious I think it is industrial standard for measurements. Look at speed modeling of other CPUs like this Intel 80core monster. It's speed was modeled at the same 95C temperature.

This of course has good prospects for us because if you will keep this core at around 65C then @1.15V you can go a bit higher than 2.8GHz on Quad core.

. Add some volts and job done! 3GHz should be easy....

These processors can get hot. Opterons are built to take this heat. Server consissions range from 70C to 80C+ temps usoully in a standard blade. Some of us don't know about server condissions, but the ones that do should be taken word for word. It gets very hot in a blade server more then most like. lol Besides cpus are not that fragile. Xeons and opterons alike you would be amazed the punisment they can take in testing as well as 24/7 use.

Originally Posted by doompc

All CPUs speed up in 64 bits due to the larger amout of registers and the standard SSE2 instructions.
But Core2 does not speed up as much as K8 since MacroFusion doesn't work in long mode.

On SSE execution K10 has little advantage.
Core2 has 3 SSEs plus one load and one store units.
K8 has 3 FPUs (that do SSE) plus the load/store unit that do two loads/stores per cycle, on K10 the FPUs are widened to 128 bit so it can do 3 128 bit SSE per cycle plus 2 load/stores.
So Core2 does 3 SSE, 1 load and 1 store. K10 does 3 SSE, 1 load and 1 store or 2 loads or 2 stores.
http://www.xbitlabs.com/articles/cpu...amd-k8l_5.html

Great find.

Originally Posted by LOE

are you sure?

C2D can process one 128bit sse instruction per cycle, do you mean pentium has a 21.33 (128/6) bit SSE engine

Pentium D needs 2 cycles to execute one 128 bit sse instruction.. it has a 64bit engine

core 2 duo has 2x the SSE throughoutput of pentium

Yes it is missing in minesweeper, but every serious app supports SSE, there are even apps that DO NOT RUN if they don't detect atleast SSE

One of the reason c2d is 50-100% faster than pentium clock for clock is it's double SSE throughoutput - check apps that render 3d, or encode video

I took this as fallows...

Originally Posted by accord99

But it's not slower, sometimes its faster, sometimes its slower depending on the application, just like the K8. Overall, it still remains the fastest 64-bit x86 processor available today.

We see a one or two unrealistic scenarios where this happens and requires specific situations that benefit from the Quad FX's additional memory controller. However, in a single-socket system, the desktop versions of Barcelona will only have 1 memory controller and 12.8GB/s of memory bandwidth.

Most other heavy multi-threaded scenarios have the QX6700 beating the Quad FX just as easily as it does in single-threaded scenarios.

A C2D can execute 1 128-bit multiply, 1 128-bit add plus a load, store and jump in the same cycle.

I agree, only K10 has dual memory controllers in the specs. According to previous data in the past threads about K10. Current bandwidth on AM2 is 20GB/s it will nearly be 3x of that bandwidth wise. Double that on memory bandwidth. According to dalytech was it...

http://www.channelinsider.com/print_...ls/191008.aspx

Quad-core parts and other Revision H parts are rumored to have two 64-bit independent memory controllers each with its own physical address space thus giving an opportunity to better utilize the available bandwidth in case of random memory accesses occurring in heavily multi-threaded environment. This approach is in a contrary to the previous "interleaved" design, where the two 64-bit data channels are bounded to a single common address space. It will be the first single-chip implementation of the non-uniform memory access architecture.

http://www.realworldtech.com/page.cf...0206035626&p=1

http://www.realworldtech.com/page.cf...0206035626&p=2

http://www.google.com/search?hl=en&q...rs&btnG=Search

Just some more info on K10 in previous threads. But you all should really read that thread to get the lowdown on K10. I should post a link on the front of the page as a continuation. And not constently rehunting for data having ppl acting like it never existed. Sometimes its silly for somebody to have to repeat themselfs. XD

http://www.xtremesystems.org/forums/...d.php?t=117702

**Shintai** · 02-13-2007, 02:05 AM

Originally Posted by LOE

are you sure?

C2D can process one 128bit sse instruction per cycle, do you mean pentium has a 21.33 (128/6) bit SSE engine

Pentium D needs 2 cycles to execute one 128 bit sse instruction.. it has a 64bit engine

core 2 duo has 2x the SSE throughoutput of pentium

Yes it is missing in minesweeper, but every serious app supports SSE, there are even apps that DO NOT RUN if they don't detect atleast SSE

One of the reason c2d is 50-100% faster than pentium clock for clock is it's double SSE throughoutput - check apps that render 3d, or encode video

Dont try and mix numbers in your favour. Core got 1 SSE port thats 64bit. Core 2 got 3 SSE ports thats 128bit. Yet Core at same FSB/Clock aint much slower than Core 2. And thats with all the rest of the improvements too.

**~~Grayfox84~~** · 02-13-2007, 02:13 AM

http://techreport.com/reviews/2006q3/core2/index.x?pg=2

**accord99** · 02-13-2007, 02:24 AM

Originally Posted by Grayfox84

I agree, only K10 has dual memory controllers in the specs. According to previous data in the past threads about K10. Current bandwidth on AM2 is 20GB/s it will nearly be 3x of that bandwidth wise. Double that on memory bandwidth. According to dalytech was it...

AM2's memory bandwidth is 12.8GB/s. If you plug in a Barcelona core into an AM2, that's all you get since there is physically only two channels connecting the memory to the socket.

http://www.channelinsider.com/print_...ls/191008.aspx

Quad-core parts and other Revision H parts are rumored to have two 64-bit independent memory controllers each with its own physical address space thus giving an opportunity to better utilize the available bandwidth in case of random memory accesses occurring in heavily multi-threaded environment. This approach is in a contrary to the previous "interleaved" design, where the two 64-bit data channels are bounded to a single common address space. It will be the first single-chip implementation of the non-uniform memory access architecture.

Intel's current DDR2 memory controllers are already this way.

**Lightman** · 02-13-2007, 03:14 AM

Originally Posted by accord99

AM2's memory bandwidth is 12.8GB/s. If you plug in a Barcelona core into an AM2, that's all you get since there is physically only two channels connecting the memory to the socket.

Intel's current DDR2 memory controllers are already this way.

Yeap! At the rated PC6400 speed of course. At the moment AMD and few other companies are trying to push higher memory specification through JEDEC. I heard PC8500 is target for them.
This would allow DESKTOP version of AMD Quad to get 17GB/s memory bandwidth...

Of course servers are different animals and I think maximum we will see would be PC6400 Registered dimms (DDR-II 800MHz

)

**Carfax** · 02-13-2007, 04:49 AM

Regarding SSE2, I still expect Core 2 to have edge over the K10, as it has a higher peak theoretical throughput..

Core 2 can issue a max of 6 SSE instructions per cycle, while the K10 can do 3.

Ofocurse, there are other factors involved other than peak SIMD throughput, like latency and memory bandwidth, and the K10 will have the edge there.

But not enough to trounce C2D IMO.

As for INT, C2D should still maintain a healthy lead as the C2D is a beast in INT. It will be interesting to see which processor holds the performance crown for gaming, as games tend to be far more INT based than FP.

**Carfax** · 02-13-2007, 04:53 AM

Originally Posted by Shintai

Dont try and mix numbers in your favour. Core got 1 SSE port thats 64bit. Core 2 got 3 SSE ports thats 128bit. Yet Core at same FSB/Clock aint much slower than Core 2. And thats with all the rest of the improvements too.

Core Duo has 2 64-bit SSE2 ports, not 1.

As for the closer than expected performance delta between C2D and CD, I put it down to two things:

1) Merom is FSB limited at 667, far moreso than Yonah.

2) Yonah was already a very efficient high IPC processor. Actually, it was even faster than the K8 clock for clock in everything but FP intensive apps.

**Carfax** · 02-13-2007, 05:11 AM

Originally Posted by doompc

All CPUs speed up in 64 bits due to the larger amout of registers and the standard SSE2 instructions.
But Core2 does not speed up as much as K8 since MacroFusion doesn't work in long mode.

I'm willing to bet this will be addressed in Penryn.

On SSE execution K10 has little advantage.
Core2 has 3 SSEs plus one load and one store units.
K8 has 3 FPUs (that do SSE) plus the load/store unit that do two loads/stores per cycle, on K10 the FPUs are widened to 128 bit so it can do 3 128 bit SSE per cycle plus 2 load/stores.
So Core2 does 3 SSE, 1 load and 1 store. K10 does 3 SSE, 1 load and 1 store or 2 loads or 2 stores.
http://www.xbitlabs.com/articles/cpu...amd-k8l_5.html

I don't know how accurate this information is. As far as I know, the K10 can issue 2 SSE operations, and one SSE MOV per cycle in the floating point store pipe.

So thats three instructions peak. Core 2 on the other hand, can potentially do double the K10's SSE issue rate.

**doompc** · 02-13-2007, 05:32 AM

Carfax, it's not clear if the FMISC unit (that do FLOAD in K8) will be widened to 128 bit. If not could not do 1x 128 bit SSE Load per cycle.

Core2 can theoricaly issue 6x micro-ops per cycle, and it decodes a maximum 2+3 instructions, but it fetches 16 Byte, that's only 128 bits. With the data on bufer waiting to be decoded it may decode an average 3 instructions per cycle.
I bet the 32 Byte instruction fetch will keep K10's FPUs much busier than Conroe's.

Thread: !!!The Ultimate K8L Thread 2007 & Beyond!!!

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions