!!!The Ultimate K8L Thread 2007 & Beyond!!!

Printable View

Show 100 post(s) from this thread on one page

02-12-2007, 11:08 PM
JumpingJack

Quote:

Originally Posted by Lightman

Some interesting bits:

Source:
http://www.edn.com/article/CA6415782.html?partner=enews

Enjoy!

:)

95C ... this has gotta be a typo....
02-13-2007, 12:36 AM
Lightman

Quote:

Originally Posted by JumpingJack

95C ... this has gotta be a typo....

No it's not a typo. They are modeling core frequency at high temperatures because most servers are running in tight blade cases, people in Africa want to use air cooling :p:, etc.
To be serious I think it is industrial standard for measurements. Look at speed modeling of other CPUs like this Intel 80core monster. It's speed was modeled at the same 95C temperature.

This of course has good prospects for us because if you will keep this core at around 65C then @1.15V you can go a bit higher than 2.8GHz on Quad core. :D . Add some volts and job done! 3GHz should be easy.... :woot:
02-13-2007, 12:52 AM
Grayfox84

Quote:

Originally Posted by brentpresley

Keep on flamebaiting and you will go the way of Serge84. :fact:

Does it make your e-Pen_is bigger to throw barbs when you can't come up with any facts to refute my arguments. Only 2 year olds throw temper tantrums.

He didn't violate NDA to me, just confirmed for me what my other friend at DOE had already said.

Its not a fact, lol its only a delay. XD :fact: And friends tend to lend you things.

Anyways back on topic...
Please by all means show us the K10 your friend has? We would all love to see. I mean if he really did show you these performance numbers why don't you/he post them? Any kind of numbers. If your being truthful with all your arguments please state/show more facts on what your saying. Not trying to flame or anything because my statment is not stated as such. Only your not the kind of person that backs up their clams very well, we all just want to know more. Thats what this thread is all about after all. :toast:

Quote:

Originally Posted by Lightman

No it's not a typo. They are modeling core frequency at high temperatures because most servers are running in tight blade cases, people in Africa want to use air cooling :p:, etc.
To be serious I think it is industrial standard for measurements. Look at speed modeling of other CPUs like this Intel 80core monster. It's speed was modeled at the same 95C temperature.

This of course has good prospects for us because if you will keep this core at around 65C then @1.15V you can go a bit higher than 2.8GHz on Quad core. :D . Add some volts and job done! 3GHz should be easy.... :woot:

These processors can get hot. Opterons are built to take this heat. Server consissions range from 70C to 80C+ temps usoully in a standard blade. Some of us don't know about server condissions, but the ones that do should be taken word for word. It gets very hot in a blade server more then most like. lol Besides cpus are not that fragile. Xeons and opterons alike you would be amazed the punisment they can take in testing as well as 24/7 use.

Quote:

Originally Posted by doompc

All CPUs speed up in 64 bits due to the larger amout of registers and the standard SSE2 instructions.
But Core2 does not speed up as much as K8 since MacroFusion doesn't work in long mode.

On SSE execution K10 has little advantage.
Core2 has 3 SSEs plus one load and one store units.
K8 has 3 FPUs (that do SSE) plus the load/store unit that do two loads/stores per cycle, on K10 the FPUs are widened to 128 bit so it can do 3 128 bit SSE per cycle plus 2 load/stores.
So Core2 does 3 SSE, 1 load and 1 store. K10 does 3 SSE, 1 load and 1 store or 2 loads or 2 stores.
http://www.xbitlabs.com/articles/cpu...amd-k8l_5.html

Great find.

Quote:

Originally Posted by LOE

are you sure? :nono: C2D can process one 128bit sse instruction per cycle, do you mean pentium has a 21.33 (128/6) bit SSE engine :rofl:

Pentium D needs 2 cycles to execute one 128 bit sse instruction.. it has a 64bit engine

core 2 duo has 2x the SSE throughoutput of pentium

Yes it is missing in minesweeper, but every serious app supports SSE, there are even apps that DO NOT RUN if they don't detect atleast SSE

One of the reason c2d is 50-100% faster than pentium clock for clock is it's double SSE throughoutput - check apps that render 3d, or encode video

I took this as fallows... :ROTF: :wierd: :rofl:

Quote:

Originally Posted by accord99

But it's not slower, sometimes its faster, sometimes its slower depending on the application, just like the K8. Overall, it still remains the fastest 64-bit x86 processor available today.

We see a one or two unrealistic scenarios where this happens and requires specific situations that benefit from the Quad FX's additional memory controller. However, in a single-socket system, the desktop versions of Barcelona will only have 1 memory controller and 12.8GB/s of memory bandwidth.

Most other heavy multi-threaded scenarios have the QX6700 beating the Quad FX just as easily as it does in single-threaded scenarios.

A C2D can execute 1 128-bit multiply, 1 128-bit add plus a load, store and jump in the same cycle.

I agree, only K10 has dual memory controllers in the specs. According to previous data in the past threads about K10. Current bandwidth on AM2 is 20GB/s it will nearly be 3x of that bandwidth wise. Double that on memory bandwidth. According to dalytech was it...

http://www.channelinsider.com/print_...ls/191008.aspx

Quad-core parts and other Revision H parts are rumored to have two 64-bit independent memory controllers each with its own physical address space thus giving an opportunity to better utilize the available bandwidth in case of random memory accesses occurring in heavily multi-threaded environment. This approach is in a contrary to the previous "interleaved" design, where the two 64-bit data channels are bounded to a single common address space. It will be the first single-chip implementation of the non-uniform memory access architecture.

http://www.realworldtech.com/page.cf...0206035626&p=1

http://www.realworldtech.com/page.cf...0206035626&p=2

http://www.google.com/search?hl=en&q...rs&btnG=Search

Just some more info on K10 in previous threads. But you all should really read that thread to get the lowdown on K10. I should post a link on the front of the page as a continuation. And not constently rehunting for data having ppl acting like it never existed. Sometimes its silly for somebody to have to repeat themselfs. XD

http://www.xtremesystems.org/forums/...d.php?t=117702
02-13-2007, 02:05 AM
Shintai

Quote:

Originally Posted by LOE

are you sure? :nono: C2D can process one 128bit sse instruction per cycle, do you mean pentium has a 21.33 (128/6) bit SSE engine :rofl:

Pentium D needs 2 cycles to execute one 128 bit sse instruction.. it has a 64bit engine

core 2 duo has 2x the SSE throughoutput of pentium

Yes it is missing in minesweeper, but every serious app supports SSE, there are even apps that DO NOT RUN if they don't detect atleast SSE

One of the reason c2d is 50-100% faster than pentium clock for clock is it's double SSE throughoutput - check apps that render 3d, or encode video

Dont try and mix numbers in your favour. Core got 1 SSE port thats 64bit. Core 2 got 3 SSE ports thats 128bit. Yet Core at same FSB/Clock aint much slower than Core 2. And thats with all the rest of the improvements too.
02-13-2007, 02:13 AM
Grayfox84

http://techreport.com/reviews/2006q3/core2/index.x?pg=2
02-13-2007, 02:24 AM
accord99

Quote:

Originally Posted by Grayfox84

I agree, only K10 has dual memory controllers in the specs. According to previous data in the past threads about K10. Current bandwidth on AM2 is 20GB/s it will nearly be 3x of that bandwidth wise. Double that on memory bandwidth. According to dalytech was it...

AM2's memory bandwidth is 12.8GB/s. If you plug in a Barcelona core into an AM2, that's all you get since there is physically only two channels connecting the memory to the socket.

Quote:

http://www.channelinsider.com/print_...ls/191008.aspx

Quad-core parts and other Revision H parts are rumored to have two 64-bit independent memory controllers each with its own physical address space thus giving an opportunity to better utilize the available bandwidth in case of random memory accesses occurring in heavily multi-threaded environment. This approach is in a contrary to the previous "interleaved" design, where the two 64-bit data channels are bounded to a single common address space. It will be the first single-chip implementation of the non-uniform memory access architecture.

Intel's current DDR2 memory controllers are already this way.
02-13-2007, 03:14 AM
Lightman

Quote:

Originally Posted by accord99

AM2's memory bandwidth is 12.8GB/s. If you plug in a Barcelona core into an AM2, that's all you get since there is physically only two channels connecting the memory to the socket.

Intel's current DDR2 memory controllers are already this way.

Yeap! At the rated PC6400 speed of course. At the moment AMD and few other companies are trying to push higher memory specification through JEDEC. I heard PC8500 is target for them.
This would allow DESKTOP version of AMD Quad to get 17GB/s memory bandwidth...

Of course servers are different animals and I think maximum we will see would be PC6400 Registered dimms (DDR-II 800MHz :) )
02-13-2007, 04:49 AM
Carfax

Regarding SSE2, I still expect Core 2 to have edge over the K10, as it has a higher peak theoretical throughput..

Core 2 can issue a max of 6 SSE instructions per cycle, while the K10 can do 3.

Ofocurse, there are other factors involved other than peak SIMD throughput, like latency and memory bandwidth, and the K10 will have the edge there.

But not enough to trounce C2D IMO.

As for INT, C2D should still maintain a healthy lead as the C2D is a beast in INT. It will be interesting to see which processor holds the performance crown for gaming, as games tend to be far more INT based than FP.
02-13-2007, 04:53 AM
Carfax

Quote:

Originally Posted by Shintai

Dont try and mix numbers in your favour. Core got 1 SSE port thats 64bit. Core 2 got 3 SSE ports thats 128bit. Yet Core at same FSB/Clock aint much slower than Core 2. And thats with all the rest of the improvements too.

Core Duo has 2 64-bit SSE2 ports, not 1.

As for the closer than expected performance delta between C2D and CD, I put it down to two things:

1) Merom is FSB limited at 667, far moreso than Yonah.

2) Yonah was already a very efficient high IPC processor. Actually, it was even faster than the K8 clock for clock in everything but FP intensive apps.
02-13-2007, 05:11 AM
Carfax

Quote:

Originally Posted by doompc

All CPUs speed up in 64 bits due to the larger amout of registers and the standard SSE2 instructions.
But Core2 does not speed up as much as K8 since MacroFusion doesn't work in long mode.

I'm willing to bet this will be addressed in Penryn.

Quote:

On SSE execution K10 has little advantage.
Core2 has 3 SSEs plus one load and one store units.
K8 has 3 FPUs (that do SSE) plus the load/store unit that do two loads/stores per cycle, on K10 the FPUs are widened to 128 bit so it can do 3 128 bit SSE per cycle plus 2 load/stores.
So Core2 does 3 SSE, 1 load and 1 store. K10 does 3 SSE, 1 load and 1 store or 2 loads or 2 stores.
http://www.xbitlabs.com/articles/cpu...amd-k8l_5.html

I don't know how accurate this information is. As far as I know, the K10 can issue 2 SSE operations, and one SSE MOV per cycle in the floating point store pipe.

So thats three instructions peak. Core 2 on the other hand, can potentially do double the K10's SSE issue rate.
02-13-2007, 05:32 AM
doompc

Carfax, it's not clear if the FMISC unit (that do FLOAD in K8) will be widened to 128 bit. If not could not do 1x 128 bit SSE Load per cycle.

Core2 can theoricaly issue 6x micro-ops per cycle, and it decodes a maximum 2+3 instructions, but it fetches 16 Byte, that's only 128 bits. With the data on bufer waiting to be decoded it may decode an average 3 instructions per cycle.
I bet the 32 Byte instruction fetch will keep K10's FPUs much busier than Conroe's.
02-13-2007, 06:01 AM
Carfax

Doompc, do you think it's possible that Intel could implement a 32 byte instruction fetch in Penryn with the extra transistors?

How radical a change would it be to do something like that?
02-13-2007, 07:39 AM
doompc

Carfax, I think so, but Intel has not commented anything on this subject.

LOE, we don't have info on that. But even if FMISC is still 64 bit the decoder may simply route the SSE Loads to the Load/Store unit.
02-13-2007, 07:52 AM
Donnie27

Quote:

Originally Posted by MAS

We've already had confirmation that 40% overall is simply not happening
Randy Allen and Patrik Patla (AMD directors) told us about 40per cent, and suddenly brentpresley appears and tells us

look better at the picture

40 % advantage is for rough multitasking environment
10 % is for single-threaded appl.

Weren't those the same folks who said the Conroe Tests were Bogus and rigged? So why should we believe them?
02-13-2007, 07:57 AM
SEA

Quote:

Originally Posted by Carfax

Regarding SSE2, I still expect Core 2 to have edge over the K10, as it has a higher peak theoretical throughput..

Core 2 can issue a max of 6 SSE instructions per cycle, while the K10 can do 3.

I'm sorry to say that you've got a messup.
1) C2D has 6 uOps per cycle. K10 has 3 SSE per cycle. See the difference?

only scalar SSE operations are single uOps while the vector operations are typically 2-4 uOps.

Don't believe? Read here...

just in case you did not know: mOps fusion - old K8 uses it several years already.
02-13-2007, 08:18 AM
Donnie27

Quote:

Originally Posted by brentpresley

:stick:

That is a link to Core Duo not Core 2 Duo

Hehehe!
02-13-2007, 08:39 AM
SEA

Quote:

Originally Posted by brentpresley

:stick:

That is a link to Core Duo not Core 2 Duo

Thank you! :D

Thank you that you did not break my expectations! I was sure that all fanboys will immediatelly stop thinking right after they read CORE DUE not c2d... :clap:

ok, enough laugh.
I repeat that link this way:
http://www.intel.com/technology/itj/...oved_cores.htm

Quote:

only scalar SSE operations are single uOps while the vector operations are typically 2-4 uOps.

See? That's the term link it and it is the only purpose of link.
02-13-2007, 09:06 AM
SEA

Quote:

Originally Posted by brentpresley

The SSE units were REWORKED in Core 2 vs. Core Duo.

That document specifically describes ONLY the enhancements in Yonah and doesn't cover Core 2 AT ALL (publication date is 3 MONTHS before C2D was released).

You analysis is invalid. :D :stick:

Let me inform you that's why fanatism is so bad thihg.

I see I should bring my point explicitelly:

c2D has up to 6 nOps per cycle.
not SSE instructions per cycle!
uOp is NOT SSE instruction!

and thus the assertion

Quote:

Core 2 can issue a max of 6 SSE instructions per cycle, while the K10 can do 3.

is wrong.
02-13-2007, 09:16 AM
SEA

Additionally:

Quote:

If you just compare floating-point addition and multiplication, both Core and K8L can do four (packed) double-precision operations per cycle (2 x fadd + 2 x fmul) or eight (packed) single-precision operations per cycle (4 x fadd + 4 x fmul). The fact that there's FP/SSE MOV hardware on each of Core's three main issue ports will give the Intel part an edge in handling memory traffic, though.

http://arstechnica.com/news.ars/post/20061011-7961.html

and for "in handling memory traffic" - AMD will balance it with other technic. So only time will show if it is 10% or whatever else...
02-13-2007, 10:00 AM
SEA

ok, first, you admit that

Quote:

C2D has 3 128-bit SSE units

that is it.
02-13-2007, 10:23 AM
SEA

next,

Quote:

That STILL only gives a MAXIMUM throughput of 4 64-bit SSE instructions per cycle.

wrong again...
you lost FSTORE

Show 100 post(s) from this thread on one page