!!!The Ultimate K8L Thread 2007 & Beyond!!!

Printable View

Show 100 post(s) from this thread on one page

02-13-2007, 10:30 AM
SEA

i did not get why you mentioned this "THIS HAS NOTHING TO DO WITH SSE INSTRUCTIONS ISSUED PER CYCLE".
Just in case - we never talked about it.

Back to "4 64-bit SSE instructions per cycle":
2xFMUL + 2xFADD + FSTORE (and + 2x 128 SSE loads)
02-13-2007, 10:36 AM
SEA

Quote:

Originally Posted by brentpresley

You DO REALIZE you quoted the part where I was talking about K8L, right? So you are saying AMD is WRONG and they can't do 4 SSE 64-bit instructions per cycle?

It's really hard to talk out of both sides of your mouth, isn't it?

Can you translate it to English pls.
While i repeat more clearly:

You are wrong with your sentence:

Quote:

Originally Posted by brentpresley

RUMORS on Rev H (K8L/K10):
Double SSE throughput on the 2 SSE units (i.e. expanded to 128-bits each, and each able to work on 2 64-bit SSE instructions per cycle). That STILL only gives a MAXIMUM throughput of 4 64-bit SSE instructions per cycle.

So you can see - YOU are saing that, not me. You wrong.
Wanna laugh more?
02-13-2007, 10:37 AM
nn_step

Quote:

Originally Posted by SEA

Can you translate it to English pls.
While i repeat more clearly:

You are wrong with your sentence:

So you can see - YOU are saing that, not me. You wrong.
Wanna laugh more?

umm K8L has 3 SSE/FP units
just like K7 and K8, the big difference however is that all 3 are 128bit
02-13-2007, 10:39 AM
SEA

Quote:

Originally Posted by brentpresley

Because the integer and SSE units are COMPLETELY different parts of silicon that have NOTHING to do with each other.

I'm glad you admit it.
I just wondered why you mentioned it at all? (you can leave it w/out answer btw, i see your point anyway)
02-13-2007, 10:42 AM
SEA

Quote:

Originally Posted by nn_step

umm K8L has 3 SSE/FP units
just like K7 and K8, the big difference however is that all 3 are 128bit

Did i say 4 or 2?
02-13-2007, 10:46 AM
nn_step

http://www.chip-architect.com/news/O...t_Core_Ill.jpg
though you are correct in that it is just 2 SSE, since it runs in FPMUL and FPADD
02-13-2007, 11:02 AM
SEA

ah, you mean this:

Quote:

Originally Posted by SEA

2xFMUL + 2xFADD + FSTORE (and + 2x 128 SSE loads)

Right and i hope everyone now can refresh his memory looking at picture :)
02-13-2007, 11:03 AM
Grayfox84

Quote:

Originally Posted by accord99

AM2's memory bandwidth is 12.8GB/s. If you plug in a Barcelona core into an AM2, that's all you get since there is physically only two channels connecting the memory to the socket.

Intel's current DDR2 memory controllers are already this way.

And OCing raises tho's numbers through the roof. Nobody said there was a bandwidth wall limit. Your only talking about stock speeds. PPL have broken that wall long ago. Btw there is a difference between system bandwidth and memory bandwidth. And we all know everybody here OC's thats extreme.

http://www.amd.com/us-en/Processors/...E13042,00.html

Quote:

Originally Posted by brentpresley

I have been told benches will be forthcoming VERY soon (just what I have been told, remember, I don't have ES in hand, just contacts with them).

What's wrong Serge? PISSED I've got more info on Rev. H than you? Scooped your precious little thread here? :stick:

We all know this is your SECOND userid. Keep flamebating me, and we'll get that one banned as well. :fact:

Remember the LAST time you flamebated me claiming that 64-bit K8 performance at 2.8GHz was a FAST as C2D at 2.8GHz? :slap:

I made you look REALLY stupid with MULTIPLE benchmarks on that one. :fact: :slapass:

The only one whos flaming is you. And frankly I didn't get banned. You'll see the 15th. Why would I be mad? Your silly and show you take this thread way too seriously like the kid you are. I don't have to respond to your BS any more. I don't take things personally over the net like you apperently do. :rolleyes: This is my friends account btw.

I'm surprised they haven't done anything about you with your framing yet. O_o

Oh you mean this?

http://s38.photobucket.com/albums/e1...urrent=ZX3.jpg
http://s38.photobucket.com/albums/e1...urrent=ZV4.jpg

http://s38.photobucket.com/albums/e1...&current=2.jpg

All it looks like to me I showed was K8 in 64-bit mode can beat conroe in math related calculations. ALU is what that is, so K8 has a advantage in loading things faster, doing operations and ALU related tasks faster that need a program like that such as games. But conroe has a advantage over K8 in 32-bit mode however. FPU's SSE from conroe has the advantage with multimedia and encoding here. And AMD has the bandwidth advantage here.

Conclusion Conroe is the best 32-bit chip ever made.
K8 is the best 64-bit chip ever made.
02-13-2007, 11:26 AM
SEA

We are all the time talking about just SSE operations.
SSE LOAD is such one.

And one more time all together:
K10: FMUL + FADD + FSTORE (+ 2x SSE loads)
C2D: FADD + FMUL + FLOAD + FSTORE

the last one is from link you provided - http://www.xbitlabs.com/articles/cpu...preview_9.html
02-13-2007, 12:07 PM
SEA

Youy sound pretty sure. ;)

What are those x87 operations you mentioned in?
FMUL?
FADD?
FSTORE?
2x SSE loads?

Or... i'm afraid to ask.. Have you gotten optimisation manual for barcelona?!!! :hm:

Ah, i see...
02-13-2007, 12:13 PM
SEA

you mean they call it "FP execution units"
The ones what execute x87 / SSE (they both use same part ALU - sorry FPU).

Also - here is a perfect diagram a few posts above.
02-13-2007, 12:25 PM
nn_step

technically SSE was created for SIMD calculations but since most of the logic is the same they combine both SIMD and Floating point math. They could be separated but Intel and AMD see it as wasted transistors.
02-13-2007, 12:39 PM
SEA

brentpresley
I have read :D

you 've got funny style - when you have nothing to say - you say common known things the way as if nobody but you knew it :)
Also could you please point more specific what have you corrected with your post? that FSTORE cannot process SSE stores?
Any link? Or your historical word is good enough? :)

Common, admit just once you were wrong! find time... :lol:

You even can find SSE fstore (=fmisk) in K8.
I CAN link http://www.xbitlabs.com/articles/cpu...amd-k8l_5.html:

Quote:

This is illogical since writes from the 128-bit SSE registers into memory are executed on the FMISC (FSTORE) unit

oops... corrected link
02-13-2007, 12:40 PM
Grayfox84

Quote:

Originally Posted by brentpresley

NOT all FPUs are SSE enabled.

K8/K10 BOTH have 3 FPU units as pictured in NN_step's diagram. But only TWO of those are SSE capable.

x87 = FPU (floating point) calculations, but not all FPU opterations are SSE operations.

HISTORY LESSON:
Intel invented SSE (and subsequent variants) as a more efficient mechanism to do floating point calculations that the old x87 instructions were capable of. Over time these have been extended to aid repetitive calculations (multimedia comes to mind). However, for BACKWARDS COMPATIBILITY, x87 was always maintained in the CPUs.

AMD historically has made VERY powerful x87 FPUs, but not utilized SSE very much. Intel has approached things from the opposite perspective: they have made very good SSE units, and only maintained BASIC x87 compatibility in their FPUs. With more and more code being ported to run SSE, AMD is moving more to the Intel-approach to FPUs (i.e. better SSE performance).

PLEASE stop posting when you don't know what you are talking about. You really are showing your ignorance on the subject.

I don't have time to correct you on EVERY single point anymore.

Someone else step up to the plate and link this guy a basic description of x86 superscaler processors please.

Your flaming. Please stop right now, you take things far too seriously. :rolleyes:
02-13-2007, 01:10 PM
Carfax

Quote:

Originally Posted by brentpresley

Someone else step up to the plate and link this guy a basic description of x86 superscaler processors please.

But you're doing such a good job ;)

If the instruction fetch is a weakness in Conroe right now, it is likely that Intel will likely increase it to 32 bytes, like the K10.

Not sure how great a change this is, but it doesn't seem too radical.

**Edit** Just got this from the Xbitlabs article:

Quote:

By the way, Conroe processors fetch instructions in 16-byte blocks, just like K8 processors do, so they can decode the instruction stream at a rate of 4 instructions per clock only when the average instruction length is no longer than 4 bytes. Otherwise the decoder cannot process not only 4 but even 3 instructions per clock. To fight this in short loops, the Conroe has a special 64-byte internal buffer that caches loops up to 64 bytes long (four 16-byte blocks) and allows fetching data in such loops at a rate of 32 bytes per cycle. If a loop is longer than 4 blocks, it cannot be cached in this buffer.

Apparently, the Intel engineers have found a clever work around for the instrution fetch limitation.
02-13-2007, 01:18 PM
SEA

Quote:

Originally Posted by brentpresley

:ROTF:

The link you provided has this diagram:[]
CLEARLY showing 2 SSE units and FPMISC (non-SSE enabled FPU).

The link I provided clearly states in text that FPMISC (=FSTORE) executes SSE STORES.
02-13-2007, 01:41 PM
SEA

Quote:

Originally Posted by brentpresley

:stick: :stick: :stick:

STORE. It just STORES it for FUTURE use. No OPERATION is performed on it.

The instruction is not ISSUED (i.e. COMPLETED).

:stick: :stick: :stick:

OK, I catch your sudden turn ;)

Now we don't count any SSE load/store/move operations since it turned out that they are not operations (aka No OPERATION is performed on it. :D )

Good.
Let's coun't so-called "OPERATIONS" :lol:
That is 4 SSE FP in C2D
That is 4 SSE FP in K10.

Done. So what was your point?

Mine initially was to correct this wrong sentence: "Core 2 can issue a max of 6 SSE instructions per cycle, while the K10 can do 3"
02-13-2007, 01:43 PM
SEA

Quote:

Correct me if I am wrong, but that is how I read that.

You are right this time.
02-13-2007, 04:52 PM
Vapor

SEA and Brent, cut out the flaming.
02-13-2007, 05:30 PM
Scimitar

While not an official benchmark or anything, I found the comments of Dr. Vijay Pande from Stanford University to be interesting. He stated that the SSE128 units in the K10 are on par with Conroe's SSE performance. Perhaps they have tested an ES for folding. Combine that with the K10's projected vastly superior floating point performance, and the Barcelona should be a great chip.

I still haven't seen any projected numbers for integer performance though.
02-13-2007, 05:35 PM
accord99

Quote:

Originally Posted by Scimitar

While not an official benchmark or anything, I found the comments of Dr. Vijay Pande from Stanford University to be interesting. He stated that the SSE128 units in the K10 are on par with Conroe's SSE performance. Perhaps they have tested an ES for folding. Combine that with the K10's projected vastly superior floating point performance, and the Barcelona should be a great chip.

SSE is floating point and is necessary to extract maximum performance.
02-13-2007, 07:06 PM
nn_step

Quote:

Originally Posted by accord99

SSE is floating point and is necessary to extract maximum performance.

ummm :banana::banana::banana::banana: NO..
Floating point is working with numbers that are arranged exactly like this
http://upload.wikimedia.org/wikipedi..._point.svg.png
and is Properly classified as SISD Or Single Instruction, Single Data Math.

SSE is SIMD or Single Instruction, Multiple Data, which uses a single instruction to perform the same work on many different Data fields at once.

This is the difference between SISD and SIMD
http://arstechnica.com/cpu/1q00/simd/figure6.gif

Floating point is used for scientific calculations when Fixed point math isn't acceptable (though these days the improvements in floating point have reduced that requirement) SSE is for Vector math and similar parallel instructions. Read about AltiVec if you want the most logical version of what exactly SIMD is
02-13-2007, 07:18 PM
accord99

SSE1 is primarily about single-precision floating point, SSE2 is primarily about double-precision floating point. The way to extract maximum floating point capabilities from a modern x86 processor is to use SSE operations, plus x87 is deprecated in 64-bit Windows.
02-13-2007, 07:29 PM
nn_step

Quote:

Originally Posted by accord99

SSE1 is primarily about single-precision floating point, SSE2 is primarily about double-precision floating point. The way to extract maximum floating point capabilities from a modern x86 processor is to use SSE operations, plus x87 is deprecated in 64-bit Windows.

SSE1 was merely the addition of the required logic to do the most basic Vector math; However SSE2 enables the programmer to perform SIMD math of virtually any type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the need to touch the (legacy) MMX/FPU registers. SSE2 IS everything SSE should have been. And if you actually programmed something for once in your life that specifically makes usage of these pieces of the CPU, you would know they are VERY VERY different animals and your repeated speaking of them as the same is not only inaccurate but annoying.
02-13-2007, 07:37 PM
accord99

That's nice, but irrelevant to the discussion at hand. And doesn't change the fact that maximizing floating point performance comes from using SSE.

Show 100 post(s) from this thread on one page