i did not get why you mentioned this "THIS HAS NOTHING TO DO WITH SSE INSTRUCTIONS ISSUED PER CYCLE".
Just in case - we never talked about it.
Back to "4 64-bit SSE instructions per cycle":
2xFMUL + 2xFADD + FSTORE (and + 2x 128 SSE loads)
Printable View
i did not get why you mentioned this "THIS HAS NOTHING TO DO WITH SSE INSTRUCTIONS ISSUED PER CYCLE".
Just in case - we never talked about it.
Back to "4 64-bit SSE instructions per cycle":
2xFMUL + 2xFADD + FSTORE (and + 2x 128 SSE loads)
Quote:
Originally Posted by brentpresley
Can you translate it to English pls.
While i repeat more clearly:
You are wrong with your sentence:
So you can see - YOU are saing that, not me. You wrong.Quote:
Originally Posted by brentpresley
Wanna laugh more?
umm K8L has 3 SSE/FP unitsQuote:
Originally Posted by SEA
just like K7 and K8, the big difference however is that all 3 are 128bit
I'm glad you admit it.Quote:
Originally Posted by brentpresley
I just wondered why you mentioned it at all? (you can leave it w/out answer btw, i see your point anyway)
Did i say 4 or 2?Quote:
Originally Posted by nn_step
http://www.chip-architect.com/news/O...t_Core_Ill.jpg
though you are correct in that it is just 2 SSE, since it runs in FPMUL and FPADD
ah, you mean this:
Right and i hope everyone now can refresh his memory looking at picture :)Quote:
Originally Posted by SEA
And OCing raises tho's numbers through the roof. Nobody said there was a bandwidth wall limit. Your only talking about stock speeds. PPL have broken that wall long ago. Btw there is a difference between system bandwidth and memory bandwidth. And we all know everybody here OC's thats extreme.Quote:
Originally Posted by accord99
http://www.amd.com/us-en/Processors/...E13042,00.html
The only one whos flaming is you. And frankly I didn't get banned. You'll see the 15th. Why would I be mad? Your silly and show you take this thread way too seriously like the kid you are. I don't have to respond to your BS any more. I don't take things personally over the net like you apperently do. :rolleyes: This is my friends account btw.Quote:
Originally Posted by brentpresley
I'm surprised they haven't done anything about you with your framing yet. O_o
Oh you mean this?
http://s38.photobucket.com/albums/e1...urrent=ZX3.jpg
http://s38.photobucket.com/albums/e1...urrent=ZV4.jpg
http://s38.photobucket.com/albums/e1...¤t=2.jpg
All it looks like to me I showed was K8 in 64-bit mode can beat conroe in math related calculations. ALU is what that is, so K8 has a advantage in loading things faster, doing operations and ALU related tasks faster that need a program like that such as games. But conroe has a advantage over K8 in 32-bit mode however. FPU's SSE from conroe has the advantage with multimedia and encoding here. And AMD has the bandwidth advantage here.
Conclusion Conroe is the best 32-bit chip ever made.
K8 is the best 64-bit chip ever made.
We are all the time talking about just SSE operations.
SSE LOAD is such one.
And one more time all together:
K10: FMUL + FADD + FSTORE (+ 2x SSE loads)
C2D: FADD + FMUL + FLOAD + FSTORE
the last one is from link you provided - http://www.xbitlabs.com/articles/cpu...preview_9.html
Youy sound pretty sure. ;)
What are those x87 operations you mentioned in?
FMUL?
FADD?
FSTORE?
2x SSE loads?
Or... i'm afraid to ask.. Have you gotten optimisation manual for barcelona?!!! :hm:
Ah, i see...
you mean they call it "FP execution units"
The ones what execute x87 / SSE (they both use same part ALU - sorry FPU).
Also - here is a perfect diagram a few posts above.
technically SSE was created for SIMD calculations but since most of the logic is the same they combine both SIMD and Floating point math. They could be separated but Intel and AMD see it as wasted transistors.
brentpresley
I have read :D
you 've got funny style - when you have nothing to say - you say common known things the way as if nobody but you knew it :)
Also could you please point more specific what have you corrected with your post? that FSTORE cannot process SSE stores?
Any link? Or your historical word is good enough? :)
Common, admit just once you were wrong! find time... :lol:
You even can find SSE fstore (=fmisk) in K8.
I CAN link http://www.xbitlabs.com/articles/cpu...amd-k8l_5.html:
oops... corrected linkQuote:
This is illogical since writes from the 128-bit SSE registers into memory are executed on the FMISC (FSTORE) unit
Your flaming. Please stop right now, you take things far too seriously. :rolleyes:Quote:
Originally Posted by brentpresley
But you're doing such a good job ;)Quote:
Originally Posted by brentpresley
If the instruction fetch is a weakness in Conroe right now, it is likely that Intel will likely increase it to 32 bytes, like the K10.
Not sure how great a change this is, but it doesn't seem too radical.
**Edit** Just got this from the Xbitlabs article:
Apparently, the Intel engineers have found a clever work around for the instrution fetch limitation.Quote:
By the way, Conroe processors fetch instructions in 16-byte blocks, just like K8 processors do, so they can decode the instruction stream at a rate of 4 instructions per clock only when the average instruction length is no longer than 4 bytes. Otherwise the decoder cannot process not only 4 but even 3 instructions per clock. To fight this in short loops, the Conroe has a special 64-byte internal buffer that caches loops up to 64 bytes long (four 16-byte blocks) and allows fetching data in such loops at a rate of 32 bytes per cycle. If a loop is longer than 4 blocks, it cannot be cached in this buffer.
Quote:
Originally Posted by brentpresley
The link I provided clearly states in text that FPMISC (=FSTORE) executes SSE STORES.
OK, I catch your sudden turn ;)Quote:
Originally Posted by brentpresley
Now we don't count any SSE load/store/move operations since it turned out that they are not operations (aka No OPERATION is performed on it. :D )
Good.
Let's coun't so-called "OPERATIONS" :lol:
That is 4 SSE FP in C2D
That is 4 SSE FP in K10.
Done. So what was your point?
Mine initially was to correct this wrong sentence: "Core 2 can issue a max of 6 SSE instructions per cycle, while the K10 can do 3"
You are right this time.Quote:
Correct me if I am wrong, but that is how I read that.
SEA and Brent, cut out the flaming.
While not an official benchmark or anything, I found the comments of Dr. Vijay Pande from Stanford University to be interesting. He stated that the SSE128 units in the K10 are on par with Conroe's SSE performance. Perhaps they have tested an ES for folding. Combine that with the K10's projected vastly superior floating point performance, and the Barcelona should be a great chip.
I still haven't seen any projected numbers for integer performance though.
SSE is floating point and is necessary to extract maximum performance.Quote:
Originally Posted by Scimitar
ummm :banana::banana::banana::banana: NO..Quote:
Originally Posted by accord99
Floating point is working with numbers that are arranged exactly like this
http://upload.wikimedia.org/wikipedi..._point.svg.png
and is Properly classified as SISD Or Single Instruction, Single Data Math.
SSE is SIMD or Single Instruction, Multiple Data, which uses a single instruction to perform the same work on many different Data fields at once.
This is the difference between SISD and SIMD
http://arstechnica.com/cpu/1q00/simd/figure6.gif
Floating point is used for scientific calculations when Fixed point math isn't acceptable (though these days the improvements in floating point have reduced that requirement) SSE is for Vector math and similar parallel instructions. Read about AltiVec if you want the most logical version of what exactly SIMD is
SSE1 is primarily about single-precision floating point, SSE2 is primarily about double-precision floating point. The way to extract maximum floating point capabilities from a modern x86 processor is to use SSE operations, plus x87 is deprecated in 64-bit Windows.
SSE1 was merely the addition of the required logic to do the most basic Vector math; However SSE2 enables the programmer to perform SIMD math of virtually any type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the need to touch the (legacy) MMX/FPU registers. SSE2 IS everything SSE should have been. And if you actually programmed something for once in your life that specifically makes usage of these pieces of the CPU, you would know they are VERY VERY different animals and your repeated speaking of them as the same is not only inaccurate but annoying.Quote:
Originally Posted by accord99
That's nice, but irrelevant to the discussion at hand. And doesn't change the fact that maximizing floating point performance comes from using SSE.