AMD Zambezi news, info, fans !

Printable View

Show 100 post(s) from this thread on one page

09-14-2011, 02:53 PM
Manicdan

Quote:

Originally Posted by Andi64

IF Windows scheduler thinks this is an octo-core, and the threads go to whatever core is free, then you could have 4 threads running on 2 modules, which is far from optimal performance. Don't you think? It would be better to treat the CPU as a 4c/8t, the same way Intel HTT works.

considering the max turbo kicks in at half module use, it could be optimal performance, and it could be more efficient too.
09-14-2011, 03:24 PM
LowRun

Quote:

Originally Posted by Manicdan

so we go look at concept cars because we want them, and the moment they say its not going to be anymore than a concept we call them horrible names and buy the competition, but all AMD is doing is trying to get things finished up, and theres so many things going on here. new arch, new node, new company building these chips, its ALOT of risk. the fact that opteron is shipping means they wont just make this product go away like larabee, it will be here, people just need some patience.

You asked "why not be happy they showed us what kind of air/water OC we can expect." and i answered.

Why should i be happy about how it overclocks if i don't even know how it performs.

As for your concept car stuff... sorry, i couldn't make any sense out of it :shrug:
09-14-2011, 03:46 PM
JumpingJack

Quote:

Originally Posted by Andi64

Some new info from OBR. This is something interesting to talk about, IMHO:

"Little secret: In Windows 8 (Development Release) is not Bulldozer chip as "8 Core - 8 Threads" but 4 Core - 8 Threads! AMD went out with the truth ! ! !"

IF Windows scheduler thinks this is an octo-core, and the threads go to whatever core is free, then you could have 4 threads running on 2 modules, which is far from optimal performance. Don't you think? It would be better to treat the CPU as a 4c/8t, the same way Intel HTT works.

Windows has been multithreaded aware (SMT at least) since XP, however, aware and 'optimized' are two unrelated terms. Vista did not improve on that, but Windows 7 implemented SMT parking (google it, you will find several references). While not nearly perfect, the improvement in performance in lightly threaded applications can be quite high. Though completely different architecture, BD may or may not benifit from scheduling threads across modules as opposed to within module, but it would make sense that Windows may initially view a module as a dual threaded 'core' and enumerate the contexts as such and take advantage of better scheduling. Total guess but it would make sense. I, personally, would take this with a healthy dose of skepticism until both final silicon and final Windows builds are actually released.
09-14-2011, 04:10 PM
PerryR

Quote:

Originally Posted by metalazzo

And do you know how much is "much slower" ? :D

And do you know what an "NDA" is? :D
09-15-2011, 04:37 AM
metalazzo

Yes, it's an anagram of DNA. What did I win ? :)
For me, it's the same thing as movieman (or whatever his name) saying that AMD has a winner etc...
09-15-2011, 05:44 AM
Oliverda

Quote:

Originally Posted by Opteron146

@JF-AMD:[/SIZE][/COLOR][/B]
Just found that comment from the x264 lead developer, they are interested in hand-coding for XOP:

http://forum.doom9.org/showthread.ph...43#post1507243
Contacts: http://x264dev.multimedia.cx/about

It is an old message from June, but just in case if nobody @AMD knows him, still: Please help that guy.
I guess I do not need to tell you something about x264 and its importance ^^

Thanks

Opteron146

Some applications use x264 (HandBrake for example) so that would be a boost for BD in this area.
09-15-2011, 06:18 AM
chew*

1 Attachment(s)

Why is there still an on going discussion of core's modules.........

I don't give a rats ass what Marketing calls the chip.

AMD's patent draws a clear picture. They say a picture says a 1000 words right? AMD's own picture for there own patent.

Note core 100 not module 100 aka core 0, and then inside core 0 is 2 clusters A and B.

Case closed.

Attachment 120124
09-15-2011, 06:24 AM
informal

I guess it doesn't matter how marketing calls it and you are right chew*. What matters is performance. If it fails to beat X6 thuban in multithreaded workloads then it doesn't warrant an 8C marketing name IMO. They would be better off calling it 4C/8T part since in this case they would get praised for being much faster than X4 in many applications. But this way,equal or barely faster than Thuban with sky high Turbo clocks ... It is not going to bring them much praise in reviews.
09-15-2011, 06:31 AM
chew*

Quote:

Originally Posted by informal

I guess it doesn't matter how marketing calls it and you are right chew*. What matters is performance. If it fails to beat X6 thuban in multithreaded workloads then it doesn't warrant an 8C marketing name IMO. They would be better off calling it 4C/8T part since in this case they would get praised for being much faster than X4 in many applications. But this way,equal or barely faster than Thuban with sky high Turbo clocks ... It is not going to bring them much praise in reviews.

Only reason I have tried to point this out many times so far is due to peoples expectations. Those expecting 100% native 8 core multithreaded performance have unrealistic expectations. Hopefully this gives them a better idea so they can have more realistic expectations.
09-15-2011, 06:35 AM
Manicdan

a good marketing team can take anything and make it sound good

8 cores for the price of intels 4 cores
vs
our 4 core chips are faster than intels 4 core chips (pending perf results)

see.
09-15-2011, 06:44 AM
chew*

Quote:

Originally Posted by Manicdan

a good marketing team can take anything and make it sound good

8 cores for the price of intels 4 cores
vs
our 4 core chips are faster than intels 4 core chips (pending perf results)

see.

Engineers > Marketing when it comes to what's really what.
09-15-2011, 06:50 AM
informal

Quote:

Originally Posted by chew*

Only reason I have tried to point this out many times so far is due to peoples expectations. Those expecting 100% native 8 core multithreaded performance have unrealistic expectations. Hopefully this gives them a better idea so they can have more realistic expectations.

Well to be honest,they already said so @ Analyst Day presentations. They stated "80% of CMP" approach while having less die area and less power draw. So this 80% means 0.8x the performance of "native" X8 Bulldozer,done the old way of stacking cores next to each other. In other words,without knowing if they actually improved the "cores"/clusters versus K10,the speedup is a far cry from 33% more we have on paper and in marketing slides.
09-15-2011, 06:55 AM
Manicdan

Quote:

Originally Posted by chew*

Engineers > Marketing when it comes to what's really what.

tell that to apple, but your absolutely right that only engineers create the product, while marketing create the demand.
but you need both, and some companies depend on one more than the other. we all hope that a great product will sell itself, but that just isnt true these days with how much marketing affects our lives. and i dont mind some of it where they try to push the good features of a great product, but i cant stand commercials which make their products appear as a necessity rather than a luxury, or convince someone they are inferior until they buy it.

the benchmark im waiting for is OCed gaming perf for a price range vs the competition. Cores and Hz dont mean crap until then.
09-15-2011, 07:12 AM
FlanK3r

So...example 4C/8SMT> 6C/6T ??? Efectivity 6C is about 5.5x example, then 4C/8SMT could be about 6.3x?
09-15-2011, 07:19 AM
informal

Quote:

Originally Posted by FlanK3r

So...example 4C/8SMT> 6C/6T ??? Efectivity 6C is about 5.5x example, then 4C/8SMT could be about 6.3x?

Well 80% of CMP is 1.6x to be exact ( or 25% lower than perfect scaling). 4 "real cores" as chew said,both with 2 hardware threads running on them,should therefore get you around 6.4x or 6.4/5.5=1.16x speedup over 6C Thuban,provided same clock and same IPC. We have hints that IPC may be lower sometimes and higher sometimes. Also we have 9% higher base clock than 1100T. All in all,20% faster than 1100T should be the expected result,but as we can see from cinebench result for example,it is not faster. Maybe there are corner cases where modules don't perform so weel and some cases where they perform exactly like 2 cores. So it averages to say 1.5x or so.
09-15-2011, 07:30 AM
Manicdan

80% is the second thread, so it would be 1.8x not 1.6x. i asked this to JF on an AMD blog he did a while back.
09-15-2011, 07:39 AM
informal

Quote:

Originally Posted by Manicdan

80% is the second thread, so it would be 1.8x not 1.6x. i asked this to JF on an AMD blog he did a while back.

Ok ,think about it like this : 100pts is a base number for 2 full core performance running some workload,so performance without any "compromises". Call it hypothetical dual core bulldozer done the old way. 80% of this is how much? 100x0.8=80pts. How slower is this than the CMP BD used for comparison? 100/80=1.25 or 25% slower. Or if you want to use 100pts as a base for single core and even count non-perfrect scaling(95%) due to software scheduling limitations : 1 hypothetical BD core 100pts,2 of those in CMP design 195pts. 80% of this is how much exactly? 195x0.8=156pts. How much slower than the hypothetical BD is this? Yes, 25% : 195/156=1.25 .
09-15-2011, 07:45 AM
Mechanical Man

Quote:

Originally Posted by informal

Ok ,think about it like this : 100pts is a base number for 2 full core performance running some workload,so performance without any "compromises". Call it hypothetical dual core bulldozer done the old way. 80% of this is how much? 100x0.8=80pts. How slower is this than the CMP BD used for comparison? 100/80=1.25 or 25% slower. Or if you want to use 100pts as a base for single core and even count non-perfrect scaling(95%) due to software scheduling limitations : 1 hypothetical BD core 100pts,2 of those in CMP design 195pts. 80% of this is how much exactly? 195x0.8=156pts. How much slower than the hypothetical BD is this? Yes, 25% : 195/156=1.25 .

Your math is way off man.

It was said, "second core adds 80% of full core"

So two cores 180% with bd. imaginary full core bd that would be 200%. 180/200 = 0,9 (90%) so its 10% decrease in performance compared to this imaginary two full core bd.
09-15-2011, 07:47 AM
informal

2 Attachment(s)

How about official AMD slide? Is that way off too?
Attachment 120134
Read carefully what it says at the bottom. Where does it say exactly ,in official AMD presentation,that second core adds 80%? The whole module has 80% of CMP performance while having less die area and less power. CMP approach is 2 cores done the old way,or if you want another AMD slide ,here it is:
Attachment 120135
09-15-2011, 07:53 AM
Manicdan

http://blogs.amd.com/work/2010/08/30...ge-1/#comments

Quote:

Manicdan August 30, 2010

the 80% thing is still confusing many,

If I have 2 cores, I get 100% ontop of the 100% of the first core = 100% each

If I have 2 BD cores in 1 module, do i get 80% ontop of the 100%, for 90% each

Or do we get 60% ontop of the 100%, for 80% each?

Considering the 50% performance increase over MC, there really is no wrong answer here, but it does play a very fun role in the conspiracy theory math we like for trying to determine single threaded performance

John Fruehe August 30, 2010

It is all about throughput. To your question it is like 90% each.

One thread on one core = 100 units of throughput
Two threads on two cores in the same module = ~180 units of throughput
Two threads running on 2 cores in 2 different modules = ~200 units of throughput (I know Amdahl’s law says it won’t be straight scaling so it is actually less than that, just relax on that one for a moment.)

The point is that there is a small penalty for a shared environment. But, what is the payoff for that? How about more cores. If we did not share resources, that same die space that holds 16 cores might only hold, perhaps 12 cores, or so. (NOTE TO CONSPIRACY THEORISTS: this is just for example, don’t start making die space assumptions….) Would you give up 20% performance (or less) in order to get 33% more cores? If your application was highly threaded you would do that in a heartbeat.

People are fixating on what you give up by sharing and not what you gain. Think of SMT. You share integer pipelines. But in the example above, you would only get ~120 units of throughput vs. the 200 units of two full cores. So that penalty is 80% for sharing. Funny that nobody ever brings that up.
09-15-2011, 07:54 AM
informal

Well he is marketing guy,the presentation I linked was done by the chief architect,Mike Butler. Who do you think knows better?
Note also that it was said that it was average figure. This means performance can be equal or better to CMP (so yes 1.8x applies) or 1.5x or even lower in some corner cases. It all depends on micro benchmark used. What matters is an average and it is 80% (of dual core CMP approach).
09-15-2011, 07:59 AM
Manicdan

or it means that on average the second core gives 80% when looking across many benchmarks. which makes sense to me because its talking about die size and power in the same sentence. why say 80% perf when your comparing one module to 2 cores, it makes more sense that the extra core costs less area than a traditional core and uses less space.

EDIT
btw if the second core is really that weak then it means single threaded stuff should be very strong.
09-15-2011, 08:04 AM
informal

Because the whole presentation was about a module and not about a single core running on it.... You can even see it,it's so painfully obvious. It talks about 2 hardware execution threads,the good predicatbility of their performance(which they estimate to be 80% of the "old" CMP dual core approach). Read the first bullet point: "What it(:module) is? A monolithic dual core building block that supports two threads of execution" Then at the end: "Customer benefits: Estimated average of 80% of CMP performance(this is 2 cores by the way) with much less die area and power" .It clearly speaks about the module since the module now has much less die are than 2 cores done CMP way.

edit: sorry for bold parts,I just had to point out the obvious in the slide since it somehow escapes you guys.

edit no2:

Quote:

Originally Posted by Manicdan

btw if the second core is really that weak then it means single threaded stuff should be very strong.

Single thread should be definitely stronger when running alone on the module,similar goes for SB and Nehalem. The difference is that this single thread should be a lot stronger than a single thread of SB (not directly compared but compared when both run on isolation on their respective cores/modules).
SB sees 0-30% speedup with SMT on. Bulldozer module sees around 1.6x speedup on average.This tells us that there is no erratic behavior with module approach and that is at least predictable. You can expect 10-15% better single thread result than what you would get from multicore scaled result.

The question is : is that single core running on a module alone still noticeably faster than K10 core? If it isn't than multicore result will be a lot less better and you won't see 33% better scores over Thuban at similar clock. You may see 10-20% better ,depending on application. Maybe even less than that (as cinebench indicates-no better than Thuban ...).
09-15-2011, 08:20 AM
Solus Corvus

This again?

I can hardly wait to hear the kind of arguing that will ensue when they subtract the FP unit out of the module and send all the FP/SIMD work to the GPU. How many cores will it have then?
09-15-2011, 08:22 AM
Manicdan

i tried to get clarity on what the 80% was because its clearly confusing people,
i got an answer, and i trust that answer until we are told it was a mistake.

Show 100 post(s) from this thread on one page