Thanks for answering that JF. :rolleyes:
So if he's getting accurate numbers from engineering, then it's a simple thing for him to say, "BD numbers are better than existing chips". Simple right?
Printable View
That's pretty much what he have said countless times already.
http://www.xtremesystems.org/forums/...&postcount=602
I don't have any reason to be in building 400. And it is better off that the marketing guy is not "dropping in" on them.
Our performance engineering team has done a real accurate job on performance modeling in the past, I have no reason to doubt them. Generally the worst that we see is too much conservatism, not too much optimism.
OK, so let me get the gist of all of this whole thread down to two statements:
1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.
2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.
OK, I got it now.
I'll make it short and easy to understand. Original quote:
Which is 100% true, as K10 has more execution units. I don't see the words perfomance, shared or dedicated in this post. Then you say:
Which is wrong, based on the above. I just pointed it out, but it seems it was a perfect excuse to ignore what the guy is actually saying (as you like to do) and repeat the same post you've been repeating how many times now?
I hope you properly get it now.
Let's say their past history isn't as imaculate as you portray it. There is an alternate discussion on BD details on Aces and Paul Demone directly answers JFs claims :
BD taped out a month or two ago. If they were lucky silicon is mostly functional. If not, they are working overtime to fix it and get working samples. Silicon is being characterized and in pre-validation stage.Quote:
Originally Posted by Paul Demone
In other words, benchmarks and performance are second place at this time, most important is getting a functional chip.
What this all means, every claim about BD performance is based on estimates done without having actual silicon in hand.
SUN Rock was meant to be the greatest chip done in the past decade with with innovative features like transactional memory and scout threads. I still remember Jonathan Schwartz, how ecstatic he was over Rock.
Rock turned out a complete dud, burning 300w and abisymal performance.
Six year old quotes? I'm going to start filling Sandy Bridge threads with info on Netburst, thats cool with you guys right?
I have to wonder why Paul would even say this unless he just wants to argue:
The first section deals with repeating the whole "OK, so it's faster overall but what about single threaded work?! Ha!" We've already been told that BD is faster than the current gen at both, which he even acknowledges in the second half. As such, what point was there to even posting the first part? As for the second part, who cares? If the frequencies are higher due to the changes in the chip and that permits an overall faster singlethreaded and multithreaded experience than is possible with current designs, then why does it matter if the new chip ticks faster? I'm not saying I think that clock for clock the new part will be slower at this point, but even if it were it would be fine given that in the end it's still faster and not only clocked higher.Quote:
ROFL. A Niagara has higher "aggregate" (across all threads) IPC than a US-IV
but far lower single thread performance. Listen for what a salesman doesn't
say! Higher single thread performance than K10? Probably, but at far higher
clock rates enabled by a deeper pipeline, simpler cores, and a process shrink.
http://flamewheelspin.ytmnd.com/
perfectly sums up this thread...
:))
No need to; the point was simply to take the appropriate spoon of salt wrts to marketing and performance claims for a future product.
Not in the slightest.Quote:
Originally Posted by JF-AMD
First of all, nobody claimed BD will be slower than existing products either in performance overall or single threaded performance. Nobody brought in discussion the dedicated vs. shared resources but you, so false dilemma you have there.
The only point raised ( by me at least ) was that given the design trade-offs BD did ( which I addressed in a previous post in details - my POV nothing more and which David Kanter also mentioned in his article), it is expected that BD will lose slightly in performance per clock compared to K10 in integer code . Overall performance of BD, including single threaded will be higher no doubt than K10. But not per clock.
On what do you base your opinion on? Deeper pipeline?
Are there any bits of info regarding cache inclusiveness/exclusiveness, other than 16 kB L1D, which hints for inclusive cache?
I'm still predicting inclusive cache for the L1D size. There is no reason to stick to exclusive cache as it gives virtually no benefit because of poor L2/L1 and L3/L2 ratios. It just slows every memory operation quite a bit while giving marginal improvement on hit rate. Anyone with some knowledge on the performance penalty due to exclusive cache? I'd believe that inclusive cache would bring more than enough to compensate for any loss the deeper pipeline could potentially cause, brining the cache latencies to near Nehalem numbers, if not greater. SB seems to be a real badass on this, so I can't see BD topping near it's latecies with inclusive cache.
He was adressing JF's point about IPC being higher ( Paul doubts that ). I am surprised it isn't obvious.
Well, you see, neither me, Paul or others are interested in absolute values for benchmarks scores. My interest is how they got there, the uarch, the trade-offs, the clever stuff done to hide bottlenecks, the corner cases,etc. I don't give a rats ass if it scores 101 FPS in I-don't-know-what-game or does SuperPi in -2 sec.Quote:
As for the second part, who cares? If the frequencies are higher due to the changes in the chip and that permits an overall faster singlethreaded and multithreaded experience than is possible with current designs, then why does it matter if the new chip ticks faster? I'm not saying I think that clock for clock the new part will be slower at this point, but even if it were it would be fine given that in the end it's still faster and not only clocked higher.
The fun is analyzing the intentions and the implementation, not the end result. I take great pleasure in reading about Netburst, Prescott,Tejas, Nehalem ( 1st one ) , Tanglewood, Rock, etc even if some were duds in the end. It may suck, but it was innovative and challenging.
Well after readin' all the stuff about BD, my nooby chip expertese tells me that.
IPC will be be improved at the same clocks, than current AMD processors.
It will take less space per core
It will clock higher than the current crop of AMD processors.
It looks like it will be highly competitive in the server market, but behind in the 'gamers' segment (possibly close to matching today's Intels because of clockspeed, but not surpassing it in IPC).
obviously no one will know until it gets leaked.
I don't agree with the word "estimates"
A design is validated and debugged long before it goes to silicon. Validation
is done both by cycle accurate software simulation and FPGA hardware
emulation. An FPGA hardware implementation of the core, or entire processor,
can run 10+MHz and can be made cycle accurate. This is also how you do
performance tuning during the design phase itself.
Typically operating systems are booted and many software applications
are run long before you go to silicon.
About your link......
What in this musing from the investment board inhabitants can be classified
as not being investor FUD and of any technical relevance concerning
the architectural details of bulldozer?
Regards, Hans
I don't get why you guys are so sure it can't offer more ipc than k10. I think it does make a difference that the 3 could either be ALUs or AGUs because they couldn't do so simultaneously. Add to the fact that many applications don't even use a full ALU/AGU, so combined with a better prefetcher bulldozer should offer good ipc gains.
It's been confirmed many times over that 80% number is integer cores in a single module vs integer cores in different modules, and the performance is lost due to shared components in the modules, not due to weaker cores.
in fact its been said couples of times by JF-AMD himself ....
sharing most inevitably mean communism for some .... but not for me .. if its to bring a good product at an affordable price with a big improvement over the last product im all for anything really
arguying with the man who works at the company to wich you decide to pick about said product .. and said person is in talk with engineers who built the damn thing ....
I agree & savantu doesn't help to have objectiv view about facts.
If JF- said IPC will be better, this is true ... Why ? It's simple, he don't want be unemployed.
Good marketing is telling the true ... And Henri Richard has done some big mistake, and now don't work anymore for AMD.
Bad guys don't stay so long ...
What if there are multiple threads with lots of AVX instructions? Single module can only feed one AVX instruction at a time, or two 128-bit SSEx instructions, or 4 64-bit FPU instructions, right?
Up to <number of modules> threads running AVX there should be no performance penalty as long as there is no other FP instructions in the fly. More there are, lower the AVX performance will be. And if one adds more AVX threads, the FPU units will just starve and there is no performance improvement?
In short: If I want to do lots of AVX, I can only run <number of modules> threads for improved performance?
Repeating for the 10th time already,10h can retire(back end of the chip) 3 macro ops,period.It has 9 execution units. There's your problem.
by the time this what if comes there will be more powerfull cpu's cappable of doing more then a single avx instruction per core etc....
and anyway isnt avx a better suited instruction for massive multimedia task in term of the way it process the info ???? so even if the cpu's might be limited by that fact they will most likely finish their job easily right ????
Quotes taken out of context can be true, but in context they can mean something different. Your quote was a response to my post about pipes. He is trying to make 3 pipelines appear like 6 pipes. Which is a twist to the truth.
BD has more resources since it can use 2 ALUs and 2 AGUs every clock, Phenom II averages at 1.5 ALUs and 1.5 AGUs since the share pipe. Again, if you can't use it, it isn't a resource. 2+2=4 (3+3)/2=3.
The discussion is still around IPC. Even if you try to make it look different. And it's still about BDs integer execution capacity compared to k8 (10h), we are pointing out that BDs 4 pipes seems a bit stronger than K8s 3 pipes.
And by adding the different parts of K8s pipeline together some people here are trying to make them look twice as strong.
4 pipes equals more resources than 3.
Is this clear enough?
http://www.realworldtech.com/include...ulldozer-4.png
Each integer core takes 4 Macro ops from the dispatch group buffers while each 10h(Istanbul) core takes 3 Macro ops.
where did you get those slides ? :O they are more then clear about how the arch works compared to westmere and instanbul ... Thanks
From here. It's very good article ;)
Its not like it hadn't been already posted here in this thread:
http://www.xtremesystems.org/forums/...&postcount=685
I didn't saw the post :)
PS: I read it from the first day it got published, http://www.google.com/realtime ;)
The thing is that it uses it. If the CPU can't use all 6 at the same time that's another thing. All 6 will get used at some point. Either way, they are on the die, they're connected, and they are used. Alternatively, not at the same time, whatever. But they are there, they are used and thus they are a resource. K10 has more resources than BD (integer "clusters").
Instructions per clock (compared to K10). Frequency doesn't matter, this is per clock:
IPC (CPU level) --> Will be higher, more "modules", double integer resources per "module", less resources per integer "cluster", better use of available resources per integer "cluster".
IPC ("module" level) --> Will be higher, double integer resources per "module", less resources per integer "cluster", better use of available resources per integer "cluster".
IPC (single integer "cluster") --> Less resources, better use of available resources. Higher or lower instructions per clock?
The bold part is likely lower, and that's exactly what savantu, terrace and others are discussing here. IPC per integer "cluster". We don't know for sure, since JF just says "IPC will be higher". At what of the previous levels? After all the BS, bans, etc. he still hasn't answered this question.
Now, if you throw frecuency in the mix, knowing that it will be higher than current K10 CPUs, of course you can say single integer "cluster" perfomance is higher. Just notice how he never uses IPC+higher+per integer "cluster" in the same sentence. The only info we know about single thread perfomance is that it will "be higher". Of course, because of the higher frequency, not because IPC is higher.
JF just has to answer the question and this debate is going to end fast: IPC per integer cluster has been increased or not? No BS, just yes or no.
Stargazer,can you read or not?The man said IPC will be higher and single thread performance will be higher. Can't you just stop beating the dead horse already?It's dead,alright?
K10 has a clear bottleneck in the retirement unit.It has a massive 9 execution units available(3ALU,3AGU,3FPU) but can retire only 3 macro ops per cycle.
What would make the IPC lower on "integer cluster level"? Deeper pipeline + "less" resources, L1D cut down by 75%?
Not any of those. Less absolute resources, more practical resources per thread. This alone could possibly compensate for any IPC loss caused by deeper pipeline, let alone the improvements in other areas. If the cache is actually inclusive, then that alone would compensate for every possible CPU-level change which would reduce IPC even the fiercest Intel fan could think of.
Potential integer throughput of those 2ALU/2AGU says very little about the IPC performance, let alone single-thread performance, or whole product performance. All you'd need is slightly faster cache access and more aggressive prefetching and branch predicting to bring 10 % IPC increase with 10 % penalty on "integer clusters".
What BS? He has already stated both single thread and multithreaded performance are both higher.
informal, some people can't believe their eyes.
Anyway, yeah, read the article at anand. It's quite clear about uarch changes there (and probably the only site other than realworldtech which bothered to make their own diagrams to help better understand). I haven't read the one on realworldtech yet but judging from that pic at post 726, it might be good.
David Kanter over at Real World Tech has a writeup about Bulldozer's uArch.
http://www.realworldtech.com/page.cf...WT082610181333
I figured this didn't have to be posted as a completely new thread.
2 pages is a few? :D
And finally we got a statement, its the first time he explicit mentioned this and the question was answered, ironically after the one that asked the question first was banned... :rolleyes:
If he would have done so much earlier we could have at least saved 15 pages of nonsense... anyway im satisfied with the answer and there is nothing more to ask.
Or the guy who made 15 pages of non-sense could be willing to read a few articles on the front page before starting assumptions (and indirectly accusing people/slides of lying). Two ways of looking at it.. :rolleyes:
1) As per what percentage improvement could be seen... JF has already said that with 33% cores 50% performance gain at server workloads could be seen. This is the only information JF is willing to share and unless you hold Intel stock or work for them, i see no reason why'd you press so much for that information... which he already explained that he couldn't share owing to product being some time away from launch (i assume about a good 2 quarters or so...). Personally speaking AMD wouldn't want Intel to have information on an upcoming product, as it will give Intel an edge and possibly a chance to outmaneuver them. It works the same the when it comes to the opposite... The only time Intel leaked information (remember C2D) on an upcoming architecture was when AMD was kicking them around left right and center and in all segments of market... Now if Intel finds out stuff, they could possibly evolve a new pricing strategy (given their scale and market share its easier now) or something else, to counter a competitive product. Competitive BD is... :up:
2) IPC is higher... :yepp:
3) IPC compared to previous architectures of AMD is higher... he said as much... and many a times over...
to be fair, historically companies that hide information until days before launch tend to have problems with their product, especially the ones who have many delays. Even if BD does have a sizeable increase over k10.5, which it should, I honestly don't think it will be enough to compete with Sandy Bridge.
That preview by Intel was a red cape for AMD to charge at, and I'm willing to bet if they had a better product they would have released their own preview, challenging for the top spot. My guess is that BD will be a fine product, just still not has a powerful as Intel's in terms of pure performance. To me it seems it's more about power efficiency, as JF keeps mentioning 50% off 33% more cores. Well why not 100% off 33% more cores? That's because the thermal envelopes would just be too high not to mention the power draw would be astronomical considering they don't have a working 32nm process.
At least from my perspective, it seems to me that AMD is done challenging for the top enthusiast performance spot. They seem to have shifted onto a new direction, trying to offer the most performance per dollar, especially over the long run when you consider electricity bills. That's quite reasonable, as Intel has far more money spent on their fabrication process, and thus have denser, faster caches which seriously helps out on applications like Super Pi.
Is the OS aware of the cores sharing resources? If 2 cores of a module have 80% of the performance of two independent cores, when an application is using 2 threads (most of the games, for example) will the OS work on two different modules, or on a single module?
I'm reading the thread, sorry if it has already been answered...
No one is sure, all JF has said is that AMD is working with MS to devise core utilization order etc.
I would imagine, that ideally for multithreaded tasks you would want the same module due to the shared L2, but for separate tasks you would want different modules due to the performance loss from sharing components
At the same time as you have a performance loss to shared components you have a boost from Turbo. If four threads run on a module each, you will have no turbo, since turbo managment is at a module level and not at a core level. If all threads run at two modules, you will have a 10% performance hit, but you will have turbo making up for that and more.
I suspect we'll be seeing AMD lineup using 8 cores vs 4 cores, even if than means 4 modules. The AMD cores within the modules are certainly more core like, than HT.
At the end of the day, the prices will be set based on workloads and how it copes with them. If a 4 module/8 core AMD chip (at say 2.5ghz), can deal with the same workload as a 4 core(8ht) Intel chip (at 2ghz), then that will be it's price window (speed values for arguments sake etc).
If you run one thread per module all modules work at the same time, and no modules rest, therefore no module can enter turbo. But if two the modules work with two threads, then two modules rest, if two modules rest the other to can enter turbo mode.
You can't have turbo and all modules working at the same time, the fact that parts of a module is idle doesn't matter since turbo works on a module level.
And it's said everywhere that a second thread run in a module "only" increases performance with 80%. That is a 10% performance loss compared to a traditional dual core approach.
I doubt that would happen though due to manufacturing costs. I have to believe 1 module is bigger than 1 Intel core. For consumers, the Intel core would make more sense, whereas for servers the module would make more sense as you are comparing 130% to 180% of the integer performance. Thus since server processors are always priced with much higher margins in mind, they could probably line up their processors that way, so even if intel's ipc is 10% faster, they would still win the performance battle.
However, I just can't see AMD being able to price their products as you described for the general consumer and still make a profit, especially when Intel is at 32nm whereas AMD is stuck at 45nm. Even if they could, if the Intel product offers anywhere from 5-20% more ipc, I would just by an unlocked k series processor and be happy with that. Having anything beyond 4 threads is pretty much useless for me, so single threaded performance is what will earn my money.
I would not make assumptions about how our processor works based on how our competitor has implemented technology.
As you may (or may not) be aware, I was critical of the way that they implemented turbo. I am happy with the way that we have implemented it. I can't get into specifics, but I can assure you that when you look at the two implementations, you will see a clear difference and you'll appreciate what we have done with the technology.
I hate to say things like that without being able to disclose any of the detail, but more than that I hate people going down the path of assuming things about our product that might not be fully accurate. It's a fine line.
Just keep in mind that this is a brand new architecture and things are going to be approached from a different perspective. The modularity is only one small part of it; there are a lot of things that have been changed.
People have been asking for someone to really bring some real innovation to the market, I think you will see that.
It is working on a module level but that is all we know. Many things AMD didn't reveal,for obvious reasons.
This is quite a fascinating architecture. If that RWT article is accurate then I am extremely interested in seeing some benchmarks.
I don't buy that overall per-core IPC must necessarily decrease (in relation to K10) because of reduced interger ALUs. Of course they will obviously miss out, compared to a 3 or 4 ALU core, on cases where int ILP is greater then 2. But in cases where the code is more mixed int and memory ops, IPC could go up in relation to K10 - based on available execution resources alone. Which case is more common obviously depends on the specific code being ran. Though I'd suggest that a program with consistently high integer ILP would be more efficient using packed integers (handled by the FPU) anyway.
If we add to that the fact that missed branches and cache misses (both significantly improved in BD) have a much greater effect on overall IPC than some missed ILP cases, it's clear that claiming lower IPC than K10 isn't really justified based on fewer ALUs alone. I doubt that BD will have lower IPC per-core than K10. In reality it's probably somewhere in the vast gulf between PII and SB.
As already noted though, IPC isn't the only factor in a processor's performance. This is obviously a high frequency design. The memory and cache subsystems are a big leap forward for AMD. They are designed to keep a large number of cores well fed - to minimize the amount of time that execution resources are waiting on data and thus increase efficiency. Intel will probably continue to lead in IPC by a significant margin. Whether AMD can increase frequency enough to make single threaded performance competitive remains to be seen. On the multi-threaded side BD sounds like a monster.
If AMD can't match Intel's single threaded performance it looks like we will have a split market come 2011. Office users and gamers might do best with SB while people doing encoding, folding, heavy multitasking, HPC, and servers might do best with BD.
While I agree with everything else you put (that RWT article is a must read for anyone who hasn't), I would say this last statement is wrong.
I suspect that margins will be significantly lower for gamers/office users (although will bobcat/llano fill the office space?). It could be a great result for overclockers, as we'll have access to decent multicore tech, that should have a bit of room to mess with.
So unless Intel go for a price war, all AMD has to do is price match on a performance level.
it's only people wanting absolute max, that care about who has the best CPU. The mainstream gamer just wants to spend £200 on a cpu and make sure that the cpu is competitive to other cpus round that price break.
Why don't we look at the argument from another view point.
Show me the source code to 1 program which can sustain under optimal conditions an IPC greater than 1.8, for which multi-threading isn't a better solution.
For those of you smart enough to actually wonder what makes IPC greater than 1 possible [In source code]; let me save you a long winding trip and give you the answer; such a beast DOES NOT EXIST.
Let me think about that.
Hans wrote for the K8:
http://chip-architect.com/news/2003_...it_Core.html#3Quote:
Each Scheduler can launch one ALU and one AGU operation per cycle. The ALU operation may come from one x86 instruction while the AGU operation may come from another.
That is no 1.5, that is 3 ... maybe u missed the fact, that the MacroOps are splitted into µOps at that stage ?
correct there are 3 full integer operations in k8 and on, that can do either ALU or AGU, but as I understand it is more efficient due to improved prefetchers and smaller die sizes to use a 2+2 simplified design
JF, so each BD is faster clock per clock than the Phenom cores? Or is it by just comparing the top clocked frequency processors of each product line?
No, it is not "either" it is both ... what do you not understand in the quote of Hans' article ?
That is correct, the current IPCs of usual code is around 1, I think Nehalem achievs 1.5-1.7 in best cases, thus: 2 pipes are enough :)Quote:
but as I understand it is more efficient due to improved prefetchers and smaller die sizes to use a 2+2 simplified design
Yes you are right, but I never said anything against that point ;-)
Maybe one note on that, because I red it earlier: The AGU results are not retired, they go immediately into the LD/STR units, so the waiting µOp can get its mem-data ;-) Later, after the calculation of the µOp is finished, that µOp is retired.
So in short the retire / ExU ratio is 1:2 for both, not 1:3. For K10 it's (3:6) and for BD it's (4:8).
AMD to Test Upcoming Bobcat Processors in Servers
Quote:
"We're definitely in the process of examining this as a design point," said Donald Newell, AMD's new server chief technology officer, in an interview. "It would be foolish not to."
Quote:
"There's only a few papers ... and there's a lot more data to collect," Newell said. "It really depends on a number of factors ... to whether or not that's a good design point."
Quote:
"It's hard for Arm to move up in the server world, like x86 would be to move down to dishwashers," Newell said.
AMD also is looking to mold graphics processors and separate accelerator units into its server offerings. Right now GPUs and accelerators are designed for specialist computing needs, but the company wants to build chips where all the architectural elements flawlessly work together, Newell said.
"We're definitely in the process of examining this as a design point," is not equal to actual testing of Bobcat in server environment. Today's journalists really like to twist the words and jump to (wrong) conclusions .To get to actual testing they have to see if it makes sense in the first place.
That's correct :).BTW I'm sure AMD at least investigated the other ALU/AGU possibilities and they came out with the most efficient one.Wasted resources&power/diminishing returns is not what they would want from a design like Bulldozer,especially with the clock targets they have in mind :).
I have debunked this in several places. We are NOT "testing" bobcat in servers.
We are looking at the market to determine whether there is a place for it. It would be irresponsible to not consider every piece of silicon and IP that we have access to. But, as Bobcat is defined today, it does not meet the needs of the server market. Just as Atom and ARM are coming up short as well. When you can get six cores @ 35W TDP in an Opteron 4000, why would you want to build more servers and have more physical hardware? The folks looking ar really low power environements are looking at embedded or they are looking to reduce management and power costs. 12 cores @ 35W/CPU in a single server makes a lot more sense than 6 low power (and low performance) dual core 1P servers. When you talk to the big cloud guys, core density is critical because that means fewer systems to manage.
Anybody who wants to speculate about clock rates ?
Just rememered IBMs 4.25 GHz p7 8core chip with 4xSMT. That is with 45nm :eek:
So far I thought 5 GHz for BD is fanboy dreaming, but compared to that monsterous 45nm chip it should be rather reasonable now that a smaller BD die produced in 32nm together with high-k interconnects should be able to achieve that.
What do you think ? Is it ok, to speculate on x86 clocks by comparing it to Power / RISC numbers ?
@informal:
I agree totally ;-)
Thanks
I have a hard time believing 5ghz stock as that's just never been done before that I can recall. However, intel's Sandy Bridge lineup covers 2.5-3.4ghz, and assuming that they will have an ipc advantage, AMD may end up covering 3-4ghz (numbers per-turbo on both sides).
Overclocking BD should be fun if it is truly a high frequency design. Even though Netburst cpus are just about worthless in terms of performance, they are still some of the most fun to mess with. AMD could perhaps combine the best of both worlds, and give it more IPC than k10.5 while still making it clock like p4s (that would be a major win amongst enthusiasts now that Intel is locking fsb).
I will give it a try :D
@95W envelope we have 6 cores done on 45nm working @ 2.8Ghz. If BD was done on the same node I guess ,with the targeted 20% in clock speed due to pipeline changes, we could have 2.8x1.2=3.36 or round up to 3.4Ghz.BUT,it will go to 32nm highK/mg instead.I would still pick the same clock and power draw values just to be conservative(let's disregard the 45->32nm node improvement since we have 33% more cores).That's a 4 module part. Now,if count in 10-15% IPC improvement(pick average 12.5) and 33% more cores and at last divide by 1.1(10%) for the "performance hit" in fully loaded modules,in multithreaded workloads we get an equivalent performance of 4.65Ghz X6 Thuban .This is with no Turbo over stock.
Now,with the new Turbo(<=1/2 of the cores are idle,picking Thuban's Turbo conditions),I would expect ~20-30% clock increase,take a 25% as middle .We get => 3.4x1.25=4.25Ghz in poorly threaded or single threaded applications.Now add the speculated 10-15% IPC jump(pick 12.5 as arithm. mean value) to get the equivalent Thuban class core clock=> 4.25x1.125~=4.8Ghz Thuban in single threaded workloads(no 10% hit here).If the power gating happens in a way so that 2 modules are gated,we have the 10% hit due to core scaling in modules => 4.8/1.1=4.36Ghz Thuban class core speed in poorly threaded workloads(1<no. of threads active<=4).
So to sum it up,I expect a 95W 3.4Ghz "X8" Bulldozer model,with 4.25Ghz effective turbo and 10-15%(pick 12.5%) IPC jump. This would be equal to a:
-4.8Ghz Thuban in purely single threaded workloads and
-4.36Ghz Thuban class core in poorly threaded workloads.
-4.65Ghz X6 Thuban in multi thread workloads.
In the 125W range I would expect 3.6 and 3.8Ghz models,and if they really want to push the limit,a 4Ghz 125W model. Turbo would be smaller,percentage wise and similar or slightly lower frequency wise than in the earlier example. So effectively just add 0.2, 0.4 and 0.6Ghz on top of the 3 numbers for "equivalent Thuban" above and you will have projection how these 3 125 or 140W ones could perform(top model ,the hypothetical 125/140W 4Ghz one could easily be equivalent to 4.7-5.4Ghz Thuban class core,depending on the workload).
Enough of xtreme speculation from me :)
Assuming that the 50% more performance with 33% more cores referes to IPC, we have that BD´s, we have a 12,5% increase in IPC in relation to K10h. Estimate that the area of a module in 32nm is the same of a core of the previous generation, and that the power envelope (just of the core now, not the whole chip) is the same for the same area of the previous generation. If we have a 20% higher frequency for the same power envelope, we've got a 35% increase for the same thermal envelope.
Each module have 30mm^2, so, the total will be 120mm^2, for a 4 module. Plus some 8mb of L3 cache, like, 60mm^2, we have 180mm^2. A previous generation had a heat of 125W at 3.4GHz, so this one will be 4,1GHz at 95W, turbo at 5GHz.
Let's see performance-wise. For 4modules/8 cores, we have that the performance of a bulldozer will be 70% higher, while consuming 30% less than a PhII 3.4GHz.
The IPC of a SB is 50% higher than PhII, but won't clock as high as a BD. At 95W, a 4core will be 3.3GHz. So, we have that at this power envelope, BD will be 20% faster than SB, with about the same die area of a SB, or slightly smaller.
Of course, Intel will release a 8 core SB, but its die area should be around 320mm^2, and no way that at 3.3GHz the power consumption will be lower than 150W. For servers, Intel must counter at least with a 10 core, absolute minimum.
So, you see, BD will be a competitor for Ivy Bridge, not Sandy Bridge.
As you rightly said... Intel chippery has faster and denser caches which help in most desktop environment situations... AMD will be good, but beat Intel... Not unless some multi-threading is thrown into picture...
Then again, this is off topic but then you got to look beyond architecture to see, whether binaries involved in creating software are any of a bother and how much... As far as i'm aware, latest intel binaries do not allow AVX to work on any other chip than "Genuine Intel." This would shut up fanboys from both sides :P
Yes, In server arena, which is the most lucrative for both Intel and AMD, BD will help AMD gain a competitive edge... People aren't yet using MC as much, as big OEM partners are yet to come out with servers featuring MC to its best... However, will they be able to resist BD? Actually Eagleton could not come any sooner for Intel... but as much as i learned, it would be based on Sandy-bridge and not Ivy-league (please correct me if i'm wrong on this bit). What i'm saying is, BD is going to be a major win for AMD in server arena, which is where they'll make most of the money.
AMD is keeping it quiet as the product is 2 or more quarters away and it gives Intel enough time to out-maneuver them in the market based on factors like price and all, limiting options for AMD... Hence my theory of "Intel Employee" when people want to know more than AMD has already offered :P
Seems like we all know what we need to know then. :D
In this very thread people discuss frequencies of 3.3-4 GHz for BD which is significantly higher than MC ( max 2.3GHz ).
The 50% more performance, 33% more cores applies versus Magny Cours. You also need to factor in frequency since this was the unknown part in the AMD slide.
8 cores version will most likely be @ 3.0 & 3.2 maybe .... 6 cores 3.4 and higher ... and on and on
I considered that the clock advantage would be related to the core area, which is really small in BD, such that 2 cores have about the same die space as a PhII core. So, the clock advantage is even higher from the point of view of a number of cores.
So, if the clocks are so small as you say, BD is more or less tied or a bit lower than SB in perf per watt.re
16 core, matching MC, would have 2,6 GHz, from the point of view of my reasoning.
on the mcm part they are still bound by tdp ... now have 2 of those glued together and you need to lower your clocks considerably ... so you still stay in the desired tdp so i dont think you will see clocks higher then 2.3 for the 16 cores version ...
maybe 2.5 for the 12 cores mcm part on 32nm ... and higher the less cores they have