Module divisions are transparent... so does that mean natural multi-threading? meaning a single threaded app will be split among 2 cores at the module level?
Printable View
Module divisions are transparent... so does that mean natural multi-threading? meaning a single threaded app will be split among 2 cores at the module level?
Yes i already read the artciel...
The improtant part is other enhancements. ;)Quote:
The 3rd ALU does have some performance benefits, and AMD canned it to reduce die size, but AMD mentioned that the 4-wide front end, fusion and other enhancements more than make up for this reduction. In other words, while there’s fewer single thread integer execution resources in Bulldozer than Phenom II, single threaded integer performance should still be higher.
4way wide frontend is wide, but alone its useless if you can't process the ops fast enough, so the solution is higher clocks speeds aka turbo.
Again i said, its highly likely that bulldozer will have higher singlethreaded performance, but IPC will go down compared to a single deneb core.
IPC will go down?Are you serious ? :p:
Having more underutilized units is bad.Having less units that are constantly fed and always have something to do(the part you made bold) will make the "IPC",the relative and mostly meaningless number (that is 99% of the time <=2), higher. That's the whole point of this machine.
I doubt they will double pump the ALUs.The units will just be more utilized,a difference from the units in today's parts which sit idle a lot of time .
I think that the guy answering the question in the teleconference got a bit confused. It is possible that the new parts won't work in old boards (and the old CPUs will work in the new boards),but this has never been the case in the past : just look at the AM2/AM2+/AM3 comparison.
Read the AT article.It is the hardest thing to do indeed,but the way this machine is built allows it to go a bit further ahead of itself(with all the BP,data speculation,much improved out of order loads and stores capability etc). I guess the more in depth slides from HotCHips will shed more light on this subject later today.
Here is my take on how BD will operate on sockets AM3 and AM3+
Bulldozer CPU = AM3 and AM3+ compatible
BD + AM3 = Dual Channel DDR3 enabled
BD + AM3+ = Quad Channel DD3 enabled
I bet difference is a few extra memory features on socket AM3+, just as previously done with sockets AM2=>AM2+/AM3
Hmm somehow I doubt there will be 4 ch. on AM3+ boards.
well, I'm only speculating that the difference will be as such that you will get extra benefits on AM3+ but won't hinder AM3 compatibility
But its true QC won't be possible. but perhaps enhanced power features like the turbo mode, could be different on either sockets.
Ontario 2 core + DX 11 GPU die is about 75 sqmm based on analysis of photos of the wafer shown at Computex.
Yeah, I wonder how long we have to wait before we get an answer for that.
No matter if AM3+ have many or very few improvements, AMD won't tell us too soon I guess, because they want to sell as many CPUs and chipsets before BD as possible.
The 870/880G/890GX/890FX was launched this spring, and when was the specs revealed? Two or three months before? And that was even with a chipset with very minor updates, especially the NB.
This makes me think that it isn't very likely that we'll see any specs this year.
This socket confusion will get sorted out pretty soon tho.
if BD dosnt drop into AM3, it better be due to costing alot and competing very well, then maybe the extra 100-200$ would be worth while/irrelevant.
since i was just going to wait for AM3+ before dropping AM2+ anyway, im not feeling very hurt, but it does kinda suck
If all the web's complied with NDA, they'd have more info now. Those who did so have the new slides:rolleyes:
As for the socket compatibility, there is no info about it in the AMD slides and what was said during the conference call does not say as much as the webs, too. So I guess if they don't have another source they just made the informations trying to hit the mark.
I just can't understand how can you talk on 6 pages about infromations based on nothing? Calm down people! It's starting to be impossible to find something important betwen all the crap here. If it will continue this way, people will move to another forums:shakes:
FSB Interface was in the processor too, otherwise it couldn't communicate with the NB. So HT has nothing to do with a NB.
NB has always been memory controller, the main bus interface has always been present in the processor. K7 had an area called "Bus interface unit".
So I still can't see what part of the NB except the IMC that has been integrated in Phenom. :confused:
John is this true?:eek:Quote:
Fruehe says that AMD will be able to get six-core and eight-core Bulldozer chips in the 30 to 40 watt power range, which is pretty low for a server. "The question is this," says Fruehe. "Is there a need for a more discrete, less-threaded chip for servers?"
Tbh I don't see ANY reason why AMD can't release 3 module (6 core) Bulldozer units besides product portfolio choices.
Each module is independent and the design would permit this choice fairly easily i believe.
Plus there will always be 4 module chips with 1 defective module to disable ;)
I just wondering how can bulldozer perform such a low power within 8 cores.
Finally JF answer me.:D
http://www.amdzone.com/phpbb3/viewto...37878&start=50
I actually have a meeting on this at lunch. It's actually a pretty cool discussion. However, don't expect a detailed discussion until much later.
I believe the next round of "20 questions" answers (some time next week) will tackle part of this. But there are some pretty cool SW things to consider.
That is all I can say.
So do we have a definitive/official response as to whether or not Bulldozer will be compatible with current AM3 motherboards? This thread grows too fast for me to follow :p
I would not give AMD a hard time for needing a new mobo, I would rather they design the cpu then the mobo to suite rather having design trade offs to accomadate the older mobo tech.
it worked, we know exactly where to plant the bug. now i just need an inside man, a tracking bug (maybe a baby monitor?), and a black suv with tinted windows parked in the grass just 20ft away from the room. sounds easyQuote:
I think the fact that there is a waiter in the conference room at AMD would be enough of a tip
sure a low voltage chip can do that. at 3GHz? unlikely.
no, i will not read that article. i dont need someone to tell me the sky is blue either.
the only way to increase alu utilization efficiently is through software. using hardware to improve this is generally inefficient. gpu's are a good example of this. it is harder to exploit data parallelism but the majority of transistors in gpu's are sram in register files/caches, and alu's. this is about as lean as you can get.
16KB L1D (down from 64KB in K10.5), only 2 ALUs per integer core (down from 3 in K10.5), longer pipeline... compensated by some op fusion, better prefetching & branch prediction, but: pretty clear now that the single- (and therefore low-) threaded integer IPC is NOT going to make a huge jump from K10.5. I'm not sure it will match Nehalem, even- might be close. SB should be well in the clear.
The only thing I'm wondering with regards to AM3/3+ compatibility of bulldozer, are there no slides regarding it? It seems like everything is word of mouth.
I was quick to believe what Opteron146 has quoted but now that I think of it, there aren't any official slides regarding platform yet. Not all infos are released yesterday (or today for those still in the 24th).
- AnandtechQuote:
. In many ways the architecture looks to be on-par with what Intel has done with Nehalem/Westmere. We finally have a wider front end, branch fusion, power gating/true turbo and more aggressive prefetching. Whether or not AMD can surpass Sandy Bridge’s performance really boils down to how many Bulldozer modules you get at what price. If 2 module (4-core) Bulldozer CPUs go up against dual-core Sandy Bridge things could get very interesting.
with all the delays (late ?) 2011 Bulldozer and now this. BD is looking more and more like Microsoft Vista. hope i'm wrong.
It's more interesting that they slimmed down the integer cores to 2 ALUs/2 AGUs.
This is a big change from the 3 ALUs/3 AGUs that AMD has used for K7/K8/K10.
Combine this with the four wide front end/decoder and it is clear that BS is a
throughput design optimized for server type workloads.
It looks like AMD concluded that it can't match Intel in single core performance
and it would always have to throw 2 cores at every Intel core. It sat down and
looked carefully at how much sharing and trimming it could do to make 2 of its
cores closer in area and power to 1 Intel Nehalem/SB core. The biggest moves
are sharing front end and FPU between cores and slimming down the integer
cores. This should keep AMD in the game in servers where it can put two 8 core
devices into an MCM and sell it against 8 core SB and 10 core Westmere-EX.
The loss of the third ALU/AGU in the BD's integer cores likely won't make a lot
of difference to commercial server workloads which have low ILP and high MLP.
The sharing of the FPUs probably won't hurt HPC performance much because
with 8 cores per die and 16 cores per package memory system limitations will
tend to dominate.
IMO where it becomes problematic is building client devices based on the BD
architecture. AMD will really have to ratchet up its CPU clock rates to more
than make up for the loss in IPC from the slimmed down integer cores.
http://aceshardware.freeforums.org/a...iew-t1042.html
What are we looking at here terrace? An opinion from traditionally anti-AMD guy(Demone) who has no real perf. data at hand? What "losses in IPC" is he talking about?
There's "L3 Cache and NB" and "Integrated Northbridge Controller".
The "NB" in "L3 Cache and NB" is the northbridge component found on AMD processors since K8. The "Northbridge Controller" is the IOMMU (what we're used to referring to as northbridge, such as AMD 790FX) integrated into the processor die.
That's like saying that with K8 AMD concluded that couldn't match the higher cpu clocks of netburst and that would always have to throw more instructions per clock at Intel MHZ's.
Crysis 2 will support 8 cores, good luck with the single thread performance thing.
Size of the cache isn't everything, now it might be quicker, 8-way or more? And the L1 takes a pretty large die area on all K7 derivatives so the space might be used more efficient now. And as they said, you could quite easily make up for the lower cache size.
The pipeline part might be something good, Athlon 64 has a longer pipeline than K7, and core 2 has a longer pipe than A64, and nehalem has an even longer one, only prescott has a longer pipe than Nehalem among X86 cores. Now they might have evened out the execution to be more streamlined, and these stages could be related to prefetch, thus increasing effiency.
And when it comes to pipes, Phenom II has 3 pipes, that means 3 ALU or 3 AGU. While Bulldozer has 4 pipes, which translates to 2 ALU and 2 AGU. That means that Bulldozer most likely will have quicker execution than Phenom II, and might even save die space at the same time.
So even without the branch prediction, prefetch and op fusion Bulldozer might be faster. Now consider the very aggresive prefetch, which translates to better use of available bandwidth, enhanced memory controller, dual 128 bit FPUs, and possibly higher clockspeeds.
Bulldozer might be an insane performer, we simply have to little knowledge to judge at this point.
honestly wtf cares about single thread performance? geez if ppl cared about it, they would not be buying dual-cores even. the whole nature of multiple cores is for multi-threading, anything else is fail!
techpowerup fail
"At the chip-level, there's a large L3 cache, a northbridge that integrates the PCI-Express root complex, and an integrated memory controller. Since the northbridge is completely on the chip, the processor does not need to deal with the rest of the system with a HyperTransport link. It connects to the chipset (which is now relegated to a southbridge, much like Intel's Ibex Peak), using A-Link Express, which like DMI, is essentially a PCI-Express link. It is important to note that all modules and extra-modular components are present on the same piece of silicon die. Because of this design change, Bulldozer processors will come in totally new packages that are not backwards compatible with older AMD sockets such as AM3 or AM2(+). "
:rofl:
official slide from last year's Analyst Day:
http://prohardver.hu/dl/upc/2009-11/...road.thumb.png
http://prohardver.hu/dl/upc/2009-12/13063_amd_rmap.png
it does have some purpose
imagine if we hear AMD say that you can have 75% of the performance at 4 threads that can have with 8 thread on BD, either it sounds like core scaling sucks, or it would mean that when 1 thread has the BD module all to itself, the bonuses are astonishing. this is very important for things that do have multi cpu support, but still have one thread limiting the others. like in games, one thread is 100% usage, the other 2 may be at 40-70% usage. so if you can give your most resource hogging thread more performance (like turboing) it can run at up to 120%, letting the other threads reach 50-90% usage.
getting the most power out of a cpu at 1, 2, 3, 4, 5, 6, 7, or all threads in a given TDP is what were are noticing both the blue and green teams are trying to accomplish, and in the end it means a WELL ROUNDED cpu.
just look at laptops, if you buy a quad u get 2ghz, if u buy a duel you get 3ghz, with proper features the quad should ALWAYS be faster than the duel, even if just 2 threads are needed. and hopefully we will be there soon enough
Single-thread performance --> low-thread performance, due to the nature of 2-cored modules and Intel's 2-way HT.
So you are essentially saying "wtf cares about 4-threaded performance?!?!"
Well, that encompasses the vast majority of what you'll be doing on the desktop most of the time, so, the answer is: most people.
In that case, you'll have the 8-cored SB 2011 monster able to run one thread per core, while the 8-core Zambezi runs in full module-based-resource sharing mode. The result will not be pretty.
Again, single thread performance really translates as N-threaded performance, where N is either the number of cores in an Intel device, or the number of modules in a BD device. Because up until those thresholds, shared resources aren't really having much of an impact, and you are not really constrained much by power, etc.
It's a big mistake to think that a single thread performance doesn't matter. And I don't mean a single threaded programs. Just think about it - what is better -8 fast cores or 8 slow cores in a multithreaded program? Now if we speak about an upcoming high-end desktops then that would be 8x 4-wide 2 thread/core capable SB int cores vs 8x 2-wide Bulld int cores and 8x265 SB fpu vs 4x256 Bulld FPU. Yes, I'm avare about 16-core interlagos, but this probably is going to be a multimodule server oriented chip.
delete pls
read last post from yuri.cs Now, i dont know reality about AM3+ vs AM3 :confused::shrug:
http://translate.google.cz/translate...409%23p8041409
ammm sandy bridge uses a double pumped fpu makeup that means legacy "non-avx" will run it at 8*128 bit, bulldozer will also run the same with 4 modules or 8 cores.
Cant understand the translation can you give a summery of the post....
http://www.brightsideofnews.com/Data...t_Core_675.jpg
Bobcat article - http://www.brightsideofnews.com/news...-movement.aspx
AMD Bobcat Core plan: Add an 80-core Cedar GPU and you have - Ontario
80 shaders means it has double the cores of 4200/3200 and since i have overclocked a 4290 to 900mhz i know how well it performs. If this 80 core GPU is also clocked around 750-1000Mhz we may find quite some graphical horses under the bobcats bonnet... :D
lost planet 2 uses 12 threads, at 4.5Ghz it keeps all 12 above 70%
i was thinking the same the 10 first second. Looks like a fake map on a ship :D
LOL is descriptive of this thread
http://en.wikipedia.org/wiki/Amdahl's_law
a simple example:
a task is 50% parallel, 50% serial.
if i speed up the parallel part by 2x i increase performance by a factor of 1.33
if i speed up the parallel part by 10x i increase performance by a factor of 1.81
if i speed up the parallel part by 100x i increase performance by a factor of 1.98
if i speed up the parallel part infinitely i increase performance by a factor of 2
Will BD DROP into current AM3 boards or not?
How do you know? In a true 12 core design do you think it would be over 70%? Maybe it uses four cores, and the HT-threads easily gets maxed out. Four cores is around 66% of a quad.
My point is, having a hexa core being utilized 70% can only mean at least 4 cores. It could be 12, but there is no way for you to see it, and since it isn't 100%, I seems like it isn't using all cores.
Finally, chips from AMD has always been nicely ordered, pointing at a mostly hand made layout. I can imagine that it leads to an uneven power usage and unnecessary long circuits and timings. And wastes die space.
lol, it is the exact opposite. hand layout is much better. humans are better at finding eulerian paths and coming up with clever layouts. computers cant really do that with all of the design rules and other parameters as effectively. the difference in performance is 2.6-7x faster with custom designed circuits.
really what happens is a coder will simulate his module and make sure it reaches the targeted timing, which is usually much higher than actual delay to assure robust operation. if the logic cant reach the speed it is either rewritten or circuit designers optimize it. in certain logic families it must be entirely custom designed.
circuits that are custom designed are usually things like power gating, clock distribution, and analog circuits such as pll's, dll's, and memory controllers/ io pads.
L1D drop to one fourth could possibly indicate two things, inclusive cache hierarchy and in rare cases slightly higher clocks.
Deeper pipelines then could further increase the clocks and simplify the design. I believe inclusive cache hierarchy would more than compensate for the loss in IPC, and if the BP is better, it could possibly even reduce the time spent in stalls vs. K10 with weaker BP but shorter pipelines.
...or then... The cache hierarchy remains exclusive and slow. Any benefit of exclusive cache is being lost due to smaller L1 compared to L2, and L2 compared to L3. Deeper pipeline reduces the IPC but doesn't allow much higher clocks to compensate for it. Aggressive BP isn't good enough to compensate deeper pipeline. More L1D misses. Speculative execution would lead to wrong decisions and L1I misses too often, greatly worsening the IPC in some cases while improving it only marginally in most cases. ...and GF messes with 32nm HighK SOI, effectively ruining the benefit of more cores in hope for better yields.
Inclusive cache hierarchy would mean that AMD would be able to squeeze the gap between K10/Nehalem cache performance, and bring BD to SB levels of cache performance.
AMD says (http://www.youtube.com/watch?v=VIs1CxuUrpc)
"Synthesizable with small number of custom arrays"
Together with what was said before I think one of the main goals that AMD wants to achieve is to have easily customizable processors. Add a gpu core here, some cache there and another core here. From the slide it looks like lots of their process is already capable of being laid out by a computer.
We have the caches, the integer units and the floating point units being the fixed hand optimized blocks with stuff like the x86 decode organically filling up the space in between. AMD also says it makes it easier to put the whole thing on a different process.
I've only limited knowledge about modern synthesizing and floor planning from working with some FPGAs.
Maybe Hans or somebody in the industry can say something about Bobcat?
Only 1 1/2 hour till the other NDA on the presentation slides drop.
I hope they'll talk about platforms (both server and desktop) on the second wave of slides.
Another deck of slides,or a good part of them, is up at certain (well) known website that deals with hardware inner things :D. I don't want to say which one since the NDA is not over yet(1hour remaining) so there is a chance it gets pulled :). If you understood my word play you will know which website it is ;).
edit:
since terrace asked me on SA forum about the load/store BW here it is,confirmed(2x128bit L and 1x128bit S capability,per core-the slide talks about dedicated integer cores and distinct/non shared features).This is from the other slide deck with more detailed info on Bulldozer.The thing i noticed about the shared FPU is the dual 128bit packed integer pipelines ,along the 2 128bit FMACs that sit inside the FPU monster.
Well, that doesn't actually go as far as saying you can do all 3 (2 loads, 1 store) at the same time...
There's now info up at tech report, too.
1) JF told you at the other thread that IPC is higher.
2) If that's true then the higher frequency design comes on top of that.
3) And last but not least: Power gating Turbo now allows much higher single core frequencies.
Looks like a 1-2-3 speed bump for single thread performance to me....
Regards, Hans
Its AMD, you got to have faith they'll give Intel something to think about ;):up:
I love the under dog:hump:
I like how amd is combating HT by intel. I'm looking forward to both bobcat and bulldozer. Although I still wonder how AMD is going to complete in the mobile market against intel's atom.
Not exactly sure which context your mean when you say they "do not floorplan", but they definitely allow floorplanning at some level. The first step of synthesis, RTL -> Netlist, doesn't floorplan (it just cares about standard cells usage & timing estimates/constraints), if that's what you're trying to get at. However, the second step of synthesis, Netlist -> Placement (placement tool), definitely does floor-planning.
Tools like Cadence Encounter take floorplan constraints and allow for partitioning sub-modules, however like the picture above shows the results tend to look like a jumbled mess, since strict boundaries aren't adhered to.
Well, historically [x86] chips from both camps have always been mainly custom layout in the datapath with a varying amount of synthesized control logic, seeing pictures like this is a bit of an eye opener from the norm :D
Another example is Intel's Pine-Trail (bigger):
http://www.intel.com/pressroom/enhan...netrail_06.jpg
The huge purple blotch running down the middle is all synthesized logic :yepp:
You pretty much sum up my thoughts on the matter, it looks like they shot for a semi-custom approach by supplying some of the main datapath logic (not necessarily say the whole FPU, etc., just the important chunks) and the arrays as hard-macros/external-IP (in- or out-of-house, doesn't matter) while synthesizing the rest.
While they're definitely not unique in the approach, it will certainly provide a quicker process adaptation, since only a standard cell library and select logic/array-IP pieces would technically be necessary. Granted there's still a bit more work than just swapping libraries/IP and pressing a few buttons :p:
Just pointing out that while we humans can definitely be more adapt at coming up with these clever (sometimes novel) solutions to optimizing layout area/timing/congestion-constraints, it's also a significant capital and time investment, so it's for ROI and time to market reasons that it doesn't always work out. The case of Bobcat is obvious an example of this, and Atom for that matter.
Honestly I would find the logically optimal euler path to be much easier for a computer to solve :D
But yes computerized tools aren't very good when it comes to balancing the plethora of added constraints in a physical world, hence us restricting them to sub-optimal standard cells + wiring constraints.
you say that so much and i think it's easier to just either visit another forum, not get so upset, ignore the posts, or even just set certain users to ignore. notice it's not a big deal to anyone else. the shintai era is over dude.
some people dont like amd and you'll just have to deal with it. personally i am more frustrated with them (amd) more than anything else. i have high standards, especially from big companies. and yes AMD is a big semiconductor company.
Another nice example is the 1.9W TDP 2GHz hardmacro version of the dual
core ARM cortex A9 in the TSMC 40G process (total size of only 6.7 mm2)
http://www.arm.com/products/CPUs/Cor...ard-Macro.html
http://www.arm.com/images/A9-osprey-hres.jpg
Regards, Hans
No, it's because Intel is raising their turbo clock with 66 MHz, from 3733 MHz in 45 nm i7 880, to 3800 MHz in 32 nm i7 2600.:yepp::yepp::yepp::yepp:
If Intel can only raise it with 66 MHz, then AMD should only be able to raise it with less than that, right? It's fundamentally impossible that AMD can do ANYTHING better, according to some people.:D:D
Yet, they keep on posting and bashing in threads about things they hate.. I mean, what's the point?
Maybe I should join a Britney Spears forum and see what being negative in a forum is all about, because I honestly don't know.
why Britney spears? there's nothing wrong with Britney!.....LEAVE BRITNEY ALONE!!!
http://www.youtube.com/watch?v=kHmvkRoEowc