Raqia@B3D: AMD Bulldozer Core Patent Diagrams [Archive]

Mechromancer

04-27-2009, 06:01 AM

Raqia over at Beyond3D has an interesting post about Bulldozer patents: http://forum.beyond3d.com/showthread.php?t=54018. The interesting link from his post is this one:http://citavia.blog.de/. In the actual blog post, it has links to filed patents and patent applications concerning Bulldozer. This is fact, not speculation at this point. What can we gleam from this info?

My try to predict AMD's bulldozer core µarchitecture

by Dresdenboy @ 2009-04-15 – 10:37:15 am

It's time for my first blog post after publishing my thoughts in several forums for years.

I want to start with a graphic (first published on planet3dnow) showing what some of AMD's last year's patent applications contained as an exemplary MPU architecture. It's worth to note, that this architecture fits nicely to some rumor brought up by Charlie on the Inquirer.

Another hint is this old AMD presentation, which mentions future developments like "Throughput Architecture" and "Cluster-based Multi-threading", although not explicitly stating its planned use. However, both sources tell us about some clusters. Now this is, what appeared in the patent applications:

Additionally there are many other interesting bits hidden in many different patent applications (long numbers) and filed patents (shorter numbers):
clustered multithreading with 2 int clusters with each of them having:
2 ALUs, 2 AGUs
one L1 data cache
scheduler, integer register file (IRF), ROB
(see 20080263373*, 20080209173, 7315935)
a trace cache, not to make cheaper decoders but to quickly recover from a mispredicted branch (7197630 and many others)
read port arbitration for a faster IRF (7315935)
shared FPU supporting ADD, MUL, FMAC etc. and 64 or 128 bit max. operand width (20080263373)
FPU may run in full bit or reduced bit modes to save power (20080209185)
32 byte fetch, 4-way Decoder - multithreaded round robin or depending on queue saturation (20080263373, EP1244962)
fine grained power management (token based, 20080263373) for optimal usage of given TDP/ACP
a lot more speculation (data speculation, cache way prediction, see 7024537, 7028166 and many others)
2 loads from L1 D$ per cycle per cluster (7502914)
maybe 2 cycle effective L1 D$ latency instead of 4 thanks to replaying (7502914)
possibly a shared L2 (7502914)
loop detectors (7130991)
dynamically scalable cache architecture to save power by switching off cache portions or levels (20080104324)
AMD's turbo mode (running cores faster if others are less utilized, 7490254, filed 2005/08/02)

Even if only some of these points will be true for Bulldozer, it will be a very interesting MPU.

* Users registered on http://www.freepatentsonline.com/ may simply click on the links to open the pdf.

http://data5.blog.de/media/291/3428291_45cdf95bfa_m.png

Lightman

04-27-2009, 07:34 AM

Thanks for a link Mechromancer!

If only half of that finds it way into Bulldozer it will be very interesting MPU. From the marketing slides projected performance of Int and Fp should be much better than current cores. Of course we don't know projected core speeds on that slides, still it looks good.
I just wonder what Intel is planning for 2011.
BTW I really like all these power savings patents in that list :)

Mechromancer

04-27-2009, 08:34 AM

Damn, AMD needed Bulldozer to come out in 2009! I hate so much it got delayed (most likely due to -$$$,$$$,$$$,$$$). It's a damn shame we have to wait until 2011 for this, but I suppose 6, 8, and 12 core K10.5 High-K CPUs will fill in the gap a little :yepp:.

demonkevy666

08-22-2009, 01:41 PM

bump I expect NOV 2010 if on 45nm.

2011 probably like phenom II for 2009....

Mechromancer

08-22-2009, 02:38 PM

I found something interesting while looking for K10.5 block diagrams to compare with the one I posted above. Apparently in January, some Germans read the patent concerning features of Bulldozer and it pretty clearly states SMT capability in the core design! Read what they discussed months ago for yourself, particularly Dresdenboy's post (translated): http://translate.google.com/translate?ie=UTF-8&oe=UTF&u=http%3A%2F%2Fwww.planet3dnow.de%2Fvbulletin%2Fsh owthread.php%3Ft%3D354511%26page%3D6&sl=de&tl=en

Go to the second to last page of that forum post (they've kept it going for months!) and you'll find a link to Dresdenboys' blog with an updated Bulldozer diagram he made from viewing the patents.

Here it is for you lazy folk:
http://info.nuje.de/Bulldozer_Core_uArch_0.4.png

Now lets figure out what it all means...

demonkevy666

08-22-2009, 04:21 PM

I found something interesting while looking for K10.5 block diagrams to compare with the one I posted above. Apparently in January, some Germans read the patent concerning features of Bulldozer and it pretty clearly states SMT capability in the core design! Read what they discussed months ago for yourself, particularly Dresdenboy's post (translated): http://translate.google.com/translate?ie=UTF-8&oe=UTF&u=http%3A%2F%2Fwww.planet3dnow.de%2Fvbulletin%2Fsh owthread.php%3Ft%3D354511%26page%3D6&sl=de&tl=en

Go to the second to last page of that forum post (they've kept it going for months!) and you'll find a link to Dresdenboys' blog with an updated Bulldozer diagram he made from viewing the patents.

Here it is for you lazy folk:
http://info.nuje.de/Bulldozer_Core_uArch_0.4.png

Now lets figure out what it all means...

:o this core is a modular monster.

informal

08-22-2009, 05:29 PM

The diagram clearly illustrates the "clustered" core uarchitecture ,first presented by Fred Weber in 2005 at Analyst meeting.
The BD design that will come in 2011(or late 2010) is a 2nd or maybe even 3rd iteration of the original idea that was born some 8 years ago in AMD.
The latest iteration,judging by the patents submitted in '07 and '08 is practically a throughput clustered design with basic core being made of 2 (2 way!) int pipelines and one (4 way!) "super wide"-256bit- SSE/FP (a la Sandy Bridge AVX). The design should also have much more advanced power management features,with AMD's version of "Turbo" mode,many new tweaks to the branch prediction,IMC etc. There will be support for all previous and future x86 instruction set extensions(AVX and XOP/FMA4 included).
There is some talk in the patents about "adapting" ISA capabilities of the cores(ie. cores can "learn" via firmware the, at first unsupported, instruction set extensions-this one may be slated for 2012 refresh of the BD cores).
All in all,Dresdenboy has done amazing job at detailing the patents so far. All we can do is wait for the AMD's next big Hammer to land(on the opponents :D).

Glow9

08-22-2009, 05:36 PM

Now lets figure out what it all means...

AM3s like the 965 being priced lower and lower per month then in 2012 something new maybe? I'm just going off of recent AMD history. I hope not but I'm going off of what a recent article posted on here that 965 could be around w/out anything new until mid 2010

informal

08-22-2009, 05:39 PM

Bulldozer for desktop is Q1 2011,but looking at AMD's excellent Deneb/Istanbul launches(pushed forward both server parts),BD may land even in late Q4 2010 IF GF meets its 32nm SOI goals.
Until that happens we may see higher clocked Denebs and hex core on desktop.Maybe even a new revision of Deneb with 1MB of L2 per core done @ 32nm to tide them over until Bulldozer comes(sort of 32nm practice run for both GF and AMD).

charged3800z24

08-22-2009, 05:58 PM

Reading this, gets me all kinds of excited :p:. I keep thinking that the Phenom II line is just the "tide me over" that AMD is using to really square away this new design. This might be why the Phenom line is just getting minor tweaks here in there. Kinda like Intel did with the Core arch. Presott was getting old quick, but Intel was able to string it along until "WAP" Core duo hit like a bomb. I have high hopes and in time we'll see.

demonkevy666

08-22-2009, 05:58 PM

The diagram clearly illustrates the "clustered" core uarchitecture ,first presented by Fred Weber in 2005 at Analyst meeting.
The BD design that will come in 2011(or late 2010) is a 2nd or maybe even 3rd iteration of the original idea that was born some 8 years ago in AMD.
The latest iteration,judging by the patents submitted in '07 and '08 is practically a throughput clustered design with basic core being made of 2 (2 way!) int pipelines and one (4 way!) "super wide"-256bit- SSE/FP (a la Sandy Bridge AVX). The design should also have much more advanced power management features,with AMD's version of "Turbo" mode,many new tweaks to the branch prediction,IMC etc. There will be support for all previous and future x86 instruction set extensions(AVX and XOP/FMA4 included).
There is some talk in the patents about "adapting" ISA capabilities of the cores(ie. cores can "learn" via firmware the, at first unsupported, instruction set extensions-this one may be slated for 2012 refresh of the BD cores).
All in all,Dresdenboy has done amazing job at detailing the patents so far. All we can do is wait for the AMD's next big Hammer to land(on the opponents :D).

if it's a cluster will improve any single threaded or just fall off a cliff ???...

I'd love to see all cores loaded from one thread being weaved threw each core hehe.

Mechromancer

08-22-2009, 10:09 PM

From the patent:

SUMMARY

Various embodiments of an apparatus for executing branch predictor directed prefetch operations are disclosed. The apparatus may include an instruction cache, a fetch unit, and a branch prediction unit. According to one embodiment, the branch prediction unit may provide an address of a first instruction to the fetch unit. The fetch unit may send a fetch request for the first instruction to the instruction cache to perform a fetch operation. In response to detecting a cache miss corresponding to the first instruction, the fetch unit may execute one or more prefetch operation while the cache miss corresponding to the first instruction is being serviced. The branch prediction unit may provide an address of a predicted next instruction to the fetch unit. The branch prediction unit may predict the address of the next instruction based on the predicted outcome of various branches in the instruction stream. The fetch unit may send a prefetch request for the predicted next instruction to the instruction cache to execute the prefetch operation.

In one embodiment, in response to detecting a cache miss corresponding to the predicted next instruction, the fetch unit may send a prefetch request for the predicted next instruction to a next level of memory, e.g., an L2 cache. If a cache hit is detected in the next level of memory, the fetch unit may store prefetched instruction data corresponding to the predicted next instruction in the instruction cache. In other embodiments, the fetch unit may store prefetched instruction data corresponding to the predicted next instruction in a prefetch buffer. The prefetch request may be sent to other parts of the memory hierarchy of the system. For instance, if a cache miss is detected in the L2 cache, the prefetch request may be sent to an L3 cache or main memory until the instruction data is found. In other embodiments, the prefetch operation may be aborted if a cache miss is detected in the L2 cache.

In one embodiment, if a cache hit is detected corresponding to the predicted next instruction, the fetch unit may send a next prefetch request for a subsequent predicted instruction to the instruction cache to execute a next prefetch operation. The fetch unit may obtain the address of the subsequent predicted instruction from the branch prediction unit. After servicing the cache miss corresponding to the first instruction, the fetch unit may stop executing prefetch operations and resume execution of fetch operations to the instruction cache.
BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of one embodiment of an exemplary front end of a processor core;

FIG. 2 is a flow diagram illustrating a method for performing instruction fetch and prefetch operations, according to one embodiment;

FIG. 3 is a block diagram of one embodiment of the processor core of FIG. 1; and

FIG. 4 is a block diagram of one embodiment of a processing unit including multiple processing cores.

I can't find a site that has these 4 diagrams referenced above, but what is up with FIG. 4? What are "multiple processing cores"? SMT maybe?

I don't want to get my own hopes up, but this CPU sounds like it will definitely hit us with a ton of bricks like Intel's Core uarch did.

Thank you Informal for your added info. I honestly was not all that excited about Bulldozer, but this has rekindled the fire.

Mechromancer

08-23-2009, 12:35 PM

I just posted an update and continuation of this story on the news page: http://www.xtremesystems.org/forums/showthread.php?t=232784

ajaidev

08-23-2009, 12:58 PM

As i said on the news page this is a very good core if true. The core is a cluster and reminds me of semi modular forum of the nehalem arc. The arc. is off course something new all together. The k10/10.5 is not so complex dont know about nehalem tough. The cores seems to have higher core inter dependency and communication. The sandy bridge seems more of a evo of the nehalem/clarkdale than a full new arc.

muziqaz

08-24-2009, 01:06 PM

One of the AMD's server PR people said, that AMD does not have any plans for Hyperthreading like core system. So that means Bulldozer will not have any kind of HT implementation.

FlanK3r

08-25-2009, 08:20 AM

I heard, AMD have some samples "Orochi" to time...but nothing more :(

Lightman

08-25-2009, 09:49 AM

I heard, AMD have some samples "Orochi" to time...but nothing more :(

About time!

If they are really lucky, they can probably validate and correct design within 12-14 months. That would mean Q3/Q4 2010 launch which is most we can count on. But bear in mind AMD is doing new arch + new node at the same time so it might be not so smooth for them ... TIME WILL TELL :up:

chew*

08-25-2009, 10:10 AM

I hope these continue the trend of liking the cold.

FlanK3r

08-25-2009, 11:25 AM

chew* roger :D...and performance more better than PII :)

FlanK3r

08-25-2009, 01:26 PM

whau, news in XBitlabs:up:

http://www.xbitlabs.com/news/cpu/display/20090825073221_AMD_s_Bulldozer_Processors_to_Featu re_Simultaneous_Multi_Threading_Technology.html

UPDATE: AMD has contacted X-bit labs claiming that it has not announced any simultaneous multi-threading technologies for Bulldozer processors. Still, there are other multi-threading implementations that may still be supported.

Advanced Micro Devices announced during Hot Chips conference that its next-generation code-named Bulldozer microprocessors will feature a multi-threading technology (SMT) which would be akin to Intel Corp.’s well-known HyperThreading.

AMD did not reveal many details about its multi-threading capability and only said that its Bulldozer processors would support it in 2011. Still, it is rather likely that AMD’s approach may be somewhat different compared to Intel’s HT and may even be of the same kind like Sun Microsystems’ simultaneous multi-threading feature than supports execution of four threads on one physical core.

etc...

haylui

08-25-2009, 10:26 PM

throw more MAC and ALU into the core pls
also not to forget vertex units for multimedia enhanced apps
IPC increment is hard to achieve within short time, unlike Intel, has a lot of resource to throw in to fine tune the IPC performance. AMD has no extra resources to do this research so focus on more units in a core.
same goes with branch prediction hit rate, shorten the branch a little bit and shorten the re-build time in the brunch if it was a miss.