Dresdenboys' blog: AMD Bulldozer - Patent based research

**Mechromancer** · 11-11-2009, 01:01 PM

Originally Posted by informal

Just in from web cast:

Bulldozer won't have classic "core" but something AMD calls modules !
Bobcat is alive

,sub 1W operation,super low power but has 90% of mainstream performance of today's mainstream CPUs! Fully modular and ready for APU implementation, has OoO abilities,2-way execution,very high performance and IMO looks like one BD "module"

Now on to BD: confirmed CMT design! More in a minute!
In : Int units are shared(2x2way execution),1 256b wide FPU.My God,DDboy hit the nail on the head,he is 99% correct in his speculation.

more: highly advance clock gating,shutting down individual modules for best perf./watt ratio,Turbo like APM functionality.
AMD states all of this is going to be a game changer.

ALL HAIL DRESDENBOY!

Hyperthreading = PWNT

**informal** · 11-11-2009, 01:07 PM

^^ in the above quote there is a chance it's 2x 4-way int clusters instead of DDboy's speculation about 2x2-way since AMD lists 4 "pipes" in the BD module diagram. But i have no idea if these are simple or complex instructions mentioned there. In patents there is a mention of possible total of 8(eight!) instructions being executed in parallel (due to ability to execute additional 4 fastpath ones in the same clock cycle)

**Mechromancer** · 11-11-2009, 01:13 PM

Originally Posted by informal

^^ in the above quote there is a chance it's 2x 4-way int clusters instead of DDboy's speculation about 2x2-way since AMD lists 4 "pipes" in the BD module diagram. But i have no idea if these are simple or complex instructions mentioned there. In patents there is a mention of possible total of 8(eight!) instructions being executed in parallel (due to ability to execute additional 4 fastpath ones in the same clock cycle)

Once again I say:

AMD =

**Chumbucket843** · 11-11-2009, 01:26 PM

Originally Posted by Mechromancer

Hyperthreading = PWNT

CMT = paper

**informal** · 11-11-2009, 01:27 PM

I just checked again and it is 4 way execution indeed with 2x2way clusters within one module(CPU core) and these two are sharing one wide(256b) SIMD unit.The front end for 4x4way would be way to much complex and expensive ,at least for this generation of products.But still is an option for future iterations of this (previously) unseen design approach. Fastpath comment still stands(even more so now) since 4 fastpath above 4 complex instructions give us precisely total of 8 instructions in one cycle,as dresdenboy found out in his research.
What is amazing is level of detail he "guessed",he has been correct in almost every part of his speculations.I remember Savantu and his bashing against ddboy's blog,how it is just pure wishfull thinking and imagination,how semi companies patent useless stuff all the time etc. Looks like he is this year's honorable bunnysuit winner

.

Originally Posted by Chumbucket843

CMT = paper

Yes for now,but it is mini-revolution in 2011

. The approach is novel and needs to be applauded since it's a brave move from AMD.
CMT was all paper for years now,there is academic research papers but not 1 firm ever even presented a possible design solution. The design is much more potent than half-threading(SMT in intel's way of doing things),since resource sharing is done much better in hardware(via common front end and separate int execution units that can share data and one shared dual threaded SIMD unit-a best of both worlds approach). How will it work in practice we'll have to wait and see,but AMD stated that one small bobcat core(based on smae bulldozer) is at the 90% level of today's mainstream performance ,all with that very low power draw .

edit: let's not forget Hans de Vries and his chip-architect website which detailed this very same approach 7 years ago(IIRC). This was the original Hammer design,not the sledgehammer aka K8 which AMD launched back in 2003(not to say K8 wasn't good,quite opposite). Back in those days Hans presented a possible future core from AMD that resembles exactly what dredenboy depicted in his diagrams and what AMD presented today

.

**Mechromancer** · 11-11-2009, 01:33 PM

Originally Posted by informal

I just checked again and it is 4 way execution indeed with 2x2way clusters within one module(CPU core) and these two are sharing one wide(256b) SIMD unit.The front end for 4x4way would be way to much complex and expensive ,at least for this generation of products.But still is an option for future iterations of this (previously) unseen design approach. Fastpath comment still stands(even more so now) since 4 fastpath above 4 complex instructions give us precisely total of 8 instructions in one cycle,as dresdenboy found out in his research.
What is amazing is level of detail he "guessed",he has been correct in almost every part of his speculations.I remember Savantu and his bashing against ddboy's blog,how it is just pure wishfull thinking and imagination,how semi companies patent useless stuff all the time etc. Looks like he is this year's honorable bunnysuit winner

.

Yes for now,but it is mini-revolution in 2011

. The approach is novel and needs to be applauded since it's a brave move from AMD.
CMT was all paper for ears now,there is academic research papers but not 1 firm ever even presented a possible design solution. The design is much more potent than half-threading(SMT in intel's way of doing things),since resource sharing is done much better in hardware(via common front end and separate int execution units that can share data and one shared dual threaded SIMD unit-a best of both worlds approach). How will it work in practice we'll have to wait and see,but AMD stated that one small bobcat core(based on smae bulldozer) is at the 90% level of today's mainstream performance ,all with that very low power draw .

Originally Posted by Chumbucket843

CMT = paper

An we'll soon have a runner up.

**madcho** · 11-11-2009, 01:34 PM

if i understand well, 2 core shares 8 int pipelines. So in a dual core with a dual threaded apply you have up to 8 int/clock.
And on same processor, with a monothread apply you can have up to 8 int/clock, because it's shared on 2 cores.
On a Quad, with a multithreaded bench with 4 thread you can have up to 16int/clock, and with only 2 thread you can have up 16int/clock if the "good cores" are used. If only one thread 8/clock.

Phenom II is based on athlon with only 3/clock/core.

The performance increase could be amazing if they increase L3 to fetch that monster.

**informal** · 11-11-2009, 01:42 PM

Madcho you are mixing some things up.You need to reread the webcast and look again at dredenboy's blog.

Anyhow,Charlie D. has a new dirty tidbit

:
http://www.semiaccurate.com/2009/11/...rth-has-moved/

Bulldozer has taped out, the earth has moved
More analyst day dirt dug up
by Charlie Demerjian

November 11, 2009

THREE VERY INTERESTING tidbits snuck out in the Q&A session at the AMD analyst day today. It seems that Fusion and the new cores have taped out and are at the fabs.

The new cores were said to begin sampling to OEMs in 2010. When pressed on the timing of tapeouts, one AMD spokesperson said that the fabs were 'running product now'. That means the chips have taped out and the fun is about to begin.

Next up was the process the Fusion cores will be on. The first of them will be made on a silicon-on-insulator (SOI) process, something that makes a lot of sense. It is much easier to port a GPU from bulk silicon to SOI than to do things the other way around. The answer did not preclude bulk silicon variants of Fusion in the future, but since the first generation cores are not made on it, I would not expect that to happen for a while.

The last bit was confirmation of what we have know, or at least have strongly suspected for a while, that the first generation of Fusion products will be a 'stars' core. The optimistic view of this is that AMD is reusing the old K10 variant for time to market reasons. Basically the uncore was done first, and since it is modular, why not use it?

If you are pessimistic, you could see this as the Bulldozer and Bobcat cores being massively late. Given that they were on the roadmap for 45nm and delayed about 2 years ago to 32nm, this has a ring of truth to it. Because it was a planned move, and one that rationalizes a likely untenable earlier schedule, I don't think this is a delay, or even a bad thing. The 'delay' probably avoided another "Barcelona".

In the end, it looks like AMD is on track. 2010 will likely be full of pain, but you can finally see the light at the end of the tunnel. The first of the new parts have taped out, so it is only a matter of time before details start leaking. Then we will know if the grand plan is working, at least on a technical level.S|A

**Dresdenboy** · 11-23-2009, 12:59 PM

I've updated my blog regarding Bulldozer's FMAC units.

The information provided during the Analyst Day simply was not enough to satisfy me (and maybe most of us)

**informal** · 11-23-2009, 01:18 PM

Thanks for the update dresdenboy!

It's amazing how many things "you got right"

. I still remember some skeptic intel fans(savantu, where art thou?) who claimed that your patent based research would not be successful at all since companies "patent all kinds of stuff daily" and bulldozer you predicted was some wishful thinking.We all know how that turned out

.

Very interesting find on the fmac possible structure(especially that not-so-confidential-anymore paper

).

**Piotrsama** · 11-23-2009, 04:03 PM

This way no [instruction] fusion of FADDs and FMULs (in todays code) is necessary, which would have not only added complexity in the decoders but would only work for certain combinations.

That rules out some sort of micro-op fusion like the core architecture has?

**freeloader** · 11-23-2009, 06:54 PM

So, "in english for the rest us", how much performance will BD have over Phenom II; roughly?

**Chumbucket843** · 11-23-2009, 07:02 PM

no one here has any idea.

**nn_step** · 11-23-2009, 07:19 PM

Originally Posted by freeloader

So, "in english for the rest us", how much performance will BD have over Phenom II; roughly?

in what exactly? Because the performance is going to vary based up the task being used as the basis for comparison.

**freeloader** · 11-23-2009, 07:32 PM

Originally Posted by nn_step

in what exactly? Because the performance is going to vary based up the task being used as the basis for comparison.

Things that matter to me are Folding@Home and video & audio editing/transcoding.

**god_43** · 11-23-2009, 11:37 PM

it will for sure have i7 (more probably) power. but i think it is really up in the air. from what i understand, this design is very ....different/new, because of this; it is hard to tell what type of power it will yield? any of the gurus care to correct me?

**haylui** · 11-24-2009, 04:36 AM

Originally Posted by Chumbucket843

CMT = paper

AMD didn't choose multi-threading like Intel did for their Pentium 4 in the Athlon 64 and their realized that it is a mistake. So they won't do this again!!!

**haylui** · 11-24-2009, 04:43 AM

Originally Posted by freeloader

So, "in english for the rest us", how much performance will BD have over Phenom II; roughly?

at very least 50% over Phenom II, because AMD's engineers know very well, if the minimum 50% couldn't be achieved; it would be doomed, as Intel will be launching new architecture to counter BullDozer's architecture.

From the paper, it is very clear that BullDozer is going to be benefit from the new design in terms of power dissipation and much higher IPC in ALU and FPU. Hopefully BullDozer could make use of build in GPU to do much of the FPU intensive job.

Expected to be about 80%~100% over current Phenom II in certain area like encoding and ALU, overall is 60%.

**Particle** · 11-24-2009, 06:48 AM

Originally Posted by Chumbucket843

CMT = paper

I don't think that's accurate. Since CMT isn't something they're likely to just tack on at the end and AMD is likely to be experimenting with pieces on silicon at this point, I think it's rather more likely that it isn't just some neat concept paper. At the very least, its physical implementation has probably been designed.

**informal** · 11-24-2009, 06:56 AM

Particle is correct since Mr Bergman stated in the Q&A session of the Analyst day that they are twiddling around with the first samples at this moment in time and that they will be shipping the product to their partners (for evaluation and testing purposes ) in first half of 2010,just by the time the whole range of Magny Cours and Lisbon product is launched.

**Eson** · 11-24-2009, 07:18 AM

Originally Posted by http://www.sun.com/processors/throughput/faqs.html#5

What is chip multithreading (CMT)? How does it differ from chip multiprocessing (CMP) and simultaneous multithreading (SMT)?

Today's traditional single-core processors can only process one thread at a time, spending a majority of time waiting for data from memory. In sharp contrast, chip multithreading (CMT) refers to a processor's ability to process multiple software threads. A CMT processor could implement this multithreaded capability using a variety of methods, such as (i) having multiple cores on a single chip (CMP), (ii) executing multiple threads on a single core (SMT), or (iii) combination of both CMP and SMT.

Didn't AMD say SMT was nothing for them and they focused on CMP?

**haylui** · 11-24-2009, 08:35 AM

Originally Posted by Eson

Didn't AMD say SMT was nothing for them and they focused on CMP?

At that time of spoke, there were less than 0.1% of software supporting this and VMwares are only used on servers
Now, VMwares are entering desktop level and more and more softwares are taking the advantage of multi-core and multi-threading.
Things change and so do trend, Intel once thought their CPU would reach 10GHz in a few years. Aren't they were right at that time of speaking??
Do not just take a paragraph out of context

**ajaidev** · 11-24-2009, 09:54 AM

• Two integer clusters share fetch and decode logic but have their own dedicated Instruction and Data cache
• Integer clusters can not be shared between threads: integer cores act like a Chip Multi Processing (CMP) CPU.
• The extra integer core (schedulers, D-cache and pipelines) adds only 5% die space
• L1-caches are similar to Barcelona/Shanghai (64 KB 2-way? Not confirmed)
• Up to 4 modules share a L3-cache and Northbridge
• Two times 4 Bulldozer modules (2 x 8 "cores" or 16 cores) are about 60 to 80% faster than the twelve core Opteron 6100 CPU in SPECInt_rate.

http://it.anandtech.com/IT/showdoc.aspx?i=3681&p=3

very interesting article, who was asking about L1 instruction and CMP related info. Lastly SPECInt_rate hehe i am too tired to use the calculator some one put that percentages in numerical value.

**kl0012** · 11-24-2009, 11:06 AM

Originally Posted by ajaidev

http://it.anandtech.com/IT/showdoc.aspx?i=3681&p=3

very interesting article, who was asking about L1 instruction and CMP related info. Lastly SPECInt_rate hehe i am too tired to use the calculator some one put that percentages in numerical value.

Depends on frequency, but here is it:
A+ Server 1021M-UR+B, AMD Opteron 2439 SE (12 cores, 2.8GHz) - 215
CELSIUS R670, Intel Xeon W5590 (8 cores, 3.3GHz) - 274 (+27%)
Bulldozer (16 cores 2.8GHz?) - 344-386 (+60%-80%)
SandyBridge (12-16 core server version?) - ???

**informal** · 11-24-2009, 01:06 PM

First of all we don't know the clocks of the Interlagos ATM.Second,there will also be 2P version of 16 core variant(4 modules/8cores in MCM via direct connect 2.0 resulting in 16 cores within a single MPU;4 DDR3 channels) .That one will have massive int/fp rate results. And judging by the latest Dredenboy's blog about the actual implementation of the FMAC units(bridged as described in patents),the fp/sse part will be brutally strong..

Thread: Dresdenboys' blog: AMD Bulldozer - Patent based research

Thread Tools

Search Thread

Display

Bookmarks

Bookmarks

Posting Permissions