Dresdenboys' blog: AMD Bulldozer - Patent based research

**Mechromancer** · 08-27-2009, 04:47 PM

DRESDENBOY! Glad to have you here finally. Keep up the great work.

PS: While we're at it, let's bring Hans into the fold

.

**Mechromancer** · 08-27-2009, 06:14 PM

Dresdenboy updated his blog. Check out the link on the first page.

**saaya** · 08-27-2009, 09:09 PM

how about the rumored reverse hyper threading from a few years ago, after reading over the paper somebody here posted a link to here i kept getting the impression somebody back then misunderstood cmp and made up the whole reverse hyper threadding hype back then, as a clustered core can share its resources to work on one single thread... yet the same paper sounds like cmp overall still sucks big time for single threaded aplications?

so is there a way to go cmp without hurting single thread performance? somebody mentioned switching it on/off on the go?
or is there another way to maybe even boost single thread performance with cmp by doing branches on diferent parts of a clustered core, or is there some other way to boost single thread performance with cmp?

**JumpingJack** · 08-27-2009, 10:14 PM

Originally Posted by Dresdenboy

But in case of AMD I think we can take patents in a different way. Intel, IBM and other large companies have a lot of people working on such designs and try to cover many ideas to fill their IP pool. AMD is not so much IP oriented (just protecting itself somehow) because it can't afford to pay for tons of potentially useless patents and waste a lot of the design teams' time for developing "fun architectures" and patenting them. However, if someone developed an idea with some future potential, they might patent it just in case. That likely happens often during the early design stages.

Actually, when one really studies the field, AMD has a very impressive patent portfolio ... it's not quantity, it's quality. I guess that is pretty much what you said.

Jack

**Hans de Vries** · 08-27-2009, 11:48 PM

Originally Posted by JumpingJack

Originally Posted by Dresdenboy

But in case of AMD I think we can take patents in a different way. Intel, IBM and other large companies have a lot of people working on such designs and try to cover many ideas to fill their IP pool. AMD is not so much IP oriented (just protecting itself somehow) because it can't afford to pay for tons of potentially useless patents and waste a lot of the design teams' time for developing "fun architectures" and patenting them. However, if someone developed an idea with some future potential, they might patent it just in case. That likely happens often during the early design stages.

Actually, when one really studies the field, AMD has a very impressive patent portfolio ... it's not quantity, it's quality. I guess that is pretty much what you said.

Jack

Generally AMD patents highlight aspects from a certain micro-architecture
which is the same in a whole set of patents and described in each of them.

These micro-architectures have evolved over the last 10-15 years.
Cluster based superscalar processing has evolved to Cluster based multi-
threading as a more effective way of doing SMT which suffers from a
high pressure on the Load/Store units and the L1 caches as Fred Weber
already mentioned years ago as AMD CTO.

The origins of Cluster based superscalar processing as a successor of
superscalar processing go back even further in time as this DEC patent
filed in 1995 shows: http://www.freepatentsonline.com/6167503.pdf

Regards, Hans

**Dresdenboy** · 08-28-2009, 02:03 AM

Originally Posted by Hornet331

Originally Posted by Dresdenboy

Charlie Demerijan
know....

Well, another case of wrong information by obmitting parts of it.

I wrote "Charlie Demerijan seems to know", just for the record. That means I'm not 100% sure about what he knows. He did never say, that he guesstimated around but that he has seen the microarchitecture during a meeting with some AMD guy.

@thread:
Hello Hans, nice to see you here.

**Hornet331** · 08-28-2009, 02:46 AM

Originally Posted by Dresdenboy

Well, another case of wrong information by obmitting parts of it.

I wrote "Charlie Demerijan seems to know", just for the record. That means I'm not 100% sure about what he knows. He did never say, that he guesstimated around but that he has seen the microarchitecture during a meeting with some AMD guy.

and thats exactly my problem with guys like mr demerjian, fuad etc. They always where shown non public information by "some guy" (insider, someone who knows an insider etc.) and most of the time it turns out to be hot air.

Your work on the other hand is based on facts and its cool to see someone invests so much time in scraping together little pieces and tries to make a picture out of it.

Your gathered information coincidently fits with one of his articles, maybe he had reriable source for once. I wouldn't rely on his articles to verify your theory.

**Hans de Vries** · 08-28-2009, 07:26 AM

Originally Posted by Dresdenboy

Hello Hans, nice to see you here.

Hi Matthias

**ajaidev** · 08-28-2009, 07:38 AM

Originally Posted by Dresdenboy

Well, another case of wrong information by obmitting parts of it.

I wrote "Charlie Demerijan seems to know", just for the record. That means I'm not 100% sure about what he knows. He did never say, that he guesstimated around but that he has seen the microarchitecture during a meeting with some AMD guy.

@thread:
Hello Hans, nice to see you here.

Yep he seems to know things, but even if he knows there are chances he may mislead people. Similar things happened with RV780 specs most websites said it has 480-640SP's but actually it was 800SP's.

Since you and Hans are quite knowledgeable on chip arc. at this thread, it seems that the proposed arc. would work in reality. Oh ya any speculation on Sandy bridge since we all ready have a possible die shot of it...!!

Also can there be a chance that SMT is implemented in a similar way as PowerPC7 than I7?

**Smartidiot89** · 08-28-2009, 08:25 AM

Originally Posted by Hornet331

and thats exactly my problem with guys like mr demerjian, fuad etc. They always where shown non public information by "some guy" (insider, someone who knows an insider etc.) and most of the time it turns out to be hot air.

Your work on the other hand is based on facts and its cool to see someone invests so much time in scraping together little pieces and tries to make a picture out of it.

Your gathered information coincidently fits with one of his articles, maybe he had reriable source for once. I wouldn't rely on his articles to verify your theory.

Offtopic: The fact that Charlie does have tons of connections is nothing that can be denied, and most of he's articles are actually very nice. This Charlie Demerjian hate spawned cause he's the opposite of an Nvidia fanboy, so today even bringing up obvious facts we here the green men coming into threads and typing "CHARLIE DEMERJIAN TEH LULZ!"

Nice to see both Hans de Vries and Dresdenboy in this thread will be following it closely

**Mechromancer** · 08-29-2009, 10:05 AM

Dresdenboy, concerning the below diagram, specifically the decode stage, you previously stated that it looks like this uarch will be able to do up to 8 threads if the threads are separate and made up of 4 individual microcode and 4 individual fastpath instructions. To get the most efficiency out of that kind of setup, will software engineers need to target this architecture specifically to get the most efficiency out of it? Will this turn out like another largely unsupported AMD implementation like SSE4A, or will this make the majority of multi-threaded software run crazy fast?

**Dresdenboy** · 08-30-2009, 03:42 AM

Originally Posted by Mechromancer

Dresdenboy, concerning the below diagram, specifically the decode stage, you previously stated that it looks like this uarch will be able to do up to 8 threads if the threads are separate and made up of 4 individual microcode and 4 individual fastpath instructions. To get the most efficiency out of that kind of setup, will software engineers need to target this architecture specifically to get the most efficiency out of it? Will this turn out like another largely unsupported AMD implementation like SSE4A, or will this make the majority of multi-threaded software run crazy fast?

I didn't say 8 threads (just 2), but I wrote about the micro ops. So this means that one thread could have one (probably even more) microcoded (complex) operations decoded, while the other thread could be decoded using the fast path decoders. So one thread with simple ops could be decoded for the appropriate int cluster while the second thread could have some (probably microcoded, see the KGC blog entry) AVX instructions being decoded for the FPU.

**Mechromancer** · 08-30-2009, 06:01 AM

Originally Posted by Dresdenboy

I didn't say 8 threads (just 2), but I wrote about the micro ops. So this means that one thread could have one (probably even more) microcoded (complex) operations decoded, while the other thread could be decoded using the fast path decoders. So one thread with simple ops could be decoded for the appropriate int cluster while the second thread could have some (probably microcoded, see the KGC blog entry) AVX instructions being decoded for the FPU.

Thank you very much. I needed that clarification. Will taking advantage of that capability require specific software enhancements for this architecture or does this look like a method the CPU can handle with any basic software coded with AVX support? As you can tell, I'm no expert by any means, but I want to just have an idea whether or not software engineers have to do AMD specific coding to get these enhancements to work or not (like SSE4A).

**Dresdenboy** · 08-30-2009, 07:30 AM

Originally Posted by Mechromancer

Thank you very much. I needed that clarification. Will taking advantage of that capability require specific software enhancements for this architecture or does this look like a method the CPU can handle with any basic software coded with AVX support? As you can tell, I'm no expert by any means, but I want to just have an idea whether or not software engineers have to do AMD specific coding to get these enhancements to work or not (like SSE4A).

It depends on the instruction mix of the two threads, but for any given code it would be nearly impossible to compile and execute threads in a way to make maximum use of this capability. There might be combinations of threads or processes with for example standard integer code in one thread and a lot of microcoded instructions (an AVX version of Prime95 if AVX would be microcoded) in the other.

**nn_step** · 08-30-2009, 08:07 AM

Originally Posted by Dresdenboy

I didn't say 8 threads (just 2), but I wrote about the micro ops. So this means that one thread could have one (probably even more) microcoded (complex) operations decoded, while the other thread could be decoded using the fast path decoders. So one thread with simple ops could be decoded for the appropriate int cluster while the second thread could have some (probably microcoded, see the KGC blog entry) AVX instructions being decoded for the FPU.

Given that all previous CPUs indicate that Microcoded operations are rare, the excessive transistors dedicated to a single thread would be wasteful.
fast path decoding is more energy efficient, and cheaper in transistor counts.
Thus the logical outcome is that if there are 8 decoders, is that at most only 2 of them would be micro-decoders and the remaining 6 should be fast path.

**Dresdenboy** · 08-30-2009, 02:22 PM

Originally Posted by nn_step

Given that all previous CPUs indicate that Microcoded operations are rare, the excessive transistors dedicated to a single thread would be wasteful.
fast path decoding is more energy efficient, and cheaper in transistor counts.
Thus the logical outcome is that if there are 8 decoders, is that at most only 2 of them would be micro-decoders and the remaining 6 should be fast path.

The microcode decoders aren't necessarily full microcode decoders (as the one vector decode/microcode unit in K7-K10) but could be partly decoders. However, they just fetch microcode from microcode ROM. That's also an efficient task (like a table lookup) instead of having a fully hardwired fast path decode unit which needs to switch a lot of transistors to translate some x86 ISA instruction to the right micro ops for the internal execution engines.

**Mechromancer** · 08-30-2009, 04:05 PM

Theres some interesting discussion about this topic over at AMDZone. Check it out: http://www.amdzone.com/phpbb3/viewto...st=0&sk=t&sd=a

**Nedjo** · 08-31-2009, 06:51 AM

Originally Posted by Mechromancer

Theres some interesting discussion about this topic over at AMDZone. Check it out: http://www.amdzone.com/phpbb3/viewto...st=0&sk=t&sd=a

heheh, Hans is K6-2 with "K6-2 Fresh Boarder" ath AMDZone forum

This is definitely the most important piece of info 'cos it comes from John Fruehe:

Originally Posted by JF-AMD

There is nothing to put in jeopardy. I have been consistent with what we have told the press, customers and the public.

Between now and Bulldozer there is not HT/SMT-type technologies in our processor. Can't say what is beyond that because they only let me see up to Bulldozer (I have a ~3-5-year technology horizon for my job).

We have been pretty clear that we believe that a strategy of cores delivers more predictable results than SMT, which can show small performance increases in some workloads and potentially even performance degradation in other workloads. We are pretty transparent on this topic.

If, in the post-bulldozer timeframe enhancements to SMT allow for real scalability and no performance hit, there is no reason that AMD wouldn't consider it. For now, A.) we don't plan it any time soon and B.) really haven't locked down the post-bulldozer products 100% yet.

Bulldozer is a completely new architecture, from the ground up. Any assumptions that you make today are based on your knowledge of existing platforms, not the future platforms.

When you make definitive statements about what bulldozer will/will not be, you run the risk of being wrong. Just as most of the web speculation is wrong as well. I can't comment on the product until we release more data, but I will say that many of the commonly heald beliefs about processors may change in the bulldozer timeframe.

**BrowncoatGR** · 08-31-2009, 12:37 PM

Originally Posted by Nedjo

This is definitely the most important piece of info 'cos it comes from John Fruehe:

I can't comment on the product until we release more data, but I will say that many of the commonly heald beliefs about processors may change in the bulldozer timeframe.

That is one bold statement

**Mechromancer** · 08-31-2009, 01:02 PM

Bulldozer may very well not have CMT, but the gen after or even refreshes, (like K8 to K10) may. One thing that may be safe to assume is Bulldozer having a much higher IPC than anything today. Lots of cores with high IPC per core is just fine and doesn't lend itself to the weaknesses of certain multi-threading implementations. I still can't imagine what "commonly held beliefs" JF is talking about. It better be exciting as

!

**savantu** · 08-31-2009, 09:18 PM

Originally Posted by Mechromancer

Bulldozer may very well not have CMT, but the gen after or even refreshes, (like K8 to K10) may. One thing that may be safe to assume is Bulldozer having a much higher IPC than anything today. Lots of cores with high IPC per core is just fine and doesn't lend itself to the weaknesses of certain multi-threading implementations. I still can't imagine what "commonly held beliefs" JF is talking about. It better be exciting as

!

And how exactly do you get that "high IPC" ? K10 and Core 2 barely struggle to get 1 IPC in average, Nehalem does slightly better ( think 1.2 ) while it has the most advanced prefetchers in the world by far ( thanks to Netburst ). That is less than 1/3 of what is possible.

The reason ? x86. It simply lacks the ingredients for allowing high IPC to be extracted. This is way Netburst was designed in the 1st place. You can improve perfomance in 2 ways :
-increase IPC
-increase frequency

Since IPC is so hard to get, Intel decided let's try to reach a frequency as high as possible. Of course, discovering new territories back in the late '90s with 180/130/90nm proved quite a challenge. Without a thermal limit, they would have probably reached Power 6 like frequencies ( 5-6GHz ).

This was one of the paths they've taken. The other was EPIC, a completely new instruction set, designed with parallelism from the ground up. We're talking of an instruction set that tells the processor everything it needs to know because the optimizations happen at compile time ( the compiler inserts hints in the code flow : process this than that,etc ). In x86 the CPU needs to find the IPC at run time; well, most of the time you're SOL.
Itanium is able to get 4-5 IPC, but it lacks frequency by being so wide.

Basically it boils down to : if you chase IPC you lose frequency because of the complexity ( prefetchers,decoders,run-ahead,scouts); if you chase frequency you can't have a high IPC.
The solution is in the middle : K8/10,Core 2/Nehalem. To expect something revolutionary out, like very high IPC in x86, is simply wishing for pigs to fly. While some do try, most fail.

Why do you think Bulldozer was delayed some 2+ years ? Where are they now compared with the original expectations ? Could it be that Bulldozer is more Niagara like, lots of small, simpler cores with accelerators + GPUs for hard crunching ( FP ) ? That's the path everyone seems to be taking; not the uber Alpha EV8 like cores.

**Boschwanza** · 09-01-2009, 04:41 AM

It is pretty heavy stuff and my knowledge about CPU Design is very limited but as far as i understand it right, the plan of Dresdenboy descripes a CPU which trys to get an IPC of ~2 on the integerclusters or at least get an IPC over 1 trough high speculating of 1 integercluster to keep the other busy as possible.

**Smartidiot89** · 09-01-2009, 04:58 AM

Originally Posted by BrowncoatGR

That is one bold statement

Core i7 is the long evolution from Pentium Pro, and Phenom II is of K7 and probably abit further than that aswell.

Bulldozer is more or less something completely new built from the ground up. The only thing this CPU will have similar to K10.5 and Nehalem/Westmere is x86 pretty much.

**~~Chad Boga~~** · 09-01-2009, 05:01 AM

Originally Posted by Smartidiot89

Core i7 is the long evolution from Pentium Pro, and Phenom II is of K7 and probably abit further than that aswell.

Bulldozer is more or less something completely new built from the ground up. The only thing this CPU will have similar to K10.5 and Nehalem/Westmere is x86 pretty much.

Fall for marketing much?

Bulldozer may well be a great CPU, but what a crock to suggest that it will be any more radical than what some of the x86 CPU's before it were in their respective time.

**gosh** · 09-01-2009, 05:05 AM

Originally Posted by savantu

And how exactly do you get that "high IPC" ? K10 and Core 2 barely struggle to get 1 IPC in average, Nehalem does slightly better ( think 1.2 ) while it has the most advanced prefetchers in the world by far ( thanks to Netburst ). That is less than 1/3 of what is possible.

Prefetchers on nehalem wouldn't do much. In many situations I think it will be bad for performance. Especially on optimized server applications.

Thread: Dresdenboys' blog: AMD Bulldozer - Patent based research

Thread Tools

Search Thread

Display

Bookmarks

Bookmarks

Posting Permissions