AMD Bulldozer to be launched in 19th September

**Manicdan** · 08-30-2011, 09:01 AM

i think their initial concern was making a proper module that gives the best perf/mm2 they could. if they on top of that worried about a massive IPC increase, they might have messed up alot of the balancing and created accidental bottlenecks.

i think what will be good to watch is the number of cores for a typical chip (250-300mm2) on each process. the increase is currently showing a rapid increase due to K8 design being slightly updated for almost a decade now, and so with each shrink they can simply pack in more. so in the next decade we can either see chips with 30+ cores as things continue to just double, or we might see a trend leading to a cap as IPC starts to become the main focus.

**informal** · 08-30-2011, 09:07 AM

Originally Posted by Manicdan

i think their initial concern was making a proper module that gives the best perf/mm2 they could. if they on top of that worried about a massive IPC increase, they might have messed up alot of the balancing and created accidental bottlenecks.

i think what will be good to watch is the number of cores for a typical chip (250-300mm2) on each process. the increase is currently showing a rapid increase due to K8 design being slightly updated for almost a decade now, and so with each shrink they can simply pack in more. so in the next decade we can either see chips with 30+ cores as things continue to just double, or we might see a trend leading to a cap as IPC starts to become the main focus.

IPC is not the main focus for a long time. More cores and ISA extensions are. Intel did manage to get some good increases out of Core design,that is true,but with Haswell you can expect something more like coprocessor onboard and more specialized instructions for specific tasks.And yes,more cores. Similar goes for Bulldozer's successors.

**Apokalipse** · 08-30-2011, 01:12 PM

Originally Posted by informal

Indeed,IPC will probably increase with Bulldozer core but not by much- think 10% in int and similar for FP (without recompile). With Llano we had a >6% (AT measured up to 14% and as low as 3%), just by some smaller tweaks and upping the L2 cache size to 1MB. With BD ,we have a whole new L2 cache that is now shared,we have some massive pipeline changes (to allow higher clock) and we have some major uarchitectural improvements in instruction fecth,decode,branch prediction and last but not least prefetching.All these are a big step up and will offset any IPC impact that longer pipeline (or 3rd ALU) might have had on Bulldozer average IPC.

IPC was definitely not a focus of Llano. It's basically just K10.5 plus a GPU, and minor changes. And BD has a completely new core, so it's not really comparable.

IPC is definitely not as high as AMD could have made it. But I think it would put it somewhere in the ballpark of SB (where exactly I don't know).

You do need a good balance of IPC and frequency (not just focus on one of them and ignore the other eg Netburst), and I think AMD tried to be aggressive up to a point in both, but they do often trade off each other (as well as power consumption, die area)

So yes it is more complex than saying "AMD focused on getting IPC up", but it's not completely inaccurate.

**JumpingJack** · 08-30-2011, 06:47 PM

Originally Posted by v0dka

Can you elaborate on that? I seriously doubt that BD wasn't meant to be faster clock-for-clock. AMD hasn't pushed real IPC improvements in K10 so that would mean that they would be satisfied with IPC that is comparable to K8 (2003?).

http://ieeexplore.ieee.org/Xplore/lo...hDecision=-203

Compared to previous AMD x86-64 cores, project goals reduce the number of F04 inverter delays per cycle by more than 20%, while maintaining constant IPC, to achieve higher frequency and performance in the same power envelope, even with increased core counts.

AMD published 3 articles on Bulldozer at the ISSCC 2011, this one describes in detail the integer cores and scheduler, there was another paper describing the details of the module, sharing, and circuit design and a 3rd which discussed details on the L3 cache design. All three were very interesting reads.

Jack

**informal** · 08-30-2011, 06:59 PM

"while maintaining constant IPC" is the key word. The design is clearly optimized for good average IPC,keeping the pipelines fed via the aggressive prefetch and 3 separate schedulers.

**Apokalipse** · 08-30-2011, 08:16 PM

Does "constant IPC" mean that it doesn't vary much between different loads? (ie being able to keep feeding the cores with aggressive prefetching, etc)

**informal** · 08-31-2011, 12:33 AM

Yes,at least that's how I read it.

**Solus Corvus** · 08-31-2011, 03:40 AM

There is always going to be significant IPC differences between various types of workloads. Some algorithms will naturally make better use of certain architectures than others.

The way I read "constant IPC" was as maintaining a consistently high IPC within any given workload. As opposed to, say, blazing through the arithmetic only to stall hard because of a missed branch predict. The new branch prediction, prefetch, beefy decoding, cache hierarchy, etc I think backs that up. It's to keep the reduced number of ALUs consistently fed. If they were going for maximum IPC they could have added extra execution units as well. It would benefit certain types of code with high ILP, but the extra units would be hard to keep fed on average and so IPC would fluctuate more.

**drfedja** · 08-31-2011, 04:25 AM

IPC depend on software optimisations, but also is hardware dependent. All CPU's have relativly same peak IPC, but average IPC isn't same. Average CPU IPC represents how much ILP has software on certain CPU or even GPU. That average IPC is determined by memory access, cache hit rate, how CPU handle with Out of Order execution, number of execution units, efficiency of branch prediction, type of workloads etc...
If you have CPU with long pipeline, bad branch predictor, high memory latency, average prefetcher and on the other hand lot of FP execution resources and high memory bandwidth, that CPU could be potentionaly bad performer with pure branchy integer code, but will be monster with accessing paralel data and SIMD execution.
However, CPU's are tradeoffs in most cases.
In Bulldozer module we have one flexible FPU, which has almost the same area of 10h FPU. But that FPU with two threads from two integer cores has almost same IPC per thread like 10h FPU alone. With two threads FlexFPU is completly utilised and there is no wasted resources. Also front end... average workload has not IPC more than 1. BD front end can decode up to 4 instructions. That is enough to feed execution engine with two strong threads and has minimal imact on performance per thread per core (not module).

I want to tell that for eg. if you have strong FPU, and weak integer CPU you have not balanced CPU for most workloads. Also if you have strong integer CPU, and weak FPU, CPU isn't balanced well. But you can make ultimate IPC CPU with ultimate FPU, but that is expensive or can't clock high or isn't power efficient. If you make for eg. tradeoff with critical elements, you can lose 5-10% of IPC, but you can get less expensive CPU that clocks higher and dissipate less energy.

Because of that IPC isn't everything. This is a part of whole picture. We all want CPU with good IPC, which clocks higher than todays CPUs, which has also good single thread performance and CPU with cool core. That is more important than how much FPU in such CPU can execute DP ops per cycle.

BD will be very good if IPC per module will be 15-25% higher than 10h, frequency 15-20% with same or lower power envelope.

Originally Posted by Solus Corvus

It's to keep the reduced number of ALUs consistently fed. If they were going for maximum IPC they could have added extra execution units as well. It would benefit certain types of code with high ILP, but the extra units would be hard to keep fed on average and so IPC would fluctuate more.

2 ALU's isn't so much problem because L1 Data cache is 2-ported. There is only two memory operations per cycle, with 10h or even BD. There is two AGU's which is used for memory address calculatons. Most of integer operations are mov's from reg-mem, mem-reg etc.... BD integer core will be much stronger than 10h, because of much better memory level parallelism.

**Solus Corvus** · 08-31-2011, 04:51 AM

Originally Posted by drfedja

2 ALU's isn't so much problem because L1 Data cache is 2-ported. There is only two memory operations per cycle, with 10h or even BD. There is two AGU's which is used for memory address calculatons. Most of integer operations are mov's from reg-mem, mem-reg etc.... BD integer core will be much stronger than 10h, because of much better memory level parallelism.

I didn't say it was a problem. Sometimes you really can do more with less. If they kept the extra ALU or even added more to satisfy certain high ILP code (ie, not average) then they would just be wasting power and transistors that could have been, and were, spent elsewhere (more cores).

**Apokalipse** · 08-31-2011, 05:32 AM

Originally Posted by drfedja

I want to tell that for eg. if you have strong FPU, and weak integer CPU you have not balanced CPU for most workloads. Also if you have strong integer CPU, and weak FPU, CPU isn't balanced well. But you can make ultimate IPC CPU with ultimate FPU, but that is expensive or can't clock high or isn't power efficient. If you make for eg. tradeoff with critical elements, you can lose 5-10% of IPC, but you can get less expensive CPU that clocks higher and dissipate less energy.

Because of that IPC isn't everything. This is a part of whole picture. We all want CPU with good IPC, which clocks higher than todays CPUs, which has also good single thread performance and CPU with cool core. That is more important than how much FPU in such CPU can execute DP ops per cycle.

BD will be very good if IPC per module will be 15-25% higher than 10h, frequency 15-20% with same or lower power envelope.

Right. I mean, it wouldn't be that hard to increase IPC by 25 or 30% from K10.5 But it would be very hard to do it without using a lot of transistors, using a lot of power and/or sacrificing a lot of frequency.
Instead of increasing IPC in any way possible, you'd want to increase IPC in inexpensive ways, and try not to spend transistor budget in areas that won't get much use.

**drfedja** · 08-31-2011, 07:11 AM

Originally Posted by Solus Corvus

I didn't say it was a problem. Sometimes you really can do more with less. If they kept the extra ALU or even added more to satisfy certain high ILP code (ie, not average) then they would just be wasting power and transistors that could have been, and were, spent elsewhere (more cores).

Maybe in rare cases, but in average ILP is much more dependent by MLP than how much ALU units you have.

Originally Posted by Apokalipse

Right. I mean, it wouldn't be that hard to increase IPC by 25 or 30% from K10.5 But it would be very hard to do it without using a lot of transistors, using a lot of power and/or sacrificing a lot of frequency.
Instead of increasing IPC in any way possible, you'd want to increase IPC in inexpensive ways, and try not to spend transistor budget in areas that won't get much use.

Increasing ILP (or IPC) is more important for execution of non-paralel code (singlethread), but for high paralel code is more important how much cores you have and how much power they dissipate. That is the main paradigm of Bulldozer CMT - good single thread ILP with usage of all shared resources, and average multithread ILP with lot of low power cores. If you have a fat CPU core, you can add a little than you have two tiny cores. Because of that, BD module acts like one fat CPU core or two tiny with power budget of one fat core.

**Solus Corvus** · 08-31-2011, 07:40 AM

Originally Posted by drfedja

Maybe in rare cases, but in average ILP is much more dependent by MLP than how much ALU units you have.

ILP isn't at all dependent on the number of ALUs. Having more ALUs, and a smart front end, only allows you to take advantage of ILP when present and convert it into IPC, but the ILP is fixed.

ILP depends on the specific code being executed. Instructions that depend on the results of previous instructions reduce ILP. Some pieces of code don't have any ILP and only clocks or speculative execution can help.

**drfedja** · 08-31-2011, 08:25 AM

Originally Posted by Solus Corvus

ILP isn't at all dependent on the number of ALUs. Having more ALUs, and a smart front end, only allows you to take advantage of ILP when present and convert it into IPC, but the ILP is fixed.

Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously. A goal of compiler and processor designers is to identify and take advantage of as much ILP as possible.
ILP is dependent of course by smart front end, and also by memory ordering subsystem. CPU with good memory disambiguation will have more ILP than CPU without it.
When I talk about memory level paralelism I mean that how CPU handle load/store operations. This is one of main CPU function (load and store). ILP and IPC are different words for same think. Only difference is how it is measured. IPC measures how much instructions are executed per cycle. ILP is not measure. ILP is feature - for example: how many instructions is executed per minute, second or cycle is feature of instruction throughput.

ILP depends on the specific code being executed. Instructions that depend on the results of previous instructions reduce ILP. Some pieces of code don't have any ILP and only clocks or speculative execution can help.

ILP and IPC depends on specific code. Instructions can't depend on previous instruction, data has dependencies. If n-th instruction uses data which is calculated by n-1 instruction, that instruction works with data which has data depenedent by previous calculated data. Because of data dependencies instructions must wait on scheduler for needed data.
Speculative memory address calculation can help with data dependencies.

**Solus Corvus** · 08-31-2011, 06:28 PM

Originally Posted by drfedja

Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously. A goal of compiler and processor designers is to identify and take advantage of as much ILP as possible.
ILP is dependent of course by smart front end, and also by memory ordering subsystem. CPU with good memory disambiguation will have more ILP than CPU without it.
[...]
ILP and IPC are different words for same think. Only difference is how it is measured.

The first part is true and an unsourced quote from wikipedia. The bolded part isn't true. ILP is a function of the code, it doesn't depend at all on the hardware it is being run on.

IPC and ILP are related but not the same thing. ILP is a measure of how much total parallelism possible, at the instruction level, there is in any given code. Instructions are either dependent on each other or they aren't. IPC is a measure of how effective any particular CPU is at recognizing and utilizing that ILP, or speculatively bypassing dependencies, to keep available processing units occupied and how long it takes those units to complete the given instructions.

Maybe an analogy would help: think of Thread Level Parallelism. A program with 2 threads, for example, has a TLP of 2. It has that TLP regardless of if it is running on a single core or dual core CPU. Obviously a dual core would be able to utilize that inherent parallelism better. Even though the TLP of the program doesn't change depending on the CPU, the throughput (a rough proxy for IPC at the TLP level) will be higher on the CPU that can recognize and utilize that parallelism.

The way I see it is AMD perceives the code its target market runs most depends more on MLP and TLP than it does on further increases in IPC. Instead of fighting with Intel over diminishing returns in IPC, at a high cost of power and transistors, they are targeting the relatively unexploited parallelism in high-threaded data driven applications. This says to me that they are aiming, first and foremost, directly at retaking the server and scientific market.

**drfedja** · 08-31-2011, 07:11 PM

Originally Posted by Solus Corvus

The first part is true and an unsourced quote from wikipedia. The bolded part isn't true. ILP is a function of the code, it doesn't depend at all on the hardware it is being run on.

Ok...I have missed something... definition from wiki says: Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously (from wikipedia).

IPC is a measure of how effective any particular CPU is at recognizing and utilizing that ILP, or speculatively bypassing dependencies, to keep available processing units occupied and how long it takes those units to complete the given instructions.

...that explains why profilers has IPC for measuring performance of code. If you have more ILP with code, IPC will be higher and peformance of code could be higher. (not neceserily higher)

**Solus Corvus** · 09-01-2011, 12:02 AM

Originally Posted by drfedja

...that explains why profilers has IPC for measuring performance of code. If you have more ILP with code, IPC will be higher and peformance of code could be higher. (not neceserily higher)

Profilers use IPC because it is the more useful measurement. ILP is how many instructions could theoretically be executed in parallel given an idealized CPU. IPC is related to how many instructions actual hardware can execute simultaneously. Rather than some theoretical figure that doesn't really help, IPC is telling you how the code is playing out in real hardware.

Generally more ILP is good, but it's not a rule. ILP is just a measure of how many instructions can be executed in parallel, it doesn't tell us anything about how much useful work is being accomplished by those instructions. There are cases where a lot of parallel but simple instructions accomplish less than a few serial but more complex instructions (such as a bunch of individual math ops VS a single AVX or SSE instruction), or where a more efficient version of an algorithm (or one that is a better match for a given CPU's branch prediction capability, etc) has less ILP but still accomplishes a task faster than a less efficient algorithm. Again, this is why measuring the real performance of your code on actual hardware with a profiler is important.

**TESKATLIPOKA** · 09-01-2011, 12:41 AM

drfedja

BD will be very good if IPC per module will be 15-25% higher than 10h, frequency 15-20% with same or lower power envelope.

I agree. Llano is also better by 6-7% and those were just some minor tweaks with doubled L2 cache.
Now BD has an Integer cluster bigger by 15% than Llano and the whole module except the second integer is bigger by 60%. I think we will see a healthy increase in IPC.

Compared to previous AMD x86-64 cores, project goals reduce the number of F04 inverter delays per cycle by more than 20%, while maintaining constant IPC, to achieve higher frequency and performance in the same power envelope, even with increased core counts.

IF its true, that would be great for notebook Trinity, because honestly Llano APU for notebooks is too much limited by TDP and thats why the frequency of cpu is so low.

Thread: AMD Bulldozer to be launched in 19th September

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions