Can Llano do AVX?

**zir_blazer** · 05-01-2010, 06:26 AM

Then, Sandy Bridge GPU would be substantially faster than Clarkdale, and slower than Fusion GPU. I agree than that is as accurate as we can get, but still, that range is pretty much an huge void.

The Memory Bandwidth issue is pretty interesing because we can be talking about a potential HUGE bottleneck there. The Radeon 5570 and 5670 nominal specifications calls for DDR3 and GDDR5 respectively, with a 128 Bits Bus Width for both, at 900 MHz and 1000 MHz, providing a theorical 28.8 and 64 GB p/s. If Llano uses a Dual Channel IMC, you could get a theorical 21.3 and 25.6 GB p/s for DDR-III 1333 MHz and 1600 MHz respectively, with the important difference than on a Llano you are going to share it with the GPU plus the other 4 Cores. Actually, we can have a coherent idea about how much it can impact performance right now by benchmarking how much these two Video Cards scales down with lower VRAM Frequencies, that would provide less Memory Bandwidth and higher access latency.

**Dresdenboy** · 05-04-2010, 11:35 PM

I'm not researching marketing strategies, but I think, these Fusion designs are meant for notebooks and low power (likely also low budget) PCs, where currently the situation is like one of these:
1) (LFB == ) GPU =HT3= CPU == DDR2/DDR3
gpu <-> mem latency high, bandwidth medium (textures, vertex data)
gpu <-> cpu latency medium, bandwidth medium (commands/shader code, vertex data)
result code: +ooo (sum: +)

2) GDDR5 == GPU =PCIe= Chipset =HT3= CPU == DDR2/DDR3
gpu <-> mem latency low, bandwidth high
gpu <-> cpu latency high, bandwidth medium
result code: +-+o (sum: +)

And Fusion will be:
GPU-CPU == DDR3
gpu <-> mem latency low, bandwidth medium (but CPU cores have twice the L2 for less pressure on IMC compared to 1))
gpu <-> cpu latency low, bandwidth high
result code: +o++ (sum: +++)

And less bridges and external interconnects save power. Same performance at less power = more performance/watt, which means, you can actually do more at e.g. 50W than with design 1).

**saaya** · 05-04-2010, 11:49 PM

Originally Posted by zir_blazer

Then, Sandy Bridge GPU would be substantially faster than Clarkdale, and slower than Fusion GPU. I agree than that is as accurate as we can get, but still, that range is pretty much an huge void.

The Memory Bandwidth issue is pretty interesing because we can be talking about a potential HUGE bottleneck there. The Radeon 5570 and 5670 nominal specifications calls for DDR3 and GDDR5 respectively, with a 128 Bits Bus Width for both, at 900 MHz and 1000 MHz, providing a theorical 28.8 and 64 GB p/s. If Llano uses a Dual Channel IMC, you could get a theorical 21.3 and 25.6 GB p/s for DDR-III 1333 MHz and 1600 MHz respectively, with the important difference than on a Llano you are going to share it with the GPU plus the other 4 Cores. Actually, we can have a coherent idea about how much it can impact performance right now by benchmarking how much these two Video Cards scales down with lower VRAM Frequencies, that would provide less Memory Bandwidth and higher access latency.

just look at how little sideport actually helped igp boards...
it gives them a perf boost of 10-25%... so i dont think memory bandwidth is such a big issue for the igp...

**zir_blazer** · 05-05-2010, 12:11 AM

Originally Posted by saaya

just look at how little sideport actually helped igp boards...
it gives them a perf boost of 10-25%... so i dont think memory bandwidth is such a big issue for the igp...

But that is because Sideport gives overkill Memory Bandwidth to a very small GPU. Llano GPU is not small, and also is sharing that Bandwidth. I take though informal comment at the other Thread than the IMC was "THE KEY" component, I actually think that it must be very efficient if it want to be sucessful.

**Dresdenboy** · 05-05-2010, 02:00 AM

Originally Posted by zir_blazer

But that is because Sideport gives overkill Memory Bandwidth to a very small GPU. Llano GPU is not small, and also is sharing that Bandwidth. I take though informal comment at the other Thread than the IMC was "THE KEY" component, I actually think that it must be very efficient if it want to be sucessful.

IIRC we heard from Sam Naffziger, that the IMC will be optimized for that task. But that would only be to keep inner latencies low and not waste bandwidth while transferring data between GPU and DDR3 or CPU cores.

But what we're missing here is a diagram with a profile of memory interface usage by cores and GPU while doing graphics. On a per frame basis I assume, that it would look like
1. CPU cores calculating CPU part of scene data (CPU uses IMC)
2. CPU initiates transfer of shader code, vertices, texture addresses etc. to GPU (IMC mostly used by GPU)
3. GPU renders frame data, fetches textures and vertices from external RAM and the latter also partly from CPU caches (because they have been processed there) (IMC mostly used by GPU)
4. store modified frame data (in local cache or external RAM) and loop to 2. if not done for this frame (IMC mostly used by GPU)

CPU surely will work on next frame already (e.g. doing physics calculations), but has L2 caches to stay local during most cycles.

**necronomicon** · 05-05-2010, 11:29 AM

More Bulldozer info: http://www.pcgameshardware.de/aid,74...lano/CPU/News/

**saaya** · 05-05-2010, 09:43 PM

hmmm with the igp integrated, wouldnt there be a HUGE boost from larger L3 caches?
provided the algorythms are up to the task of handling data that both cpu and gpu write and read, then this would be a huge shortcut compared to handling this in the memory...
it would work around the memory latency and bandwidth limitations... but i think youd need a BIIIIG l3 cache for that, right?

Originally Posted by zir_blazer

But that is because Sideport gives overkill Memory Bandwidth to a very small GPU. Llano GPU is not small, and also is sharing that Bandwidth. I take though informal comment at the other Thread than the IMC was "THE KEY" component, I actually think that it must be very efficient if it want to be sucessful.

your right... wow, i didnt know llano would have that many shaders... :o
12x that of the 880/890 chipset? geez...

880/890=40shaders
llano=480shaders

thats more than a 5670... and a 5670 has 64gb/s bandwidth...
hmmmm if amd is smart they will use sideport on the the cpu package...
think of how much a single or maybe 2 gddr5 chips on the package could do

then all they would need the L3 cache for is to buffer cpu to gpu traffic...

thanks for the link

but thats a stupid decision from amd if true...
they have to write the l1 and l2 caches to SYSTEM MEMORY before they can C6?
thats really stupid...
why dont they power gate the core only and let the L1 and L2 active? its not like L1 and L2 consume that much power...
that way the cores could c6 while its caches are still available for other cores, and the core doesnt have to load data from memory when powering on again...

**flippin_waffles** · 05-07-2010, 08:28 AM

Its getting close now. Ill just quote a post from S|A since its easier.

Fusion, comming soon:http://sites.amd.com/us/fusion/apu/P...ion.aspxFusion,

whitepaper:http://sites.amd.com/us/Documents/48...epaper_WEB.pdfA bit about system bus and memory controler: Quote: The key aspect to note is that all the major system elements â€“ x86 cores, vector (SIMD) engines, and a Unifed Video Decoder (UVD) for HD decoding tasks â€“ attach directly to the same high speed bus, and thus to the main system memory. This design concept eliminates one of the fundamental constraints that limits the performance of traditional integrated graphics controllers (IGPs). and ... Quote: Although the APUâ€™s scalar x86 cod SIMD engines share a common path to system memory, AMD’s frst generation implementations divide that memory into regions managed by the operating system running on the x86 cores and other regions managed by software running on the SIMD engines. AMD provides high speed block transfer engines that move data between the x86 and SIMD memory partitions. Unlike transfers between an external frame buffer and system memory, these transfers never hit the system’s external bus. Clever software developers can overlap the loading and unloading of blocks in the SIMD memory with execution involving data in other blocks. Insight 64 anticipates that future APU architectures will evolve towards a more seamless memory management model that allows even higher levels of balanced performance scaling.

**zir_blazer** · 05-07-2010, 09:05 AM

The Links aren't working. Fixed Links: Webpage, PDF

Llano could be one of the most interesing piece of Hardware in quite a long time and could OWN the entire OEM and Notebook market given the right price, and as I stated several times, it could eat the Socket AM3 platform market ("Mainstream" but overally looks better than current enthusiast class Thuban) at least until Bulldozer arrival. However, I don't declare it an instant winner until I see Sandy Bridge GPU in action.

**kl0012** · 05-07-2010, 10:04 AM

Originally Posted by flippin_waffles

Its getting close now. Ill just quote a post from S|A since its easier.

It seems that the first Fusion implementation is even less efficient then i thought. So in case to exchange data between CPU and GPU you need to move data from one memory location to another

**informal** · 05-07-2010, 11:19 AM

I'd rather wait and see how it works in practice

.

**Hornet331** · 05-07-2010, 11:20 AM

Originally Posted by kl0012

It seems that the first Fusion implementation is even less efficient then i thought. So in case to exchange data between CPU and GPU you need to move data from one memory location to another

Hmm I don't know, the whitepaper is not really conclusive what AMD means with "attach directly to the same high speed bus, and thus to the main system memory."

For me it seems they are saying the SIMD array (aka igp) is still conected to the memcontroller/cpu over a HT link, but its a on-die HT link, which of course elimiates quite a bit latency.
Seems like clarksfield, just one step down (form package level, to die level).

Im curious how exactly its implemented, time will tell.

**superrugal** · 05-07-2010, 12:17 PM

.........Hmmm, it seems that the multiple SIMD Arrays are connected to crossbar directly?

**SEA** · 05-07-2010, 12:35 PM

I highly doubt of 480 shaders...
But for APU they don't need much shaders to start from.

We already heard everywhere that first generation of llano comes this year? That document says The AMD Fusion family of Accelerated Processing Units is scheduled to arrive in 2011.
Also they say here about APU accelerated programs with GPU assistance, so that first llano with K8 will be able doing that as well. I guess it is same way like it can be done done now with separate graphic chips.

**ajaidev** · 05-07-2010, 12:36 PM

"attach directly to the same high speed bus, and thus to the main system memory."

By that they mean that a multi directional HT bus connect the parts together "HT1" and then another link is used to connect to the memory "HT2"

**Chumbucket843** · 05-07-2010, 12:57 PM

Originally Posted by Hornet331

Hmm I don't know, the whitepaper is not really conclusive what AMD means with "attach directly to the same high speed bus, and thus to the main system memory."

For me it seems they are saying the SIMD array (aka igp) is still conected to the memcontroller/cpu over a HT link, but its a on-die HT link, which of course elimiates quite a bit latency.
Seems like clarksfield, just one step down (form package level, to die level).

Im curious how exactly its implemented, time will tell.

they share the same memory controller but through software they partition memory. the gpu cant perform computations on the cpu's memory partition or vice versa. this is probably the most simple way of doing things although the memory controller will be an issue. the gpu is throughput optimized and the memory controller on them is designed with advanced arbitration logic to prioritize dram r/w's for the SIMD's. its like machine learning in hardware.

**Manicdan** · 05-07-2010, 01:04 PM

well 480SPs is alot, and im sure they did alot of math to know if it was going to be too much, or if they should have started with like half that. but i have to assume they knew what they were getting into, and that 480 wouldnt be memory starved or going to waste.

**blindbox** · 05-07-2010, 01:30 PM

Originally Posted by zir_blazer

The Links aren't working. Fixed Links: Webpage, PDF

Llano could be one of the most interesing piece of Hardware in quite a long time and could OWN the entire OEM and Notebook market given the right price, and as I stated several times, it could eat the Socket AM3 platform market ("Mainstream" but overally looks better than current enthusiast class Thuban) at least until Bulldozer arrival. However, I don't declare it an instant winner until I see Sandy Bridge GPU in action.

Cool flash animation, though it crashed the first time (Firefox beta hooray, only the plugin crashed)

**informal** · 05-07-2010, 01:46 PM

Fusion IMC is the key component of Llano design,it will be enough for the integrated GPU and the x86 cores.There's some clever engineering going inside that IMC and the chip as a whole.

**Hornet331** · 05-07-2010, 02:41 PM

Originally Posted by Chumbucket843

they share the same memory controller but through software they partition memory. the gpu cant perform computations on the cpu's memory partition or vice versa. this is probably the most simple way of doing things although the memory controller will be an issue. the gpu is throughput optimized and the memory controller on them is designed with advanced arbitration logic to prioritize dram r/w's for the SIMD's. its like machine learning in hardware.

Thats not really the issue, my question was more about how its connected to the IMC. Intel solution for Clarksfield was it to bundel the igp with the imc and access the imc for the cpu thorugh the highbandwidth qpi link all on the same package, which already cut the latancy quite a bit.

And what im reading right now in amds whitepaper it looks like they'll do it the opposite way plus on a deeper level, but still through a bus (HT) and not a direct connection.

**Chumbucket843** · 05-07-2010, 04:18 PM

i am sort of confused with what you are saying.

the SIMD's in the GPU and the x86 processors in the CPU are both "cores" and can request dram access. there is no reason for a QPI/HT link.

Until now, transistor budget constraints typically
mandated a two chip solution for such systems, forcing
system architects to use a chip-to-chip crossing between
the memory controller and either the CPU or GPU as
shown in Figure 3. These transfers affect memory latency,
consume system power and thus impact battery life. The
APU’s scalar x86 cores and SIMD engines share a common
path to system memory to help avoid these constraints.

**kl0012** · 05-07-2010, 09:03 PM

Originally Posted by Hornet331

Hmm I don't know, the whitepaper is not really conclusive what AMD means with "attach directly to the same high speed bus, and thus to the main system memory."

For me it seems they are saying the SIMD array (aka igp) is still conected to the memcontroller/cpu over a HT link, but its a on-die HT link, which of course elimiates quite a bit latency.
Seems like clarksfield, just one step down (form package level, to die level).

Im curious how exactly its implemented, time will tell.

It seems that there is no need for HT this time. Fusion IGP (just as
normal IGP or even a discrete GPU) has its own dedicated memory region
and has no direct connection with the CPU core that is the data exchange between the cores is going through transfer data from one memory location to another. It seems that all the innovation lies in the fact that now the gpu is on the same die with CPU and shares with it a common channel to memory.

**BrowncoatGR** · 05-07-2010, 10:21 PM

Originally Posted by kl0012

It seems that there is no need for HT this time. Fusion IGP (just as
normal IGP or even a discrete GPU) has its own dedicated memory region
and has no direct connection with the CPU core that is the data exchange between the cores is going through transfer data from one memory location to another. It seems that all the innovation lies in the fact that now the gpu is on the same die with CPU and shares with it a common channel to memory.

read again

AMD provides high speed block transfer engines that move data between the x86 and SIMD memory partitions. Unlike transfers between an external frame buffer and system memory, these transfers never hit the system’s external bus.

**kl0012** · 05-07-2010, 10:44 PM

Originally Posted by BrowncoatGR

read again

Why do you think i didn't read it? Aside, I mentioned "shared" memory access which automatically means that "those transfers" wont go through external bus. But any way the data path from cpu to gpu remains long: cpu -> cpu_memory->gpu_memory->gpu. Not much better then current IGP.

**zir_blazer** · 05-08-2010, 01:59 AM

I suppose than that is referred to commands communication bewthem the CPU and GPU, basically, when the CPU tells the GPU what to do. When that its happening, it should not use the IMC, HyperTransport, or anything, because it is on the same piece of silicon. However, when the GPU is going to request the bulk of the data, you are going to access its VRAM (Shared in the RAM) anyways, so you are basically cutting the CPU-to-GPU coordination latency. Not having to use external bus at all means that it should have a sort of huge Cache or buffers to store tons of info to not have to actually access the external bus, or that the CPU uploads data to process to the GPU in real time instead of storing it on VRAM then telling the GPU to go to retrieve it.
In current platforms topology works like this:

_Northbridge IGP without Sideport (Shared RAM): CPU uploads graphics data to the shared RAM at one hop using the IMC, and also commands the GPU one hop away using HyperTransport. The IGP has to travel two hops, using HyperTransport then the CPU IMC to retrieve the data.
_Northbridge IGP with Sideport (Exclusive): CPU uploads data to the exclusive IGP VRAM that is two hops away, using HyperTranport to the Northbridge then the mini IMC that it should have to manage the Sideport. It still commands the GPU one hop away, and the GPU got its VRAM one hop away, too.
_Northbridge IGP with Sideport AND shared RAM: You're using as VRAM both the Sideport and some shared RAM, this means that you should have an extra overhead. The topology should be depending on where the CPU is uploading to what part of the VRAM and where the GPU is retrieving it, but basically is the last two simultaneously.
_PCIe Video Card: GPU is two hops away from the CPU (HyperTransport to Northbridge, then PCIe Bus to Video Card), and the VRAM is at three (Using the GPU IMC). Nuff' said.
_PCIe Video Card with TurboCache/HyperMemory: Similar to the Northbridge IGP with Sideport AND shared RAM, as the GPU also got some VRAM of its own, just that the GPU is two hops away (HyperTransport, PCIe Bus), its exclusive VRAM at three (Add GPU IMC to the previous path), the shared one at one hop (Using the GPU IMC) from the CPU, but three from the GPU (PCIe, Hyper Transport, Processor IMC). But the concept is the same.
_Fusion: CPU commands GPU internally, lowest possible latency at 0 hops. You are using shared RAM at just one hop with the IMC, so whatever either the CPU or GPU wants to access must have to do it though the same bus. Possibily, the most important improvement would be that data to process is directly uploaded from the CPU to GPU in real time instead of it just saying it where in the RAM it has placed it at, in what case it would have to retrieve it from the VRAM.

Thread: Can Llano do AVX?

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions