Can Llano do AVX?

**Raqia** · 05-08-2010, 04:03 AM

The best solution would be to sell the APU soldered onto a board w/o commodity dimms etc; instead, use soldered GDDR5 on a wide bus for > 100 GB/s of bandwidth for both CPU and GPU. There's no need for most consumers to have commodity dimms on their machines. At this level of integration, the whole system could be designed like a graphics card is today.

**Dresdenboy** · 05-09-2010, 12:52 PM

There is a paper about the mem bandwidth required by games:
http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

**zir_blazer** · 05-09-2010, 01:17 PM

Originally Posted by Dresdenboy

There is a paper about the mem bandwidth required by games:
http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

What I understanded from it, is that 2005-2006 games sended around 60 MB p/s of data to the GPU (With Oblivion being the lone exception of topping at 142 MB p/s), then the GPU had its own 11 GB p/s usage of Memory Bandwidth from the VRAM. Not sure how a similar study do these days considering that GPUs and game engines evolved quite a bit, but that slighty old study shows a pattern.
So I suppose than that basically means that games benefict almost exclusively from lower access latency when overclocking the VRAM Memory and little to nothing about the exceed Memory Bandwidth.

**kl0012** · 05-09-2010, 10:55 PM

Originally Posted by zir_blazer

_Fusion: CPU commands GPU internally, lowest possible latency at 0 hops. You are using shared RAM at just one hop with the IMC, so whatever either the CPU or GPU wants to access must have to do it though the same bus. Possibily, the most important improvement would be that data to process is directly uploaded from the CPU to GPU in real time instead of it just saying it where in the RAM it has placed it at, in what case it would have to retrieve it from the VRAM.

1. CPU sends "command buffer" to GPU through memory (and not through I/O ports). All CPU I/O (not memory) operations are slow by its nature. They are not cached.
2. The RAM of "fusion" cpu (at least for its first version) isn't shared between GPU and CPU. Only a memory controller is shared and the RAM is divided into regions each of which is dedicated to CPU or GPU. So in order to exchange data between CPU and GPU you need to copy data from one mem region to another. This is probably not too efficient.

**kl0012** · 05-09-2010, 11:02 PM

Originally Posted by zir_blazer

What I understanded from it, is that 2005-2006 games sended around 60 MB p/s of data to the GPU (With Oblivion being the lone exception of topping at 142 MB p/s), then the GPU had its own 11 GB p/s usage of Memory Bandwidth from the VRAM. Not sure how a similar study do these days considering that GPUs and game engines evolved quite a bit, but that slighty old study shows a pattern.
So I suppose than that basically means that games benefict almost exclusively from lower access latency when overclocking the VRAM Memory and little to nothing about the exceed Memory Bandwidth.

The estimation was done for 1024x768 screen resolution. For 1920x1200 with high texture quality you would probably need much higher bandwidth.

**Dresdenboy** · 05-09-2010, 11:48 PM

Originally Posted by kl0012

2. The RAM of "fusion" cpu (at least for its first version) isn't shared between GPU and CPU. Only a memory controller is shared and the RAM is divided into regions each of which is dedicated to CPU or GPU. So in order to exchange data between CPU and GPU you need to copy data from one mem region to another. This is probably not too efficient.

It sounds like the efficient part of this story is, that the CPU doesn't have to do the copying by executing code. In fact it could be powered off while copying takes place. That's what the Fusion paper suggests (at least to me). What if the command blocks (a few MB per frame in total) are sent in packets of a few hundred kb? These could still be residing in the L2 cache and fetched from there. The copying itself has to be initiated by the graphics driver, which in Llano's case should be able to program/control the IMC accordingly.

Textures are a different story, but they don't have to be copied per frame from CPU to GPU.

**kl0012** · 05-10-2010, 02:50 AM

Originally Posted by Dresdenboy

It sounds like the efficient part of this story is, that the CPU doesn't have to do the copying by executing code. In fact it could be powered off while copying takes place. That's what the Fusion paper suggests (at least to me).

I'm not sure if CPU is doing something when a data transfer from a main memory to a vram is taking place in the current scheme. Probably a DMA is responsible for this. The only diff with the current solutions is a mem transfer which dosn't hit external bus (PCIe in case of discrete graphics or HT in case of integrated graphics).

What if the command blocks (a few MB per frame in total) are sent in packets of a few hundred kb? These could still be residing in the L2 cache and fetched from there. The copying itself has to be initiated by the graphics driver, which in Llano's case should be able to program/control the IMC accordingly.
Textures are a different story, but they don't have to be copied per frame from CPU to GPU.

I think it would be possible if AMD will implement some on-die "command buffer" memory which can be mapped by CPU. Currently GPU reads commands form a special region in system memory which is mapped as I/O to prevent caching.

**Hans de Vries** · 05-10-2010, 04:03 AM

Originally Posted by zir_blazer

Basically, that would place it slighty above the Radeon 5570/5670 but a bit far from the 5750, with 400 and 720 SP respectively. That means that we can speculate quite accurately about Fusion GPU performance. With two exceptions: Having the GPU directly connected in the same piece of silicon to the CPU means that you have a benefict for basically eliminating their communication latency, and that is better, however, how much of an impact it could make the fact that it would be sharing Memory Bandwidth with the other Cores and with an increased latency compared to the Video Card own VRAM? Well, that is all what is left to know about Fusion GPU besides actual numbers.
Now... What we do know about Sandy Bridge? Do we have a remote idea of its performance? Else, I would still stay at bay until more info surfaces. The worst thing that you can do is saying that you are I N V I N C I B L E and get owned before finishing to say the classic sentence.

BTW... Where the hell is Hans de Vries? It should be useful his input in this Thread after soo many days.

Hans is still wondering what we can conclude from this new info....

http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx
http://sites.amd.com/us/Documents/48...epaper_WEB.pdf

It's easier to talk about what I would like to see in it.
Nathan Brookwood talks all about HPC/GPGPU applications
and a little bit about the architecture.

The all important question is still: Does Llano have a GDDR5 sideport?

It seems to be required for both HPC and graphics since for SPEC_FP_rate
we see the 128 bit bus as a real bottleneck already for Istanbul, let alone
for a GPGPU, and for graphics we can see how much faster the Radeon HD
5670 is compared to the Radeon HD 5570 by using GDDR5 instead of GDDR3.
(http://www.anandtech.com/show/2935)

The only hint for sideport GDDR5 memory is the statement about the
autonomous data transfer (presumably scatter/gather DMA controllers)
between "CPU" and "GPU" based memory. If the GPU memory is physically
in the DDR3 DIMMS then you can just allocate the required amount as
a non cache-able area to store game data.

In the context of Nathan's white paper you'll expect AMD to port all
it's math and HPC libraries to the GPGPU via ACML-GPU, see:
(http://developer.amd.com/gpu/acmlgpu/Pages/default.aspx)
Here you need a shell which copies portions of the large data structures
from CPU memory (which can be dozens of GByte) to the smaller GDDR5
memory of the GPGPU (1/2 to 2 GByte) for high bandwidth, high throughput
processing. The DMA units should be able to double the copy bandwidth
over software copying.

Now, Llano die photo doesn't show a sideport memory interface which
should be 128 wide IMO. We also know that not the whole die is shown
so I'm not sure. I'm sure the ATI guys would fight very hard to get one
on the die and they are right but there is no proof yet. If not then we
just have to be satisfied with the latest Sep 2009 JEDEC file which
standardizes DDR3-1866 and DDR3-2133...
http://www.jedec.org/standards-docum...ocs/jesd-79-3d

Regards, Hans

**saaya** · 05-10-2010, 04:44 AM

if you look at things in a bigger perspective... then im sure itll have sideport...
amd was talking about fusion for a long time, and they started sideport in the 6xx igp days already... shortly after they took over ati...
that was with 40 and 80 shader cores, and 64bit... and even though it didnt give a huge boost and some mainboard and laptops makers actually opted to NOT use sideport...
they continued to support it in new chipsets

and now we are talking about 480 shader cores!
and no sideport at all?
it would be really odd and a step back to NOT have sideport on llano...
and the fact alone that they dont talk about having or not having sideport is a hint towards them at least considering it imo...
sometimes the most important things are what somebody DOESNT say, and not what he does say

**Kej** · 05-10-2010, 05:02 AM

Originally Posted by Hans de Vries

.............If not then we just have to be satisfied with the latest Sep 2009 JEDEC
file which standardizes DDR3-1866 and DDR3-2166...
http://www.jedec.org/standards-docum...ocs/jesd-79-3d

Regards, Hans

Fetching that paper needs registration so I need to ask, is that really DDR3-2166
and not DDR3-2133?

Personally I would have preferred that they moved in bigger steps this time, that
is to DDR-2000 and DDR-2400 before moving on to a future DDR4. The upside of
faster memory feels diminishing, to me, nowadays.

**Helmore** · 05-10-2010, 05:56 AM

Llano's die size doesn't seem big enough to support a sideport IMHO. I mean, Llano is only around 160 to 170 mm² and it needs to have the pin-out for the 128-bit DDR3 memory bus, around 20 PCIe channels and the display connections. I don't think we'll be seeing a sideport on Llano.

**Raqia** · 05-10-2010, 06:33 AM

I think AMD is more focused on low cost and smooth execution w/ this first generation of Fusion, so it probably won't have a more expensive and complicated arrangement with a side-port. It would make sense to solder the APU and memory onto a single board or package in future iterations though since this is intended to be a stand-alone consumer product. Why not integrate an entire consumer level systems into units like graphics cards today and just keep the modularity and upgradeability for servers?

**Hans de Vries** · 05-10-2010, 07:11 AM

Originally Posted by Kej

Fetching that paper needs registration so I need to ask, is that really DDR3-2166
and not DDR3-2133?

Right, corrected. It's all in steps of 266 MHz

Regards, Hans

**Hans de Vries** · 05-10-2010, 07:49 AM

Originally Posted by Helmore

Llano's die size doesn't seem big enough to support a sideport IMHO. I mean, Llano is only around 160 to 170 mm² and it needs to have the pin-out for the 128-bit DDR3 memory bus, around 20 PCIe channels and the display connections. I don't think we'll be seeing a sideport on Llano.

You may be referring to the smaller die which they showed at the Nov.
2009 analyst meeting which was cut off through the 128 bit DDR3 IMC
and one of the PCIe interfaces.

The larger one showed at the Notebook meeting is ~205 mm^2 but it's
possibly incomplete as well. It seems to be a partial die, also, there are
no display I/O cells visible.

Llano has ~ 1 billion transistors according to an AMD slide which places
it in the same league as a six core Westmere with 1.17 Billion transistors.
From the transistors one would expect something like 320 to 400 Shader
Processor units which is a lot.

The ATI Redwood die which has 400 SP units has a die size of 104 mm^2
for 627 million transistor including a full 128 bit GDDR5 IMC on TSMC 40nm.

Regards, Hans

**rcofell** · 05-10-2010, 05:30 PM

Just got around to reading that whitepaper and wading through the marketing talk for any real information :X

The only snippet I could find was that part mentioning the memory handling, as already discussed. My thoughts are:
1) Memory transfers between the CPU and "SIMD Arrays" are done through block transfer engines, so essentially a DMA controller in concept. I assume that's technically how CPU/GPU communication is already done (DMA handles the PCIE data chunk shuffling), or should be..., except this would more tightly coupled and not going across the external bus (I assume this is how they refer to the main PCIE/HT busses).
2) Memory is shared, yet divided/partitioned between the CPUs & SIMD, so maybe some sort of virtual memory scheme? It could be done such that a transfer between CPU/SIMD wouldn't have to modify the main memory and just deal with virtual address mapping. Caching becomes an interesting point too, having it shared might help if you're feeding data that's freshly touched/produced by the CPU, but if it's a large data set just sitting in main memory waiting to be processed it wouldn't help much. Somewhere in here the sideport could be thrown too, since things would be abstracted with virtual address management.
3) The part about not needing to touch the external bus, I think it must just be referring to everything being kept within the CPU die and not touching the HT bus, since this will be directly tied into the memory controller/bus/crossbar, however you want to describe it. The only situation it should even need to go outside to the HT bus would be on an SMP implementation, which afaik llano isn't reaching for...

Originally Posted by saaya

hmmm with the igp integrated, wouldnt there be a HUGE boost from larger L3 caches?
provided the algorythms are up to the task of handling data that both cpu and gpu write and read, then this would be a huge shortcut compared to handling this in the memory...
it would work around the memory latency and bandwidth limitations... but i think youd need a BIIIIG l3 cache for that, right?
<snip>
then all they would need the L3 cache for is to buffer cpu to gpu traffic...

thanks for the link

but thats a stupid decision from amd if true...
they have to write the l1 and l2 caches to SYSTEM MEMORY before they can C6?
thats really stupid...
why dont they power gate the core only and let the L1 and L2 active? its not like L1 and L2 consume that much power...
that way the cores could c6 while its caches are still available for other cores, and the core doesn't have to load data from memory when powering on again...

Been really busy the last week, but to answer your question about high/low performance transistors and leakage from earlier:
Yes, if say you design for 2x the clock speed and use higher speed transistors your leakage will increase a bare minimum 2x (modern processes), and in all honesty probably very much more than that. I don't have any empirical data on hand here, but tweaking the threshold voltage parameter for more performance/current has an exponential effect on sub-threshold leakage if you look straight at the equation. This ignores the fact that going for 2x clock would require better pipe-lining/logic re-design, which in itself would almost certainly involve higher dynamic/static power due to additional gates/clock routing, since high performance transistors aren't nearly that much better

Back OT:
I do think that sharing the L3 would provide a boost in some situations, perhaps sometimes a good amount, but definitely not all the time. The big thing to keep in mind is what algorithms are being used, if there's a very large working set and the SIMD are just crunching (stream/throughput processing) through it, then there's a high likely hood that once data is used it won't be used again, hence won't stick around the cache. However, as you mention the L3 might make a good buffer between CPU/GPU, since currently it's known that the latency involved in transferring the data and starting a kernel on a GPU prohibits shorter data-sets from receiving an overall speedup. However, it's outside my knowledge to even guess what the current break even point is for data-set size and positive speedup (also highly algorithm dependent).

There's also the fact that the SIMD might start trashing the CPU data in L3 (remember, multiple cores, some working on other threads/tasks). Of course, this could be managed with a simple solution such as lock bits, letting the SIMD/CPUs partition the L3 when necessary.

About the C6 state stuff, from what I understand C6 is essentially fully power gated (ie. off). Since leakage is a recurring theme these days and L1/L2 are relatively large structures made of transistors, there's at least always some noticeable leakage going on, granted the L1/L2 would probably use higher-Vt transistors (lower leakage), but it's still power draw. I don't think the gains of letting other cores use the L1/L2 would outweigh the power savings, since the core is inactive and no longer attempting to fill its cache. If the core were to actually flush to L3 instead of main memory then that would be good enough. You have to consider the fact that with multiple cores the current thread running on the powered down core will likely move over to one of the other cores (which'll only fill its cache with the necessary data), otherwise more likely is the core is going to C6 because the thread is done, hence no need for the corresponding data again.

I'd say the most important point is C6 is meant for when a core is expected to be turned off for a relatively long period of time (in cpu terms), so the incurred memory latency is amortized, especially when it's likely that the core won't be using the same data when woken up.

Someday I'll have to learn to write less, heh.

**saaya** · 05-11-2010, 04:31 AM

Originally Posted by Raqia

The best solution would be to sell the APU soldered onto a board w/o commodity dimms etc; instead, use soldered GDDR5 on a wide bus for > 100 GB/s of bandwidth for both CPU and GPU. There's no need for most consumers to have commodity dimms on their machines. At this level of integration, the whole system could be designed like a graphics card is today.

i think thats where itll ultimately end, but i highly doubt thats what llano will be already... even if they get rid of 128bit gddr3 and make that gddr5... then thats still not enough to really feed the cpu and gpu cores... and if they cant feed them, why put them on the die to begin with?

rcofell, thanks!

if you dont power down l1 and l2 then you dont save as much power, but youll be able to turn off the cpu cores a lot faster and more frequent... idk how much power cache consumes compared to the cpu cores, but id be surprised if cache consumes as much or more than the cores...
johan made some excellent experiments in his article on anandtech, and he came to the conclusion that turbo works great for servers, not so much because it overclocks the cpu, but mostly because it turns off cores and reduces their power very efficiently. even when the package was 60% loaded several cores spent quite some time in C6, and saved a lot of power that way.

im worried that with amd depending on L1 and L2 to be flushed and moved and copied and reloaded... they will either end up with sleeping cores when work needs to be done, or they wont be able to ever go to sleep as theres always some little work that needs to be done...

its ironic cause thats what made k8 so efficient, it could switch much faster from one power state to another than intels cpus, which saved a lot of power.

**SEA** · 05-11-2010, 05:58 AM

Apparently guys you think wrong direction..
This Llano is not about blazing graphics speed but about much faster streaming computation...

**saaya** · 05-11-2010, 06:53 AM

Originally Posted by SEA

Apparently guys you think wrong direction..
This Llano is not about blazing graphics speed but about much faster streaming computation...

oh yeah, your right! that doesnt require a lot of bandwidth at all

:P

if amd really doesnt do anything but slap an igp/gpu on their cpu package and castrate it with a 128bit ddr3 imc and calls THAT fusion after over 5 years of marketing talk and supposedly working on fusion for all that time...
ill just point a finger at them and laugh... how pathetic is that? they could have done the same thing YEARS ago

**demonkevy666** · 05-11-2010, 07:09 AM

Originally Posted by saaya

i think thats where itll ultimately end, but i highly doubt thats what llano will be already... even if they get rid of 128bit gddr3 and make that gddr5... then thats still not enough to really feed the cpu and gpu cores... and if they cant feed them, why put them on the die to begin with?

rcofell, thanks!

if you dont power down l1 and l2 then you dont save as much power, but youll be able to turn off the cpu cores a lot faster and more frequent... idk how much power cache consumes compared to the cpu cores, but id be surprised if cache consumes as much or more than the cores...
johan made some excellent experiments in his article on anandtech, and he came to the conclusion that turbo works great for servers, not so much because it overclocks the cpu, but mostly because it turns off cores and reduces their power very efficiently. even when the package was 60% loaded several cores spent quite some time in C6, and saved a lot of power that way.

im worried that with amd depending on L1 and L2 to be flushed and moved and copied and reloaded... they will either end up with sleeping cores when work needs to be done, or they wont be able to ever go to sleep as theres always some little work that needs to be done...

its ironic cause thats what made k8 so efficient, it could switch much faster from one power state to another than intels cpus, which saved a lot of power.

it's not a K8 die it's a K10 die.
Nb has it's own clock now even without the L3 cache.

Originally Posted by saaya

oh yeah, your right! that doesnt require a lot of bandwidth at all

:P

if amd really doesnt do anything but slap an igp/gpu on their cpu package and castrate it with a 128bit ddr3 imc and calls THAT fusion after over 5 years of marketing talk and supposedly working on fusion for all that time...
ill just point a finger at them and laugh... how pathetic is that? they could have done the same thing YEARS ago

I think they're going to need a bit of vram (1gb) for those 480 shaders.

**SEA** · 05-11-2010, 11:37 AM

Originally Posted by saaya

oh yeah, your right! that doesnt require a lot of bandwidth at all

:P

if amd really doesnt do anything but slap an igp/gpu on their cpu package and castrate it with a 128bit ddr3 imc and calls THAT fusion after over 5 years of marketing talk and supposedly working on fusion for all that time...
ill just point a finger at them and laugh... how pathetic is that? they could have done the same thing YEARS ago

roll your eyes back for a minute please

1) What AMD has done you could read on first page from AMD itself...
2) Contemporary Integrated GPUs are already on par or better then top level cpus in some distributed projects.
3) summing up two above:
Actually you got the picture. So, hold your early laugh...

**Oliverda** · 05-11-2010, 11:40 AM

Originally Posted by Hans de Vries

The larger TLB is good for newer large workloads. A fast Integer divide
is a bit overdue compared to Core/Nehalem. I think the somewhat larger
L1 caches (8 transitor/bit instead of 6 transistor/bit) opened up the
required extra space in the layout needed for a fast integer divider.
Any impact is very program specific.

Regards, Hans

Would you explain the relationship between the larger caches and integer divider unit derived from?

**Dresdenboy** · 05-11-2010, 01:19 PM

Originally Posted by Oliverda

Would you explain the relationship between the larger caches and integer divider unit derived from?

He meant, that the area of the caches increased due to the 8T design (vs. 6T before). Integrating these into the layout plus some other changes freed some space, where something else could be put in (physically).

**rcofell** · 05-11-2010, 01:26 PM

Originally Posted by Oliverda

Would you explain the relationship between the larger caches and integer divider unit derived from?

I'll go out on a limb here and assume he's correlating it with the fact the major units are custom designed (hierarchically - caches, ALUs, etc. layouts designed separately) and hence somewhat fixed in dimension/aspect ratio considering area/timing efficient designs. In this case it looks like the L1 caches are roughly the same width, while a little bit taller, so that means everything else in the same row has to also use up the same height or else lead to empty/wasted space. From there the space constraint opens up for more features

**saaya** · 05-11-2010, 09:18 PM

Originally Posted by demonkevy666

it's not a K8 die it's a K10 die.
Nb has it's own clock now even without the L3 cache.

i know, why did you mention this?

Originally Posted by demonkevy666

I think they're going to need a bit of vram (1gb) for those 480 shaders.

yeah, if they use system memory only its going to completely kill perf i think...
i think the cpu cores would be somewhat ok, as single channel 64bit is actually acceptable for 2 cores, even 4 if you dont push them very hard... but the gpu having less than 128bit... like i said, then why put so many sps on the chip if you cant feed them?

Originally Posted by SEA

Contemporary Integrated GPUs are already on par or better then top level cpus in some distributed projects.

that doesnt make sense though... igps are cut down gpus, so if your after DP flops then why use a cut down gpu instead of the real deal that offers 20x the perf? especially if you look at platform costs that makes a lot more sense as you can cram at least 8 gpus in one server if you use dual gpu cards, vs a single tiny gpu per platform on a llano server...

so llano as a server chip... idk... i cant really think of it as such a great idea

it will probably offer better perf/cost and perf/watt, but like i said, look at integration... unless there will be pciE llano cards with 2 or more llano chips on it, it wont be that useful i think.

Thread: Can Llano do AVX?

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions