Page 4 of 5 FirstFirst 12345 LastLast
Results 76 to 100 of 124

Thread: Can Llano do AVX?

  1. #76
    Xtreme Enthusiast
    Join Date
    Mar 2005
    Location
    Buenos Aires, Argentina
    Posts
    644
    Then, Sandy Bridge GPU would be substantially faster than Clarkdale, and slower than Fusion GPU. I agree than that is as accurate as we can get, but still, that range is pretty much an huge void.

    The Memory Bandwidth issue is pretty interesing because we can be talking about a potential HUGE bottleneck there. The Radeon 5570 and 5670 nominal specifications calls for DDR3 and GDDR5 respectively, with a 128 Bits Bus Width for both, at 900 MHz and 1000 MHz, providing a theorical 28.8 and 64 GB p/s. If Llano uses a Dual Channel IMC, you could get a theorical 21.3 and 25.6 GB p/s for DDR-III 1333 MHz and 1600 MHz respectively, with the important difference than on a Llano you are going to share it with the GPU plus the other 4 Cores. Actually, we can have a coherent idea about how much it can impact performance right now by benchmarking how much these two Video Cards scales down with lower VRAM Frequencies, that would provide less Memory Bandwidth and higher access latency.

  2. #77
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    I'm not researching marketing strategies, but I think, these Fusion designs are meant for notebooks and low power (likely also low budget) PCs, where currently the situation is like one of these:
    1) (LFB == ) GPU =HT3= CPU == DDR2/DDR3
    gpu <-> mem latency high, bandwidth medium (textures, vertex data)
    gpu <-> cpu latency medium, bandwidth medium (commands/shader code, vertex data)
    result code: +ooo (sum: +)

    2) GDDR5 == GPU =PCIe= Chipset =HT3= CPU == DDR2/DDR3
    gpu <-> mem latency low, bandwidth high
    gpu <-> cpu latency high, bandwidth medium
    result code: +-+o (sum: +)

    And Fusion will be:
    GPU-CPU == DDR3
    gpu <-> mem latency low, bandwidth medium (but CPU cores have twice the L2 for less pressure on IMC compared to 1))
    gpu <-> cpu latency low, bandwidth high
    result code: +o++ (sum: +++)

    And less bridges and external interconnects save power. Same performance at less power = more performance/watt, which means, you can actually do more at e.g. 50W than with design 1).
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  3. #78
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by zir_blazer View Post
    Then, Sandy Bridge GPU would be substantially faster than Clarkdale, and slower than Fusion GPU. I agree than that is as accurate as we can get, but still, that range is pretty much an huge void.

    The Memory Bandwidth issue is pretty interesing because we can be talking about a potential HUGE bottleneck there. The Radeon 5570 and 5670 nominal specifications calls for DDR3 and GDDR5 respectively, with a 128 Bits Bus Width for both, at 900 MHz and 1000 MHz, providing a theorical 28.8 and 64 GB p/s. If Llano uses a Dual Channel IMC, you could get a theorical 21.3 and 25.6 GB p/s for DDR-III 1333 MHz and 1600 MHz respectively, with the important difference than on a Llano you are going to share it with the GPU plus the other 4 Cores. Actually, we can have a coherent idea about how much it can impact performance right now by benchmarking how much these two Video Cards scales down with lower VRAM Frequencies, that would provide less Memory Bandwidth and higher access latency.
    just look at how little sideport actually helped igp boards...
    it gives them a perf boost of 10-25%... so i dont think memory bandwidth is such a big issue for the igp...

  4. #79
    Xtreme Enthusiast
    Join Date
    Mar 2005
    Location
    Buenos Aires, Argentina
    Posts
    644
    Quote Originally Posted by saaya View Post
    just look at how little sideport actually helped igp boards...
    it gives them a perf boost of 10-25%... so i dont think memory bandwidth is such a big issue for the igp...
    But that is because Sideport gives overkill Memory Bandwidth to a very small GPU. Llano GPU is not small, and also is sharing that Bandwidth. I take though informal comment at the other Thread than the IMC was "THE KEY" component, I actually think that it must be very efficient if it want to be sucessful.

  5. #80
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by zir_blazer View Post
    But that is because Sideport gives overkill Memory Bandwidth to a very small GPU. Llano GPU is not small, and also is sharing that Bandwidth. I take though informal comment at the other Thread than the IMC was "THE KEY" component, I actually think that it must be very efficient if it want to be sucessful.
    IIRC we heard from Sam Naffziger, that the IMC will be optimized for that task. But that would only be to keep inner latencies low and not waste bandwidth while transferring data between GPU and DDR3 or CPU cores.

    But what we're missing here is a diagram with a profile of memory interface usage by cores and GPU while doing graphics. On a per frame basis I assume, that it would look like
    1. CPU cores calculating CPU part of scene data (CPU uses IMC)
    2. CPU initiates transfer of shader code, vertices, texture addresses etc. to GPU (IMC mostly used by GPU)
    3. GPU renders frame data, fetches textures and vertices from external RAM and the latter also partly from CPU caches (because they have been processed there) (IMC mostly used by GPU)
    4. store modified frame data (in local cache or external RAM) and loop to 2. if not done for this frame (IMC mostly used by GPU)

    CPU surely will work on next frame already (e.g. doing physics calculations), but has L2 caches to stay local during most cycles.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  6. #81
    Registered User
    Join Date
    Mar 2010
    Posts
    11

  7. #82
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    hmmm with the igp integrated, wouldnt there be a HUGE boost from larger L3 caches?
    provided the algorythms are up to the task of handling data that both cpu and gpu write and read, then this would be a huge shortcut compared to handling this in the memory...
    it would work around the memory latency and bandwidth limitations... but i think youd need a BIIIIG l3 cache for that, right?

    Quote Originally Posted by zir_blazer View Post
    But that is because Sideport gives overkill Memory Bandwidth to a very small GPU. Llano GPU is not small, and also is sharing that Bandwidth. I take though informal comment at the other Thread than the IMC was "THE KEY" component, I actually think that it must be very efficient if it want to be sucessful.
    your right... wow, i didnt know llano would have that many shaders... :o
    12x that of the 880/890 chipset? geez...

    880/890=40shaders
    llano=480shaders

    thats more than a 5670... and a 5670 has 64gb/s bandwidth...
    hmmmm if amd is smart they will use sideport on the the cpu package...
    think of how much a single or maybe 2 gddr5 chips on the package could do
    then all they would need the L3 cache for is to buffer cpu to gpu traffic...

    thanks for the link
    but thats a stupid decision from amd if true...
    they have to write the l1 and l2 caches to SYSTEM MEMORY before they can C6?
    thats really stupid...
    why dont they power gate the core only and let the L1 and L2 active? its not like L1 and L2 consume that much power...
    that way the cores could c6 while its caches are still available for other cores, and the core doesnt have to load data from memory when powering on again...

  8. #83
    Xtreme Enthusiast
    Join Date
    Feb 2005
    Posts
    970
    Its getting close now. Ill just quote a post from S|A since its easier.

    Fusion, comming soon:http://sites.amd.com/us/fusion/apu/P...ion.aspxFusion,

    whitepaper:http://sites.amd.com/us/Documents/48...epaper_WEB.pdfA bit about system bus and memory controler: Quote: The key aspect to note is that all the major system elements – x86 cores, vector (SIMD) engines, and a Unifed Video Decoder (UVD) for HD decoding tasks – attach directly to the same high speed bus, and thus to the main system memory. This design concept eliminates one of the fundamental constraints that limits the performance of traditional integrated graphics controllers (IGPs). and ... Quote: Although the APU’s scalar x86 cod SIMD engines share a common path to system memory, AMD’s frst generation implementations divide that memory into regions managed by the operating system running on the x86 cores and other regions managed by software running on the SIMD engines. AMD provides high speed block transfer engines that move data between the x86 and SIMD memory partitions. Unlike transfers between an external frame buffer and system memory, these transfers never hit the system’s external bus. Clever software developers can overlap the loading and unloading of blocks in the SIMD memory with execution involving data in other blocks. Insight 64 anticipates that future APU architectures will evolve towards a more seamless memory management model that allows even higher levels of balanced performance scaling.

  9. #84
    Xtreme Enthusiast
    Join Date
    Mar 2005
    Location
    Buenos Aires, Argentina
    Posts
    644
    The Links aren't working. Fixed Links: Webpage, PDF

    Llano could be one of the most interesing piece of Hardware in quite a long time and could OWN the entire OEM and Notebook market given the right price, and as I stated several times, it could eat the Socket AM3 platform market ("Mainstream" but overally looks better than current enthusiast class Thuban) at least until Bulldozer arrival. However, I don't declare it an instant winner until I see Sandy Bridge GPU in action.

  10. #85
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by flippin_waffles View Post
    Its getting close now. Ill just quote a post from S|A since its easier.
    It seems that the first Fusion implementation is even less efficient then i thought. So in case to exchange data between CPU and GPU you need to move data from one memory location to another

  11. #86
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    I'd rather wait and see how it works in practice .

  12. #87
    I am Xtreme
    Join Date
    Jul 2007
    Location
    Austria
    Posts
    5,485
    Quote Originally Posted by kl0012 View Post
    It seems that the first Fusion implementation is even less efficient then i thought. So in case to exchange data between CPU and GPU you need to move data from one memory location to another
    Hmm I don't know, the whitepaper is not really conclusive what AMD means with "attach directly to the same high speed bus, and thus to the main system memory."

    For me it seems they are saying the SIMD array (aka igp) is still conected to the memcontroller/cpu over a HT link, but its a on-die HT link, which of course elimiates quite a bit latency.
    Seems like clarksfield, just one step down (form package level, to die level).

    Im curious how exactly its implemented, time will tell.

  13. #88
    Registered User
    Join Date
    Sep 2009
    Posts
    77
    .........Hmmm, it seems that the multiple SIMD Arrays are connected to crossbar directly?
    Last edited by superrugal; 05-07-2010 at 12:19 PM.

  14. #89
    Xtreme Member
    Join Date
    Nov 2006
    Posts
    324
    I highly doubt of 480 shaders...
    But for APU they don't need much shaders to start from.

    We already heard everywhere that first generation of llano comes this year? That document says The AMD Fusion family of Accelerated Processing Units is scheduled to arrive in 2011.
    Also they say here about APU accelerated programs with GPU assistance, so that first llano with K8 will be able doing that as well. I guess it is same way like it can be done done now with separate graphic chips.
    Windows 8.1
    Asus M4A87TD EVO + Phenom II X6 1055T @ 3900MHz + HD3850
    APUs

  15. #90
    Xtreme Mentor
    Join Date
    Jul 2008
    Location
    Shimla , India
    Posts
    2,631
    "attach directly to the same high speed bus, and thus to the main system memory."

    By that they mean that a multi directional HT bus connect the parts together "HT1" and then another link is used to connect to the memory "HT2"
    Coming Soon

  16. #91
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    Quote Originally Posted by Hornet331 View Post
    Hmm I don't know, the whitepaper is not really conclusive what AMD means with "attach directly to the same high speed bus, and thus to the main system memory."

    For me it seems they are saying the SIMD array (aka igp) is still conected to the memcontroller/cpu over a HT link, but its a on-die HT link, which of course elimiates quite a bit latency.
    Seems like clarksfield, just one step down (form package level, to die level).

    Im curious how exactly its implemented, time will tell.
    they share the same memory controller but through software they partition memory. the gpu cant perform computations on the cpu's memory partition or vice versa. this is probably the most simple way of doing things although the memory controller will be an issue. the gpu is throughput optimized and the memory controller on them is designed with advanced arbitration logic to prioritize dram r/w's for the SIMD's. its like machine learning in hardware.

  17. #92
    I am Xtreme
    Join Date
    Dec 2007
    Posts
    7,750
    well 480SPs is alot, and im sure they did alot of math to know if it was going to be too much, or if they should have started with like half that. but i have to assume they knew what they were getting into, and that 480 wouldnt be memory starved or going to waste.

  18. #93
    Xtreme Enthusiast
    Join Date
    Feb 2009
    Posts
    800
    Quote Originally Posted by zir_blazer View Post
    The Links aren't working. Fixed Links: Webpage, PDF

    Llano could be one of the most interesing piece of Hardware in quite a long time and could OWN the entire OEM and Notebook market given the right price, and as I stated several times, it could eat the Socket AM3 platform market ("Mainstream" but overally looks better than current enthusiast class Thuban) at least until Bulldozer arrival. However, I don't declare it an instant winner until I see Sandy Bridge GPU in action.
    Cool flash animation, though it crashed the first time (Firefox beta hooray, only the plugin crashed)

  19. #94
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    Fusion IMC is the key component of Llano design,it will be enough for the integrated GPU and the x86 cores.There's some clever engineering going inside that IMC and the chip as a whole.

  20. #95
    I am Xtreme
    Join Date
    Jul 2007
    Location
    Austria
    Posts
    5,485
    Quote Originally Posted by Chumbucket843 View Post
    they share the same memory controller but through software they partition memory. the gpu cant perform computations on the cpu's memory partition or vice versa. this is probably the most simple way of doing things although the memory controller will be an issue. the gpu is throughput optimized and the memory controller on them is designed with advanced arbitration logic to prioritize dram r/w's for the SIMD's. its like machine learning in hardware.
    Thats not really the issue, my question was more about how its connected to the IMC. Intel solution for Clarksfield was it to bundel the igp with the imc and access the imc for the cpu thorugh the highbandwidth qpi link all on the same package, which already cut the latancy quite a bit.

    And what im reading right now in amds whitepaper it looks like they'll do it the opposite way plus on a deeper level, but still through a bus (HT) and not a direct connection.

  21. #96
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    i am sort of confused with what you are saying.
    the SIMD's in the GPU and the x86 processors in the CPU are both "cores" and can request dram access. there is no reason for a QPI/HT link.


    Until now, transistor budget constraints typically
    mandated a two chip solution for such systems, forcing
    system architects to use a chip-to-chip crossing between
    the memory controller and either the CPU or GPU as
    shown in Figure 3. These transfers affect memory latency,
    consume system power and thus impact battery life. The
    APU’s scalar x86 cores and SIMD engines share a common
    path to system memory to help avoid these constraints.

  22. #97
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by Hornet331 View Post
    Hmm I don't know, the whitepaper is not really conclusive what AMD means with "attach directly to the same high speed bus, and thus to the main system memory."

    For me it seems they are saying the SIMD array (aka igp) is still conected to the memcontroller/cpu over a HT link, but its a on-die HT link, which of course elimiates quite a bit latency.
    Seems like clarksfield, just one step down (form package level, to die level).

    Im curious how exactly its implemented, time will tell.
    It seems that there is no need for HT this time. Fusion IGP (just as
    normal IGP or even a discrete GPU) has its own dedicated memory region
    and has no direct connection with the CPU core that is the data exchange between the cores is going through transfer data from one memory location to another. It seems that all the innovation lies in the fact that now the gpu is on the same die with CPU and shares with it a common channel to memory.

  23. #98
    Xtreme Addict
    Join Date
    Jun 2007
    Location
    Thessaloniki, Greece
    Posts
    1,307
    Quote Originally Posted by kl0012 View Post
    It seems that there is no need for HT this time. Fusion IGP (just as
    normal IGP or even a discrete GPU) has its own dedicated memory region
    and has no direct connection with the CPU core that is the data exchange between the cores is going through transfer data from one memory location to another. It seems that all the innovation lies in the fact that now the gpu is on the same die with CPU and shares with it a common channel to memory.
    read again
    AMD provides high speed block transfer engines that move data between the x86 and SIMD memory partitions. Unlike transfers between an external frame buffer and system memory, these transfers never hit the system’s external bus.
    Seems we made our greatest error when we named it at the start
    for though we called it "Human Nature" - it was cancer of the heart
    CPU: AMD X3 720BE@ 3,4Ghz
    Cooler: Xigmatek S1283(Terrible mounting system for AM2/3)
    Motherboard: Gigabyte 790FXT-UD5P(F4) RAM: 2x 2GB OCZ DDR3 1600Mhz Gold 8-8-8-24
    GPU:HD5850 1GB
    PSU: Seasonic M12D 750W Case: Coolermaster HAF932(aka Dusty )

  24. #99
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by BrowncoatGR View Post
    read again
    Why do you think i didn't read it? Aside, I mentioned "shared" memory access which automatically means that "those transfers" wont go through external bus. But any way the data path from cpu to gpu remains long: cpu -> cpu_memory->gpu_memory->gpu. Not much better then current IGP.
    Last edited by kl0012; 05-07-2010 at 10:48 PM.

  25. #100
    Xtreme Enthusiast
    Join Date
    Mar 2005
    Location
    Buenos Aires, Argentina
    Posts
    644
    I suppose than that is referred to commands communication bewthem the CPU and GPU, basically, when the CPU tells the GPU what to do. When that its happening, it should not use the IMC, HyperTransport, or anything, because it is on the same piece of silicon. However, when the GPU is going to request the bulk of the data, you are going to access its VRAM (Shared in the RAM) anyways, so you are basically cutting the CPU-to-GPU coordination latency. Not having to use external bus at all means that it should have a sort of huge Cache or buffers to store tons of info to not have to actually access the external bus, or that the CPU uploads data to process to the GPU in real time instead of storing it on VRAM then telling the GPU to go to retrieve it.
    In current platforms topology works like this:

    _Northbridge IGP without Sideport (Shared RAM): CPU uploads graphics data to the shared RAM at one hop using the IMC, and also commands the GPU one hop away using HyperTransport. The IGP has to travel two hops, using HyperTransport then the CPU IMC to retrieve the data.
    _Northbridge IGP with Sideport (Exclusive): CPU uploads data to the exclusive IGP VRAM that is two hops away, using HyperTranport to the Northbridge then the mini IMC that it should have to manage the Sideport. It still commands the GPU one hop away, and the GPU got its VRAM one hop away, too.
    _Northbridge IGP with Sideport AND shared RAM: You're using as VRAM both the Sideport and some shared RAM, this means that you should have an extra overhead. The topology should be depending on where the CPU is uploading to what part of the VRAM and where the GPU is retrieving it, but basically is the last two simultaneously.
    _PCIe Video Card: GPU is two hops away from the CPU (HyperTransport to Northbridge, then PCIe Bus to Video Card), and the VRAM is at three (Using the GPU IMC). Nuff' said.
    _PCIe Video Card with TurboCache/HyperMemory: Similar to the Northbridge IGP with Sideport AND shared RAM, as the GPU also got some VRAM of its own, just that the GPU is two hops away (HyperTransport, PCIe Bus), its exclusive VRAM at three (Add GPU IMC to the previous path), the shared one at one hop (Using the GPU IMC) from the CPU, but three from the GPU (PCIe, Hyper Transport, Processor IMC). But the concept is the same.
    _Fusion: CPU commands GPU internally, lowest possible latency at 0 hops. You are using shared RAM at just one hop with the IMC, so whatever either the CPU or GPU wants to access must have to do it though the same bus. Possibily, the most important improvement would be that data to process is directly uploaded from the CPU to GPU in real time instead of it just saying it where in the RAM it has placed it at, in what case it would have to retrieve it from the VRAM.

Page 4 of 5 FirstFirst 12345 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •