MMM
Page 5 of 5 FirstFirst ... 2345
Results 101 to 124 of 124

Thread: Can Llano do AVX?

  1. #101
    Registered User
    Join Date
    Nov 2008
    Posts
    28
    The best solution would be to sell the APU soldered onto a board w/o commodity dimms etc; instead, use soldered GDDR5 on a wide bus for > 100 GB/s of bandwidth for both CPU and GPU. There's no need for most consumers to have commodity dimms on their machines. At this level of integration, the whole system could be designed like a graphics card is today.

  2. #102
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    There is a paper about the mem bandwidth required by games:
    http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  3. #103
    Xtreme Enthusiast
    Join Date
    Mar 2005
    Posts
    644
    Quote Originally Posted by Dresdenboy View Post
    There is a paper about the mem bandwidth required by games:
    http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
    What I understanded from it, is that 2005-2006 games sended around 60 MB p/s of data to the GPU (With Oblivion being the lone exception of topping at 142 MB p/s), then the GPU had its own 11 GB p/s usage of Memory Bandwidth from the VRAM. Not sure how a similar study do these days considering that GPUs and game engines evolved quite a bit, but that slighty old study shows a pattern.
    So I suppose than that basically means that games benefict almost exclusively from lower access latency when overclocking the VRAM Memory and little to nothing about the exceed Memory Bandwidth.

  4. #104
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by zir_blazer View Post
    _Fusion: CPU commands GPU internally, lowest possible latency at 0 hops. You are using shared RAM at just one hop with the IMC, so whatever either the CPU or GPU wants to access must have to do it though the same bus. Possibily, the most important improvement would be that data to process is directly uploaded from the CPU to GPU in real time instead of it just saying it where in the RAM it has placed it at, in what case it would have to retrieve it from the VRAM.
    1. CPU sends "command buffer" to GPU through memory (and not through I/O ports). All CPU I/O (not memory) operations are slow by its nature. They are not cached.
    2. The RAM of "fusion" cpu (at least for its first version) isn't shared between GPU and CPU. Only a memory controller is shared and the RAM is divided into regions each of which is dedicated to CPU or GPU. So in order to exchange data between CPU and GPU you need to copy data from one mem region to another. This is probably not too efficient.

  5. #105
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by zir_blazer View Post
    What I understanded from it, is that 2005-2006 games sended around 60 MB p/s of data to the GPU (With Oblivion being the lone exception of topping at 142 MB p/s), then the GPU had its own 11 GB p/s usage of Memory Bandwidth from the VRAM. Not sure how a similar study do these days considering that GPUs and game engines evolved quite a bit, but that slighty old study shows a pattern.
    So I suppose than that basically means that games benefict almost exclusively from lower access latency when overclocking the VRAM Memory and little to nothing about the exceed Memory Bandwidth.
    The estimation was done for 1024x768 screen resolution. For 1920x1200 with high texture quality you would probably need much higher bandwidth.

  6. #106
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by kl0012 View Post
    2. The RAM of "fusion" cpu (at least for its first version) isn't shared between GPU and CPU. Only a memory controller is shared and the RAM is divided into regions each of which is dedicated to CPU or GPU. So in order to exchange data between CPU and GPU you need to copy data from one mem region to another. This is probably not too efficient.
    It sounds like the efficient part of this story is, that the CPU doesn't have to do the copying by executing code. In fact it could be powered off while copying takes place. That's what the Fusion paper suggests (at least to me). What if the command blocks (a few MB per frame in total) are sent in packets of a few hundred kb? These could still be residing in the L2 cache and fetched from there. The copying itself has to be initiated by the graphics driver, which in Llano's case should be able to program/control the IMC accordingly.

    Textures are a different story, but they don't have to be copied per frame from CPU to GPU.
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  7. #107
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by Dresdenboy View Post
    It sounds like the efficient part of this story is, that the CPU doesn't have to do the copying by executing code. In fact it could be powered off while copying takes place. That's what the Fusion paper suggests (at least to me).
    I'm not sure if CPU is doing something when a data transfer from a main memory to a vram is taking place in the current scheme. Probably a DMA is responsible for this. The only diff with the current solutions is a mem transfer which dosn't hit external bus (PCIe in case of discrete graphics or HT in case of integrated graphics).

    What if the command blocks (a few MB per frame in total) are sent in packets of a few hundred kb? These could still be residing in the L2 cache and fetched from there. The copying itself has to be initiated by the graphics driver, which in Llano's case should be able to program/control the IMC accordingly.
    Textures are a different story, but they don't have to be copied per frame from CPU to GPU.
    I think it would be possible if AMD will implement some on-die "command buffer" memory which can be mapped by CPU. Currently GPU reads commands form a special region in system memory which is mapped as I/O to prevent caching.
    Last edited by kl0012; 05-10-2010 at 03:34 AM.

  8. #108
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by zir_blazer View Post
    Basically, that would place it slighty above the Radeon 5570/5670 but a bit far from the 5750, with 400 and 720 SP respectively. That means that we can speculate quite accurately about Fusion GPU performance. With two exceptions: Having the GPU directly connected in the same piece of silicon to the CPU means that you have a benefict for basically eliminating their communication latency, and that is better, however, how much of an impact it could make the fact that it would be sharing Memory Bandwidth with the other Cores and with an increased latency compared to the Video Card own VRAM? Well, that is all what is left to know about Fusion GPU besides actual numbers.
    Now... What we do know about Sandy Bridge? Do we have a remote idea of its performance? Else, I would still stay at bay until more info surfaces. The worst thing that you can do is saying that you are I N V I N C I B L E and get owned before finishing to say the classic sentence.

    BTW... Where the hell is Hans de Vries? It should be useful his input in this Thread after soo many days.
    Hans is still wondering what we can conclude from this new info....

    http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx
    http://sites.amd.com/us/Documents/48...epaper_WEB.pdf

    It's easier to talk about what I would like to see in it.
    Nathan Brookwood talks all about HPC/GPGPU applications
    and a little bit about the architecture.

    The all important question is still: Does Llano have a GDDR5 sideport?

    It seems to be required for both HPC and graphics since for SPEC_FP_rate
    we see the 128 bit bus as a real bottleneck already for Istanbul, let alone
    for a GPGPU, and for graphics we can see how much faster the Radeon HD
    5670 is compared to the Radeon HD 5570 by using GDDR5 instead of GDDR3.
    (http://www.anandtech.com/show/2935)

    The only hint for sideport GDDR5 memory is the statement about the
    autonomous data transfer (presumably scatter/gather DMA controllers)
    between "CPU" and "GPU" based memory. If the GPU memory is physically
    in the DDR3 DIMMS then you can just allocate the required amount as
    a non cache-able area to store game data.

    In the context of Nathan's white paper you'll expect AMD to port all
    it's math and HPC libraries to the GPGPU via ACML-GPU, see:
    (http://developer.amd.com/gpu/acmlgpu/Pages/default.aspx)
    Here you need a shell which copies portions of the large data structures
    from CPU memory (which can be dozens of GByte) to the smaller GDDR5
    memory of the GPGPU (1/2 to 2 GByte) for high bandwidth, high throughput
    processing. The DMA units should be able to double the copy bandwidth
    over software copying.

    Now, Llano die photo doesn't show a sideport memory interface which
    should be 128 wide IMO. We also know that not the whole die is shown
    so I'm not sure. I'm sure the ATI guys would fight very hard to get one
    on the die and they are right but there is no proof yet. If not then we
    just have to be satisfied with the latest Sep 2009 JEDEC file which
    standardizes DDR3-1866 and DDR3-2133...
    http://www.jedec.org/standards-docum...ocs/jesd-79-3d


    Regards, Hans
    Last edited by Hans de Vries; 05-10-2010 at 07:09 AM.

  9. #109
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    if you look at things in a bigger perspective... then im sure itll have sideport...
    amd was talking about fusion for a long time, and they started sideport in the 6xx igp days already... shortly after they took over ati...
    that was with 40 and 80 shader cores, and 64bit... and even though it didnt give a huge boost and some mainboard and laptops makers actually opted to NOT use sideport...
    they continued to support it in new chipsets

    and now we are talking about 480 shader cores!
    and no sideport at all?
    it would be really odd and a step back to NOT have sideport on llano...
    and the fact alone that they dont talk about having or not having sideport is a hint towards them at least considering it imo...
    sometimes the most important things are what somebody DOESNT say, and not what he does say

  10. #110
    Xtreme Member
    Join Date
    Oct 2007
    Location
    Sweden
    Posts
    127
    Quote Originally Posted by Hans de Vries View Post
    .............If not then we just have to be satisfied with the latest Sep 2009 JEDEC
    file which standardizes DDR3-1866 and DDR3-2166...
    http://www.jedec.org/standards-docum...ocs/jesd-79-3d


    Regards, Hans
    Fetching that paper needs registration so I need to ask, is that really DDR3-2166
    and not DDR3-2133?

    Personally I would have preferred that they moved in bigger steps this time, that
    is to DDR-2000 and DDR-2400 before moving on to a future DDR4. The upside of
    faster memory feels diminishing, to me, nowadays.

  11. #111
    Xtreme Addict
    Join Date
    Apr 2006
    Location
    City of Lights, The Netherlands
    Posts
    2,381
    Llano's die size doesn't seem big enough to support a sideport IMHO. I mean, Llano is only around 160 to 170 mm² and it needs to have the pin-out for the 128-bit DDR3 memory bus, around 20 PCIe channels and the display connections. I don't think we'll be seeing a sideport on Llano.
    "When in doubt, C-4!" -- Jamie Hyneman

    Silverstone TJ-09 Case | Seasonic X-750 PSU | Intel Core i5 750 CPU | ASUS P7P55D PRO Mobo | OCZ 4GB DDR3 RAM | ATI Radeon 5850 GPU | Intel X-25M 80GB SSD | WD 2TB HDD | Windows 7 x64 | NEC EA23WMi 23" Monitor |Auzentech X-Fi Forte Soundcard | Creative T3 2.1 Speakers | AudioTechnica AD900 Headphone |

  12. #112
    Registered User
    Join Date
    Nov 2008
    Posts
    28
    I think AMD is more focused on low cost and smooth execution w/ this first generation of Fusion, so it probably won't have a more expensive and complicated arrangement with a side-port. It would make sense to solder the APU and memory onto a single board or package in future iterations though since this is intended to be a stand-alone consumer product. Why not integrate an entire consumer level systems into units like graphics cards today and just keep the modularity and upgradeability for servers?

  13. #113
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by Kej View Post
    Fetching that paper needs registration so I need to ask, is that really DDR3-2166
    and not DDR3-2133?
    Right, corrected. It's all in steps of 266 MHz

    Regards, Hans

  14. #114
    Xtreme Member
    Join Date
    Sep 2008
    Posts
    235
    Quote Originally Posted by Helmore View Post
    Llano's die size doesn't seem big enough to support a sideport IMHO. I mean, Llano is only around 160 to 170 mm² and it needs to have the pin-out for the 128-bit DDR3 memory bus, around 20 PCIe channels and the display connections. I don't think we'll be seeing a sideport on Llano.
    You may be referring to the smaller die which they showed at the Nov.
    2009 analyst meeting which was cut off through the 128 bit DDR3 IMC
    and one of the PCIe interfaces.

    The larger one showed at the Notebook meeting is ~205 mm^2 but it's
    possibly incomplete as well. It seems to be a partial die, also, there are
    no display I/O cells visible.

    Llano has ~ 1 billion transistors according to an AMD slide which places
    it in the same league as a six core Westmere with 1.17 Billion transistors.
    From the transistors one would expect something like 320 to 400 Shader
    Processor units which is a lot.

    The ATI Redwood die which has 400 SP units has a die size of 104 mm^2
    for 627 million transistor including a full 128 bit GDDR5 IMC on TSMC 40nm.


    Regards, Hans

  15. #115
    Xtreme Cruncher
    Join Date
    Apr 2005
    Location
    TX, USA
    Posts
    898
    Just got around to reading that whitepaper and wading through the marketing talk for any real information :X

    The only snippet I could find was that part mentioning the memory handling, as already discussed. My thoughts are:
    1) Memory transfers between the CPU and "SIMD Arrays" are done through block transfer engines, so essentially a DMA controller in concept. I assume that's technically how CPU/GPU communication is already done (DMA handles the PCIE data chunk shuffling), or should be..., except this would more tightly coupled and not going across the external bus (I assume this is how they refer to the main PCIE/HT busses).
    2) Memory is shared, yet divided/partitioned between the CPUs & SIMD, so maybe some sort of virtual memory scheme? It could be done such that a transfer between CPU/SIMD wouldn't have to modify the main memory and just deal with virtual address mapping. Caching becomes an interesting point too, having it shared might help if you're feeding data that's freshly touched/produced by the CPU, but if it's a large data set just sitting in main memory waiting to be processed it wouldn't help much. Somewhere in here the sideport could be thrown too, since things would be abstracted with virtual address management.
    3) The part about not needing to touch the external bus, I think it must just be referring to everything being kept within the CPU die and not touching the HT bus, since this will be directly tied into the memory controller/bus/crossbar, however you want to describe it. The only situation it should even need to go outside to the HT bus would be on an SMP implementation, which afaik llano isn't reaching for...


    Quote Originally Posted by saaya View Post
    hmmm with the igp integrated, wouldnt there be a HUGE boost from larger L3 caches?
    provided the algorythms are up to the task of handling data that both cpu and gpu write and read, then this would be a huge shortcut compared to handling this in the memory...
    it would work around the memory latency and bandwidth limitations... but i think youd need a BIIIIG l3 cache for that, right?
    <snip>
    then all they would need the L3 cache for is to buffer cpu to gpu traffic...

    thanks for the link
    but thats a stupid decision from amd if true...
    they have to write the l1 and l2 caches to SYSTEM MEMORY before they can C6?
    thats really stupid...
    why dont they power gate the core only and let the L1 and L2 active? its not like L1 and L2 consume that much power...
    that way the cores could c6 while its caches are still available for other cores, and the core doesn't have to load data from memory when powering on again...
    Been really busy the last week, but to answer your question about high/low performance transistors and leakage from earlier:
    Yes, if say you design for 2x the clock speed and use higher speed transistors your leakage will increase a bare minimum 2x (modern processes), and in all honesty probably very much more than that. I don't have any empirical data on hand here, but tweaking the threshold voltage parameter for more performance/current has an exponential effect on sub-threshold leakage if you look straight at the equation. This ignores the fact that going for 2x clock would require better pipe-lining/logic re-design, which in itself would almost certainly involve higher dynamic/static power due to additional gates/clock routing, since high performance transistors aren't nearly that much better

    Back OT:
    I do think that sharing the L3 would provide a boost in some situations, perhaps sometimes a good amount, but definitely not all the time. The big thing to keep in mind is what algorithms are being used, if there's a very large working set and the SIMD are just crunching (stream/throughput processing) through it, then there's a high likely hood that once data is used it won't be used again, hence won't stick around the cache. However, as you mention the L3 might make a good buffer between CPU/GPU, since currently it's known that the latency involved in transferring the data and starting a kernel on a GPU prohibits shorter data-sets from receiving an overall speedup. However, it's outside my knowledge to even guess what the current break even point is for data-set size and positive speedup (also highly algorithm dependent).

    There's also the fact that the SIMD might start trashing the CPU data in L3 (remember, multiple cores, some working on other threads/tasks). Of course, this could be managed with a simple solution such as lock bits, letting the SIMD/CPUs partition the L3 when necessary.

    About the C6 state stuff, from what I understand C6 is essentially fully power gated (ie. off). Since leakage is a recurring theme these days and L1/L2 are relatively large structures made of transistors, there's at least always some noticeable leakage going on, granted the L1/L2 would probably use higher-Vt transistors (lower leakage), but it's still power draw. I don't think the gains of letting other cores use the L1/L2 would outweigh the power savings, since the core is inactive and no longer attempting to fill its cache. If the core were to actually flush to L3 instead of main memory then that would be good enough. You have to consider the fact that with multiple cores the current thread running on the powered down core will likely move over to one of the other cores (which'll only fill its cache with the necessary data), otherwise more likely is the core is going to C6 because the thread is done, hence no need for the corresponding data again.

    I'd say the most important point is C6 is meant for when a core is expected to be turned off for a relatively long period of time (in cpu terms), so the incurred memory latency is amortized, especially when it's likely that the core won't be using the same data when woken up.


    Someday I'll have to learn to write less, heh.



  16. #116
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by Raqia View Post
    The best solution would be to sell the APU soldered onto a board w/o commodity dimms etc; instead, use soldered GDDR5 on a wide bus for > 100 GB/s of bandwidth for both CPU and GPU. There's no need for most consumers to have commodity dimms on their machines. At this level of integration, the whole system could be designed like a graphics card is today.
    i think thats where itll ultimately end, but i highly doubt thats what llano will be already... even if they get rid of 128bit gddr3 and make that gddr5... then thats still not enough to really feed the cpu and gpu cores... and if they cant feed them, why put them on the die to begin with?

    rcofell, thanks!
    if you dont power down l1 and l2 then you dont save as much power, but youll be able to turn off the cpu cores a lot faster and more frequent... idk how much power cache consumes compared to the cpu cores, but id be surprised if cache consumes as much or more than the cores...
    johan made some excellent experiments in his article on anandtech, and he came to the conclusion that turbo works great for servers, not so much because it overclocks the cpu, but mostly because it turns off cores and reduces their power very efficiently. even when the package was 60% loaded several cores spent quite some time in C6, and saved a lot of power that way.

    im worried that with amd depending on L1 and L2 to be flushed and moved and copied and reloaded... they will either end up with sleeping cores when work needs to be done, or they wont be able to ever go to sleep as theres always some little work that needs to be done...

    its ironic cause thats what made k8 so efficient, it could switch much faster from one power state to another than intels cpus, which saved a lot of power.

  17. #117
    Xtreme Member
    Join Date
    Nov 2006
    Posts
    324
    Apparently guys you think wrong direction..
    This Llano is not about blazing graphics speed but about much faster streaming computation...
    Windows 8.1
    Asus M4A87TD EVO + Phenom II X6 1055T @ 3900MHz + HD3850
    APUs

  18. #118
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by SEA View Post
    Apparently guys you think wrong direction..
    This Llano is not about blazing graphics speed but about much faster streaming computation...
    oh yeah, your right! that doesnt require a lot of bandwidth at all :P

    if amd really doesnt do anything but slap an igp/gpu on their cpu package and castrate it with a 128bit ddr3 imc and calls THAT fusion after over 5 years of marketing talk and supposedly working on fusion for all that time...
    ill just point a finger at them and laugh... how pathetic is that? they could have done the same thing YEARS ago
    Last edited by saaya; 05-11-2010 at 06:56 AM.

  19. #119
    Xtreme Mentor
    Join Date
    May 2008
    Location
    cleveland ohio
    Posts
    2,879
    Quote Originally Posted by saaya View Post
    i think thats where itll ultimately end, but i highly doubt thats what llano will be already... even if they get rid of 128bit gddr3 and make that gddr5... then thats still not enough to really feed the cpu and gpu cores... and if they cant feed them, why put them on the die to begin with?

    rcofell, thanks!
    if you dont power down l1 and l2 then you dont save as much power, but youll be able to turn off the cpu cores a lot faster and more frequent... idk how much power cache consumes compared to the cpu cores, but id be surprised if cache consumes as much or more than the cores...
    johan made some excellent experiments in his article on anandtech, and he came to the conclusion that turbo works great for servers, not so much because it overclocks the cpu, but mostly because it turns off cores and reduces their power very efficiently. even when the package was 60% loaded several cores spent quite some time in C6, and saved a lot of power that way.

    im worried that with amd depending on L1 and L2 to be flushed and moved and copied and reloaded... they will either end up with sleeping cores when work needs to be done, or they wont be able to ever go to sleep as theres always some little work that needs to be done...

    its ironic cause thats what made k8 so efficient, it could switch much faster from one power state to another than intels cpus, which saved a lot of power.
    it's not a K8 die it's a K10 die.
    Nb has it's own clock now even without the L3 cache.
    Quote Originally Posted by saaya View Post
    oh yeah, your right! that doesnt require a lot of bandwidth at all :P

    if amd really doesnt do anything but slap an igp/gpu on their cpu package and castrate it with a 128bit ddr3 imc and calls THAT fusion after over 5 years of marketing talk and supposedly working on fusion for all that time...
    ill just point a finger at them and laugh... how pathetic is that? they could have done the same thing YEARS ago
    I think they're going to need a bit of vram (1gb) for those 480 shaders.
    HAVE NO FEAR!
    "AMD fallen angel"
    Quote Originally Posted by Gamekiller View Post
    You didn't get the memo? 1 hour 'Fugger time' is equal to 12 hours of regular time.

  20. #120
    Xtreme Member
    Join Date
    Nov 2006
    Posts
    324
    Quote Originally Posted by saaya View Post
    oh yeah, your right! that doesnt require a lot of bandwidth at all :P

    if amd really doesnt do anything but slap an igp/gpu on their cpu package and castrate it with a 128bit ddr3 imc and calls THAT fusion after over 5 years of marketing talk and supposedly working on fusion for all that time...
    ill just point a finger at them and laugh... how pathetic is that? they could have done the same thing YEARS ago
    roll your eyes back for a minute please
    1) What AMD has done you could read on first page from AMD itself...
    2) Contemporary Integrated GPUs are already on par or better then top level cpus in some distributed projects.
    3) summing up two above:
    Actually you got the picture. So, hold your early laugh...
    Windows 8.1
    Asus M4A87TD EVO + Phenom II X6 1055T @ 3900MHz + HD3850
    APUs

  21. #121
    Xtreme Addict
    Join Date
    Dec 2007
    Location
    Hungary (EU)
    Posts
    1,376
    Quote Originally Posted by Hans de Vries View Post
    The larger TLB is good for newer large workloads. A fast Integer divide
    is a bit overdue compared to Core/Nehalem. I think the somewhat larger
    L1 caches (8 transitor/bit instead of 6 transistor/bit) opened up the
    required extra space in the layout needed for a fast integer divider.
    Any impact is very program specific.


    Regards, Hans
    Would you explain the relationship between the larger caches and integer divider unit derived from?
    -

  22. #122
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by Oliverda View Post
    Would you explain the relationship between the larger caches and integer divider unit derived from?
    He meant, that the area of the caches increased due to the 8T design (vs. 6T before). Integrating these into the layout plus some other changes freed some space, where something else could be put in (physically).
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  23. #123
    Xtreme Cruncher
    Join Date
    Apr 2005
    Location
    TX, USA
    Posts
    898
    Quote Originally Posted by Oliverda View Post
    Would you explain the relationship between the larger caches and integer divider unit derived from?
    I'll go out on a limb here and assume he's correlating it with the fact the major units are custom designed (hierarchically - caches, ALUs, etc. layouts designed separately) and hence somewhat fixed in dimension/aspect ratio considering area/timing efficient designs. In this case it looks like the L1 caches are roughly the same width, while a little bit taller, so that means everything else in the same row has to also use up the same height or else lead to empty/wasted space. From there the space constraint opens up for more features



  24. #124
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by demonkevy666 View Post
    it's not a K8 die it's a K10 die.
    Nb has it's own clock now even without the L3 cache.
    i know, why did you mention this?

    Quote Originally Posted by demonkevy666 View Post
    I think they're going to need a bit of vram (1gb) for those 480 shaders.
    yeah, if they use system memory only its going to completely kill perf i think...
    i think the cpu cores would be somewhat ok, as single channel 64bit is actually acceptable for 2 cores, even 4 if you dont push them very hard... but the gpu having less than 128bit... like i said, then why put so many sps on the chip if you cant feed them?

    Quote Originally Posted by SEA View Post
    Contemporary Integrated GPUs are already on par or better then top level cpus in some distributed projects.
    that doesnt make sense though... igps are cut down gpus, so if your after DP flops then why use a cut down gpu instead of the real deal that offers 20x the perf? especially if you look at platform costs that makes a lot more sense as you can cram at least 8 gpus in one server if you use dual gpu cards, vs a single tiny gpu per platform on a llano server...

    so llano as a server chip... idk... i cant really think of it as such a great idea it will probably offer better perf/cost and perf/watt, but like i said, look at integration... unless there will be pciE llano cards with 2 or more llano chips on it, it wont be that useful i think.

Page 5 of 5 FirstFirst ... 2345

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •