MMM
Page 2 of 5 FirstFirst 12345 LastLast
Results 26 to 50 of 124

Thread: Can Llano do AVX?

  1. #26
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by terrace215 View Post
    That's okay. Judging by your S|A article, you come off extremely uninformed... just an fyi...
    hah, thats fine, luckily the world isnt full of people like you so i can learn
    seriously dude, take it easy and dont act like such an 4ss...

    about offloading fpu code to the gpu...
    i cant believe that its THAT hard... doing it well and doing efficiently, thats probably tricky...
    but you make it sound as if intels and amds engineers would just shrug if youd ask them how it could be done...
    and i just cant believe that... they are solving problems all day in and out, and having worked a lot with hw and sw engineers, they can always think about a way to get xxx working somehow... except for chinese/taiwanese engineers, hah, they are good at copying or tweaking designs but not that good at coming up with something new and tend to say "thats not possible" a lot
    Last edited by saaya; 04-22-2010 at 10:54 PM.

  2. #27
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by Hans de Vries View Post
    A few observations suggest that AMD's Llano could do AVX instructions.

    [...]

    There is extra integer logic. A good guess would be a faster version
    of the Integer divider. One that can produce multiple result bits/cycle
    like the ones in the Core2 and Nehalem architecture.
    First, thanks for the nice work. Two nice bits of information on Opteron's (no, not you, Alex) bday

    The lengthened integer block could be the H/W divider support (I wrote about it here, because there exists a patent for it) or another variant: a lower power multiplier implementation (as mentioned here, the related patent is no. 7,028,068). I'm sure you could detect, what's more likely. Maybe even both (because the integer divider support is more about avoiding unecessary iterations thanks to some quick checks, which shouldn't need a lot of gates).

    Quote Originally Posted by terrace215 View Post
    Even if they *could* support AVX in Llano, would AMD really want their first implementation of AVX to be crippled (128b exe units) vs the contemporaneous SB implementation?

    I guess it would be a trade-off. A little better performance than not supporting it vs. bad marketing from head-to-head comparisons with SB on "AVX benchmarks". Might it not be better to wait for a 256b implementation? I suppose that depends on how long it will be for such a successor in the llano market space.
    Just remember the SSE1+2 implementation in Netburst. It was implemented the same way and nobody was unhappy with it - as long as the compiler created code for it.

    Quote Originally Posted by saaya View Post
    hmmm do you remember when people started talking about reverse hyper threading? intel can split the fpu, ie hyper threading... amd is going to use one fpu for 2 integer cores... this is what people could have interpreted or misunderstood as reverse hyper threading right?
    Quote Originally Posted by saaya View Post
    does anybody know how much work needs to be done to offload fpu code like avx to the gpu cores? any idea?
    The current problem is marked in bold plus a lot of overhead for managing the instructions, decode, retire with such long paths. GPU ALUs/FMADs just have to come closer to be utilized efficiently. And BTW there are some ATI patents (2004 or so) describing a processor both capable of doing GPU stuff and x86 code execution...

    P.S.: Additional to Hans' birthday gift I published another gift on my blog - maybe exactly one year before launch
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  3. #28
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by Dresdenboy View Post
    The current problem is marked in bold plus a lot of overhead for managing the instructions, decode, retire with such long paths. GPU ALUs/FMADs just have to come closer to be utilized efficiently. And BTW there are some ATI patents (2004 or so) describing a processor both capable of doing GPU stuff and x86 code execution...)
    thanks
    so the problem is for the cpu to decode for the gpu, and make efficient use of the gpu cores? but even if its not very efficient, its extra processing power... for free... even if you only use 10% of the gpu cores, thats still going to be a boost...

    your blog doesnt load for me btw...

  4. #29
    Xtreme Member
    Join Date
    Jul 2004
    Location
    Berlin
    Posts
    275
    Quote Originally Posted by saaya View Post
    thanks
    so the problem is for the cpu to decode for the gpu, and make efficient use of the gpu cores? but even if its not very efficient, its extra processing power... for free... even if you only use 10% of the gpu cores, thats still going to be a boost...
    Think of it like this: You are the boss of a factory producing tables. You have to make 100 tables of 30 different types. What's more efficient then (time and cost wise): to have 20 workers or to have 1000 with all implications (communication, place, ways, controlling..)

    Quote Originally Posted by saaya View Post
    your blog doesnt load for me btw...
    Someone is trying to keep my information away from public... woohooo conspiracy ahoi

    You seem to be not the only one:
    http://support.mozilla.com/de/forum/1/653217

    blog.de suffered some DOS attack and also uses some IP blocking as it seems. I'll look for a different place to provide kind of a backup.

    Maybe you could use an atom reader for http://citavia.blog.de/feed/atom/posts/ (or RSS: http://citavia.blog.de/feed/rss2/posts/ ).
    Now on Twitter: @Dresdenboy!
    Blog: http://citavia.blog.de/

  5. #30
    Xtreme Member
    Join Date
    Aug 2004
    Posts
    210
    Quote Originally Posted by terrace215 View Post
    Even if they *could* support AVX in Llano, would AMD really want their first implementation of AVX to be crippled (128b exe units) vs the contemporaneous SB implementation?
    Even if there would be some penalties - there would be still benefits of the use of 3 operand instructions.
    Quote Originally Posted by Dresdenboy View Post
    First, thanks for the nice work. Two nice bits of information on Opteron's (no, not you, Alex) bday
    Haha ok - noted ;-)

  6. #31
    Xtreme Enthusiast
    Join Date
    Oct 2008
    Posts
    678
    Quote Originally Posted by Hans de Vries View Post
    It's not that "crippled", not by a factor 2 (=256/128). For example:
    If an SIMD FP add takes 4 clock cycles then:

    128 bit: A+B+C takes 8 clock cycles.
    256 bit: A+B+C takes 9 clock cycles. (using pipelined 128 bit hardware)

    128 bit: A+B+C+D takes 9 clock cycles.
    256 bit: A+B+C+D takes 11 clock cycles. (using pipelined 128 bit hardware)

    It all depends on how many unused time-slots there are due to the data
    dependencies. A bigger bottleneck for Llano would be the L1 cache access
    bandwidth: 32 bytes/cycle for Llano versus 48 bytes/cycle for Sandy Bridge.


    Regards, Hans
    How will it turn out in heavier workloads, when a hundred calculations are to be done in a row? I'm not a chip architect, but I belvieve that if the FPU can be fed enough it seems to me that a hundred 128bit calculations would take like 107 cycles while a hundred 256bit ones would take 207.
    I know of course that it isn't that simple, but my point is that it could be misleading to talk about single calculations as the pipe length distorts the result compared to real world performance. In the same way as comparing a 20 stage pipeline to a 10 stage one in single calculations. The real world performance could be nearly identical, but at one single calculation the 10 stage one is twice as fast.

    Quote Originally Posted by saaya View Post
    thanks
    so the problem is for the cpu to decode for the gpu, and make efficient use of the gpu cores? but even if its not very efficient, its extra processing power... for free... even if you only use 10% of the gpu cores, thats still going to be a boost...

    your blog doesnt load for me btw...
    I'm not that good at this stuff, but i guess the problem is latency. It's quicker to do a random operation in the FPU than to send it all the way to the GPU. It's when it is large workloads that can get parallelized the GPU is useful.
    Just doing one or a few operations at random is probably much more efficent in a high clocked FPU than in a low clocked GPU, at the moment.

  7. #32
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by Dresdenboy View Post
    Think of it like this: You are the boss of a factory producing tables. You have to make 100 tables of 30 different types. What's more efficient then (time and cost wise): to have 20 workers or to have 1000 with all implications (communication, place, ways, controlling..)
    well, its more like i HAVE 1000 employees, its just a matter of how many i chose to build a team to work on something...
    and i dont understand the reference to communication, place, ways, controlling... is it that much work to get them data and pick up the results?
    in your example, id take 1 worker per tablet to do the final assembly, and tell the other guys to prepare them and do the basic work that is the same for all of them, and then there will be 1 worker for each tablet that puts it together or does whatever needs to be done on this special model that doesnt get done on all the other ones.

    i mean thats the big thing here that i dont get... no matter how difficult it is to get them working or how bad they are at what they do... you already hired them, just that right now most of the day they are just sitting around, not doing anything.

    Quote Originally Posted by Dresdenboy View Post
    Someone is trying to keep my information away from public... woohooo conspiracy ahoi

    You seem to be not the only one:
    http://support.mozilla.com/de/forum/1/653217

    blog.de suffered some DOS attack and also uses some IP blocking as it seems. I'll look for a different place to provide kind of a backup.

    Maybe you could use an atom reader for http://citavia.blog.de/feed/atom/posts/ (or RSS: http://citavia.blog.de/feed/rss2/posts/ ).
    hahah, yeah, sounds like M from amd security in on your heels hahaha

  8. #33
    Xtreme Addict
    Join Date
    Aug 2006
    Location
    eu/hungary/budapest.tmp
    Posts
    1,591
    Quote Originally Posted by Hans de Vries View Post
    A few observations

    [img]
    whoa... awesome job! Never thought one could tell so much from a die shot.
    Me, I can only tell the difference between cache and logic, but naming the
    parts... stunning
    Usual suspects: i5-750 & H212+ | Biostar T5XE CFX-SLI | 4GB RAndoM | 4850 + AC S1 + 120@5V + modded stock for VRAM/VRM | Seasonic S12-600 | 7200.12 | P180 | U2311H & S2253BW | MX518
    mITX media & to-be-server machine: A330ION | Seasonic SFX | WD600BEVS boot & WD15EARS data
    Laptops: Lifebook T4215 tablet, Vaio TX3XP
    Bike: ZX6R

  9. #34
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by Frank M View Post
    whoa... awesome job! Never thought one could tell so much from a die shot.
    Me, I can only tell the difference between cache and logic, but naming the
    parts... stunning
    well, if youve worked with something for long enough you get really good at spotting details and can read everything like a book

  10. #35
    Xtreme Enthusiast
    Join Date
    Oct 2008
    Posts
    678
    Quote Originally Posted by saaya View Post
    well, if youve worked with something for long enough you get really good at spotting details and can read everything like a book
    But even in books there is a difference between reading, and reading between the lines.
    Understanding words isn't the same thing as understanding the context.

  11. #36
    Xtreme Mentor
    Join Date
    Jun 2008
    Location
    France - Bx
    Posts
    2,601
    Intel AVX page

    For those who don't know this link ...

  12. #37
    Xtreme Addict
    Join Date
    Jan 2007
    Location
    Brisbane, Australia
    Posts
    1,264
    lol @ 3dnow.

    Llano + Quake II + 3dn patch ftw!

    Great work Hans, as always. Be interesting to see what it all turns out to be.

  13. #38
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    Quote Originally Posted by saaya View Post
    hah, thats fine, luckily the world isnt full of people like you so i can learn
    seriously dude, take it easy and dont act like such an 4ss...

    about offloading fpu code to the gpu...
    i cant believe that its THAT hard... doing it well and doing efficiently, thats probably tricky...
    but you make it sound as if intels and amds engineers would just shrug if youd ask them how it could be done...
    and i just cant believe that... they are solving problems all day in and out, and having worked a lot with hw and sw engineers, they can always think about a way to get xxx working somehow... except for chinese/taiwanese engineers, hah, they are good at copying or tweaking designs but not that good at coming up with something new and tend to say "thats not possible" a lot
    transmeta attempted something similar. the difference is they intended for the underlying architecture to intercept x86 and it is not inside and x86 cpu. even if AMD tried there would be so many issues and it would be counterproductive. its pretty much impossible with any real world constraints. think about it. it is THAT hard.

  14. #39
    Xtreme Mentor
    Join Date
    Mar 2006
    Posts
    2,978
    Quote Originally Posted by Raqia View Post
    It's interesting that 3dnow! is still being kept around. It's a minuscule amount of die space but it can't be trivial to implement and debug. Company pride I suppose?
    Not surprising, a number of applications utilize a 3DNow! code path, including games and the staple -- Mainconcept h264 encoder. That little bit of die area will never be retired.
    One hundred years from now It won't matter
    What kind of car I drove What kind of house I lived in
    How much money I had in the bank Nor what my cloths looked like.... But The world may be a little better Because, I was important In the life of a child.
    -- from "Within My Power" by Forest Witcraft

  15. #40
    YouTube Addict
    Join Date
    Aug 2005
    Location
    Klaatu barada nikto
    Posts
    17,574
    if anything integer divide is going to take more logic than integer multiply.
    Fast computers breed slow, lazy programmers
    The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
    http://www.lighterra.com/papers/modernmicroprocessors/
    Modern Ram, makes an old overclocker miss BH-5 and the fun it was

  16. #41
    c[_]
    Join Date
    Nov 2002
    Location
    Alberta, Canada
    Posts
    18,728
    Quote Originally Posted by Chumbucket843 View Post
    transmeta attempted something similar. the difference is they intended for the underlying architecture to intercept x86 and it is not inside and x86 cpu. even if AMD tried there would be so many issues and it would be counterproductive. its pretty much impossible with any real world constraints. think about it. it is THAT hard.
    Really it depends on the GPU core, maybe some parts of it are or will be designed somewhat like an "SPE" which could allow the CPU to utilize parts of it as an external resource.. oh well

    All along the watchtower the watchmen watch the eternal return.

  17. #42
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by -Boris- View Post
    But even in books there is a difference between reading, and reading between the lines.
    Understanding words isn't the same thing as understanding the context.
    sure, reading tons of books doesnt man youll necessarily get all hints... but whats your point? that hans is interpreting things into the die shot that just arent there? its not like he even claimed it supports avx, he found hints that it might, the thread title carries a question mark, so i dont get why people troll him for hinting at an interesting find without posting anything useful themselves...

    even IF llano CAN do avx, it doesnt mean it will... its not like it would be the first time that logic gets added that is only enabled in a later rev or design. it might be broken or not work as well as planned, and it will never be used...

    Quote Originally Posted by Chumbucket843 View Post
    transmeta attempted something similar. the difference is they intended for the underlying architecture to intercept x86 and it is not inside and x86 cpu. even if AMD tried there would be so many issues and it would be counterproductive. its pretty much impossible with any real world constraints. think about it. it is THAT hard.
    hmmm... correct me if im wrong, but isnt every modern x86 cpu blocks of risc logic with an x86 decoder on top? so then the decoder needs to be able to spit out code that a gpu core can process, and the latter has to be able to spit out data in a way it can be built into an x86 compatible result again... its basically adding a new type of fpu block to the cpu... so the gpu core wouldnt have to execute x86 code, right?
    please excuse my lack of knowledge about this
    Last edited by saaya; 04-24-2010 at 10:12 AM.

  18. #43
    Xtreme Enthusiast
    Join Date
    Oct 2008
    Posts
    678
    Quote Originally Posted by saaya View Post
    sure, reading tons of books doesnt man youll necessarily get all hints... but whats your point? that hans is interpreting things into the die shot that just arent there? its not like he even claimed it supports avx, he found hints that it might, the thread title carries a question mark, so i dont get why people troll him for hinting at an interesting find without posting anything useful themselves...

    even IF llano CAN do avx, it doesnt mean it will... its not like it would be the first time that logic gets added that is only enabled in a later rev or design. it might be broken or not work as well as planned, and it will never be used...

    No, It's not like that at all, I respect Hans very much, and I always have. I agree in everything that you are saying.
    My point is simply that we don't know anything yet, and don't get me wrong, I love speculations and speculate alot myself. But some readers have already taken his posts as accurate readings of the die. All I'm saying is that even the most skilled engineer can't read the details in die shots.
    And as Hans himself pointed out with his question mark, this are mostly fun speculations, however I do think he does have some interesting points. And I would love to see him right.
    But we shall not take these speculations as truths. And I am convinced that Hans himself wouldn't like his post to be interpreted like that.

    I apologize to Hans if my posts in some way has been percieved as trolling.

  19. #44
    YouTube Addict
    Join Date
    Aug 2005
    Location
    Klaatu barada nikto
    Posts
    17,574
    Quote Originally Posted by saaya View Post
    sure, reading tons of books doesnt man youll necessarily get all hints... but whats your point? that hans is interpreting things into the die shot that just arent there? its not like he even claimed it supports avx, he found hints that it might, the thread title carries a question mark, so i dont get why people troll him for hinting at an interesting find without posting anything useful themselves...

    even IF llano CAN do avx, it doesnt mean it will... its not like it would be the first time that logic gets added that is only enabled in a later rev or design. it might be broken or not work as well as planned, and it will never be used...


    hmmm... correct me if im wrong, but isnt every modern x86 cpu blocks of risc logic with an x86 decoder on top? so then the decoder needs to be able to spit out code that a gpu core can process, and the latter has to be able to spit out data in a way it can be built into an x86 compatible result again... its basically adding a new type of fpu block to the cpu... so the gpu core wouldnt have to execute x86 code, right?
    please excuse my lack of knowledge about this
    well the underlying structure for Modern CPUs are uniform sized instruction and generally separation of arithmetic from load/stores but the arithmetic logic itself need to able to approximate the expected results efficiently otherwise there will be serious inefficiency in the design.

    For example to emulate 80bit x87 calculations, it is more efficient to just to make your FPU 80bits, than to attempt to emulate it with a 64bit FPU
    Fast computers breed slow, lazy programmers
    The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
    http://www.lighterra.com/papers/modernmicroprocessors/
    Modern Ram, makes an old overclocker miss BH-5 and the fun it was

  20. #45
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    Quote Originally Posted by -Boris- View Post
    No, It's not like that at all, I respect Hans very much, and I always have. I agree in everything that you are saying.
    My point is simply that we don't know anything yet, and don't get me wrong, I love speculations and speculate alot myself. But some readers have already taken his posts as accurate readings of the die. All I'm saying is that even the most skilled engineer can't read the details in die shots.
    And as Hans himself pointed out with his question mark, this are mostly fun speculations, however I do think he does have some interesting points. And I would love to see him right.
    But we shall not take these speculations as truths. And I am convinced that Hans himself wouldn't like his post to be interpreted like that.

    I apologize to Hans if my posts in some way has been percieved as trolling.
    yeah, maybe some people got too carried away
    and even though i cant remember hans being wrong about any of his speculations based on die shots in the past...
    there were several cases where he was right and something was indeed in place, but disabled hyper threadding in early p4s and ... yamhil in later ones for example...

    Quote Originally Posted by nn_step View Post
    well the underlying structure for Modern CPUs are uniform sized instruction and generally separation of arithmetic from load/stores but the arithmetic logic itself need to able to approximate the expected results efficiently otherwise there will be serious inefficiency in the design.

    For example to emulate 80bit x87 calculations, it is more efficient to just to make your FPU 80bits, than to attempt to emulate it with a 64bit FPU
    sorry, didnt understand the part about arythmetic and fpu coordination
    all i got is that the two need to work together and thats probably a pita if the fpu is off die or in far away block and doing things in a very different manner than what the engineers are used to working with...

    hmmmm but how about 128 instead of 64? if all you do is widen the interface but the fpu can still only do 64 per clock... then do you actually gain any performance whatsoever on 128bit wide operations? and do you gain 64bit op perf?

    thx nn_step
    Last edited by saaya; 04-26-2010 at 07:57 AM.

  21. #46
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    Quote Originally Posted by saaya View Post
    hmmm... correct me if im wrong, but isnt every modern x86 cpu blocks of risc logic with an x86 decoder on top? so then the decoder needs to be able to spit out code that a gpu core can process, and the latter has to be able to spit out data in a way it can be built into an x86 compatible result again... its basically adding a new type of fpu block to the cpu... so the gpu core wouldnt have to execute x86 code, right?
    please excuse my lack of knowledge about this
    im not enitirely sure how cpu's break down instructions. it probably takes a simple instruction and breaks it down into simpler operations. you concept of digital logic is not right. an ISA is just a bunch of instructions a processor executes. the processor doesnt need special logic for that particular isa. it just needs to be able to execute every instruction.

    here is a simple decoder. its like a switch in a network. its not some complex decrypting magic.
    http://www.asic-world.com/digital/co...er_using_Demux

    i understand what you are saying: could an x86 decoder allow instructions to run on a gpu? no. there are many issues. gpu's are still asic's. they are designed around directx. they dont support all of the features that would be needed in a cpu and they are very different architecturally. for instance sse has 8 GPR's(general purpose registers). evergreen has 128GPR's. both are 128 bits wide. an sse programmer cant use 128 gpr's because 120 of them dont exist!

  22. #47
    Xtreme Enthusiast
    Join Date
    Mar 2005
    Posts
    644
    Quote Originally Posted by Chumbucket843 View Post
    i understand what you are saying: could an x86 decoder allow instructions to run on a gpu? no. there are many issues. gpu's are still asic's. they are designed around directx. they dont support all of the features that would be needed in a cpu and they are very different architecturally. for instance sse has 8 GPR's(general purpose registers). evergreen has 128GPR's. both are 128 bits wide. an sse programmer cant use 128 gpr's because 120 of them dont exist!
    Are you forgetting that x86 Processors got Register Renaming from around a decade ago? Applications see the 16 SSE and x86-64 Registers (You got 16 on both when running on Long Mode. And SSE aren't supposed to be GPR as far that I know), but you don't know how much the Processor really keep track of internally.
    Both the FPU and the GPU do the same thing: Floating point calculations. However, a Processor FPU is a very complex monster on silicon that is also hard to program for (The x87 Assembly always sucked in difficulty, as far that I know) while the GPU got tons of simpler and specific purpose FPU. Things aren't specifically like that right now because GPUs evolved from having fixed function units to much more complex, programable ones, though I don't know how far is the potential reach of CUDA and ATI Close to Metal when compared to the x87 FPU. But the more they work for the GPGPU approach, the more complex these programmable units should become and maybe should adquiere a high grade of flexibility.
    The question is if you can make the less complex GPU programmable units being able to understand and process x87 instructions or viceversa with the Processor FPU. If anything, it could be possible than in two generations or three the FPU, SSE and GPU become the same unit capable of executing all the different instructions sets.

  23. #48
    Xtreme Cruncher
    Join Date
    May 2009
    Location
    Bloomfield
    Posts
    1,968
    Quote Originally Posted by zir_blazer View Post
    Are you forgetting that x86 Processors got Register Renaming from around a decade ago?
    that is a problem because register on the gpu are intended to be explicitly managed(you allocate registers per thread).thats just one example of why this cant work.
    Applications see the 16 SSE and x86-64 Registers (You got 16 on both when running on Long Mode. And SSE aren't supposed to be GPR as far that I know), but you don't know how much the Processor really keep track of internally.
    SSE has GPR's along with MMX, XMM and some other registers. im pretty sure that all of them are doubled in long mode. with out all types of registers sse is impossible.
    Both the FPU and the GPU do the same thing: Floating point calculations. However, a Processor FPU is a very complex monster on silicon that is also hard to program for (The x87 Assembly always sucked in difficulty, as far that I know) while the GPU got tons of simpler and specific purpose FPU. Things aren't specifically like that right now because GPUs evolved from having fixed function units to much more complex, programable ones, though I don't know how far is the potential reach of CUDA and ATI Close to Metal when compared to the x87 FPU. But the more they work for the GPGPU approach, the more complex these programmable units should become and maybe should adquiere a high grade of flexibility.
    while there is a lot of difference in how FP is implemented that doesnt change that the end result is just math. x87 doesnt have that much more capabilities wrt functions. you can convert to int, or compute square root or exponent, nothing that is very important.
    The question is if you can make the less complex GPU programmable units being able to understand and process x87 instructions or viceversa with the Processor FPU. If anything, it could be possible than in two generations or three the FPU, SSE and GPU become the same unit capable of executing all the different instructions sets.
    the problem is "understanding". its like translating another language verbatim. even though you translated it, it's still not necessarily correct and the syntax will be messed up. there are fundamental differences in the way these processors work. GPU's and CPU's are merging architectually. its going to be interesting and i hope we dont get stuck with x86 again.

  24. #49
    YouTube Addict
    Join Date
    Aug 2005
    Location
    Klaatu barada nikto
    Posts
    17,574
    Quote Originally Posted by saaya View Post
    yeah, maybe some people got too carried away
    and even though i cant remember hans being wrong about any of his speculations based on die shots in the past...
    there were several cases where he was right and something was indeed in place, but disabled hyper threadding in early p4s and ... yamhil in later ones for example...

    sorry, didnt understand the part about arythmetic and fpu coordination
    all i got is that the two need to work together and thats probably a pita if the fpu is off die or in far away block and doing things in a very different manner than what the engineers are used to working with...

    hmmmm but how about 128 instead of 64? if all you do is widen the interface but the fpu can still only do 64 per clock... then do you actually gain any performance whatsoever on 128bit wide operations? and do you gain 64bit op perf?

    thx nn_step
    well given that no x86 or x86_64 instruction set works on 128bit floating point math, it would be an atrocious waste of transistors to implement a 128bit FPU. Hence why AMD's K10 implements a 80bit and a 64bit FPU in parallel. Such that 128bit SIMD instructions can be done at full speed and if need be, a 80bit and a 64bit instruction can be processed in parallel.
    Fast computers breed slow, lazy programmers
    The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
    http://www.lighterra.com/papers/modernmicroprocessors/
    Modern Ram, makes an old overclocker miss BH-5 and the fun it was

  25. #50
    Xtreme Member
    Join Date
    Aug 2007
    Posts
    282
    Btw... there is an interesting poll on AMD...


    Quick Poll
    Are you currently developing or planning on supporting the AVX instruction set in your application development efforts?
    a) No plans at this time
    b) Currently researching how to take advantage of AVX
    c) Already developing specific AVX code
    d) What's AVX?

    source: http://developer.amd.com/Pages/default.aspx#

    And this AMD's blog mentions some interesting things too:
    http://blogs.amd.com/developer/2009/...ing-a-balance/
    Last edited by jogshy; 04-26-2010 at 10:40 PM.

Page 2 of 5 FirstFirst 12345 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •