Can Llano do AVX?

**saaya** · 04-22-2010, 10:48 PM

Originally Posted by terrace215

That's okay. Judging by your S|A article, you come off extremely uninformed... just an fyi...

hah, thats fine, luckily the world isnt full of people like you so i can learn

seriously dude, take it easy and dont act like such an 4ss...

about offloading fpu code to the gpu...
i cant believe that its THAT hard... doing it well and doing efficiently, thats probably tricky...
but you make it sound as if intels and amds engineers would just shrug if youd ask them how it could be done...
and i just cant believe that... they are solving problems all day in and out, and having worked a lot with hw and sw engineers, they can always think about a way to get xxx working somehow... except for chinese/taiwanese engineers, hah, they are good at copying or tweaking designs but not that good at coming up with something new and tend to say "thats not possible" a lot

**Dresdenboy** · 04-22-2010, 10:53 PM

Originally Posted by Hans de Vries

A few observations suggest that AMD's Llano could do AVX instructions.

[...]

There is extra integer logic. A good guess would be a faster version
of the Integer divider. One that can produce multiple result bits/cycle
like the ones in the Core2 and Nehalem architecture.

First, thanks for the nice work. Two nice bits of information on Opteron's (no, not you, Alex) bday

The lengthened integer block could be the H/W divider support (I wrote about it here, because there exists a patent for it) or another variant: a lower power multiplier implementation (as mentioned here, the related patent is no. 7,028,068). I'm sure you could detect, what's more likely. Maybe even both (because the integer divider support is more about avoiding unecessary iterations thanks to some quick checks, which shouldn't need a lot of gates).

Originally Posted by terrace215

Even if they *could* support AVX in Llano, would AMD really want their first implementation of AVX to be crippled (128b exe units) vs the contemporaneous SB implementation?

I guess it would be a trade-off. A little better performance than not supporting it vs. bad marketing from head-to-head comparisons with SB on "AVX benchmarks". Might it not be better to wait for a 256b implementation? I suppose that depends on how long it will be for such a successor in the llano market space.

Just remember the SSE1+2 implementation in Netburst. It was implemented the same way and nobody was unhappy with it - as long as the compiler created code for it.

Originally Posted by saaya

hmmm do you remember when people started talking about reverse hyper threading? intel can split the fpu, ie hyper threading... amd is going to use one fpu for 2 integer cores... this is what people could have interpreted or misunderstood as reverse hyper threading right?

Originally Posted by saaya

does anybody know how much work needs to be done to offload fpu code like avx to the gpu cores? any idea?

The current problem is marked in bold plus a lot of overhead for managing the instructions, decode, retire with such long paths. GPU ALUs/FMADs just have to come closer to be utilized efficiently. And BTW there are some ATI patents (2004 or so) describing a processor both capable of doing GPU stuff and x86 code execution...

P.S.: Additional to Hans' birthday gift I published another gift on my blog - maybe exactly one year before launch

**saaya** · 04-22-2010, 11:02 PM

Originally Posted by Dresdenboy

The current problem is marked in bold plus a lot of overhead for managing the instructions, decode, retire with such long paths. GPU ALUs/FMADs just have to come closer to be utilized efficiently. And BTW there are some ATI patents (2004 or so) describing a processor both capable of doing GPU stuff and x86 code execution...)

thanks

so the problem is for the cpu to decode for the gpu, and make efficient use of the gpu cores? but even if its not very efficient, its extra processing power... for free... even if you only use 10% of the gpu cores, thats still going to be a boost...

your blog doesnt load for me btw...

**Dresdenboy** · 04-23-2010, 12:24 AM

Originally Posted by saaya

thanks

so the problem is for the cpu to decode for the gpu, and make efficient use of the gpu cores? but even if its not very efficient, its extra processing power... for free... even if you only use 10% of the gpu cores, thats still going to be a boost...

Think of it like this: You are the boss of a factory producing tables. You have to make 100 tables of 30 different types. What's more efficient then (time and cost wise): to have 20 workers or to have 1000 with all implications (communication, place, ways, controlling..)

Originally Posted by saaya

your blog doesnt load for me btw...

Someone is trying to keep my information away from public... woohooo conspiracy ahoi

You seem to be not the only one:
http://support.mozilla.com/de/forum/1/653217

blog.de suffered some DOS attack and also uses some IP blocking as it seems. I'll look for a different place to provide kind of a backup.

Maybe you could use an atom reader for http://citavia.blog.de/feed/atom/posts/ (or RSS: http://citavia.blog.de/feed/rss2/posts/ ).

**Opteron146** · 04-23-2010, 12:30 AM

Originally Posted by terrace215

Even if they *could* support AVX in Llano, would AMD really want their first implementation of AVX to be crippled (128b exe units) vs the contemporaneous SB implementation?

Even if there would be some penalties - there would be still benefits of the use of 3 operand instructions.

Originally Posted by Dresdenboy

First, thanks for the nice work. Two nice bits of information on Opteron's (no, not you, Alex) bday

Haha ok - noted ;-)

**-Boris-** · 04-23-2010, 01:02 AM

Originally Posted by Hans de Vries

It's not that "crippled", not by a factor 2 (=256/128). For example:
If an SIMD FP add takes 4 clock cycles then:

128 bit: A+B+C takes 8 clock cycles.
256 bit: A+B+C takes 9 clock cycles. (using pipelined 128 bit hardware)

128 bit: A+B+C+D takes 9 clock cycles.
256 bit: A+B+C+D takes 11 clock cycles. (using pipelined 128 bit hardware)

It all depends on how many unused time-slots there are due to the data
dependencies. A bigger bottleneck for Llano would be the L1 cache access
bandwidth: 32 bytes/cycle for Llano versus 48 bytes/cycle for Sandy Bridge.

Regards, Hans

How will it turn out in heavier workloads, when a hundred calculations are to be done in a row? I'm not a chip architect, but I belvieve that if the FPU can be fed enough it seems to me that a hundred 128bit calculations would take like 107 cycles while a hundred 256bit ones would take 207.
I know of course that it isn't that simple, but my point is that it could be misleading to talk about single calculations as the pipe length distorts the result compared to real world performance. In the same way as comparing a 20 stage pipeline to a 10 stage one in single calculations. The real world performance could be nearly identical, but at one single calculation the 10 stage one is twice as fast.

Originally Posted by saaya

thanks

so the problem is for the cpu to decode for the gpu, and make efficient use of the gpu cores? but even if its not very efficient, its extra processing power... for free... even if you only use 10% of the gpu cores, thats still going to be a boost...

your blog doesnt load for me btw...

I'm not that good at this stuff, but i guess the problem is latency. It's quicker to do a random operation in the FPU than to send it all the way to the GPU. It's when it is large workloads that can get parallelized the GPU is useful.
Just doing one or a few operations at random is probably much more efficent in a high clocked FPU than in a low clocked GPU, at the moment.

**saaya** · 04-23-2010, 01:22 AM

Originally Posted by Dresdenboy

Think of it like this: You are the boss of a factory producing tables. You have to make 100 tables of 30 different types. What's more efficient then (time and cost wise): to have 20 workers or to have 1000 with all implications (communication, place, ways, controlling..)

well, its more like i HAVE 1000 employees, its just a matter of how many i chose to build a team to work on something...
and i dont understand the reference to communication, place, ways, controlling... is it that much work to get them data and pick up the results?
in your example, id take 1 worker per tablet to do the final assembly, and tell the other guys to prepare them and do the basic work that is the same for all of them, and then there will be 1 worker for each tablet that puts it together or does whatever needs to be done on this special model that doesnt get done on all the other ones.

i mean thats the big thing here that i dont get... no matter how difficult it is to get them working or how bad they are at what they do... you already hired them, just that right now most of the day they are just sitting around, not doing anything.

Originally Posted by Dresdenboy

Someone is trying to keep my information away from public... woohooo conspiracy ahoi

You seem to be not the only one:
http://support.mozilla.com/de/forum/1/653217

blog.de suffered some DOS attack and also uses some IP blocking as it seems. I'll look for a different place to provide kind of a backup.

Maybe you could use an atom reader for http://citavia.blog.de/feed/atom/posts/ (or RSS: http://citavia.blog.de/feed/rss2/posts/ ).

hahah, yeah, sounds like M from amd security in on your heels

hahaha

**Frank M** · 04-23-2010, 01:59 AM

Originally Posted by Hans de Vries

A few observations

[img]

whoa... awesome job! Never thought one could tell so much from a die shot.
Me, I can only tell the difference between cache and logic, but naming the
parts... stunning

**saaya** · 04-23-2010, 03:53 AM

Originally Posted by Frank M

whoa... awesome job! Never thought one could tell so much from a die shot.
Me, I can only tell the difference between cache and logic, but naming the
parts... stunning

well, if youve worked with something for long enough you get really good at spotting details and can read everything like a book

**-Boris-** · 04-23-2010, 03:55 AM

Originally Posted by saaya

well, if youve worked with something for long enough you get really good at spotting details and can read everything like a book

But even in books there is a difference between reading, and reading between the lines.
Understanding words isn't the same thing as understanding the context.

**Olivon** · 04-23-2010, 11:23 AM

Intel AVX page

For those who don't know this link ...

**mAJORD** · 04-23-2010, 04:18 PM

lol @ 3dnow.

Llano + Quake II + 3dn patch ftw!

Great work Hans, as always. Be interesting to see what it all turns out to be.

**Chumbucket843** · 04-23-2010, 04:47 PM

Originally Posted by saaya

hah, thats fine, luckily the world isnt full of people like you so i can learn

seriously dude, take it easy and dont act like such an 4ss...

about offloading fpu code to the gpu...
i cant believe that its THAT hard... doing it well and doing efficiently, thats probably tricky...
but you make it sound as if intels and amds engineers would just shrug if youd ask them how it could be done...
and i just cant believe that... they are solving problems all day in and out, and having worked a lot with hw and sw engineers, they can always think about a way to get xxx working somehow... except for chinese/taiwanese engineers, hah, they are good at copying or tweaking designs but not that good at coming up with something new and tend to say "thats not possible" a lot

transmeta attempted something similar. the difference is they intended for the underlying architecture to intercept x86 and it is not inside and x86 cpu. even if AMD tried there would be so many issues and it would be counterproductive. its pretty much impossible with any real world constraints. think about it. it is THAT hard.

**JumpingJack** · 04-23-2010, 05:00 PM

Originally Posted by Raqia

It's interesting that 3dnow! is still being kept around. It's a minuscule amount of die space but it can't be trivial to implement and debug. Company pride I suppose?

Not surprising, a number of applications utilize a 3DNow! code path, including games and the staple -- Mainconcept h264 encoder. That little bit of die area will never be retired.

**nn_step** · 04-23-2010, 07:48 PM

if anything integer divide is going to take more logic than integer multiply.

**STEvil** · 04-23-2010, 10:27 PM

Originally Posted by Chumbucket843

transmeta attempted something similar. the difference is they intended for the underlying architecture to intercept x86 and it is not inside and x86 cpu. even if AMD tried there would be so many issues and it would be counterproductive. its pretty much impossible with any real world constraints. think about it. it is THAT hard.

Really it depends on the GPU core, maybe some parts of it are or will be designed somewhat like an "SPE" which could allow the CPU to utilize parts of it as an external resource.. oh well

**saaya** · 04-24-2010, 10:09 AM

Originally Posted by -Boris-

But even in books there is a difference between reading, and reading between the lines.
Understanding words isn't the same thing as understanding the context.

sure, reading tons of books doesnt man youll necessarily get all hints... but whats your point? that hans is interpreting things into the die shot that just arent there? its not like he even claimed it supports avx, he found hints that it might, the thread title carries a question mark, so i dont get why people troll him for hinting at an interesting find without posting anything useful themselves...

even IF llano CAN do avx, it doesnt mean it will... its not like it would be the first time that logic gets added that is only enabled in a later rev or design. it might be broken or not work as well as planned, and it will never be used...

Originally Posted by Chumbucket843

transmeta attempted something similar. the difference is they intended for the underlying architecture to intercept x86 and it is not inside and x86 cpu. even if AMD tried there would be so many issues and it would be counterproductive. its pretty much impossible with any real world constraints. think about it. it is THAT hard.

hmmm... correct me if im wrong, but isnt every modern x86 cpu blocks of risc logic with an x86 decoder on top? so then the decoder needs to be able to spit out code that a gpu core can process, and the latter has to be able to spit out data in a way it can be built into an x86 compatible result again... its basically adding a new type of fpu block to the cpu... so the gpu core wouldnt have to execute x86 code, right?
please excuse my lack of knowledge about this

**-Boris-** · 04-24-2010, 11:51 PM

Originally Posted by saaya

sure, reading tons of books doesnt man youll necessarily get all hints... but whats your point? that hans is interpreting things into the die shot that just arent there? its not like he even claimed it supports avx, he found hints that it might, the thread title carries a question mark, so i dont get why people troll him for hinting at an interesting find without posting anything useful themselves...

even IF llano CAN do avx, it doesnt mean it will... its not like it would be the first time that logic gets added that is only enabled in a later rev or design. it might be broken or not work as well as planned, and it will never be used...

No, It's not like that at all, I respect Hans very much, and I always have. I agree in everything that you are saying.
My point is simply that we don't know anything yet, and don't get me wrong, I love speculations and speculate alot myself.

But some readers have already taken his posts as accurate readings of the die. All I'm saying is that even the most skilled engineer can't read the details in die shots.
And as Hans himself pointed out with his question mark, this are mostly fun speculations, however I do think he does have some interesting points. And I would love to see him right.
But we shall not take these speculations as truths. And I am convinced that Hans himself wouldn't like his post to be interpreted like that.

I apologize to Hans if my posts in some way has been percieved as trolling.

**nn_step** · 04-25-2010, 06:22 AM

Originally Posted by saaya

sure, reading tons of books doesnt man youll necessarily get all hints... but whats your point? that hans is interpreting things into the die shot that just arent there? its not like he even claimed it supports avx, he found hints that it might, the thread title carries a question mark, so i dont get why people troll him for hinting at an interesting find without posting anything useful themselves...

even IF llano CAN do avx, it doesnt mean it will... its not like it would be the first time that logic gets added that is only enabled in a later rev or design. it might be broken or not work as well as planned, and it will never be used...

hmmm... correct me if im wrong, but isnt every modern x86 cpu blocks of risc logic with an x86 decoder on top? so then the decoder needs to be able to spit out code that a gpu core can process, and the latter has to be able to spit out data in a way it can be built into an x86 compatible result again... its basically adding a new type of fpu block to the cpu... so the gpu core wouldnt have to execute x86 code, right?
please excuse my lack of knowledge about this

well the underlying structure for Modern CPUs are uniform sized instruction and generally separation of arithmetic from load/stores but the arithmetic logic itself need to able to approximate the expected results efficiently otherwise there will be serious inefficiency in the design.

For example to emulate 80bit x87 calculations, it is more efficient to just to make your FPU 80bits, than to attempt to emulate it with a 64bit FPU

**saaya** · 04-26-2010, 07:44 AM

Originally Posted by -Boris-

No, It's not like that at all, I respect Hans very much, and I always have. I agree in everything that you are saying.
My point is simply that we don't know anything yet, and don't get me wrong, I love speculations and speculate alot myself.

But some readers have already taken his posts as accurate readings of the die. All I'm saying is that even the most skilled engineer can't read the details in die shots.
And as Hans himself pointed out with his question mark, this are mostly fun speculations, however I do think he does have some interesting points. And I would love to see him right.
But we shall not take these speculations as truths. And I am convinced that Hans himself wouldn't like his post to be interpreted like that.

I apologize to Hans if my posts in some way has been percieved as trolling.

yeah, maybe some people got too carried away

and even though i cant remember hans being wrong about any of his speculations based on die shots in the past...

there were several cases where he was right and something was indeed in place, but disabled hyper threadding in early p4s and ... yamhil in later ones for example...

Originally Posted by nn_step

well the underlying structure for Modern CPUs are uniform sized instruction and generally separation of arithmetic from load/stores but the arithmetic logic itself need to able to approximate the expected results efficiently otherwise there will be serious inefficiency in the design.

For example to emulate 80bit x87 calculations, it is more efficient to just to make your FPU 80bits, than to attempt to emulate it with a 64bit FPU

sorry, didnt understand the part about arythmetic and fpu coordination

all i got is that the two need to work together and thats probably a pita if the fpu is off die or in far away block and doing things in a very different manner than what the engineers are used to working with...

hmmmm but how about 128 instead of 64? if all you do is widen the interface but the fpu can still only do 64 per clock... then do you actually gain any performance whatsoever on 128bit wide operations? and do you gain 64bit op perf?

thx nn_step

**Chumbucket843** · 04-26-2010, 12:58 PM

Originally Posted by saaya

hmmm... correct me if im wrong, but isnt every modern x86 cpu blocks of risc logic with an x86 decoder on top? so then the decoder needs to be able to spit out code that a gpu core can process, and the latter has to be able to spit out data in a way it can be built into an x86 compatible result again... its basically adding a new type of fpu block to the cpu... so the gpu core wouldnt have to execute x86 code, right?
please excuse my lack of knowledge about this

im not enitirely sure how cpu's break down instructions. it probably takes a simple instruction and breaks it down into simpler operations. you concept of digital logic is not right. an ISA is just a bunch of instructions a processor executes. the processor doesnt need special logic for that particular isa. it just needs to be able to execute every instruction.

here is a simple decoder. its like a switch in a network. its not some complex decrypting magic.

http://www.asic-world.com/digital/co...er_using_Demux

i understand what you are saying: could an x86 decoder allow instructions to run on a gpu? no. there are many issues. gpu's are still asic's. they are designed around directx. they dont support all of the features that would be needed in a cpu and they are very different architecturally. for instance sse has 8 GPR's(general purpose registers). evergreen has 128GPR's. both are 128 bits wide. an sse programmer cant use 128 gpr's because 120 of them dont exist!

**zir_blazer** · 04-26-2010, 01:57 PM

Originally Posted by Chumbucket843

i understand what you are saying: could an x86 decoder allow instructions to run on a gpu? no. there are many issues. gpu's are still asic's. they are designed around directx. they dont support all of the features that would be needed in a cpu and they are very different architecturally. for instance sse has 8 GPR's(general purpose registers). evergreen has 128GPR's. both are 128 bits wide. an sse programmer cant use 128 gpr's because 120 of them dont exist!

Are you forgetting that x86 Processors got Register Renaming from around a decade ago? Applications see the 16 SSE and x86-64 Registers (You got 16 on both when running on Long Mode. And SSE aren't supposed to be GPR as far that I know), but you don't know how much the Processor really keep track of internally.
Both the FPU and the GPU do the same thing: Floating point calculations. However, a Processor FPU is a very complex monster on silicon that is also hard to program for (The x87 Assembly always sucked in difficulty, as far that I know) while the GPU got tons of simpler and specific purpose FPU. Things aren't specifically like that right now because GPUs evolved from having fixed function units to much more complex, programable ones, though I don't know how far is the potential reach of CUDA and ATI Close to Metal when compared to the x87 FPU. But the more they work for the GPGPU approach, the more complex these programmable units should become and maybe should adquiere a high grade of flexibility.
The question is if you can make the less complex GPU programmable units being able to understand and process x87 instructions or viceversa with the Processor FPU. If anything, it could be possible than in two generations or three the FPU, SSE and GPU become the same unit capable of executing all the different instructions sets.

**Chumbucket843** · 04-26-2010, 05:45 PM

Originally Posted by zir_blazer

Are you forgetting that x86 Processors got Register Renaming from around a decade ago?

that is a problem because register on the gpu are intended to be explicitly managed(you allocate registers per thread).thats just one example of why this cant work.

Applications see the 16 SSE and x86-64 Registers (You got 16 on both when running on Long Mode. And SSE aren't supposed to be GPR as far that I know), but you don't know how much the Processor really keep track of internally.

SSE has GPR's along with MMX, XMM and some other registers. im pretty sure that all of them are doubled in long mode. with out all types of registers sse is impossible.

Both the FPU and the GPU do the same thing: Floating point calculations. However, a Processor FPU is a very complex monster on silicon that is also hard to program for (The x87 Assembly always sucked in difficulty, as far that I know) while the GPU got tons of simpler and specific purpose FPU. Things aren't specifically like that right now because GPUs evolved from having fixed function units to much more complex, programable ones, though I don't know how far is the potential reach of CUDA and ATI Close to Metal when compared to the x87 FPU. But the more they work for the GPGPU approach, the more complex these programmable units should become and maybe should adquiere a high grade of flexibility.

while there is a lot of difference in how FP is implemented that doesnt change that the end result is just math. x87 doesnt have that much more capabilities wrt functions. you can convert to int, or compute square root or exponent, nothing that is very important.

The question is if you can make the less complex GPU programmable units being able to understand and process x87 instructions or viceversa with the Processor FPU. If anything, it could be possible than in two generations or three the FPU, SSE and GPU become the same unit capable of executing all the different instructions sets.

the problem is "understanding". its like translating another language verbatim. even though you translated it, it's still not necessarily correct and the syntax will be messed up. there are fundamental differences in the way these processors work. GPU's and CPU's are merging architectually. its going to be interesting and i hope we dont get stuck with x86 again.

**nn_step** · 04-26-2010, 08:39 PM

Originally Posted by saaya

yeah, maybe some people got too carried away

and even though i cant remember hans being wrong about any of his speculations based on die shots in the past...

there were several cases where he was right and something was indeed in place, but disabled hyper threadding in early p4s and ... yamhil in later ones for example...

sorry, didnt understand the part about arythmetic and fpu coordination

all i got is that the two need to work together and thats probably a pita if the fpu is off die or in far away block and doing things in a very different manner than what the engineers are used to working with...

hmmmm but how about 128 instead of 64? if all you do is widen the interface but the fpu can still only do 64 per clock... then do you actually gain any performance whatsoever on 128bit wide operations? and do you gain 64bit op perf?

thx nn_step

well given that no x86 or x86_64 instruction set works on 128bit floating point math, it would be an atrocious waste of transistors to implement a 128bit FPU. Hence why AMD's K10 implements a 80bit and a 64bit FPU in parallel. Such that 128bit SIMD instructions can be done at full speed and if need be, a 80bit and a 64bit instruction can be processed in parallel.

**jogshy** · 04-26-2010, 10:21 PM

Btw... there is an interesting poll on AMD...

Quick Poll
Are you currently developing or planning on supporting the AVX instruction set in your application development efforts?
a) No plans at this time
b) Currently researching how to take advantage of AVX
c) Already developing specific AVX code
d) What's AVX?

source: http://developer.amd.com/Pages/default.aspx#

And this AMD's blog mentions some interesting things too:
http://blogs.amd.com/developer/2009/...ing-a-balance/

Thread: Can Llano do AVX?

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions