Also I'd like to know what type of instructions can execute AGLU, but what my knowledge of what can AGLU execute is based also on Optimisation Manual and my assumptions is that the AGLU can execute address calculations and LEA, and probably can execute INC. If AMD's manual says that the AGLU can execute simple ALU operations. Maybe i'm wrong for INC, but it could be possible for such unit to support some other type of instructions than CALL and LEA.
If it can calculate adress, that unit can also execute simple ADD or INC with unsigned integer (address + offset) and some logical operations like XOR or AND. That is my speculation, because Optimisation Manual probably isn't fully written.
Optimisation guide also refer this:
Optimisation manual says that the AG0|AG1 units execute LEA instruction when work with 3 operands. But with legacy 2 operand instructions LEA can be executed only at EX0|EX1 units. AG0|AG1 can execute CALL instructions, which is double op decoded. Fist op. execute on EX and secon op. execite on AGLU.There are four integer execution units per core. Two units which handle all arithmetic, logical and
shift operations (EX). And two which handle address generation and simple ALU operations
(AGLU). Figure 2 shows a block diagram for one integer cluster. There are two such integer clusters
per compute unit.
The CALL instruction clearly transfers control to another procedure, and the RET instruction returns to the instruction following the call.
But that isn't any big difference in comparison to K10. K10 also execute CALL instruction like double op, but on BD CALL disp, near and CALL reg, near has 50% lower latency than 10h and CALL mem (near) is hardwired - double decoded, on 10h is microcoded.
According to Optimisation manual, main difference in BD AGU vs 10h AGU units is that the BD AGU can execute LEA, when work with three operands, and CALL is fully hardwired, with slightly lower latencies.
Use google translate to learn Serbian...:P
I will translate that diagrams to English, that isn't problem, but I think it is understandable in that version. Picture is worth a thousand words! :P
Last edited by drfedja; 08-09-2011 at 03:07 PM.
"That which does not kill you only makes you stronger." ---Friedrich Nietzsche
PCAXE
drfedja awesome looking diagrams, however I think there's a small error for the Bulldozer FPU. Bulldozer only has 1 IMAC unit and it's located in Pipe 0 which it shares with an FMA unit and a convert unit. The two integer units in pipe 2 and pipe 3 are for vector integer ADD(multiplication is done in the IMAC unit) as well as AVX, SSE and x87 instructions that are not handled by the FMA, CVT, or the XBR units.
You are right. I will correct that.
There are no 256-bit integer AVX instructions (256-bit int. AVX comes with AVX2) and integer FMA is handled by FP pipe 0.
According to Optimisation Manual:
All of them can execute integer SIMD instructions. Pipe 0 performs integer fused multiply accomulate, and pipe 1 execute shuffle and FSTORE and all of integer SIMD.A 128-bit integer multiply accumulate (IMAC) unit is incorporated into FPU pipe 0. The IMAC
performs integer fused multiply and accumulate, and similar arithmetic operations on AVX, MMX
and SSE data. A crossbar (XBAR) unit is integrated into FPU pipe 1 to execute the permute
instruction along with shifts, packs/unpacks and shuffles. There is an FPU load-store unit which
supports up to two 128-bit loads and one 128-bit store per cycle.
There are four units. According to instruction latencies table, CVT unit is pipe 0, which is shared with FMA0 pipe 0 (128-bit FP block), pipe 1 is XBAR unit which is responsible for shuffling operations and shared with FMA1, pipe 2 and pipe 3 is integer SIMD units which handle some bitwise or logic operations: eg.
ANDNP, ANDNS is FP bitwise operations but it is handled by pipe 2 and pipe 3. That type of instructions has same troughput on 10h, but there is shared with FADD or FMUL pipe.
"That which does not kill you only makes you stronger." ---Friedrich Nietzsche
PCAXE
The thread's becoming more & more technical. I still wonder whether bulldozer could use all of the four FPU unit(128bit*2 fmac+ 128bit*2 mmx) to run superpi.![]()
SuperPi uses about 50% of legacy x87 FP operations. Average IPC of SPi is 0.65-0.7 with 10h microcarchitecture. SPi is mixed type of code, and it is very memory depended, because there is alot of memory stack operations. In general, FPU throughput isn't bottleneck for executing SPi. x87 execution of SPi is saturated by inefficiency of 10h memory subsystem (LS-units->L1D->L2->L3 caches). I think that SPi could be much better on BD, but significantly slower than Sandy or even Nehalem.
In general, SPi isn't optimised code for modern parallel SIMD architecture. There aslo could be problem with unaligned memory access, store to load forwarding and data dependencies in order to run code serialized. That is probably the main reason why is Core architecture so superior when runing such unoptimized code.
"That which does not kill you only makes you stronger." ---Friedrich Nietzsche
PCAXE
Thanks, so there is no newer information about AGLUs.
My speculation: if the AGLUs can handle any of the ALU operations, then they must know normal, zero and sign-extended register copy at least, so the instruction table is inaccurate. The additions also must be handled by them and with a slightly more compexity the SUB, NEG, INC, DEC and CMP operations (not the fused compares, just the standalone ones). The logical NOTs, ANDs, ORs, XORs and (not fused) TESTs also requires only a little more simple circuits.
And exaclty these operations are performed by the double-pumped fast ALUs in Netburst, at 4/cycle rate.
-
Maybe, some people on the Net had speculated that simple ALU or AGLU on BD may handle instructions like 30 years old 6502.![]()
However, mov, push and call are most frequent instructions in x86 machine code. Also the SUB, NEG, INC, DEC and CMP is often used, so AGLU unit could be very useful.
Attachment 118788
"That which does not kill you only makes you stronger." ---Friedrich Nietzsche
PCAXE
I think the MOV section covers simple loads, stores (certain workplace of AGUs) and register copies on the picture. Since the PUSH and POP instructions has been already recuced to single store and load operations at execution level by the K10's Stack Engine, they had to look for another instruction group to speed up (my mentioned conception could extend the general integer execution speed from K10's 3 to 4 ALU operations/cycle/thread). Many conditional jumps (je, jne and others) will be fusioned with preceding TEST/CMP instruction, so the listed add-like and logical ALU instructions would cover another 10-12% on the picture by the AGLU-s.
-
good idea for cpus that are ONLY expensive since people buying such chips cant really do anything about it. but for chips in the 300-400$ range its going to deter people who dont need the nice heatsink unless they also have the regular versions also available with a discount.
2500k @ 4900mhz - Asus Maxiums IV Gene Z - Swiftech Apogee LP
GTX 680 @ +170 (1267mhz) / +300 (3305mhz) - EK 680 FC EN/Acteal
Swiftech MCR320 Drive @ 1300rpms - 3x GT 1850s @ 1150rpms
XS Build Log for: My Latest Custom Case
Intel Considers to Bundle Liquid Cooling Solution with Next-Generation Enthusiast Processors
http://www.xbitlabs.com/news/coolers...rocessors.html
That is an interesting concept, but then they have to deal with potential for RMA. Having a fan fail on a inexpensive heatsink isn't the same, or as likely, as having the pump fail on the water-cooling unit. Kind of a cool idea, but I'd be afraid it would come back to haunt them in the long term.
I had Coolit sealed water systems pump pop. It took the whole system except cpu(5870, AM3 motherboard). I would imagine that if AMD went this way, they would use one of those sealed systems as they are cheapest. If a air cooler fails, you have system shut down or freeze, if water cooler fails, you have quite a bit of damage.
all alienware PCs come with these now, so i bet they are reliable enough to use with other products too
2500k @ 4900mhz - Asus Maxiums IV Gene Z - Swiftech Apogee LP
GTX 680 @ +170 (1267mhz) / +300 (3305mhz) - EK 680 FC EN/Acteal
Swiftech MCR320 Drive @ 1300rpms - 3x GT 1850s @ 1150rpms
XS Build Log for: My Latest Custom Case
would AMD manufacture them or farm them out?
I think, it will be good idea. The topmodel CPU bundled with some better aircooler (Tower type) or lowend WT setup as H50 or H70....
ROG Power PCs - Intel and AMD
CPUs:i9-7900X, i9-9900K, i7-6950X, i7-5960X, i7-8086K, i7-8700K, 4x i7-7700K, i3-7350K, 2x i7-6700K, i5-6600K, R7-2700X, 4x R5 2600X, R5 2400G, R3 1200, R7-1800X, R7-1700X, 3x AMD FX-9590, 1x AMD FX-9370, 4x AMD FX-8350,1x AMD FX-8320,1x AMD FX-8300, 2x AMD FX-6300,2x AMD FX-4300, 3x AMD FX-8150, 2x AMD FX-8120 125 and 95W, AMD X2 555 BE, AMD x4 965 BE C2 and C3, AMD X4 970 BE, AMD x4 975 BE, AMD x4 980 BE, AMD X6 1090T BE, AMD X6 1100T BE, A10-7870K, Athlon 845, Athlon 860K,AMD A10-7850K, AMD A10-6800K, A8-6600K, 2x AMD A10-5800K, AMD A10-5600K, AMD A8-3850, AMD A8-3870K, 2x AMD A64 3000+, AMD 64+ X2 4600+ EE, Intel i7-980X, Intel i7-2600K, Intel i7-3770K,2x i7-4770K, Intel i7-3930KAMD Cinebench R10 challenge AMD Cinebench R15 thread Intel Cinebench R15 thread
http://www.amdzone.com/phpbb3/viewto...rt=900#p209349
Starting to look like BD is a total dud. Sorry, it hurts me to say this but I believe OBR is correct.I just hope I didn't wast $230 CDN on a Crosshair V mobo.
BD = DNF of processors. ???
As quoted by LowRun......"So, we are one week past AMD's worst case scenario for BD's availability but they don't feel like communicating about the delay, I suppose AMD must be removed from the reliable sources list for AMD's products launch dates"
Ill wait till the end of October before I slowly start losing faith in BD. I hope that day never comes.
i7 920@4.34 | Rampage II GENE | 6GB OCZ Reaper 1866 | 8800GT (zzz) | Corsair AX750 | Xonar Essence ST w/ 3x LME49720 | HiFiMAN EF2 Amplifier | Shure SRH840 | EK Supreme HF | Thermochill PA 120.3 | MCP355 | XSPC Reservoir | 3/8" ID Tubing
Phenom 9950BE @ 3400/2000 (CPU/NB) | Gigabyte MA790GP-DS4H | HD4850 | 4GB Corsair DHX @850 | Corsair TX650W | T.R.U.E Push-Pull
E2160 @3.06 | ASUS P5K-Pro | BFG 8800GT | 4GB G.Skill @ 1040 | 600W Tt PP
A64 3000+ @2.87 | DFI-NF4 | 7800 GTX | Patriot 1GB DDR @610 | 550W FSP
Bookmarks