It can do both In recent papers the Bulldozer designers call those ops "Cops" (complex ops, equivalent of a ALU/FP micro-op + a mem op [load/store/load+store]).
Decode: 4 Cops/cycle/module or up to 5 in case of branch fusion (IIRC branch op has to be in last place then)
Issue: 2 ALU ops + 2 AGLU ops per cycle per core plus 4 FP/SIMD ops per cycle per module in the FPU (belonging to both threads)
Bulldozers decode unit extracts and
decodes up to four x86 instructions per
cycle from raw instruction bytes. The decode
pipeline converts x86 instructions into Cops
that can directly execute on the functional
units.
The scheduler picks and
schedules four Cops per cycle to the execution
units out of order.
on FPU:
AMD designed the Bulldozer FPU to
deliver industry-leading performance on
HPC, multimedia, and gaming applications.
The primary means of achieving such
performance is a four-wide, two-way, multithreaded,
fully out-of-order FPU, combined
with two 128-bit FMAC units supported by
a 128-bit high-bandwidth load/store subsystem.
Source: Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas, "Bulldozer: An Approach to Multithreaded Compute Performance," IEEE Micro, pp. 6-15, March/April, 2011
Here are some links related to Chuck Moore's comments on Financial Analyst Day 2010, where he mentioned the 4 "instructions" per cycle issue per core and as also the same bandwidth of decode:
http://citavia.blog.de/2010/04/22/pr...143/#c12914412
David Kanter's article on BD gives more details if the software optimization manual is too cryptic.
http://realworldtech.com/page.cfm?Ar...WT082610181333
Bookmarks