@Apokalipse, Dolk:
Nice to see at least some technical discussion going on, other than analyzing the wave of faked BD screenshots ^^
To your discussion:
Well, after providing 2 instruction streams to a BD module, the OS is kind of out of the loop of executing them.
The fetch/decode/dispatch path works on 2 threads, capable of switching between them on a per-cycle-basis. The dispatch unit dispatches groups of "cops" (complex ops, AMD term) belonging to either thread (I hope, my English skills aren't killing the message here

). Such a dispatch group might contain ALU, AGU and FP ops of one thread. I assume the FP ops or the according dispatch group have a tag denoting their respective thread.
From now on the integer and the FP schedulers are kind of independently working on the cops/uops they received, until control flow changes. The FP scheduler "sees" a lot of uops asking for ready inputs and producing outputs. Except for control flow changes it actually doesn't even need to know to which thread an op belongs to. The thread relation is assured by a register mapping (has to be done for OOO execution anyway). So the FPU could execute uops as their inputs become available.
If there are 256b ops, they're already decoded as 2 uops ("Double decode" type) to do calculations on both 128b wide halves of the SIMD vector. I don't think that there is an explicit locking mechanism. The FPU could basically issue both uops in the same cycle or - if there are ready uops of the other thread - issue them to one 128b unit during 2 consecutive cycles (the software optimization manual show latencies of 256b ops supporting this). This actually depends on the implemented policy: issue uops based on their age only or issue them ensuring balanced execution of both threads.
But what's important to your discussion: this happens independently from both integer cores.
Bookmarks