AMD Ontario APU pictured,die size ~77mm^2

**kl0012** · 09-07-2010, 11:28 PM

Originally Posted by informal

Yes.

I'll believe when I see it. I have some doubts about these tests. The max int instruction throughput for bobcat at 1.6 Ghz is 3.2 GIPS (limited by two decoders). I am very sceptical that bobcat can reach nearly max instruction throughput in this synthetic test (while conroe & athlon64 can't). Still very impressive if true.

**Calmatory** · 09-08-2010, 02:14 AM

Originally Posted by kl0012

I'll believe when I see it. I have some doubts about these tests. The max int instruction throughput for bobcat at 1.6 Ghz is 3.2 GIPS (limited by two decoders). I am very sceptical that bobcat can reach nearly max instruction throughput in this synthetic test (while conroe & athlon64 can't). Still very impressive if true.

Even if it had, say, 8 decoders it wouldn't be any faster. It could in theory run at 12.8 "GIPS", but in practice it wouldn't run any faster(only in cases it would actually exploit ILP > 2, which I believe is quite rare with the given code).

But haters gonna hate.

**kl0012** · 09-08-2010, 07:28 AM

Originally Posted by Calmatory

Even if it had, say, 8 decoders it wouldn't be any faster. It could in theory run at 12.8 "GIPS", but in practice it wouldn't run any faster(only in cases it would actually exploit ILP > 2, which I believe is quite rare with the given code).

But haters gonna hate.

But if your cpu has only two decoders it doesn't mean that it has an equal IPC to cpu with 4 decoders when executes code with ILP <= 2. A simple example (code with sequence of 4 arithmetic operations):
a = b + c
a = a + d
e = g + h
e = e + f
Cpu with 4 decoders can execute first and third instructions in the same cycle, while cpu with 2 decoders will need one more cycle for that. Of cause in reality things are a bit more complex because of OutOfOrder buffer but again, i really doubt bobcat has bigger OOO instruction window then Conroe/Athlon64.

**Calmatory** · 09-08-2010, 10:46 AM

Originally Posted by kl0012

But if your cpu has only two decoders it doesn't mean that it has an equal IPC to cpu with 4 decoders when executes code with ILP <= 2. A simple example (code with sequence of 4 arithmetic operations):
a = b + c
a = a + d
e = g + h
e = e + f
Cpu with 4 decoders can execute first and third instructions in the same cycle, while cpu with 2 decoders will need one more cycle for that. Of cause in reality things are a bit more complex because of OutOfOrder buffer but again, i really doubt bobcat has bigger OOO instruction window then Conroe/Athlon64.

Two words: vectorization and SIMD.

**kl0012** · 09-08-2010, 08:54 PM

Originally Posted by Calmatory

Two words: vectorization and SIMD.

Vectorization is good, but it is not a panacea. Replace third operation in my example with "mul", "and", "shift", "test" or "sub" and SIMD wont help (while these ops are still independent). But my point was simple - as far as some cpu has a bigger pool of uops available for execution, so the cpu's OoO logic has a better chance to explore ILP. This is way i'm surprised by bobcat results (if these are real). I would guess that they have used a loop buffer, but such a buffer would consume a lot of space on the cpu die.

Thread: AMD Ontario APU pictured,die size ~77mm^2

Thread Tools

Search Thread

Rate This Thread

Display

Hybrid View

Bookmarks

Bookmarks

Posting Permissions