AMD Zambezi news, info, fans !

**undone** · 10-05-2011, 09:53 AM

Originally Posted by Voodoo²

Look at the production date of that chip. "1136" that is september right?

Some other thing is more weird and surprising:

http://semiaccurate.com/forums/showp...5&postcount=13

stepping A1, week 36, year 2011.

A1 chips production underwayed in September? wtf?

**Voodoo²** · 10-05-2011, 10:17 AM

I´m not sure what "FA1" stands for but as long as I remember the steppings always were indicated by one of the letters here "FD8150FRW8KGU". For Example:

Phenom II X4 955 C2 stepping HDZ955FBK4DGI

Phenom II X4 955 C3 stepping HDZ955FBK4DGM

**imamage** · 10-05-2011, 10:38 AM

Originally Posted by Opteron146

I red but do you have a BD? So why should I believe you?
Cinebench11.5 scores are rather bad even CB10 is doing better. Hence I do not believe that the FPU is maxed out at all, especially as there is neither FMAC nor XOP/"MMX" code (the other 2 pipes in the FPU) used. Thus I think there is enough headroom for the 3,9Ghz Turbo stage. Anyways, we'll know in less than 1 week ;-)

Dang, I wish I have one before Retail Launch !!!

**BeepBeep2** · 10-05-2011, 11:43 AM

A1 is not the stepping unless they changed the naming scheme...

CACDC for Deneb translates FA1 on these new processors...

**undone** · 10-05-2011, 11:52 AM

http://www.shopblt.com/cgi-bin/shop/...r_id=296538691

**Manicdan** · 10-05-2011, 11:57 AM

whats the difference between box and try? the 4100 for 121$ sounds very desirable. but just seems a little strange looking

**undone** · 10-05-2011, 12:03 PM

Originally Posted by Manicdan

whats the difference between box and try? the 4100 for 121$ sounds very desirable. but just seems a little strange looking

Someone guess it's a copy-paste typo.

http://www.planet3dnow.de/vbulletin/...&postcount=625

**Manicdan** · 10-05-2011, 12:15 PM

yeah i was expecting something like that. the 189$ should be 4C, and the 121$ should be an x4 non FX model cpu, for it all to make sense

**Mechanical Man** · 10-05-2011, 12:23 PM

Originally Posted by Manicdan

whats the difference between box and try? the 4100 for 121$ sounds very desirable. but just seems a little strange looking

I dont think that is real price. Its some kind of error.

But, difference between box and tray is cooler. Box has cooler with it, tray does not have cooler with it.

**bamtan2** · 10-05-2011, 12:52 PM

I get concerned when I see people touting review units that only include one chip.

there are 4 chips to review. if we get only one chip reviewed on the 12th I will be smashing things.

**Apokalipse** · 10-05-2011, 01:12 PM

Originally Posted by Musho

Also, the cores are running at 80% efficiency when both cores in a single module are loaded

AMD said 80% more performance than single threaded (in the same module, also presuming same frequency), meaning 180% of single threaded, or (180/2) 90% for each core.

**Leeghoofd** · 10-05-2011, 01:20 PM

start smashing then, till now I only know about 8150 models being shipped in press kits...

**informal** · 10-05-2011, 01:25 PM

Originally Posted by Apokalipse

AMD said 80% more performance than single threaded (in the same module, also presuming same frequency), meaning 180% of single threaded, or (180/2) 90% for each core.

Actually they officially said (in presentations) 80% of CMP design which was presumably CMP-type Bulldozer with nothing shared(except maybe L3). But we have been over this before.
I suspect the biggest hit will be running fp heavy code and that the 80% figure comes from that. It's logical when you think about it : instead of replicating "full" cores in order to get 8 FPUs,you invest in each FPU more resources,increase the BW to the unit and make it shareable between 2 integer cores.In the process you make the unit in the way so that it uses SMT for 2 threads running on 2 dedicated pieces of hardware inside it. This way you have 4 new FPUs ,now shared, that produce only 25% less throughput than 8 "full" ones in CMP (without SMT probably) and all this saves you considerable die area and grants you some TDP and clock headroom. Pretty neat idea isn't it?

**duron** · 10-05-2011, 06:37 PM

look at all those goodies that come with it

hmmm some got them early(scroll down to last part)
http://www.tipidpc.com/viewtopic.php...2787&page=4183

**tbone8ty** · 10-05-2011, 06:44 PM

results anybody? nda?

plenty of press kits around, give us a tease

**Daveburt714** · 10-05-2011, 09:54 PM

You know, I'm really excited that FX is so close now....

Reguardless of final performance compared to Intel, I can't help but think that once we get our hands on these
chips all the crazy fud/benchies are going to seem ridiculous....

I've been reading all this stuff for the last 9 months, and you wouldn't believe how bad I've been biting my tounge.

Some may have been right, some may have been wrong, but once I (we) can test for ourselves all the questions will finally be answered!

I'm sure there's some Firmware/Software/Hardware/OS tweaks that need to be done to get the best results from this new uARCH, but at least
it will finally be out there and worked on.

BRING'EM ON BABY.....

If nothing else, I need a new adventure, and this chip looks like fun!

**BeepBeep2** · 10-05-2011, 10:06 PM

Originally Posted by Daveburt714

You know, I'm really excited that FX is so close now....

Reguardless of final performance compared to Intel, I can't help but think that once we get our hands on these
chips all the crazy fud/benchies are going to seem ridiculous....

I've been reading all this stuff for the last 9 months, and you wouldn't believe how bad I've been biting my tounge.

Some may have been right, some may have been wrong, but once I (we) can test for ourselves all the questions will finally be answered!

I'm sure there's some Firmware/Software/Hardware/OS tweaks that need to be done to get the best results from this new uARCH, but at least
it will finally be out there and worked on.

BRING'EM ON BABY.....

If nothing else, I need a new adventure, and this chip looks like fun!

This architecture seems very voltage friendly as well

So is Llano, GF's 32nm is very voltage hungry...2v+ for CPU-Z validations on LN2/LHe with BD.

It will be fun

**Apokalipse** · 10-05-2011, 10:23 PM

Originally Posted by informal

Actually they officially said (in presentations) 80% of CMP design which was presumably CMP-type Bulldozer with nothing shared(except maybe L3). But we have been over this before.

CMP (Chip Multi Processor) is two "full" cores, and CMT (Cluster-based MultiThreading) is what they call the modules idea:

That's the slide I was referencing, where they said 80% gain

Although in retrospect it also says 50% area investment, so I'm not sure if that exactly describes the actual BD modules used in Zambezi, which AMD said have 12% larger die area than a "full" core (hypothetical BD "full" core, not K10.5).

Originally Posted by informal

I suspect the biggest hit will be running fp heavy code and that the 80% figure comes from that. It's logical when you think about it : instead of replicating "full" cores in order to get 8 FPUs,you invest in each FPU more resources,increase the BW to the unit and make it shareable between 2 integer cores.In the process you make the unit in the way so that it uses SMT for 2 threads running on 2 dedicated pieces of hardware inside it. This way you have 4 new FPUs ,now shared, that produce only 25% less throughput than 8 "full" ones in CMP (without SMT probably) and all this saves you considerable die area and grants you some TDP and clock headroom. Pretty neat idea isn't it?

Bulldozer's FlexFP:
http://blogs.amd.com/work/2010/10/25/the-new-flex-fp/
Basically it is two 128-bit FMAC's with a shared scheduler, which works alongside two integer cores.

The Flex FP unit is built on two 128-bit FMAC units. The FMAC building blocks are quite robust on their own. Each FMAC can do an FMAC, FADD or a FMUL per cycle. When you compare that competitive solutions that can only do an FADD on their single FADD pipe or an FMUL on their single FMUL pipe, you start to see the power of the Flex FP – whether 128-bit or 256-bit, there is flexibility for your technical applications. With FMAC, the multiplication or addition commands don’t start to stack up like a standard FMUL or FADD; there is flexibility to handle either math on either unit. Here are some additional benefits:

Non-destructive DEST via FMA4 support (which helps reduce register pressure)
Higher accuracy (via elimination of intermediate round step)
Can accommodate FMUL OR FADD ops (if an app is FADD limited, then both FMACs can do FADDs, etc), which is a huge benefit

The new AES instructions allow hardware to accelerate the large base of applications that use this type of standard encryption (FIPS 197). The “Bulldozer” Flex FP is able to execute these instructions, which operate on 16 Bytes at a time, at a rate of 1 per cycle, which provides 2X more bandwidth than current offerings.

By having a shared Flex FP the power budget for the processor is held down. This allows us to add more integer cores into the same power budget. By sharing FP resources (that are often idle in any given cycle) we can add more integer execution resources (which are more often busy with commands waiting in line). In fact, the Flex FP is designed to reduce its active idle power consumption to a mere 2% of its peak power consumption.

The Flex FP gives you the best of both worlds: performance where you need it yet smart enough to save power when you don’t need it.

The beauty of the Flex FP is that it is a single 256-bit FPU that is shared by two integer cores. With each cycle, either core can operate on 256 bits of parallel data via two 128-bit instructions or one 256-bit instruction, OR each of the integer cores can execute 128-bit commands simultaneously. This is not something hard coded in the BIOS or in the application; it can change with each processor cycle to meet the needs at that moment. When you consider that most of the time servers are executing integer commands, this means that if a set of FP commands need to be dispatched, there is probably a high likelihood that only one core needs to do this, so it has all 256-bit to schedule.

Floating point operations typically have longer latencies so their utilization is typically much lower; two threads are able to easily interleave with minimal performance impact. So the idea of sharing doesn’t necessarily present a dramatic trade-off because of the types of operations being handled.

Here are the 4 likely scenarios for each cycle:

It looks like it almost has enough FP resources to get the same performance as two "full" cores, the exception being if two 256-bit instructions were issued at once - though the capability to do that requires much more (largely unused) die area.
So I would think two thread scaling (in one module) is largely a matter of the shared front-end's capability to feed the execution resources (as well as memory bandwidth, latencies etc which needs to be improved the more cores you have)

**-Boris-** · 10-05-2011, 11:52 PM

Originally Posted by informal

This way you have 4 new FPUs ,now shared, that produce only 25% less throughput than 8 "full" ones in CMP (without SMT probably) and all this saves you considerable die area and grants you some TDP and clock headroom. Pretty neat idea isn't it?

Only neat if you get more than 25% higher frequencies. I doubt that doubling the FPUs would make a big hit on frequencies. The FPUs count for a very small part of total die area so saving mm² isn't worth it. And I doubt higher power consumtion would would lower the frequencies much.

**Apokalipse** · 10-05-2011, 11:58 PM

Originally Posted by -Boris-

Only neat if you get more than 25% higher frequencies. I doubt that doubling the FPUs would make a big hit on frequencies.

Doubling the FPU's takes massively more die area, meaning more power usage, and you can't clock it as high if you want to remain within a certain TDP.

The FPUs count for a very small part of total die area

Floating point units are much more complex than integer units, and take up much more die area.
instruction sets like SSE, AVX use the FPU primarily.

**informal** · 10-06-2011, 01:04 AM

@Apokalipse
The slide about cmt is from 2005,long before amd had any real HW im their hands. I stick with what they said at FAD 2010 and that's 80% of cmp design in less die area. The rest of the stuff you quoted is well known information and doesn't go against what I wrote. I even believe there won't be massive hit in integer throughput from running 2 threads on a module. Fp may see the best numbers if threads are 1st scheduled on different modules but this has to be verified.

@Boris
You do realize that in order to get 8 *full* fpus the old way, you have to replicate front ends ,integer exec. units and L1 and L2 caches ,right? This leaves you with wasted and doubled die area that will mostly sit idle (especially fp unit). Beaty of bulldozer is exactly in maximizing perf./watt/mm^2. Btw ,the most power hungry part of the core is usually fpu...

**Apokalipse** · 10-06-2011, 02:02 AM

Originally Posted by informal

@Apokalipse
The slide about cmt is from 2005,long before amd had any real HW im their hands. I stick with what they said at FAD 2010 and that's 80% of cmp design in less die area. The rest of the stuff you quoted is well known information and doesn't go against what I wrote. I even believe there won't be massive hit in integer throughput from running 2 threads on a module. Fp may see the best numbers if threads are 1st scheduled on different modules but this has to be verified.

My point is that I don't think the FlexFP will much less performance than two conventional "full" 256-bit FPU's if the frontend can do its job and keep the execution resources fed.
I think the only case where it is limited in execution resources is if there are two 256-bit instructions from two threads at once, but that's a very rare case.

So yes it won't be as fast as two "full" cores. I'm just saying that I don't think available execution resources is the main reason for this (for either integer or FP). The FlexFP looks very efficient and much less transistor/die-area wasteful than two conventional 256-bit FPU's in two "full" cores.

The frontend is very much beefed up vs K10.5 though; which it has to be to feed the extra execution resources for two threads.

**xdan** · 10-06-2011, 02:44 AM

Originally Posted by informal

@Apokalipse
The slide about cmt is from 2005,long before amd had any real HW im their hands. I stick with what they said at FAD 2010 and that's 80% of cmp design in less die area. The rest of the stuff you quoted is well known information and doesn't go against what I wrote. I even believe there won't be massive hit in integer throughput from running 2 threads on a module. Fp may see the best numbers if threads are 1st scheduled on different modules but this has to be verified.

@Boris
You do realize that in order to get 8 *full* fpus the old way, you have to replicate front ends ,integer exec. units and L1 and L2 caches ,right? This leaves you with wasted and doubled die area that will mostly sit idle (especially fp unit). Beaty of bulldozer is exactly in maximizing perf./watt/mm^2. Btw ,the most power hungry part of the core is usually fpu...

I am not against BD CMT design, but without stronger IPC it's just useless.
So with CMT we have 80% performance of a true core. But the problem is that it's not a 100% performance core + 80% CMT core( comparing Intel + HT), it's 80% performance for both cores in module so...
If we calculate 0.8(80%) * 8 = 6.4 so 6.4 true cores performance, so a bigh hit.

This desing it doesn't scale well the more cores you put.
If we put that the IPC isn't much better- may be the same, not to be pesimist to say lower, than wat we got?
A 6.4 cores with a 10% speed bump, may be a 6.8-7 true cores performance.
So what "maximizing perf./watt/mm^2" - not performance anyway,

I have my info, and BD it's a disappoiment. For an " 8core" . As it' price, overall performance is between 2500K and 2600K, and will be hoter on air cooling than SB.

**Leeghoofd** · 10-06-2011, 02:51 AM

Originally Posted by xdan

and will be hoter on air cooling than SB.

You sure ? got data to back that statement ?

**dess** · 10-06-2011, 02:53 AM

Originally Posted by informal

Do you guys even read what I wrote? In floating point heavy code that employes all 8 threads Turbo will almost never engage. Turbo will engage accross all 8 integer cores though,but cinebench will use flexfp coprocessors most of the time where tdp will be maxed out. You can read all about bd exec. units power draw and clock characteristics at amd blogs past isscc event.

You have a point, but I think Cinebench uses mostly scalar maths, utilizing only the 1/4 or 1/2 of the 128 bit wide engines (depending on that if it uses single or double precision).

Also, it doesn't use FMA, so the underlaying FADD and FMUL units in an FMAC never work at once (or at least only one execution starts, per cycle).

0.5 x 0.5 = 0.25 -> 1/4 FPU utilization/thread (with MT)
0.25 x 0.5 = 0.125 -> 1/8 FPU utilization/thread (with MT)

Of course, it's quite theoretical as the sharing of the FPU is not exactly 50% per thread per module all the time, and these are the peak values.

Originally Posted by Apokalipse

It looks like it almost has enough FP resources to get the same performance as two "full" cores, the exception being if two 256-bit instructions were issued at once - though the capability to do that requires much more (largely unused) die area.
So I would think two thread scaling (in one module) is largely a matter of the shared front-end's capability to feed the execution resources (as well as memory bandwidth, latencies etc which needs to be improved the more cores you have)

Depends on if FMA is utilized or not and that if only one or two threads run in a given module, I think. AFAIK the FADD and FMUL units in the K10 cores are capable of working (or starting/finishing) parallelly. With BD, with regular code you can't have the underlaying FADD and FMUL units utilized (or new execution started/finished) at once, in a given FMAC, unless you use FMA code. And you have only one FMAC per thread in case both threads needs them at once...

So, with a single-threaded (or one thread per module) regular code it will perform comparable to K10 (because the second FMAC can be utilized anytime), but if more than 4 threads are running scaling will be worse.

But, perhaps I'm wrong somewhere. Feel free to correct me, then.

Thread: AMD Zambezi news, info, fans !

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions