Deneb Samples are almost out

**~~Bellisimo~~** · 11-05-2008, 09:09 AM

Originally Posted by Lightman

No it's not and Particle is more or less right.
B3 fix for TBL errata was a bypass. It didn't fix originally planned functionality. The fix brings minimal performance penalty, because TLB is flushed every time there can be some dirty data in it. I think Anand was doing more in-depth analysis.

If you mean this?

http://www.nordichardware.com/news,7189.html
or this?
http://www.xbitlabs.com/articles/cpu...m-x4-9850.html

Unfortunately, AMD engineers didn’t really explain to us what was done specifically to fix the TLB bug in the new B3 processor stepping. However, some indirect data we have at our disposal gives us reason to believe that now, after the processor core changes the bit flags for page table entries stored in L2 cache, they are all evicted into L3 cache. This may be the reason fore the latency to get a little bit higher.

**malice85** · 11-05-2008, 09:13 AM

That's also what I remembered, so I looked it up @ anand:

The hardware fix implemented in B3 Phenoms is that whenever a page table entry is modified, it's evicted out of L2 and placed in L3. There's a very minor performance penalty because of this but no where near as bad as the software/BIOS TLB fix mentioned above.

link

**~~Bellisimo~~** · 11-05-2008, 09:37 AM

If this fixes the TLB bug, it is a fix :P
it might not be the best fix around (i don't know really) but it fixes the problem

**Calmatory** · 11-05-2008, 10:38 AM

Any chance that the errata is now fixed properly? E.g. the source of the problem is corrected not to generate errors?

**EniGmA1987** · 11-05-2008, 10:47 AM

Pardon my ignorance, but I thought the TLB bug was fixed in the B3 revision of the Phenom? Was that not a hardware fix?

**G0ldBr1ck** · 11-05-2008, 10:49 AM

Originally Posted by EniGmA1987

Pardon my ignorance, but I thought the TLB bug was fixed in the B3 revision of the Phenom? Was that not a hardware fix?

just read the last couple pages for that answer, thats what we have been discusing.

**malice85** · 11-05-2008, 10:57 AM

As far is I understand with B3 the errata was fix resulting in a very small to no perfomance penalty, but wasn't fix to work the way it was initially planned to. So for me the question is if they were able to achieve this with shanghai/deneb.

**Lightman** · 11-05-2008, 01:17 PM

Originally Posted by Bellisimo

If you mean this?

http://www.nordichardware.com/news,7189.html
or this?
http://www.xbitlabs.com/articles/cpu...m-x4-9850.html

Yes, I meant that

**Rammsteiner** · 11-05-2008, 01:41 PM

Anyway, my prediction for Deneb vs Intel's products? I think untill Lynnsfield is out Deneb went trough a few revisions, eventually high/k and metal gates, being able to actually be a good competitor for Lynnsfield and the skt 775 platform. Performance wise I do see Deneb being able to beat/keep up with Yorkfield a lot better, at least in most daily apps. Vs Lynnsfield, not sure about performance, but AMD will be able to be more competitive with price/performance competition than it's now with Agena vs Yorkfield or even Kentsfield.

Anyway, time will tell and it's only what I think. It depends on more factors than just 'Deneb owns' or 'Lynnsfield has only dual channel anyway'. For example, price of Lynnsfield's platform. I dont think it's going to be as overpriced as current Bloomsfield platform due to less technology etc. But as said, time will tell.

**Cooper** · 11-05-2008, 01:49 PM

Let's not bring the TLB errata back again please. I don't recall anyone experiencing it on B2 chips with fix disabled. B3 performance is the same as B2 - dont know where you got the BS of B2 being faster w/o the fix applied.

**~~Bellisimo~~** · 11-05-2008, 02:00 PM

Originally Posted by Cooper

Let's not bring the TLB errata back again please. I don't recall anyone experiencing it on B2 chips with fix disabled. B3 performance is the same as B2 - dont know where you got the BS of B2 being faster w/o the fix applied.

B3 is a little bit faster in some benches, but its within margin of error

xbitlabs just concluded the latency in everest went up a bit

**cegras** · 11-05-2008, 02:20 PM

Originally Posted by Cooper

Let's not bring the TLB errata back again please. I don't recall anyone experiencing it on B2 chips with fix disabled. B3 performance is the same as B2 - dont know where you got the BS of B2 being faster w/o the fix applied.

Quoted for some massive truth.

chew* · 11-05-2008, 03:23 PM

Originally Posted by Cooper

Let's not bring the TLB errata back again please. I don't recall anyone experiencing it on B2 chips with fix disabled. B3 performance is the same as B2 - dont know where you got the BS of B2 being faster w/o the fix applied.

Quite the contrary cooper I only brought it up to support the fact that the 25% increase in performance might be feasible IF they were able to erradicate the TLB errata on a hardware level. IF that is the case IPC performance could easily be 15% and hardware level repair ( A better fix than current ) could easily add up to another 10% gain totaling performance gains up to 25% in some apps.

**Helmore** · 11-05-2008, 03:33 PM

Originally Posted by chew*

Quite the contrary cooper I only brought it up to support the fact that the 25% increase in performance might be feasible IF they were able to erradicate the TLB errata on a hardware level. IF that is the case IPC performance could easily be 15% and hardware level repair ( A better fix than current ) could easily add up to another 10% gain totaling performance gains up to 25% in some apps.

I'd like to get the same drug as you're using, where can I get it? *just kidding*

I think you are doing some major exaggerations. Although I'd like you to be right with you 25% IPC increase, I just don't see it happening from AMD anytime soon.

chew* · 11-05-2008, 03:56 PM

Originally Posted by Helmore

I'd like to get the same drug as you're using, where can I get it? *just kidding*

I think you are doing some major exaggerations. Although I'd like you to be right with you 25% IPC increase, I just don't see it happening from AMD anytime soon.

It's more wishful thinking, but I did say "totaling performance gains up to 25% in some apps"

**Caveman787** · 11-05-2008, 03:59 PM

They did say 15-20% ipc increase and 20% increase from clocks. So ~35% overall.

**Epsilon84** · 11-05-2008, 06:51 PM

Originally Posted by Caveman787

They did say 15-20% ipc increase and 20% increase from clocks. So ~35% overall.

Let me just say this: 15 - 20% IPC gains from a die shrink would be absolutely incredible and unheard of, but many here seem to think its a realistic figure? 15 - 20% improvements are more akin to an architectural overhaul, for example K8 -> K10 gained about that much.

Just as a point of comparison, Penryn gained on average 5 - 6% per clock through a larger cache and several minor architectural improvements. AMD would have to work miracles to get 15 - 20% gains from Deneb, frankly I very much doubt it but I would be more than happy to eat humble pie if proven wrong.

**stangracin2** · 11-05-2008, 07:06 PM

Originally Posted by Epsilon84

Let me just say this: 15 - 20% IPC gains from a die shrink would be absolutely incredible and unheard of, but many here seem to think its a realistic figure? 15 - 20% improvements are more akin to an architectural overhaul, for example K8 -> K10 gained about that much.

Just as a point of comparison, Penryn gained on average 5 - 6% per clock through a larger cache and several minor architectural improvements. AMD would have to work miracles to get 15 - 20% gains from Deneb, frankly I very much doubt it but I would be more than happy to eat humble pie if proven wrong.

isn't deneb bringing HT 3.1 and 3x as much L3 cache

**demonkevy666** · 11-05-2008, 07:11 PM

Originally Posted by Epsilon84

Let me just say this: 15 - 20% IPC gains from a die shrink would be absolutely incredible and unheard of, but many here seem to think its a realistic figure? 15 - 20% improvements are more akin to an architectural overhaul, for example K8 -> K10 gained about that much.

Just as a point of comparison, Penryn gained on average 5 - 6% per clock through a larger cache and several minor architectural improvements. AMD would have to work miracles to get 15 - 20% gains from Deneb, frankly I very much doubt it but I would be more than happy to eat humble pie if proven wrong.

he added higher clocks to that.

phenom 15% slower then kents 4mbs vs 8mbs

50% less cache.

where getting 50% more cache 512kbs x4 plus 6144kbs on L3 cache.

8mbs in total

**Epsilon84** · 11-05-2008, 07:31 PM

Originally Posted by stangracin2

isn't deneb bringing HT 3.1 and 3x as much L3 cache

Faster HT won't bring anything for desktop performance. A larger L3 cache alone won't account for a 15 - 20% IPC gain except in very cache bound apps. I guess L3 speeds may go up as well which helps, but I still think these figures are very optimistic.

**mAJORD** · 11-05-2008, 07:32 PM

I agree Elipson,

I guess the factors that seperate the 2 are:

A. People are of the belief that K10 has never been 'all there' , I tend to agree in some areas.. memory performance is lacking, performance is a little below expectations - A lot below in some cases where one would expect much more.. It's the areas were K10 barely outperforms K8 that makes it lose to Core on average.

Did the rush to get K10 out there leave some apsects of the design not functioning as they should be?

B. AMD need the performance boost.

Factors working against any large IPC gains are:

A. Just looking at the die shots, anything visable on a core level is identical.. any uarch enhancments there would have to be minor - stands to reason anyway, no one does a meaningfull overall of an arch when changing process node - old news

B. benchmarks we've seen so far show 7-15% max.

Personally A realistic guess would be 10% across the board, but next to nothing in some areas. and 15%+ in rare cases, like what we saw with Pov-Ray - a benchmark that was a sore spot and still will lag behind Core.

Its important they squeezed what they could out of it for Deneb - especially at 3Ghz plus. Lets not forget a 5% IPC boost is the equivilent of a whole 200Mhz speed bin at these frequencies.. and lets face it, no one's going over the high 3's (Ghz) with these sort of architectures any time soon.

Not that it really matters long term, Hyperthreading and any other means of increasing multi-threaded performance is all either CPU company will care about from now on. Lets face it, If Deneb was 15% slower clock/clock than i7 at single threaded (as it most likely will be) but had 4 extra cores, it would still be the winner.

Deneb's lack of SMT technology is now more of an issue than lack of IPC. Now if only they had a lare enough die size and power advantage to sneak on a couple of extra phy cores

**JumpingJack** · 11-05-2008, 07:50 PM

Originally Posted by stangracin2

isn't deneb bringing HT 3.1 and 3x as much L3 cache

As he said above ... HT communicates to the chipset NB/SB arrangement and is not the bottleneck in single socket implementations. Raising the HT speed will do nothing for observed IPC. Ironically, I just finished a FPS skew on lost planet with a Phenom (HT3.0) and FPS doesn't begin to drop off until about 600-800 MHz (down form 2000 MHz)... I can post that data if you like.

L3 cache will certainly help, but the rule of thumb is for every doubling of the cache expect a factor of sqrt(2) improvement in cache miss rate ... mAJORD mentioned above that the memory performance was poor, that is sorta a relative statement ... it's still very good, just not quite hitting expectations. AMD's IMC approach decreases penalties for cache misses, so it is even more likely that making L3 3x larger will have less of a general impact.

It would be nice though if AMD would disclose some other of their tweaks (as they most certainly made them) ... one area is in the L3 latency, they use an asychronous link between different clock domains by implementing a FIFO buffer between L3 and the cores to absorb the clock skew ... this adds latency, and looking at the overall results on Phenom it was a pretty significant hit ... my guess is they really improved this part of the cache structure, which will be a big help.

I agree with mAJORD -- for desktop applications, ~10% IPC improvement is likely, with the a few app specific 15% ...

When AMD quoted 20% over barcelona, you need to be careful to take that in the right context, they are comparing at the server level, with server related benchmarks. Today's barcleona opteron's are still on HT 2.0 I believe, going to a HT 3.0, unlike desktop, will improve 2P server performance significantly ... as good as barcelona is today on throughput, a faster socket to socket link will be even better. That 20% is not likely to translate into desktop.

That doesn't sound good, espeically if you are a devoted AMD customer -- but 10% is a good, healthy gain IPC wise for just a shrink -- (btw, the leaked Deneb benchmarks are pointing to this 10% number that mAJORD mentioned above) ... with this, and a healthy 45 nm to get to 3.0 GHz clock... AMD will have a nice competitive CPU ... I plan on getting one when they launch, so they have already sold one

Jack

**Macadamia** · 11-05-2008, 09:51 PM

Hmm JJ, how does Nehalem achieve that then? No FIFOs, yet the "Core" and "Uncore" run at different speeds (except the 965 extreme)? Or is it related to per-core clocking instead?

Deneb won't be too much of a different experience on desktop for multimedia and rendering (POV has proved me wrong though), but in games we could see a sizable gain if they prefetch in a more aggressive manner.

**JumpingJack** · 11-05-2008, 09:59 PM

Originally Posted by Macadamia

Hmm JJ, how does Nehalem achieve that then? No FIFOs, yet the "Core" and "Uncore" run at different speeds (except the 965 extreme)? Or is it related to per-core clocking instead?

Deneb won't be too much of a different experience on desktop for multimedia and rendering (POV has proved me wrong though), but in games we could see a sizable gain if they prefetch in a more aggressive manner.

This was a topic of discussion a few months back as I recall, so a google to 'Nehalem Synchronous' yielded some hits:

http://techreport.com/discussions.x/14950

he processor runs all of its internal components—the CPU cores, memory controller, and I/O—in a decoupled fashion, so one can tune their respective frequencies and voltages independently. This isn't a new idea, Kumar stressed, but Intel's implementation is new in that it uses a synchronous interface between those components. Most past implementations have asynchronous interfaces, he claimed, which result in both higher latency and indeterminism—"if you test five different systems, you will get five different results." Because of the synchronous approach, Nehalem's memory-to-cache latency is allegedly "drastically smaller" than that of the competition.

How the heck they did it, I don't know -- the science of process technology, I can read and understand, architectural details I have been able to accumulate a great deal understanding (much with Kanter's help and reading a lot of Hennesy) and I am always eager to learn more, but circuit level implementations -- frankly, I am clueless -- I can sketch out a 6T transistor SRAM cell or some simple 4T inverters or a NOR gate circuit, but ask me to string it together or throw in a PLL or a power gate -- all you will get is a dumb look

In terms of gaming -- I am not so sure, I don't think the cache/prefetching is a huge deal here (this is my opinion, and I could be wrong, there is no way to quantitatively ascertain anything) ... by it's nature, gaming algorithms are 'branchy' for lack of a better word, by this I mean -- the flow of the code has dependencies that simply require code paths to branch, for example ... shoot a gun -- does it hit a dude (yes / no) -branch, does the dude die (yes/no) branch - do you issue the animation to lop of his head (yes/no) branch ... if AMD spent sometime improving the branch prediction (direct or indirect, doesn't matter) then there will be a very nice gaming improvement, I think the cache will not be as important.

Jack

**Macadamia** · 11-05-2008, 10:30 PM

I know previous gaming code was branchy, but I don't think the current trend is emphasized there any more.
Xenon and Cell for the consoles aren't too apt at branching, I last remember, especially with buffed up SIMD units. There will always be branchy code, but does it still comprise the majority of the engine?

AMD does need serious work on their predictors though - for general performance more than anything. Despite the improvements Intel has a really decisive lead here ever since Conroe.

Thread: Deneb Samples are almost out

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions