Deneb Samples are almost out

**JumpingJack** · 11-05-2008, 10:43 PM

Originally Posted by Macadamia

I know previous gaming code was branchy, but I don't think the current trend is emphasized there any more.
Xenon and Cell for the consoles aren't too apt at branching, I last remember, especially with buffed up SIMD units. There will always be branchy code, but does it still comprise the majority of the engine?

AMD does need serious work on their predictors though - for general performance more than anything. Despite the improvements Intel has a really decisive lead here ever since Conroe.

Yeah I completely agree....

A few comments ...

It's really kind of intuitive when you think about it, just a high level gerdanken experiment helps to really understand why game code runs more branches than say compressing a file or encoding a movie clip. It is unavoidable, there is no way to program a game engine without it. The input actions of the player are unpredictable and the resulting cause and effect will always require testable conditions, the crux of any gaming algorithm is ultimately indeterministic. For the CPU duties of the gaming engine, yeah it is still important -- the CPU is responsible for receiving player input, tracking AI, culling, etc etc.

Kanter sent me a word copy of this article a few months ago for review, and it is one of the only technical representations that help to really rationalize why C2D did such a good job with gaming code when it launched compared to K8 (it was the most dramatic feature of Conroe and really lit up the forums/net of course) : http://www.realworldtech.com/page.cf...2808015436&p=5

Ironically, at least my opinion is ... that Intel's branch prediction capability, as seen in C2D, was probably the only good thing that ultimately came out of Netburst -- the penalty for a mispredicted branch can be as bad or worse than a L2 cache miss -- you need to flush the pipeline, fetch the new code/date into the front end, reorder again and repopulate the pipeline .. I've seen numbers between 30-150 cycles wasted just to correct a mispredicted branch.... to avoid this, I would not doubt if Intel architects went all out balls to the wall to figure out any and all possible ways to improve branch prediction accuracy, even then that 31 stage pipeline just drug it all back down as two or three mispredictions in a 1000 would cripple a Prescott.

The branch predictor logic most likely carried over in some fashion, to some degree into C2D, i.e. probably the only feature of Netburst to go into C2D... who knows, but makes sense.

So yeah, AMD's branch prediction is weaker than Intel's at the moment, though K8 could kick butt against Netburst gaming wise -- that was because even with great branch prediction, that uber long pipeline just stunk if there was ever a stall .... shortening up the pipeline with C2D, widening it, coupled with strong branch predictors = great scenario for gaming.

EDIT: Here's a good one ... look at figure 4: http://www.research.ibm.com/people/m...s/2004_msp.pdf this is a neat paper, it also shows the L1/L2 misses for FPS games vs other applications ... so you can see why we get this generalized statement "games love L2 cache", true they do ... but they also love good branch predictors -- both of which C2D/Q's have lots of ...

Jack

**Calmatory** · 11-06-2008, 12:04 AM

Originally Posted by Epsilon84

Let me just say this: 15 - 20% IPC gains from a die shrink would be absolutely incredible and unheard of, but many here seem to think its a realistic figure? 15 - 20% improvements are more akin to an architectural overhaul, for example K8 -> K10 gained about that much.

Just as a point of comparison, Penryn gained on average 5 - 6% per clock through a larger cache and several minor architectural improvements. AMD would have to work miracles to get 15 - 20% gains from Deneb, frankly I very much doubt it but I would be more than happy to eat humble pie if proven wrong.

Depends on the architecture and bottlenecks. E.g. if AMD is able to decrease cache latencies with few minor fixes and then add more cache, this, in some situations, could yield alot higher IPC improvements than what Intel could do with their architecture which is already giving all it has without major bottlenecks. Though, this is just an example case and I am not saying that AMD could "fix the latencies" with few "minor fixes" or that lower latencies with possible bigger caches with Deneb would yield this and this much improvement over Agena.

Though, I doubt that we will see more than 10% improvements, and average improvement falls to less than 10%. More interested with the lower power consumption, better OC and lower price.

What kind of cache association does K10 use? ... found out: L1, 2-ways associative. L2, 16-ways associative. L3, 32-ways associative.

Edit: A bit of offtopic but while we are at it... As the association of the cache increases the heeat output of cache activation, and bigger cache generally consumes mroe energy, wouldn't this mean that L3 cache(big, 6 MB 48-way?) would consume quite a lot of energy compared to e.g. L1/L2 caches? Considering that with K7 and K8 the L2 stress(S&M L2 cache test took CPU coolers to their knees with K7, and caused higher than normal(Gaming, Orthos) stress with K8) greatly influenced to the heat generation, 6MB 48-way L3 would fry an egg compared to The K7/K8. Though going 45nm will help greatly. Or am I missing something?

**gosh** · 11-06-2008, 01:36 AM

Originally Posted by JumpingJack

It's really kind of intuitive when you think about it, just a high level gerdanken experiment helps to really understand why game code runs more branches than say compressing a file or encoding a movie clip. It is unavoidable, there is no way to program a game engine without it. The input actions of the player are unpredictable and the resulting cause and effect will always require testable conditions, the crux of any gaming algorithm is ultimately indeterministic. For the CPU duties of the gaming engine, yeah it is still important -- the CPU is responsible for receiving player input, tracking AI, culling, etc etc.

Bull
There is a lot of solutions a programmer can do to avoid branches, code for different applications is very similar. What might differ is that a game developer tries to optimize the code more than other type of applications because the need for speed.
And there is no way to program any application without a lot of branches. Most branches in applicaitions is for checking errors.
Example:
Parsing data will need to check data before taking actions because you never know if the data is corrupt.

**Calmatory** · 11-06-2008, 01:45 AM

Originally Posted by gosh

Bull
There is a lot of solutions a programmer can do to avoid branches, code for different applications is very similar. What might differ is that a game developer tries to optimize the code more than other type of applications because the need for speed.
And there is no way to program any application without a lot of branches. Most branches in applicaitions is for checking errors.
Parsing data will need to check data before taking actions because you never know if the data is corrupt.

Errm, by doing simple tricks it is possible to reduce amount of branches, or simplify them so the predictor is more likely to predict it right.

Parsing data can be done in many ways. The order of instructions in a loop can have impact on the branch prediction success rate and/or cache miss/hit rate, which then directly contribute towards the performance of given code. This all is architecture dependent, e.g. Core2 predicts branches a lot better than K8/K10. K8/K10 does cache misses a lot more often than Core2, which exaggerates the differences between architectures. The data does not necessarily have to be checked every time when it is split into chunks and chunks are being parsed. Chunk size has influence on cache hit/miss rates, parsing algorithm has it's own influence on the flow of data and instructions in the pipeline, instruction order has influence on branch predictor, all this can vary between programs, unless the machine code is exact same. Besides, compiling with different compiler version can generate differences which can easily contribute towards performance, e.g. via branch prediction or data miss/hits. And there is more to the topic than branches and caches.

Offtopic? You choose.

**gosh** · 11-06-2008, 04:51 AM

Originally Posted by Calmatory

Core2 predicts branches a lot better than K8/K10. K8/K10 does cache misses a lot more often than Core2

K10 ??
Do you have any information about how K10 predicts?

**Calmatory** · 11-06-2008, 04:56 AM

Originally Posted by gosh

K10 ??
Do you have any information about how K10 predicts?

http://www.realworldtech.com/page.cf...2808015436&p=9

IIRC there were no major improvements for K10 branch predictor. Correct me if I am wrong. Even if there was, could the improvements halve the mispredictions to meet Core2?

**gosh** · 11-06-2008, 05:04 AM

Originally Posted by Calmatory

http://www.realworldtech.com/page.cf...2808015436&p=9

IIRC there were no major improvements for K10 branch predictor. Correct me if I am wrong. Even if there was, could the improvements halve the mispredictions to meet Core2?

I don't know, I have tried to look for information but the only thing that I have found is that AMD hasn't released information about this, have read is that it is improved. I have seen people making claims that K10 is bad on this area but none have showed sources about it.

As you wrote in you previous message, there is a LOT of ways to do coding, if you are in a hurry you don't have time to optimize. That type of code are probably taking advantage of a good predictor. But if you optimize it is possible do decrease conditions a lot.

**mAJORD** · 11-06-2008, 05:18 AM

This might be of some help..

http://www.anandtech.com/cpuchipsets...spx?i=2939&p=4

**Sentential** · 11-06-2008, 06:32 AM

Originally Posted by 003

No. There is no "if". Just no.

Deneb is just a die shrink of Agena, and at best will offer 5-10% performance improvement clock for clock. It's like the R600 to RV670... just a die shrink, with maybe some minor instruction set optimizations as well. How do we know this is the case?

AMD has said nothing about architectural improvements, because there are none. It's the same architecture, just a die shrink, with some very general talk of "25% performance boost". My arse.

It was the same with the transition from K8 to K10. AMD was silent on anything relating to architectural improvement, and instead talked about HT 3.0 and DDR2 and other things that don't really improve performance. And same with R600 to RV670.

But you notice that with RV670 to RV770, rumors of it's monster specs were flying out of control, and they proved to be true.

The only saving grace Deneb may have is ability to easily overclock to the range of 3.6GHz, but even now most Q6600s will go over that mark. And don't even try and compare Deneb to Nehalem ... it will be absolutely no contest. Nehalem is simply two or three leagues above Deneb in terms of performance, and it will rape (or is raping, I should say) in orifices that do not exist naturally.

See an this is where you have your infomation skewed, first Agena has a whole host of issues first of which that is corrected is its ability to scale in addition to lowering power draw.

I fully expect to see the better 50% to be over 4ghz

Second the issue with cache latency will be resolved so we should see a significant increase in performance per clock. On top of that we will see Deneb with 50% more cache size than it has now so the 15-25% it needs to match Nelhalem/Yorkfield in games isnt too much of a stretch

Originally Posted by 003

I'm sorry but I can't just sit around with my thumb in my ass while there is an entire thread of people who seem to think that deneb is going to match yorkfield or even kentsfield clock for clock. The difference between yorkfield and kentsfield is only 10% at best anyway, so people saying deneb will fall right between them are even more out of their mind.

Sad... but we'll see soon enough

**JumpingJack** · 11-06-2008, 07:14 AM

Originally Posted by gosh

Bull
There is a lot of solutions a programmer can do to avoid branches, code for different applications is very similar. What might differ is that a game developer tries to optimize the code more than other type of applications because the need for speed.
And there is no way to program any application without a lot of branches. Most branches in applicaitions is for checking errors.
Example:
Parsing data will need to check data before taking actions because you never know if the data is corrupt.

I have already linked for you above showing branch prediction rate for 3 apps, by IBM, read it. Just think of it from an algorithm approach, the indeterministic nature of a game, the randomness it generates requires conditional branches.

There is no bull, you just don't know anything about what you spout.

**savantu** · 11-06-2008, 07:19 AM

Originally Posted by Calmatory

Depends on the architecture and bottlenecks. E.g. if AMD is able to decrease cache latencies with few minor fixes and then add more cache, this, in some situations, could yield alot higher IPC improvements than what Intel could do with their architecture which is already giving all it has without major bottlenecks. Though, this is just an example case and I am not saying that AMD could "fix the latencies" with few "minor fixes" or that lower latencies with possible bigger caches with Deneb would yield this and this much improvement over Agena.
...

Caches, especially the L1 are so tightly integrated into the core that playing with their latency means redoing basically everything.
The L3 can avoid this by using a async interface.However , it's performance is only a fraction of L2 , not to mention L1.

**demonkevy666** · 11-06-2008, 07:41 AM

Originally Posted by savantu

Caches, especially the L1 are so tightly integrated into the core that playing with their latency means redoing basically everything.
The L3 can avoid this by using a async interface.However , it's performance is only a fraction of L2 , not to mention L1.

this why it's large though isn't it ?

anyways Barcelona NB speed is slow compared to Agena, 1400mhz vs 2,000mhz

I haven't ever seen a comparison of Agena vs Barcelona. (wish someone would have done that)

Calmatory, you did miss something they took out a resistor on the L3 cache, they might have done it to L2 and possible L1 too for higher clocks and less heat and watts.

**Calmatory** · 11-06-2008, 07:50 AM

Why would they have resistors there in the first place? I am not getting a CPU just to heat my room...

Actually after I think more about the L3. 48-way(?) association and heat, I think that it is not going to any kind of a problem as L1 and L2 will (hopefully) have the data already and L3 would be the last resort before DRAM.

**demonkevy666** · 11-06-2008, 08:02 AM

Originally Posted by Calmatory

Why would they have resistors there in the first place? I am not getting a CPU just to heat my room...

Actually after I think more about the L3. 48-way(?) association and heat, I think that it is not going to any kind of a problem as L1 and L2 will (hopefully) have the data already and L3 would be the last resort before DRAM.

don't know.

I think that's why L3 cache was set up to be a different speed but faster the memory.

**gosh** · 11-06-2008, 08:06 AM

Originally Posted by JumpingJack

I have already linked for you above showing branch prediction rate for 3 apps, by IBM, read it. Just think of it from an algorithm approach, the indeterministic nature of a game, the randomness it generates requires conditional branches.

There is no bull, you just don't know anything about what you spout.

Just look at some code, then you see what's going on. I have looked at game code, what I have seen is the opposite, it is very few branches compared to other code. I don't know if that code is representative though.

**Hornet331** · 11-06-2008, 08:27 AM

Originally Posted by gosh

Just look at some code, then you see what's going on. I have looked at game code, what I have seen is the opposite, it is very few branches compared to other code. I don't know if that code is representative though.

^
fiction

fact:

**gosh** · 11-06-2008, 08:46 AM

Originally Posted by Hornet331

^
fiction

fact:

Well if you compare to highly optimized compressing techniques then you probably will notice that all other types of code will use a lot more branches. They are doing loads of work to optimize the algorithm or algorithms that they use.

**Eternalightwith** · 11-06-2008, 09:12 AM

Originally Posted by leoy

NOv 8?

Propably false....
Although could be great.

That's what I was thinking. Saturday?!?
I thought the CPU industry was like the movie/video game industry in that they have their special launch day.

ie. Video games are tuesday and movies are thursday

**bedlamite** · 11-06-2008, 09:30 AM

I have no idea where does these november/december dates come from, but afaik nothing has changed and it'll be released in January.

**Calmatory** · 11-06-2008, 09:32 AM

Originally Posted by gosh

Just look at some code, then you see what's going on. I have looked at game code, what I have seen is the opposite, it is very few branches compared to other code. I don't know if that code is representative though.

More unpredictable branches in a game than in a static program with dynamic file(e.g. encoding).

**gosh** · 11-06-2008, 09:38 AM

Originally Posted by Calmatory

More unpredictable branches in a game than in a static program with dynamic file(e.g. encoding).

why?

Optimizing one area for about 1 000 lines of code compared to 50 000 is simpler, also the 1 000 lines you don't need to think about readability as much compared to more code.
Do you know what characters you need to check for doing a xml parser and how often these are found that changes the state?

**Calmatory** · 11-06-2008, 09:49 AM

Originally Posted by gosh

why?

Optimizing one area for about 1 000 lines of code compared to 50 000 is simpler, also the 1 000 lines you don't need to think about readability as much compared to more code.
Do you know what characters you need to check for doing a xml parser and how often these are found that changes the state?

Because there are no two similar circumstances. A lot of code flowing trough CPU so the predictor can't always keep the history of the predicts when a same event happens. And as you said, the parsers/archivers/en(or de)coders are highly optimized, thus they are probably made as easy for the branch predictor as possible to have as low misprediction rate as possible. Thus the higher prediction rate for the apps than for games.

This kind of comparison can't really be proven by any way, as there are thousands of games, thousands of apps, and trillions of situations where the results are opposite of what the next door neighbour is having. It depends on uarch, the way it is programmed, what is the app/game doing, what has it done, how big penalty for a miss etc etc.

Just like telling that my dad is better than yours, can't really be said as a fact nor proven.

**gosh** · 11-06-2008, 10:04 AM

Originally Posted by Calmatory

Just like telling that my dad is better than yours, can't really be said as a fact nor proven.

Yes but the thing with game code is that it mostly is about calculating dots in a 3d space. The code done for checking how these dots are calculated is a very small part compared to other types of code. When they optimize game code it is almost all about optimizing the number of "draw calls", number of objects and detail level if I have understand it correctly.

**Calmatory** · 11-06-2008, 10:15 AM

Most of the 3D stuff is done by GPU, unless using software renderer. For example pathfinding for AI can be very heavy for the CPU and for branch predictor, due to it's recursive nature and continuous state checks.

**JumpingJack** · 11-06-2008, 06:34 PM

Originally Posted by bedlamite

I have no idea where does these november/december dates come from, but afaik nothing has changed and it'll be released in January.

http://www.digitimes.com/news/a20081104PD203.html

The rumor mill is saying Deneb 2.8 and 3.0 GHz by mid Nov.

Jack

Thread: Deneb Samples are almost out

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions