Intel Q9450 vs Phenom 9850 - ATI HD3870 X2

Printable View

Show 100 post(s) from this thread on one page

09-05-2008, 09:48 PM
JumpingJack

Gosh -- this is what I meant by the 4870 X2 is better shown paired with an Intel CPU:

http://www.legionhardware.com/document.php?id=770
09-06-2008, 01:35 AM
gosh

Quote:

Originally Posted by JumpingJack

Gosh -- this is what I meant by the 4870 X2 is better shown paired with an Intel CPU:

http://www.legionhardware.com/document.php?id=770

Is it? You can't tell only showing average fps. average fps is meaningless.

EDIT: Or did you mean that just producing as much fps as possible
09-06-2008, 03:12 AM
Hornet331

Quote:

Originally Posted by JumpingJack

Gosh -- this is what I meant by the 4870 X2 is better shown paired with an Intel CPU:

http://www.legionhardware.com/document.php?id=770

ouch... in cpu limiting situations a 3ghz phenom loses against a stock Q6600.
09-06-2008, 09:43 AM
JumpingJack

Quote:

Originally Posted by gosh

Is it? You can't tell only showing average fps. average fps is meaningless.

EDIT: Or did you mean that just producing as much fps as possible

You can do some research --- generally, average and min track to an extent. Though I don't disagree completely, average FPS does not show the whole story. But neither does minimum by itself.

Average is just a statistical representation of the population that it is sampling. If the CPUs were not important, than the statistical average would not be affected and the mean of the population would average to become statistically equivalent. That is not true here, the average FPS is certainly showing a response with the power of the CPU (hence the reason for the CPU scaling article). Thus, average is not meaningless overall, it does allow one to conclude which CPU supports the GPU better. Average goes higher as both min and max goes higher.

Fraps'ing any of those games shows that Intel Min is also greater than AMD min....

However, my original point is that the 4870 X2 is a darn fast card -- and drives to the CPU limited gaming domain even at high end resolutions. This is the reason review sites use the fastest possible CPU to evaluate the capabilities of a GPU ... otherwise, it hides the performance and skews the evaluation. Regardless of which CPU is faster, the interesting question to ask is... why? Why (pre-Core 2 Duo) did AMD perform better at gaming code, but with C2D the tables turn?

Jack
09-06-2008, 09:46 AM
JumpingJack

Quote:

Originally Posted by Hornet331

ouch... in cpu limiting situations a 3ghz phenom loses against a stock Q6600.

Yeah, I get similar results on my Phenom -- though I would not really consider COD as a representative bench ... it favors Intel architecture very heavily. Not really sure why.
09-06-2008, 10:10 AM
Tonucci

Quote:

Originally Posted by JumpingJack

You can do some research --- generally, average and min track to an extent. Though I don't disagree completely, average FPS does not show the whole story. But neither does minimum by itself.

Average is just a statistical representation of the population that it is sampling. If the CPUs were not important, than the statistical average would not be affected and the mean of the population would average to become statistically equivalent. That is not true here, the average FPS is certainly showing a response with the power of the CPU (hence the reason for the CPU scaling article). Thus, average is not meaningless overall, it does allow one to conclude which CPU supports the GPU better. Average goes higher as both min and max goes higher.

Jack

I agree with this, good post :)

But one thing is clear, if the min fps is above 30, smoothness is guaranteed.

No sky high avg or max fps numbers can guarantee smoothness, it may still be very choppy during some parts.

With that said, fps graphs like the ones @ [H] are very useful, in conjunction with min and avg numbers.
09-06-2008, 10:20 AM
JumpingJack

Quote:

Originally Posted by Tonucci

I agree with this, good post :)

But one thing is clear, if the min fps is above 30, smoothness is guaranteed.

No sky high avg or max fps numbers can guarantee smoothness, it may still be very choppy during some parts.

With that said, fps graphs like the ones @ [H] are very useful, in conjunction with min and avg numbers.

Yep, I prefer myself to match the refresh rate of my monitor (no chance of tearing) ... Phenom is a great gaming CPU, only in one or two cases could I really see a Intel quad advantaged over Phenom. For the vast majority, both will yield the same game play experience.
09-06-2008, 12:19 PM
gosh

Quote:

Originally Posted by JumpingJack

Fraps'ing any of those games shows that Intel Min is also greater than AMD min....

Those games are not that heavy (hard on the hardware) to run (the CoH test is :banana::banana::banana::banana:ed up). AMD will not show advantages before Intel L2 cache isn't that advantageous (more demanding threads, more memory will decrease that advantage) able to use its large cache or communication with other hardware is increased. It is impossible to see that just viewing the numbers legion shows.
What that test showed at legion was pretty much the difference L2 (size and speed) cache is doing for whatever run the test was done for. More conclusion apart from that is hard to make. Deneb that will have 6 MB L3 cache will gain some, if the latency is decreased it will gain some more in average FPS.
What was most interesting (just knowing the average FPS) in the legion test was that you will get the same game experience with cheap X2 compared to one expensive C2Q using 4870X2 running those games.
09-06-2008, 04:12 PM
JumpingJack

Quote:

Originally Posted by gosh

Those games are not that heavy (hard on the hardware) to run (the CoH test is :banana::banana::banana::banana:ed up). AMD will not show advantages before Intel L2 cache isn't that advantageous (more demanding threads, more memory will decrease that advantage) able to use its large cache or communication with other hardware is increased. It is impossible to see that just viewing the numbers legion shows.
What that test showed at legion was pretty much the difference L2 (size and speed) cache is doing for whatever run the test was done for. More conclusion apart from that is hard to make. Deneb that will have 6 MB L3 cache will gain some, if the latency is decreased it will gain some more in average FPS.
What was most interesting (just knowing the average FPS) in the legion test was that you will get the same game experience with cheap X2 compared to one expensive C2Q using 4870X2 running those games.

Well, I agree about COH, Intel blows AMD away on those test by >50% clock for clock.... so i don't accept that as representative.

however, every one of those tests are high-res multithreaded games (1900x1200, full AA).... it isn't that it is weak on the software, it's that the 4870 X2 is such a powerful card, at high resolutions and full AA the bottlenecks stay on the CPU... in fact, there is evidence that even the highest end Intel CPU at 3.0 GHz still bottlnecks this card, meaning there is even more performance being held back.

What the legion tests show are exactly what I have been showing in this thread.... and counters your assertion. Intel is currently better at gaming code, single or multithreaded. It also explains, again, why every (and that EVERY) site that evaluated the 4870 X2 used an Intel cpu.... again, to evaluate the competency of a component, you remove the ambiguity of any other bottlenecks by using the fastest support system available.

But it is not L2 cache that is doing this.... Intel has far superior branch prediction, and gaming code is the branchiest you will find. Deneb will be an improvement over Agena no doubt, but it will not overtake even a Kentsfield in gaming.

Jack
09-06-2008, 04:17 PM
Hornet331

Quote:

Originally Posted by JumpingJack

But it is not L2 cache that is doing this.... Intel has far superior branch prediction, and gaming code is the branchiest you will find. Deneb will be an improvement over Agena no doubt, but it will not overtake even a Kentsfield in gaming.

P4 area had to pay of in some way. :D

i bet without the P4 we wouldn't have that aggressive prefetchers.
09-06-2008, 04:23 PM
JumpingJack

Quote:

Originally Posted by Hornet331

P4 area had to pay of in some way. :D

i bet without the P4 we wouldn't have that aggressive prefetchers.

Actually, my theory is that the branch predictors are the only Netburst features to be ported over into the Core microcarchitecture.

Netburst, at the end of life, had some 31 stages in it's pipeline... without question the largest performance hit were in mispredicted branches, where some 30 cycles would be wasted just flushing the pipeline and repopulating it.

To get as much as they could, the designer likely really worked over the branch predictors to avoid this penalty. Unfortunately, I cannot post the data, it is not mine, but RealWorldTech asked me for some inputs on an article they are working on specifically asking 'why is Core so much better at gaming than K10'... so I know the answer already as I have seen the data. :)

Jack
09-06-2008, 04:40 PM
Hornet331

Quote:

Originally Posted by JumpingJack

Actually, my theory is that the branch predictors are the only Netburst features to be ported over into the Core microcarchitecture.

Netburst, at the end of life, had some 31 stages in it's pipeline... without question the largest performance hit were in mispredicted branches, where some 30 cycles would be wasted just flushing the pipeline and repopulating it.

To get as much as they could, the designer likely really worked over the branch predictors to avoid this penalty. Unfortunately, I cannot post the data, it is not mine, but RealWorldTech asked me for some inputs on an article they are working on specifically asking 'why is Core so much better at gaming than K10'... so I know the answer already as I have seen the data. :)

Jack

ahhhh to bad, then we'll have to wait till the article on ret gets posted.
09-06-2008, 04:52 PM
gosh

Quote:

Originally Posted by JumpingJack

however, every one of those tests are high-res multithreaded games (1900x1200, full AA)....

High res is the video card that needs to handle, you know that what I first thought that games did take advantage of the resolution was not right. Processor is working the same independently of the resolution.

Quote:

Originally Posted by JumpingJack

it isn't that it is weak on the software, it's that the 4870 X2 is such a powerful card, at high resolutions and full AA the bottlenecks stay on the CPU... in fact, there is evidence that even the highest end Intel CPU at 3.0 GHz still bottlnecks this card, meaning there is even more performance being held back.

You could do the exact same test with a slower video card just decreasing the resolution.

Here is a very simple explanation showing how it works (single threaded game)
+ = processor is working
- = video card is working

Fast processor - low res
++--++--++--++--++--++--
Slow processor - low res
++++--++++--++++--++++--++++--++++--

Fast processor - high res
++---------++---------++---------++---------++---------++---------
Slow processor - high res
++++---------++++---------++++---------++++---------++++---------++++---------

More threads and these could do work while some thread is waiting for the videocard to paint the picture.
But the processor and the video card isn't working independently. They need to wait for each other. Some sort of synchronization is needed in a multithreaded environment also.

When you say the processor bottlenecks the video card. Can you explain what you mean.
09-06-2008, 05:17 PM
Hornet331

Quote:

Originally Posted by gosh

When you say the processor bottlenecks the video card. Can you explain what you mean.

its quite simple, the cpu bottlenecks the gpu when the card can render the image faster then the cpu can provide data that is necessaryfor the game (AI/physics etc) to advance. Just think of a massive explosion with hundersts of parts, but the parts have a rather simple form and only plain textures.
The gpu could render this scene with several 100fps but the cpu can only provide the location of all that parts 10times a second (aka 10fps), so your framerate would be 10fps, regardless of what the gfx card can do .
09-06-2008, 05:27 PM
gosh

Quote:

Originally Posted by Hornet331

its quite simple, the cpu bottlenecks the gpu when the card can render the image faster then the cpu can provide data that is necessaryfor the game (AI/physics etc) to advance. Just think of a massive explosion with hundersts of parts, but the parts have a rather simple form and only plain textures.
The gpu could render this scene with several 100fps but the cpu can only provide the location of all that parts 10times a second (aka 10fps), so your framerate would be 10fps, regardless of what the gfx card can do .

Do you mean that the video card doesn't need to wait for a faster processor?
09-06-2008, 05:34 PM
gosh

Quote:

Originally Posted by JumpingJack

But it is not L2 cache that is doing this.... Intel has far superior branch prediction, and gaming code is the branchiest you will find.

Branches are done i all software. You do it all the time. conditions and loops are everywhere. I think that games tries to avoid these in order to gain speed. That is a trick I do when I need better performance. Align data is also common and that has the same speed target, to have one single flow of data

Do you know how the branch prediction works on K10?
09-06-2008, 06:45 PM
JumpingJack

Quote:

Originally Posted by gosh

Branches are done i all software. You do it all the time. conditions and loops are everywhere. I think that games tries to avoid these in order to gain speed. That is a trick I do when I need better performance. Align data is also common and that has the same speed target, to have one single flow of data

Do you know how the branch prediction works on K10?

Yeah, there is lots of literature on in the branch predictors for both K10 and Core...

Branches in software are generated by code, branch predictors attempt to predict which branch will be true and loads that block of code into the pipeline... branch prediction is entirely in the hardware and is a hard wired logical algorithm intended to increase IPC. This is a fundamental comp sci topic, you need to do some research.

EDIT: However, game code contains more conditional logic than say 'encoding' or 'rendering' code ... hang tight, I have several papers that document the number of branches for different code types somehwere. Games are the highest, and this is where strong branch prediction shines, hence the reason Intel whollops AMD in gaming code.

Intel's branch prediction algorithms are much stronger than AMD's.

jack
09-06-2008, 06:50 PM
JumpingJack

Quote:

Originally Posted by gosh

When you say the processor bottlenecks the video card. Can you explain what you mean.

Read this thread, all the explanations are there.... I have posted numerous links that even nVidia explains how to determine bottlnecking at the CPU.

Wow.

What it essentially means is that the Phenom is not a good CPU to pair with a 4870 X2 even at high resolution game play, you are holding back potential of the video card.
09-07-2008, 02:34 AM
Hornet331

Quote:

Originally Posted by gosh

Do you mean that the video card doesn't need to wait for a faster processor?

yes, a faster cpu can go to the point where the roles are reversed, e.g. high resolution. But as jack mentioned the X2 or any current multigpu solution is far faster and has to wait for the cpu even for resolutions up to 1920x1200.
09-07-2008, 04:02 AM
gosh

Quote:

Originally Posted by JumpingJack

Yeah, there is lots of literature on in the branch predictors for both K10 and Core...

Do you have a good link about K10?

About code that needs speed and branches. The general rule is to avoid branches as much as you can. There are numerous ways to do that. It will be harder for the compiler to optimize code also if there is a lot of branches. I don't think games has a lot of branches, logically it would be the opposite. There is a lot of talks looking at game code how to avoid branches.
If you need a branches then try to have one single flow and avoiding moving the instruction pointer with a conditional branch.
09-07-2008, 04:25 AM
gosh

Quote:

Originally Posted by Hornet331

yes, a faster cpu can go to the point where the roles are reversed, e.g. high resolution. But as jack mentioned the X2 or any current multigpu solution is far faster and has to wait for the cpu even for resolutions up to 1920x1200.

In a chain of works to get the work done all that takes time in that chain is something that will add to total time. If you have a slow processor you may gain most to change the processor. If you have a slow video card you main more to change the video card. Hopefully the developers has optimized the game. This talk about my processor bottlenecks my video card or the video card bottlenecks the cpu I do not understand. They both participate in the work to render the frame. They both use time to render the frame. Buying faster video card or faster processor will decrease the total time. But both will still use up time that is added to total time to render the frame.

Here is a good paper: http://ati.amd.com/developer/gdc/PerformanceTuning.pdf
09-07-2008, 07:23 AM
JumpingJack

Quote:

Originally Posted by gosh

Do you have a good link about K10?

About code that needs speed and branches. The general rule is to avoid branches as much as you can. There are numerous ways to do that. It will be harder for the compiler to optimize code also if there is a lot of branches. I don't think games has a lot of branches, logically it would be the opposite. There is a lot of talks looking at game code how to avoid branches.
If you need a branches then try to have one single flow and avoiding moving the instruction pointer with a conditional branch.

http://www.realworldtech.com/page.cf...WT051607033728

Quote:

The branch prediction in the K8 also received a serious overhaul. The K8 uses a branch selector to choose between using a bi-modal predictor and a global predictor. The bi-modal predictor and branch selector are both stored in the ECC bits of the instruction cache, as pre-decode information. The global predictor combines the relative instruction pointer (RIP) for a conditional branch with a global history register that tracks the last 8 branches to index into a 16K entry prediction table that contains 2 bit saturating counters. If the branch is predicted as taken, then the destination must be predicted in the 2K entry target array. Indirect branches use a single target in the array, while CALLs use a target and also update the return address stack. The branch target address calculator (BTAC) checks the targets for relative branches, and can correct predictions from the target array, with a two cycle penalty. Returns are predicted with the 12 entry return address stack.

Barcelona does not fundamentally alter the branch prediction, but improves the accuracy. The global history register now tracks the last 12 branches, instead of the last 8. Barcelona also adds a new indirect predictor, which is specifically designed to handle branches with multiple targets (such as switch or case statements). Indirect branch prediction was first introduced with Intel’s Prescott microarchitecture and later the Pentium M. Indirect branches with a single target still use the existing 2K entry branch target buffer. The 512 entry indirect predictor allocates an entry when an indirect target is mispredicted; the target addresses are indexed by the global branch history register and branch RIP, thus taking into account the path that was used to access the indirect branch and the address of the branch itself. Lastly, the return address stack is doubled to 24 entries.

According to our own measurements for several PC games, between 16-50% of all branch mispredicts were indirect (29% on average). The real value of indirect branch misprediction is for many of the newer scripting or high level languages, such as Ruby, Perl or Python, which use interpreters. Other common indirect branch common culprits include virtual functions (used in C++) and calls to function pointers. For the same set of games, we measured that between 0.5-5% (1.5% on average) of all stack references resulted in overflow, but overflow may be more prevalent in server workloads.

Now think game code ... a player shoots a weapon....

a) In the evaluation loop (which a loop is a branch condtion on when to exit), it needs to check does the player pull the trigger (a branch).
b) If the trigger is pulled, what weapon is he firing (another branch).
c) Calculate the physics, does he hit the bad guy (yes or no) another branch
d) Where does he hit the bad guy (head, arm, kneck)

Game code is the absolute branchiest of all code classes. I am still looking for those papers that show it 30-80% higher than an other major kind of code. The reason for this is the total amount of variability in the propogation of the game. Game code does not know if you are going to jump, crouch, turn left, or right, die or blow up, as opposed to something like say a 3D renderer which only needs to know data and do a calculation, then move to the next pixel, use the information, do the calculation. Same with encoding, take one frame of data, calculate the attributes based on the other pixels around it, move to the next pixel... very linear. This is why P4's could do well at multimedia but sucked to bad at gaming... so long as there was little branching in the code, P4's could handle the load.

I do agree, branchy code is to be avoided at all costs ... but some applications simply demand a large amount of checks and conditions that generate new code paths (games are the biggest one, how much fun would game be if you did the exact same thing everytime).

Jack
09-07-2008, 07:36 AM
gosh

Quote:

Originally Posted by JumpingJack

Now think game code ... a player shoots a weapon....

a) In the evaluation loop (which a loop is a branch condtion on when to exit), it needs to check does the player pull the trigger (a branch).
b) If the trigger is pulled, what weapon is he firing (another branch).
c) Calculate the physics, does he hit the bad guy (yes or no) another branch
d) Where does he hit the bad guy (head, arm, kneck)

Game code is the absolute branchiest of all code classes. I am still looking for those papers that show it 30-80% higher than an other major kind of code.

Jack

Eeehhh.. Are you joking with me?
Firstly, I can tell you that firing a weapon will need much more branches than that ;). There is a lot going on the processor. Code isn't exactly like "If shoot then boom" if you know what I mean.

One type of branchy code is parsers. The complexity of the stream that is parsed and how many different paths that could be taken will add to the how branchy it is. But this can be optimized using different techniques.

branch prediction is in fact mostly for stupid developers that doesn't understand how to write fast code for the processor.

The paper you showed about branch prediction was for K8, K10 has some improvements there but what I know they are not showing how that works.

EDIT: The most branches in normal code are in fact used to handle errors.
09-07-2008, 07:45 AM
JumpingJack

Quote:

Originally Posted by gosh

In a chain of works to get the work done all that takes time in that chain is something that will add to total time. If you have a slow processor you may gain most to change the processor. If you have a slow video card you main more to change the video card. Hopefully the developers has optimized the game. This talk about my processor bottlenecks my video card or the video card bottlenecks the cpu I do not understand. They both participate in the work to render the frame. They both use time to render the frame. Buying faster video card or faster processor will decrease the total time. But both will still use up time that is added to total time to render the frame.

Here is a good paper: http://ati.amd.com/developer/gdc/PerformanceTuning.pdf

Dude, your link proves his argument.... almost the whole document is discussing CPU and GPU bottlnecking....

What part do you not understand?
09-07-2008, 07:50 AM
JumpingJack

Quote:

Originally Posted by gosh

Eeehhh.. Are you joking with me?
Firstly, I can tell you that firing a weapon will need much more branches than that ;). There is a lot going on the processor. Code isn't exactly like "If shoot then boom" if you know what I mean.

One type of branchy code is parsers. The complexity of the stream that is parsed and how many different paths that could be taken will add to the how branchy it is. But this can be optimized using different techniques.

branch prediction is in fact mostly for stupid developers that doesn't understand how to write fast code for the processor.

Quote:

Originally Posted by gosh

The paper you showed about branch prediction was for K8, K10 has some improvements there but what I know they are not showing how that works.

EDIT: The most branches in normal code are in fact used to handle errors.

What the hell are you talking about? First you say that game code is light on branches, I give you one case of a general algorithm that pins down how something as simple as shooting a weapon generates branches, then you say it does much more than that.

Gosh, frankly -- in all this discussion, it is becoming more and more clear you are pretty ignorant of how CPUs actually function, it is not worth my time any more.

You will continue to spam forums and get the flack you get because of two reasons, you have a preconceived notion of what you think it does, and what reality is are two different things. The majority application of concepts are wrong. This is why you will always get flack from the community.

It would be worth your time to spend a bit more study into basic, fundamental computer science and unlearn what you think you have learned and re-establish yourself in the fundamentals.

Take care...

jack

Show 100 post(s) from this thread on one page