Intel Core i7 Review Thread

**JumpingJack** · 11-04-2008, 02:25 AM

Originally Posted by villa1n

What do you think accounts for such a huge performance advantage, what is the nehalem "un-bottlenecking" ? As those are some pretty numbers, and how we would have hoped sli/xfire would scale like...?

Won't know until I get a CPU and run the 4870 X2's x-fired. But the BW demands on a FSB (or even HT) in gaming scenarios is not nearly as much as you would think. I posted a lot of data on this in another thread where, for example, I cranked the 2000 Mhz HT on the phenom down to 200 MHz and it made a grand total of about 2% difference at most (basically in the noise). EDIT: Also, while the PCIe lanes will service the command sets from main memory or CPU to card, most of the actual communication between GPUs in multiGPU setups are arbitrated through the SLI or xFire link -- specifically to move away from the back bus bottlnecks.

Where you will see the interconnect bottlenecking graphics performance is in cases where the texture memory requires exceed the onboard VRAM, which you can see in some 2560x1600 games today, but 1920x1200 512 MB seems sufficient.

Nehalem has simply improved clock for clock performance, the sites running single GPUs comparing against a QX9770 are simply showing gaming situations already railed up against the GPU performance, so no matter what one does, the overall result will make the 'CPUs look the same'. The fact is, i7 is showing similar gains in gaming code execution as it is in 3D rendering or video encoding ... the current crop of reviews/GPUs hides it because the GPU is capping the results.

People focus on the QPI and IMC as the major changes to Nehalem, but these were not all the major changes -- Intel also deepened the execution window, and improved branch prediction (both good for gaming code). However, looking at the tri-SLI results from Guru3D and Toms (recently posted) actually surprised me -- I was expecting modest gaming improvements but some are just huge...

Take for example clock for clock -- 60% improvement in Brothers in Arms (Guru3D data, QX9770 67 FPS, i7 965 107 @ 1920x1200) even Far Cry 2 is massive jump, which surprises me... I was expecting best case maybe 15-20%.

Who knows for sure at this point, the reviewers are simply publishing their 'study' but do not run various runs to really test out why... I am certain though, raising the FSB (keeping the same core clock speed) will not make up 60% difference.

In the Guru3D charts comparing the QX9770 and i7 for tri-SLI, the QX9770 is clearly showing CPU capped runs all the way to 1920x1200, the two CPUs only converge at 2560x1600 (which is a resolution you can say is now GPU limited).

**savantu** · 11-04-2008, 02:36 AM

Originally Posted by Bellisimo

crank your fsb upto 500 and you will have better results for the core2
this is just because of the qpi link, and few people use 3-way sli, even SLI isnt used a lot

BS.

What you're seeing is this :

- at resolutions below 1600/1200 you're CPU limited , the GPUs can do more
- at 1920/1200 GPUs starts to feel the pressure
- at 2560/1600 you're GPU limited

1st case a more powerful CPU helps
2nd case a more powerful CPU helps very little
3rd case a more powerful CPU has no effect

Skulltrail based Nehalem with 4 GPUs is going to be the all around monster.

**~~Bellisimo~~** · 11-04-2008, 02:40 AM

Originally Posted by JumpingJack

Won't know until I get a CPU and run the 4870 X2's x-fired. But the BW demands on a FSB (or even HT) in gaming scenarios is not nearly as much as you would think. I posted a lot of data on this in another thread where, for example, I cranked the 2000 Mhz HT on the phenom down to 200 MHz and it made a grand total of about 2% difference at most (basically in the noise). EDIT: Also, while the PCIe lanes will service the command sets from main memory or CPU to card, most of the actual communication between GPUs in multiGPU setups are arbitrated through the SLI or xFire link -- specifically to move away from the back bus bottlnecks.

dont forget all memory read/writes are placed on the FSB, if all data of 3 GTX280 have to placed on that same FSB, a serious bottleneck is born
i also thought the fsb isnt bidrectional, so you have to wait before you can send something in a direction

and i didnt find large gains with 4870x2 either

**savantu** · 11-04-2008, 02:43 AM

Originally Posted by JumpingJack

..
Nehalem has simply improved clock for clock performance, the sites running single GPUs comparing against a QX9770 are simply showing gaming situations already railed up against the GPU performance, so no matter what one does, the overall result will make the 'CPUs look the same'. The fact is, i7 is showing similar gains in gaming code execution as it is in 3D rendering or video encoding ... the current crop of reviews/GPUs hides it because the GPU is capping the results.

People focus on the QPI and IMC as the major changes to Nehalem, but these were not all the major changes -- Intel also deepened the execution window, and improved branch prediction (both good for gaming code). However, looking at the tri-SLI results from Guru3D and Toms (recently posted) actually surprised me -- I was expecting modest gaming improvements but some are just huge...

Nehalem isn't just Core on steroids ( IMC+QPI).They've improved every single part of the core, everything was tinkered with.
Some changes are more radical , like macro-uop fusion in 64bit and being capable to support a larger variety of instructions being fused , SMT , second TLB and so on.

Where Nehalem stumbles is the slower L1 and small L2.People forget that Penryn was an incredible high standard to start with ( look at AMD being incapable to offer a solution , probably 2010 ) and a nice massive very fast 6MB L2.As long as an app is cache friendly , Penryn will rock.

**JumpingJack** · 11-04-2008, 02:44 AM

Originally Posted by Bellisimo

dont forget all memory read/writes are placed on the FSB, if all data of 3 GTX280 have to placed on that same FSB, a serious bottleneck is born
i also thought the fsb isnt bidrectional, so you have to wait before you can send something in a direction

It isn't, that's the fault in your argument -- texture and vertices are stored in VRAM -- this is called precaching -- so you may experience better level loading times, but during actual game play the bulk of the massive slug of data is kept in very fast memory next to the GPU with a very fast BW connection (due to the necessary high throughput).

The data communicated between the card and CPU is only the command buffer which, in part, contains the pointers to the textures in VRAM... it is a very small data set relatively speaking.

On top of that, PCIe 2.0 enables bus mastering, so to access the command buffer the GPU does not go through the FSB in the Intel platform, it simply masters straight to system ram.

Jack

**savantu** · 11-04-2008, 02:46 AM

Originally Posted by Bellisimo

dont forget all memory read/writes are placed on the FSB, if all data of 3 GTX280 have to placed on that same FSB, a serious bottleneck is born
i also thought the fsb isnt bidrectional, so you have to wait before you can send something in a direction

and i didnt find large gains with 4870x2 either

Huh ? GPUs don't talk to memory over FSB , they have DMA over the NB.Using DDR3 1600 , GPUs have 25.6GBs of BW ( minus what goes to the FSB ) available.

**~~Bellisimo~~** · 11-04-2008, 02:47 AM

Originally Posted by JumpingJack

It isn't, that's the fault in your argument -- texture and vertices are stored in VRAM -- this is called precaching -- so you may experience better level loading times, but during actual game play the bulk of the massive slug of data is kept in very fast memory next to the GPU with a very fast BW connection (due to the necessary high throughput).

The data communicated between the card and CPU is only the command buffer which, in part, contains the pointers to the textures in VRAM... it is a very small data set relatively speaking.

On top of that, PCIe 2.0 enables bus mastering, so to access the command buffer the GPU does not go through the FSB in the Intel platform, it simply masters straight to system ram.

Jack

so the cpu doesnt communicate with it's own memory through the FSB?
common jack

it goes like this cpu --> FSB --> NB --> memory

**JumpingJack** · 11-04-2008, 02:49 AM

Originally Posted by savantu

Nehalem isn't just Core on steroids ( IMC+QPI).They've improved every single part of the core, everything was tinkered with.
Some changes are more radical , like macro-uop fusion in 64bit and being capable to support a larger variety of instructions being fused , SMT , second TLB and so on.

Where Nehalem stumbles is the slower L1 and small L2.People forget that Penryn was an incredible high standard to start with ( look at AMD being incapable to offer a solution , probably 2010 ) and a nice massive very fast 6MB L2.As long as an app is cache friendly , Penryn will rock.

Yeah, I agree ... the L1 and L2 cache was probably settled upon based on power and die size constraints, it will be interesting to see if Westmere tick bumps the L2 to 512 or more per core.

However, what Intel did was unique -- they fashioned a synchronous interface between cores and L3 which keeps the L3 latency very low. Their L2 latency (because it is small) is also better than the L2 latency on C2D (note I state this specifically on the absolute level of cache) -- but l1 is one cycle slower.... what it looks like is Intel did a real balancing act to design the cache hierachy to be as balanced as possible.

Ultimately this proves to be somewhat of a hinderance in single thread, as we can see in the data, but overall the sum of the parts is as good to very slightly better than core 2 (single threaded).

David did a really nice job summarizing the major items in Nehalem... I am sure you have read it.

Jack

**JumpingJack** · 11-04-2008, 02:50 AM

Originally Posted by Bellisimo

so the cpu doesnt communicate with it's own memory through the FSB?
common jack

it goes like this cpu --> FSB --> NB --> memory

It does but not nearly as much as you are alluding to ... 500 MHz over 400 Mhz FSB is not going make up 60% performance delta.... what you are seeing in the guru3D data is pure CPU power over CPU power. This is an easy test.... FSB and mem BW through the FSB is not nearly as critical as you think... if it were, then the superiour BW of the Phenom should be mopping up ... and truth is ... mem BW thorugh the interconnect is one of 100 different components that work together to produce the result.

Go to the benchmarking section, look at the Phenom vs C2Q thread that is massive, I show 200 Mhz, 400 Mhz FSB gaming on C2Q at 3.0 Ghz -- guess what, no difference.

Do you have a C2Q?? It is an easy experiment to do ... you won't affect your gaming benches by much more than 5 -8 %. (EDIT: revised this after I pulled up my FSB scaling table).

Here is an example:
QX9650 with 200 Mhz FSB using a 4870 X2

QX9650 with 400 Mhz FSB using a 4870 X2

Snow is 0% difference, cave is 80.1 in the 200 Mhz case, 85.7 in the 400 Mhz case (that is a 7% difference) ...

500 Mhz is not gonna amount to a hill of beans.

**~~Bellisimo~~** · 11-04-2008, 02:55 AM

Originally Posted by JumpingJack

It does but not nearly as much as you are alluding to ... 500 MHz over 400 Mhz FSB is not going make up 60% performance delta.... what you are seeing in the guru3D data is pure CPU power over CPU power.

offcourse it will not make up the 60% perf delta, i didnt say that
i said if you would overclock the fsb to 500 and keep the frequency at 3.2 ghz for the qx9770 you would see a similar gain (not as large)
also dont forget intel cores communicate with eachother (2 dualcore dies on a quadcore) through the FSB
so FSB is occupied with this:
- Intercore traffic
- Memory acces
- Peripheral stuff
- Graphics stuff

a 1600Mhz fsb will not cut it....
especially when there are 3 very powerfull graphic cards in the system

thats why i stated the QPI bus is largely responsible for the huge gain with 3-way SLI

Originally Posted by JumpingJack

Do you have a C2Q?? It is an easy experiment to do ... you won't affect your gaming benches by much more than 5 or 6%.

i don't, i have a E8400, but i don't have 3*GTX280 so i wouldnt be able to test it

http://anandtech.com/cpuchipsets/int...px?i=3448&p=12
oblivion is screwed, but look at QW with only 2 gpu's

**JumpingJack** · 11-04-2008, 03:00 AM

Originally Posted by Bellisimo

s

it goes like this cpu --> FSB --> NB --> memory

Also, don't be condescending or I will put you in your place but fast.

jack

**JumpingJack** · 11-04-2008, 03:02 AM

Originally Posted by Bellisimo

i don't, i have a E8400, but i don't have 3*GTX280 so i wouldnt be able to test it

That's ok, I already have... I already know the answer... and yes you did imply that 500 Mhz was the delta:

crank your fsb upto 500 and you will have better results for the core2
this is just because of the qpi link, and few people use 3-way sli, even SLI isnt used a lot

Which is, as Savantu states, BS.

**informal** · 11-04-2008, 03:03 AM

Originally Posted by Clairvoyant129

I wonder why SLI/Crossfire with an AMD platform under performs compared to an Intel platform. According to you, it's just the hyper transport, right?

It's because Phenoms run at relatively low clocks when compared to Core2.But soon we'll have 3Ghz Deneb with hopefully higher Nortbridge/L3 clocks so we'll see how SLI/CF works with that chip on def. and north of its def. clocks.

**~~Bellisimo~~** · 11-04-2008, 03:04 AM

Originally Posted by JumpingJack

That's ok, I already have... I already know the answer... and yes you did imply that 500 Mhz was the delta:

Which is, as Savantu states, BS.

i said, performance would increase, not it would make up the 60% per delta, please just read what i type instead of interpreting it wrongly

Originally Posted by JumpingJack

Also, don't be condescending or I will put you in your place but fast.

jack

should i be scared? i am just posting some relevant stuff you don't agree with, and you start posting stuff like this? I had a lot of respect for you and your knowledge, but you lost it in 5 minutes

**JumpingJack** · 11-04-2008, 03:06 AM

Originally Posted by Bellisimo

i said, performance would increase, not it would make up the 60% per delta, please just read what i type instead of interpreting it wrongly

should i be scared?

You said:

this is just because of the qpi link, and few people use 3-way sli, even SLI isnt used a lot

this is not true. The BW from the CPU to the cards is not at play here.

**~~Bellisimo~~** · 11-04-2008, 03:07 AM

Originally Posted by JumpingJack

You said: this is not true. The BW from the CPU to the cards is not at play here.

well, buy 3 GTX280 and find out yourself?

i clearly said, extra FSB results in better scaling
i didnt said ( and didnt mean) that 100mhz extra fsb would result in 60% more performance

in the next phrase i state that the QPI link is reponsible for the much better scaling compared to the FSB

**JumpingJack** · 11-04-2008, 03:07 AM

Originally Posted by Bellisimo

i said, performance would increase, not it would make up the 60% per delta, please just read what i type instead of interpreting it wrongly

should i be scared? i am just posting some relevant stuff you don't agree with, and you start posting stuff like this? I had a lot of respect for you and your knowledge, but you lost it in 5 minutes

You did not show any respect, I was offended. And what you posted has no relevance since it is irrelevant because it is wrong.

**~~Bellisimo~~** · 11-04-2008, 03:11 AM

Originally Posted by JumpingJack

You did not show any respect, I was offended. And what you posted has no relevance since it is irrelevant because it is wrong.

obviously you mistook the memory i was referring to with GPU memory, thats why i specified it to cpu - fsb - nb - mem

**JumpingJack** · 11-04-2008, 03:12 AM

Originally Posted by savantu

Huh ? GPUs don't talk to memory over FSB , they have DMA over the NB.Using DDR3 1600 , GPUs have 25.6GBs of BW ( minus what goes to the FSB ) available.

He clearly does not understand this, or did not read into the concept that PCIe 2.0 can be a bus master.

**JumpingJack** · 11-04-2008, 03:14 AM

Originally Posted by Bellisimo

obviously you mistook the memory i was referring to with GPU memory, thats why i specified it to cpu - fsb - nb - mem

Ahhh, I see... so how much bus BW do you think it takes up to write the command buffer to the GPU? Do you actually have data or numbers? I mean, would you expect that if I half the BW to the GPU via the FSB I should see roughly 1/2 the FPS performance?

**Shintai** · 11-04-2008, 03:16 AM

Originally Posted by Bellisimo

obviously you mistook the memory i was referring to with GPU memory, thats why i specified it to cpu - fsb - nb - mem

Let me help simplify it for you.

CPU goes CPU->FSB->MCH->Memory.

Your GPUs goes GPU->PCIe->MCH->Memory.

Nowhere does the GPu go over the CPU for its textures.

i7 and even Core 2 got alot of untapped power. You can also see that in the lowres benches.

**savantu** · 11-04-2008, 03:17 AM

Originally Posted by Bellisimo

well, buy 3 GTX280 and find out yourself?

i clearly said, extra FSB results in better scaling
i didnt said ( and didnt mean) that 100mhz extra fsb would result in 60% more performance

in the next phrase i state that the QPI link is reponsible for the much better scaling compared to the FSB

Empiric evidence : why doesn't K10 perform similarly with Nehalem then ? After all , HT offers similar BW as QPI.

**~~Bellisimo~~** · 11-04-2008, 03:18 AM

Originally Posted by JumpingJack

Ahhh, I see... so how much bus BW do you think it takes up to write the command buffer to the GPU? Do you actually have data or numbers?

i don't see your numbers contradicting with what i am saying, did you check the anand link? not alot of improvement for 790i - qx9770 to x58 - 965

**~~Bellisimo~~** · 11-04-2008, 03:19 AM

Originally Posted by Shintai

Nowhere does the GPu go over the CPU for its textures.

lol, where did i say that?

Originally Posted by savantu

Empiric evidence : why doesn't K10 perform similarly with Nehalem then ? After all , HT offers similar BW as QPI.

because k10 is a slow cpu....
but maybe you should ask jack that question, after all, cpu's and their busses are not related to anything a gpu does appearantly

**Shintai** · 11-04-2008, 03:21 AM

Originally Posted by Bellisimo

lol, where did i say that?

You didnt, but you obviously dont understand it either for what the GPU needs.

The only large bandwidth requiring part is textures. And that doesnt go via the FSB on a Core 2. Hence why JJs test showed no difference between 200 and 400.

And K10 being a "slow" CPU got nothing to do with it.

You have 3 people including proof against you. I think its time for you to show the edvidence for your claims.

Thread: Intel Core i7 Review Thread

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions