The GT300/Fermi Thread

**annihilat0r** · 12-21-2009, 06:00 AM

Those numbers in the AMD slides are the most misleading things I have ever seen. Shader cores and SP performance? 4870 should be a hell of a lot faster than GTX 285 if they are to mean anything.

**Piotrsama** · 12-21-2009, 07:03 AM

Originally Posted by annihilat0r

Those numbers in the AMD slides are the most misleading things I have ever seen. Shader cores and SP performance? 4870 should be a hell of a lot faster than GTX 285 if they are to mean anything.

Yes, they are marketing stuff.
Just like Nvidia' slides talking about "8X DP performance improvement" and not mentioning that SP improvement is marginal.

**annihilat0r** · 12-21-2009, 07:18 AM

Marginal being, like, double the previous SP?

Plus, when Nvidia is misleading with something, they are greedy, cruel capitalists out for the money in our pockets; but when AMD is misleading it's "marketing". It's always like that around here.

**orangekiwii** · 12-21-2009, 07:45 AM

Seriously, those kinds of slides make me sick. They're so wrong. You'd see the same kind of "fail" hardware for nvidia if you compared the gtx280 to the 4870. Except the gtx280 out performed the 4870.

And I agree with annihilat0r, if nVidia does something like this its "haha, the fools we won't fall for their rebranding and overhyped product" but with AMD its simply "marketing."

Stop being so biased people - the sad part is all these people who say its just marketing are all the people who always call others fanboys

**Chumbucket843** · 12-21-2009, 09:07 AM

Originally Posted by Piotrsama

Yes, they are marketing stuff.
Just like Nvidia' slides talking about "8X DP performance improvement" and not mentioning that SP improvement is marginal.

theoretically the improvement is marginal but in the real world it is much faster. tesla 2000 series is 80% faster than last gen if you exclude the mul unit. its even faster with fma efficiency. comparing peak flops rather than the usual n-body demo is not a good marketing technique. besides having full speed double precision is a very important feature of fermi.

**trinibwoy** · 12-21-2009, 09:33 AM

The funny thing is that a lot of HPC workloads are bandwidth limited and can't even make use of all the flops because they can't get data to the cores fast enough. That's why caching and the use of shared memory is so important. For example, a lot of compute workloads just don't play nice with HD4xxx cards because the LDS there doesn't really function like it should. So people resort to a lot of other trickery like using the texture units instead to pump data into the cores but that's obviously not a scalable approach. Things should be much better with HD5xxx but I haven't seen independent confirmation yet.

**Piotrsama** · 12-21-2009, 10:37 AM

Originally Posted by annihilat0r

Marginal being, like, double the previous SP?

No, like the jump from 933 GFLOPS from the actual C1060 to the 1040 GFLOPS of the C2050 that is coming on 2Q of 2010.

Which accounts to: 1.11X

Originally Posted by annihilat0r

Plus, when Nvidia is misleading with something, they are greedy, cruel capitalists out for the money in our pockets; but when AMD is misleading it's "marketing". It's always like that around here.

I didn't say that.
Re-read what I wrote:

Originally Posted by Piotrsama

Yes, they are marketing stuff.
Just like Nvidia' slides talking about "8X DP performance improvement" and not mentioning that SP improvement is marginal.

I'm saying that both sides' slides are marketing.

Originally Posted by orangekiwii

And I agree with annihilat0r, if nVidia does something like this its "haha, the fools we won't fall for their rebranding and overhyped product" but with AMD its simply "marketing."

Same for you.

**ajaidev** · 12-21-2009, 10:41 AM

Originally Posted by trinibwoy

The funny thing is that a lot of HPC workloads are bandwidth limited and can't even make use of all the flops because they can't get data to the cores fast enough. That's why caching and the use of shared memory is so important. For example, a lot of compute workloads just don't play nice with HD4xxx cards because the LDS there doesn't really function like it should. So people resort to a lot of other trickery like using the texture units instead to pump data into the cores but that's obviously not a scalable approach. Things should be much better with HD5xxx but I haven't seen independent confirmation yet.

Yes they are 960 dwords/clock.

Offcourse these are advertised performance i would say the worst they can do is around 960/3 dwords/clock which is still a lot more than RV770.

Now fermi has 16 LDS if i am not mistaken and work with 32 execution's/clock given the huge bandwidth i would say it is quite a bit more than what Evergreen can offer not so sure about dual evergreen tough.

**Chumbucket843** · 12-21-2009, 11:02 AM

Originally Posted by trinibwoy

The funny thing is that a lot of HPC workloads are bandwidth limited and can't even make use of all the flops because they can't get data to the cores fast enough. That's why caching and the use of shared memory is so important. For example, a lot of compute workloads just don't play nice with HD4xxx cards because the LDS there doesn't really function like it should. So people resort to a lot of other trickery like using the texture units instead to pump data into the cores but that's obviously not a scalable approach. Things should be much better with HD5xxx but I haven't seen independent confirmation yet.

yes physics simulations will be a lot faster on fermi than previously. i do remember seeing a slide on the cache hierarchy and it was 3x faster than gt200. it was CFD i think.

Originally Posted by Piotrsama

No, like the jump from 933 GFLOPS from the actual C1060 to the 1040 GFLOPS of the C2050 that is coming on 2Q of 2010.

Which accounts to: 1.11X

well first off you compared the old high end tesla to the new low end tesla. secondly theoretical flops are not a measure of real world performance (ie. larrabee). the mul unit in gt200 makes the card look faster than it really is. it can be used but not as often as it should. you might want to take other factors into consideration like 6GB of ram, fma, memory hierarchy, more bandwidth, etc.

**trinibwoy** · 12-21-2009, 11:07 AM

http://www.nvidia.com/docs/IO/43395/...83-001_v01.pdf

So clocks are actually 1.25-1.4Ghz. Tesla SKUs will have 2 cores disabled probably due to yields and/or power consumption.

**Piotrsama** · 12-21-2009, 11:35 AM

Originally Posted by Chumbucket843

well first off you compared the old high end tesla to the new low end tesla. secondly theoretical flops are not a measure of real world performance (ie. larrabee). the mul unit in gt200 makes the card look faster than it really is. it can be used but not as often as it should. you might want to take other factors into consideration like 6GB of ram, fma, memory hierarchy, more bandwidth, etc.

Yes, I realize that FLOPS aren't a real measure, but that didn't stop Nvidia to hype 8X improvement on DP FLOPS. Did it?

What's the high end you talk about? The C2070?
If it is: (which I'm not sure), C2070 is 1260 GFLOPS, and coming on Q3 2010.

The jump from 933 to 1260 is 1.35X (and the card won't be on the market until Q3 2010).

**trinibwoy** · 12-21-2009, 03:11 PM

Originally Posted by Piotrsama

Yes, I realize that FLOPS aren't a real measure, but that didn't stop Nvidia to hype 8X improvement on DP FLOPS. Did it?

Sure, you can regurgitate Nvidia's marketing numbers or you can draw you own conclusions based on what we know about the architectures so far.

GT200: MAD + MUL/SFU
Fermi: MAD + SFU

In the case of GT200 they assign the SFU 1 flop. With Fermi the contribution of the SFU isn't counted which further under-represents the compute power available. It's probably best to be conservative with Fermi expectations though - it'll limit the disappointment if it bombs

**annihilat0r** · 12-21-2009, 04:44 PM

**Nedjo** · 12-21-2009, 05:15 PM

Expected! Performance in line of GTX295!

**annihilat0r** · 12-21-2009, 05:56 PM

well that's for Tesla actually

**eric66** · 12-21-2009, 06:11 PM

is 1.4 ghz shader clock or gpu clock

**Chumbucket843** · 12-21-2009, 06:26 PM

^^shader, tesla does not have a core clock.

wow. memory clocks are quite low. thats only 50% faster than ddr3 and not much faster than 1600MHz (effective) gddr3 of tesla c1070. it is still probably is a major source of power consumption so they really need some 2Gb gddr5 IC's. judging by voltage of tesla which is probably higher than geforce like xeons/optis are to phenom/core it could hit original target clocks if frequency scales properly and power consumption is still within reason.

**zalbard** · 12-21-2009, 06:50 PM

Professional cards are usually lower clocked than gamer ones anyway.
But doesn't look bad IMO, does it?

**trinibwoy** · 12-21-2009, 06:54 PM

1Ghz GDDR5 is 88% more bandwidth than the 800Mhz GDDR3 on the C1070. Don't see that being a problem at all.

**ajaidev** · 12-21-2009, 06:58 PM

Nvidia castrates Fermi to 448SPs

"IT LOOKS LIKE we were right about Fermi being too big, too hot, and too late, Nvidia just castrated it to 448SPs. Even at that, it is a 225 Watt part, slipping into the future.

The main point is from an Nvidia PDF first found here. On page 6, there are some interesting specs, 448 stream processors (SPs), not 512, 1.40GHz, slower than the G200's 1.476GHz, and the big 6GB GDDR5 variant is delayed until 2H 2010. To be charitable, the last one isn't Nvidia's fault, it needs 64x32GDDR5 to make it work, and that isn't coming until 2H 2010 now..."

http://www.semiaccurate.com/2009/12/...-fermi-448sps/

Ouch

**safan80** · 12-21-2009, 08:45 PM

Originally Posted by ajaidev

Nvidia castrates Fermi to 448SPs

"IT LOOKS LIKE we were right about Fermi being too big, too hot, and too late, Nvidia just castrated it to 448SPs. Even at that, it is a 225 Watt part, slipping into the future.

That's why you wait for Nvidia to refresh Fermi. Nvidia has been too predictable lately, with the poorly designed GTX 280 (refreshed to 285 which still needs better cooling) and their dual pcb sandwiches.

**annihilat0r** · 12-21-2009, 08:46 PM

which node will a Fermi refresh be on? 32?

**[XC] gomeler** · 12-21-2009, 09:45 PM

Fermi is on 40nm. Aren't the last few rounds of GPUs done on half-nodes like 55nm and 40nm compared to CPU 65nm and 45nm? So maybe 28nm for the Fermi refresh if TMSC does the right spirit dance by 2011?

**tajoh111** · 12-21-2009, 11:41 PM

If those are the specs for the consumer card thats really disappointing.

Nvidia with the refresh can probably make due with 40nm since its seems more along faults in the process and yields than anything else.

These things if they are held back by heat will overclock like crazy(atleast under ln2).

I am really surprised they are not getting clocks higher, because from what I remember, the MUL was what made the clock drop from the 9800 gtx to gtx 280.

I am guessing ECC has a role to play in the drops clocked all around and without it in the consumer version, hopefully they can bring the clocks up a bit.

**Cybercat** · 12-22-2009, 12:14 AM

It's pretty bad to be stuck with 448 SPs when they've been promising 512 up and down since GDC. I mean they couldn't even do 480? And they have to do that for the whole product line, they can't give the more expensive cards better binning? Something must be REALLY wrong with it then.

Here's to hoping A3 works out for them.

Thread: The GT300/Fermi Thread

Thread Tools

Search Thread

Display

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions