Fermi: What went wrong, from the mouth of Jen-Hsun Huang

**bamtan2** · 09-23-2010, 08:47 AM

http://www.golem.de/1009/78179.html

Very honest of him. It is clear now that Fermi is old news for him. He is talking about history.

**Manicdan** · 09-23-2010, 08:48 AM

translated:
http://translate.google.com/translat...Ds%26prmd%3Div

**bamtan2** · 09-23-2010, 08:50 AM

It sounds like they tried a complex technique that the simulations implied would work, but it came back from TSMC and didn't. That is why they had problems and AMD didn't. AMD didn't use such a complicated "fabric".

But more importantly, if I fill in the blanks, it sounds as though the design guys over-reached, and the hardware guys could have prevented the problem if they had more design responsibility. So a management change was implemented. Design and hardware, theory and practicality, are more tightly coupled.

**xdan** · 09-23-2010, 09:10 AM

That could happen at all, but it was Jen-Hsun Huang was considered not just a technical problem. Rather, it was not for the Fabric own development department. "My engineers, dealing with architecture, and those dealing with the physics, sit in two different departments," said Huang. He continued: "The management lesson we have learned: There should always be a chief pilot - for everything in our business is complicated."

So physics guys need for very much SP's and L2 cachez screw the work of engineers, they wanted to much from Fermi which on paper seemed to work, on practic didn't work. The "software guys" forget that there some physics , power- laws.
THe two departaments didn't comunicated well.

**damha** · 09-23-2010, 09:21 AM

Their g80 design is no longer scalable is what he is trying to say. It took a lot of trial and error to make it work and in the end it did, with a surface temp similar to that of the sun.

Lesson learned: change is good and smaller process cannot always save your a$$.

**Chumbucket843** · 09-23-2010, 09:24 AM

Originally Posted by bamtan2

It sounds like they tried a complex technique that the simulations implied would work, but it came back from TSMC and didn't. That is why they had problems and AMD didn't. AMD didn't use such a complicated "fabric".

AMD or for that matter any integrated circuit also has a "fabric". it is just a lay term to create an idea or image of what an interconnect is like. keep in mind this still doesnt explain A2's problems.

**trinibwoy** · 09-23-2010, 09:28 AM

That's why people invented ring buses. Nvidia is in love with crossbar networks but at some point you just end up with too many wires. Like he said, there could be other problems plaguing the next architecture but this specific issue won't be repeated obviously.

Interesting that it had nothing to do with silicon defects or die-size like all the resident armchair engineers have been suggesting for the last year.

**Chumbucket843** · 09-23-2010, 09:37 AM

Originally Posted by damha

Their g80 design is no longer scalable is what he is trying to say. It took a lot of trial and error to make it work and in the end it did, with a surface temp similar to that of the sun.

Lesson learned: change is good and smaller process cannot always save your a$$.

not in the least. g80 and fermi are very different, not even the same ISA.

what he is saying, and keep in mind this is just for A1 silicon, is that they trusted TSMC's software tools which gave them misleading information and lead to the failure of the interconnect.

**LordEC911** · 09-23-2010, 10:34 AM

Originally Posted by Chumbucket843

not in the least. g80 and fermi are very different, not even the same ISA.

what he is saying, and keep in mind this is just for A1 silicon, is that they trusted TSMC's software tools which gave them misleading information and lead to the failure of the interconnect.

Why should they have trusted TSMC when they were having "similar" problems with 40nm GT2xx chips?
At some point you have to take a step back and say to yourself, we must be missing something. Nvidia management never did.

**Helmore** · 09-23-2010, 10:46 AM

Originally Posted by xdan

So physics guys need for very much SP's and L2 cachez screw the work of engineers, they wanted to much from Fermi which on paper seemed to work, on practic didn't work. The "software guys" forget that there some physics , power- laws.
THe two departaments didn't comunicated well.

You're misunderstanding what Huang means with the physics guys and the engineers. The physics guys are responsible for the actual implementation of the architecture while the engineers came up with the architecture. That's how Huang used the two terms here. In other words, the engineers came up with an architecture with lots of SPs and L2 cache, but that's not what caused the delay in Fermi though although it is part of the reason why it's so hot.

**trinibwoy** · 09-23-2010, 11:03 AM

Originally Posted by LordEC911

Why should they have trusted TSMC when they were having "similar" problems with 40nm GT2xx chips?
At some point you have to take a step back and say to yourself, we must be missing something. Nvidia management never did.

Where did you see they were having similiar problems with GT2xx?

**safan80** · 09-23-2010, 01:03 PM

At least he is admitting that they took the wrong approach with Fermi. I believe he is wrong though about "Now is time for innovation, not for integration." I believe it is time for innovation and integration. From a business stand point he should get more software developers to use Cuda to take advantage of the gpu. It's not about the gpu just doing games anymore it's about the apps they can run. The GPU is more powerful than a cpu at certain tasks video encoding is one of these tasks. If Nvidia takes too long this time developing the successor to Fermi this time around I don't think they can survive in the computer gpu market any longer. They will have to switch to the mobile market as their primary means of profit, or the company will cease to exist.

**~~Svnth~~** · 09-23-2010, 01:09 PM

Originally Posted by safan80

At least he is admitting that they took the wrong approach with Fermi. I believe he is wrong though about "Now is time for innovation, not for integration." I believe it is time for innovation and integration. From a business stand point he should get more software developers to use Cuda to take advantage of the gpu. It's not about the gpu just doing games anymore it's about the apps they can run. The GPU is more powerful than a cpu at certain tasks video encoding is one of these tasks. If Nvidia takes too long this time developing the successor to Fermi this time around I don't think they can survive in the computer gpu market any longer. They will have to switch to the mobile market as their primary means of profit, or the company will cease to exist.

Cool, I'll pm you next time a thread needs doomsday predictions. Lol. NV has $3B in the bank btw. Just as an FYI.

**tajoh111** · 09-23-2010, 01:14 PM

Originally Posted by safan80

At least he is admitting that they took the wrong approach with Fermi. I believe he is wrong though about "Now is time for innovation, not for integration." I believe it is time for innovation and integration. From a business stand point he should get more software developers to use Cuda to take advantage of the gpu. It's not about the gpu just doing games anymore it's about the apps they can run. The GPU is more powerful than a cpu at certain tasks video encoding is one of these tasks. If Nvidia takes too long this time developing the successor to Fermi this time around I don't think they can survive in the computer gpu market any longer. They will have to switch to the mobile market as their primary means of profit, or the company will cease to exist.

I think they actually need to stop integrating to some extent, atleast with their product line.

To me, it's going to be hard to make a competitive CGPU device that also does great at games.

Its make the silicone to big and makes the product a jack of all trades, master of none. They are lucky that their is no products to make their professional products look bad(Firegl products are POS because they were never supposed to be CGPU products in the first place). But at the same time, fermi is really inefficient for gaming and when you consider power consumption is not a good gaming part(atleast the gf100 variety)

I think they need to research two different lines. One for computing and one for games. I think they needs to specialize these two lines because making a single product to do both has diluted fermi performance and has prevented it from being a full blown product with all the parts being salvages.

**safan80** · 09-23-2010, 01:15 PM

Originally Posted by Svnth

Cool, I'll pm you next time a thread needs doomsday predictions. Lol. NV has $3B in the bank btw. Just as an FYI.

LOL It's just all those business classes that I'm taking. Since they have $3B in the bank they should start one

Originally Posted by tajoh111

I think they actually need to stop integrating to some extent, atleast with their product line.

To me, it's going to be hard to make a competitive CGPU device that also does great at games.

Its make the silicone to big and makes the product a jack of all trades, master of none. They are lucky that their is no products to make their professional products look bad(Firegl products are POS because they were never supposed to be CGPU products in the first place). But at the same time, fermi is really inefficient for gaming and when you consider power consumption is not a good gaming part(atleast the gf100 variety)

I think they need to research two different lines. One for computing and one for games. I think they needs to specialize these two lines because making a single product to do both has diluted fermi performance and has prevented it from being a full blown product with all the parts being salvages.

What did in Fermi was the fact they relied on a computer simulation to design the card with and they should of listened more to their engineers. Architects do this in building designing and it ends up being changed during the construction process because certain things just cannot be done and keep a building standing. I think Nvidia looked at Boeing and the way they made the 777 and tried to apply that to making a super video card.

**trinibwoy** · 09-23-2010, 01:16 PM

Nvidia doesn't have a problem with integration or innovation. Just execution.

**-Boris-** · 09-23-2010, 01:28 PM

Originally Posted by trinibwoy

That's why people invented ring buses. Nvidia is in love with crossbar networks but at some point you just end up with too many wires. Like he said, there could be other problems plaguing the next architecture but this specific issue won't be repeated obviously.

Interesting that it had nothing to do with silicon defects or die-size like all the resident armchair engineers have been suggesting for the last year.

The amount of wires increase exponentially with units and size. Just as the interference do with amount of wires. A chip at 200mm² have a much simpler crossbar and wouldn't suffer from it at all. So, the crossbar simply didn't scale to the huge chips nVidia wanted to build. A ringbus might have helped, but we can't know if it's enough. Since the redesigned GF100 with a new fixed crossbar wasn't enough to make the chip entirely functional it seems like size was an important factor.

EDIT:
I think GF100 was nVidias R600 or Prescott. I think their upcomming chips will be alot more efficient per mm². With R600 ATi learned that a huge 512 ringbus didn't pay of. And I've heard that since the failure with prescott Intel only makes changes that produces at least 2% performance increase for every 1% power increase.

**trinibwoy** · 09-23-2010, 02:08 PM

Originally Posted by -Boris-

The amount of wires increase exponentially with units and size. Just as the interference do with amount of wires. A chip at 200mm² have a much simpler crossbar and wouldn't suffer from it at all. So, the crossbar simply didn't scale to the huge chips nVidia wanted to build.

The problem wasn't simply one of scale though, design also played a part. The problem was that their simulations told them it would work and when they got chips back they found out those simulations were wrong. If it was caught up front their A1 might have looked like their A3.

A ringbus might have helped, but we can't know if it's enough. Since the redesigned GF100 with a new fixed crossbar wasn't enough to make the chip entirely functional it seems like size was an important factor.

Yep there definitely are still yield problems hence the disabled SM. But that's a far cry from completely non-functional chips as seems to have been the case with the A1 interconnect problems.

**Humminn55** · 09-23-2010, 02:13 PM

Originally Posted by safan80

LOL It's just all those business classes that I'm taking. Since they have $3B in the bank they should start one

Where do you see they have $3B in the bank? Nowhere I've looked at their financials shows any sort of figure like that.....which "$3B in the bank", as you put it, would refer to cash reserves. Latest statement, as of March 2010, showed $1.7B in cash reserves, not $3B.

So, can you please link to the $3B figure? Thanks!

**rogueagent6** · 09-23-2010, 02:16 PM

Originally Posted by Humminn55

Where do you see they have $3B in the bank? Nowhere I've looked at their financials shows any sort of figure like that.....which "$3B in the bank", as you put it, would refer to cash reserves. Latest statement, as of March 2010, showed $1.7B in cash reserves, not $3B.

So, can you please link to the $3B figure? Thanks!

I think he was responding to the quote below.

Originally Posted by Svnth

Cool, I'll pm you next time a thread needs doomsday predictions. Lol. NV has $3B in the bank btw. Just as an FYI.

**trinibwoy** · 09-23-2010, 02:29 PM

They have 3.7b in assets, 2.2b of which is cash or equivalent.

**Humminn55** · 09-23-2010, 02:30 PM

Originally Posted by rogueagent6

I think he was responding to the quote below.

Yeah, I keep seeing that figure bandied about like it's a "fact"....but the facts don't add up to anywhere near that piece of misinformation. Guess someone needs to learn to read financial reports.

**Humminn55** · 09-23-2010, 02:36 PM

Originally Posted by trinibwoy

They have 3.7b in assets, 2.2b of which is cash or equivalent.

Cash and Short Term Investments.......$1,728.23 (In thousands), from NV's 10-K filing on 3/2010.

$3B in assets is NOT "in the bank". Inventory isn't exactly money in the bank and neither is their specialized equipment, property, etc. True, one can borrow against physical capital assets, but they're far from "in the bank" money.

And NV's liabilities wipe out their assets......making their company a zero sum company.

The sad part of NV's structure right now is their net profit margin is only 6.36%, horrible. Compare that to Intel's (23.08%) or AMD (24.92%.)

**Sn0wm@n** · 09-23-2010, 02:37 PM

Originally Posted by Svnth

Cool, I'll pm you next time a thread needs doomsday predictions. Lol. NV has $3B in the bank btw. Just as an FYI.

nope ... they have a value of 3bn maybe .. but not 3bn in the bank ..

**570091D** · 09-23-2010, 02:40 PM

Originally Posted by Humminn55

Where do you see they have $3B in the bank? Nowhere I've looked at their financials shows any sort of figure like that.....which "$3B in the bank", as you put it, would refer to cash reserves. Latest statement, as of March 2010, showed $1.7B in cash reserves, not $3B.

So, can you please link to the $3B figure? Thanks!

you ever checked nvidia's SEC filings?

Thread: Fermi: What went wrong, from the mouth of Jen-Hsun Huang

Thread Tools

Search Thread

Rate This Thread

Display

Fermi: What went wrong, from the mouth of Jen-Hsun Huang

Bookmarks

Bookmarks

Posting Permissions