AMD's Answer to Conroe? AMD's upcoming Rev G processors

**sladesurfer** · 05-27-2006, 06:01 PM

http://129.15.202.185/athlon_rev_g/wtf_mates.html

Athlon Rumor Mill has posted some exceptional information regarding AMD's upcoming Rev G processors. Somehow, the editor managed to obtain die shots of both a current Revision F processor as well as an upcoming Revision G CPU. An animated clip overlays the two images and reveals the fact that there are some significant architectural changes. The wild card here is the use of two different stains in the die shots, a fact which can make the same objects look radically different. Regardless, the animation does seem to hint at some major changes though we'll wait for official word from AMD before we dive further into this rumor.

**HiJon89** · 05-27-2006, 06:08 PM

Originally Posted by sladesurfer

AMD's Answer to Conroe?

Chapter 7 Bankruptcy

**n00b 0f l337** · 05-27-2006, 06:09 PM

Not different stains, just an inversion of color, sounds like its a fakeage of copy paste and rotation.

You mean Chapter 11?

**Mikesta** · 05-27-2006, 06:11 PM

NICE FIND.

Can anyone possibly 'guesstimate' the benefit's of an added out-of-order L2 read/write buffer and an extra complex decoder?

**ozzimark** · 05-27-2006, 06:34 PM

Originally Posted by n00b 0f l337

Not different stains, just an inversion of color, sounds like its a fakeage of copy paste and rotation.

You mean Chapter 11?

odd that they call it rev G, but look at this. i pointed this out elsewhere.. a about a week ago:

the "secret" die shot look familiar?

**Thorry** · 05-27-2006, 06:59 PM

Well I can, won't reveil how or what regarding this pic, everbody knows it came from tweakers.net and what guys they discussed it with.

The extra decoders means the pipeline gets filled much faster giving an overal boost in pure ALU performance. In the 'older' cores more complex instructions would even have to be broken up an extra time (they get broken up anyways) so we'll see SSE3 performance raise a lot.

Also they prolly combine this fast feeder with more powerfull ALUs comparable to what Conroe has. This means you get excellent ALU performance (clearly seen in synthetic ALU benches like DryStone, WhetStone, PiFast, SuperPi etc). It's a total guess at how fast the new decoders and ALUs will be, we'll have to wait.

The out-of-order L2 has been a heavily requested feature and the part AMD has gotten a lot of bad press (in techy world that is). It's kinda hard to explain how it really works without writing up a 3000 word essay with a lot of technobabble, so I'm not going to do that. I'll try to explain why they did it instead of what it does:

It's well known cache isn't as important for K8 as it is for netburst, this is because of a couple of reasons:

Shorter pipelines so pipeline stalls have less of an impact (but relatively very bad anyways)
Reasonable prefetchers and branch predictors (not as good as Intel's netburst)
Very high speed memory interface (so pipeline stalls can be fixed much faster)

There are a few problems when trying to increase the performance of the cache and the predictors/prefetchers. You can simply increase cache size which reduces cache misses and thus pipeline stalls. You can also improve the prefecters and predictors.

However: Cache is expensive, they are simple memory circuits but need to operate at core speed. If you have a lot of cache the yields can go down and cache is expensive stuff. Compare it to TFT technologies, even if yields are very good 1 in say 100.000.000 cache circuits go wrong if you have a big cache the chance the cache is faulty becomes bigger and bigger.

Cache also increases die size and use up a lot of resources. The effect is also exponentially flawed, you need to increase the cache size exponentially to get the same results. So that's something you only do if you're desperate (like we've seen Intel do with Xeons).

The other way is to increase the predictors and pre-fetchers, this is not only difficult (the circuits become very complex and hard to design and fabricate) it also requires a LOT of extra circuits to make even small improvements. We've all seen the Presler die pics and saw how much of the core was actually dedicated to the pre-fetchers. Intel has a lot of experience from the netburst, AMD does not.

The branch predictors and pre-fetchers are expensive to design, but aren't as expensive to produce. Altough the die size increases it isn't as bad as the cache where die size increases exponentially. There is however a BIG downside: These circuits get hot, they need to switch a lot so micro-wear becomes an issue as well.

Because AMD has developed this new more dense cache technology we'll prolly see an increase from 2MB to 4MB or maybe 8MB. This will improve performance, AMD has prolly done a lot of research where the 'cut-off' point is between more expensive CPUs and improved performance and prolly found a good point. (Don't think the CPUs will become more expensive as they now are, they'll release single core and sempron versions with less cache and the AM2 CPUs are now expensive but prices will drop making place for the new CPUs with the same prices as they are now).

The out-of-order L2 buffer helps with the new branch predictors and prefetchers so that's prolly why they did it. This alone has almost no impact on performance but is needed to get better branch predictors and prefecters. It also takes away a argument for most techies who want to badmouth AMD.

Rev G. will improve performance and when die-shrinking it to 65nm we can see clockspeeds up to 4 ghz. This however isn't the answer to Conroe from AMD. AMD and Intel are out of sync, the answer from one is put into the market about 6 months after the release of the other.

It's important to also understand the way AMD runs their factories, it's completely different from Intel's fabs. AMD actually has a system where yield automatically get better and better. They have techniques where faults in the production of the CPUs (and the whole process before the actual CPUs are made) can be corrected. Also they have the possibility to implement improvements in the design at a weekly basis (where Intel requires as much as a month or 2 to implement changes).

I've got a report written by Wouter Tinus in Dutch about AMD's fabs, it's actually amazing what they've got.

I know my post doesn't always make stuff any clearer, but that's because it is a complex world, the world of computers and especially CPUs

**krisssi** · 05-27-2006, 07:38 PM

I'll take your word for it.

**Thorry** · 05-27-2006, 07:42 PM

Wow, you actually registered Dec 2004 and this is your first post?

Welcome

(One might say it's very un-nn of you :P)

**HiJon89** · 05-27-2006, 07:49 PM

Originally Posted by Thorry

Wow, you actually registered Dec 2004 and this is your first post?

Welcome

(One might say it's very un-nn of you :P)

He's a man of few words

**nn_step** · 05-27-2006, 07:54 PM

they forgot the SSE changes..

**metro.cl** · 05-27-2006, 08:00 PM

Originally Posted by Thorry

Well I can, won't reveil how or what regarding this pic, everbody knows it came from tweakers.net and what guys they discussed it with.

The extra decoders means the pipeline gets filled much faster giving an overal boost in pure ALU performance. In the 'older' cores more complex instructions would even have to be broken up an extra time (they get broken up anyways) so we'll see SSE3 performance raise a lot.

Also they prolly combine this fast feeder with more powerfull ALUs comparable to what Conroe has. This means you get excellent ALU performance (clearly seen in synthetic ALU benches like DryStone, WhetStone, PiFast, SuperPi etc). It's a total guess at how fast the new decoders and ALUs will be, we'll have to wait.

The out-of-order L2 has been a heavily requested feature and the part AMD has gotten a lot of bad press (in techy world that is). It's kinda hard to explain how it really works without writing up a 3000 word essay with a lot of technobabble, so I'm not going to do that. I'll try to explain why they did it instead of what it does:

It's well known cache isn't as important for K8 as it is for netburst, this is because of a couple of reasons:

Shorter pipelines so pipeline stalls have less of an impact (but relatively very bad anyways)
Reasonable prefetchers and branch predictors (not as good as Intel's netburst)
Very high speed memory interface (so pipeline stalls can be fixed much faster)

There are a few problems when trying to increase the performance of the cache and the predictors/prefetchers. You can simply increase cache size which reduces cache misses and thus pipeline stalls. You can also improve the prefecters and predictors.

However: Cache is expensive, they are simple memory circuits but need to operate at core speed. If you have a lot of cache the yields can go down and cache is expensive stuff. Compare it to TFT technologies, even if yields are very good 1 in say 100.000.000 cache circuits go wrong if you have a big cache the chance the cache is faulty becomes bigger and bigger.

Cache also increases die size and use up a lot of resources. The effect is also exponentially flawed, you need to increase the cache size exponentially to get the same results. So that's something you only do if you're desperate (like we've seen Intel do with Xeons).

The other way is to increase the predictors and pre-fetchers, this is not only difficult (the circuits become very complex and hard to design and fabricate) it also requires a LOT of extra circuits to make even small improvements. We've all seen the Presler die pics and saw how much of the core was actually dedicated to the pre-fetchers. Intel has a lot of experience from the netburst, AMD does not.

The branch predictors and pre-fetchers are expensive to design, but aren't as expensive to produce. Altough the die size increases it isn't as bad as the cache where die size increases exponentially. There is however a BIG downside: These circuits get hot, they need to switch a lot so micro-wear becomes an issue as well.

Because AMD has developed this new more dense cache technology we'll prolly see an increase from 2MB to 4MB or maybe 8MB. This will improve performance, AMD has prolly done a lot of research where the 'cut-off' point is between more expensive CPUs and improved performance and prolly found a good point. (Don't think the CPUs will become more expensive as they now are, they'll release single core and sempron versions with less cache and the AM2 CPUs are now expensive but prices will drop making place for the new CPUs with the same prices as they are now).

The out-of-order L2 buffer helps with the new branch predictors and prefetchers so that's prolly why they did it. This alone has almost no impact on performance but is needed to get better branch predictors and prefecters. It also takes away a argument for most techies who want to badmouth AMD.

Rev G. will improve performance and when die-shrinking it to 65nm we can see clockspeeds up to 4 ghz. This however isn't the answer to Conroe from AMD. AMD and Intel are out of sync, the answer from one is put into the market about 6 months after the release of the other.

It's important to also understand the way AMD runs their factories, it's completely different from Intel's fabs. AMD actually has a system where yield automatically get better and better. They have techniques where faults in the production of the CPUs (and the whole process before the actual CPUs are made) can be corrected. Also they have the possibility to implement improvements in the design at a weekly basis (where Intel requires as much as a month or 2 to implement changes).

I've got a report written by Wouter Tinus in Dutch about AMD's fabs, it's actually amazing what they've got.

I know my post doesn't always make stuff any clearer, but that's because it is a complex world, the world of computers and especially CPUs

nice read.

**STEvil** · 05-27-2006, 08:02 PM

ok, so here is a question: what part of a pipeline usually "stalls" ?

If buffers/caches were placed along the pipe or a second one added to work in "SLI/Crossfire" when the first fails (And the "pipes" can be loaded to be executed one after the other in case of stalls) would that not prove exceedingly beneficial?

The mini buffer/cache's could even help to hold data so that if the pipe stalls the work is dumped and their data is loaded in while new work is lined up to refill the pipe(s). This could make the single pipe look like two (or two become 3+) in effect.

**VulgarHandle** · 05-27-2006, 08:32 PM

yeah, i think i watched a documentry on discovry/science/etc....(one of those) on AMD, and they talked about how their fabs basically would learn to do things better and better...though im sure they meant they make it modular, so they can implement changes quickly

edit: on topic, i hope rev. g turns out well...

also, will k8l work on sAM2?
if K8L allows for ddr3, will they fit in ddr2 slots?

**WeStSiDePLaYa** · 05-28-2006, 02:30 PM

if you look, there is also extra "boxes" by the data cache.

**Thorry** · 05-28-2006, 02:53 PM

Originally Posted by STEvil

ok, so here is a question: what part of a pipeline usually "stalls" ?

If buffers/caches were placed along the pipe or a second one added to work in "SLI/Crossfire" when the first fails (And the "pipes" can be loaded to be executed one after the other in case of stalls) would that not prove exceedingly beneficial?

The mini buffer/cache's could even help to hold data so that if the pipe stalls the work is dumped and their data is loaded in while new work is lined up to refill the pipe(s). This could make the single pipe look like two (or two become 3+) in effect.

That's a very good question (shows you have been paying attention :P).

A pipeline is a hard concept to understand, but fortunatly there is 1 good website on the internet (actually one on my very short list of good sites on the internet). ars-technica: http://arstechnica.com/

They've got this great series about the CPU on a technical level, it's a bit hard to understand if you're not into the subject but they have provided some basis knowledge.

There is actually a two part (just to show how complex the concept of a pipeline is) guide about how a pipeline works, what pipeline stalls are, why this is a bad thing and why this is a fatal flaw in the netburst design.

http://arstechnica.com/articles/paed...pelining-1.ars
http://arstechnica.com/articles/paed...pelining-2.ars

They don't really go into what happens if a cache miss or branch prediction fault occurs, but if you read these two articles and read up on how cache works, why it's important etc you can form a clear image in you mind what actually does happen to the pipeline when a cache miss or branch prediction fault occurs. (I can tell you, it isn't pretty).

The short answer for those of you that are too lazy to read all this stuff (or simply don't have the time, skills, brain capacity etc):

The pipeline operates almost at the most basic level, any kind of higher intellegent behavoir at this level is almost impossible. The benefits would be nice, but the price is most certainly too high (if the current level of technology can even do it at all)

**STEvil** · 05-28-2006, 11:37 PM

good ol'e arse

yes, a multi-staged pipeline would be hard to make..

**Anarki** · 05-29-2006, 12:27 AM

Originally Posted by Thorry

That's a very good question (shows you have been paying attention :P).

A pipeline is a hard concept to understand, but fortunatly there is 1 good website on the internet (actually one on my very short list of good sites on the internet). ars-technica: http://arstechnica.com/

They've got this great series about the CPU on a technical level, it's a bit hard to understand if you're not into the subject but they have provided some basis knowledge.

There is actually a two part (just to show how complex the concept of a pipeline is) guide about how a pipeline works, what pipeline stalls are, why this is a bad thing and why this is a fatal flaw in the netburst design.

http://arstechnica.com/articles/paed...pelining-1.ars
http://arstechnica.com/articles/paed...pelining-2.ars

They don't really go into what happens if a cache miss or branch prediction fault occurs, but if you read these two articles and read up on how cache works, why it's important etc you can form a clear image in you mind what actually does happen to the pipeline when a cache miss or branch prediction fault occurs. (I can tell you, it isn't pretty).

The short answer for those of you that are too lazy to read all this stuff (or simply don't have the time, skills, brain capacity etc):

The pipeline operates almost at the most basic level, any kind of higher intellegent behavoir at this level is almost impossible. The benefits would be nice, but the price is most certainly too high (if the current level of technology can even do it at all)

Thanks for sharing that article, very informative

**Cobalt** · 05-29-2006, 01:31 AM

As far as I can see the die shot isn't nearly as impressive as this guy made it out to be. It looks like a more detailed shot and of course there are the extra "boxes" but apart from that we can't get any real information from it. The extra ALU performance looks like its specifically designed to meet the new expectations (sub 30s 2M SuperPi for example) but will that give any extra performance outside of those very specific tasks?

**MaxxxRacer** · 05-29-2006, 02:14 AM

Adding a complex decoder will really help things out in the fight against conroe, and here is why. Conroe has 3 simple and 1 complex decoders, whereas AMD has 3 complex decoders. AMD's approach may seem better, but for the most part complex instructions are NOT used and thus they can be handled by the simple decoders. Futuremore Intels Core architecture can break down some complex isntructions into simple ones (usually 2, sometimes 3) and thus doesnt loose much performance. Now lets move to RevG of AMD. With 4 COMPLEX decoders, there is not a snowballs chance in hell that conroe could keep up clock for clock in the decoding market, even with their advanced system to break down the complex instructions. One thing I should point out though. Intel has what is called "Macro-Op Fusion" which is essentially combining 2 x86 instructiosn into one. This enbaled to actaully let the decoders do 5 theoretical decodes per clock, assuming that a Macro-OP Fusion is performed and all of the decoders are working at full tick.

As well, an improved out of order loader will help which AMD K8's architecture is in need of.

nn also pointed out another direly important thing that AMD needs to upgrade; the SSE units. IIRC currently AMD is using 2 64bit units whereas Intel is using 3 128bit units which, obviously, have twice the bandwidth (per unit). Because of this Intel rapes AMD whenever any heavily SSE optimized programs are used. Can somebody say SuperPi!

For more info on this subject read this article from AnAndTech
http://www.anandtech.com/cpuchipsets...oc.aspx?i=2748

P.S. I hope this revision "G" is true!

**Lightman** · 05-29-2006, 02:25 AM

We need to remember that if rev G still will be 12/14 stage pipeline, it will be shorter than Conroe 14 stage pipeline, so at the same clock it may be even faster! Of course now we have to little information about Rev G to give some certain judgement about preformance.

PS. If AMD will play same like intel, they should show some pre-production Rev. G samples on 23 July (Conroe market debut)

**Anarki** · 05-29-2006, 06:14 AM

Originally Posted by Lightman

We need to remember that if rev G still will be 12/14 stage pipeline, it will be shorter than Conroe 14 stage pipeline, so at the same clock it may be even faster! Of course now we have to little information about Rev G to give some certain judgement about preformance.

PS. If AMD will play same like intel, they should show some pre-production Rev. G samples on 23 July (Conroe market debut)

That would be pretty cool, AMD sure keep their advances low-key and quiet!

**dogsx2** · 05-29-2006, 06:16 AM

Even if AMD can match Conroe with a Rev G, can they match the price vs performance?

$316 for a 6600 and $530 for a 6700 are rock bottom prices.

**[XC] leviathan18** · 05-29-2006, 06:33 AM

Originally Posted by dogsx2

Even if AMD can match Conroe with a Rev G, can they match the price vs performance?

$316 for a 6600 and $530 for a 6700 are rock bottom prices.

remember right now amd is running 90nm if revision G is 65nm they will lower prices for sure

**Thorry** · 05-29-2006, 06:39 AM

Originally Posted by Willis

did anyone say 4ghz and 65nano?

Yes, I did actually

**Starscream** · 05-29-2006, 07:04 AM

Originally Posted by Thorry

Well I can, won't reveil how or what regarding this pic, everbody knows it came from tweakers.net and what guys they discussed it with.
...
GOOD STORY
...
I know my post doesn't always make stuff any clearer, but that's because it is a complex world, the world of computers and especially CPUs

good read.

to add (dunno if its fully true as its been a while since ive read it)
The system AMD uses to get better and better yields is a piece of patented software.

The software by itself improves the production process and corrects stuff (no idea how).
This allows their fabs to correct stuff and improve things on the fly where others (like Intel) have to shut down the machines todo this.

Thread: AMD's Answer to Conroe? AMD's upcoming Rev G processors

Thread Tools

Search Thread

Rate This Thread

Display

AMD's Answer to Conroe? AMD's upcoming Rev G processors

Bookmarks

Bookmarks

Posting Permissions