DRESDENBOY! Glad to have you here finally. Keep up the great work.
PS: While we're at it, let's bring Hans into the fold.
DRESDENBOY! Glad to have you here finally. Keep up the great work.
PS: While we're at it, let's bring Hans into the fold.
Last edited by Mechromancer; 08-27-2009 at 04:50 PM.
Dresdenboy updated his blog. Check out the link on the first page.
how about the rumored reverse hyper threading from a few years ago, after reading over the paper somebody here posted a link to here i kept getting the impression somebody back then misunderstood cmp and made up the whole reverse hyper threadding hype back then, as a clustered core can share its resources to work on one single thread... yet the same paper sounds like cmp overall still sucks big time for single threaded aplications?
so is there a way to go cmp without hurting single thread performance? somebody mentioned switching it on/off on the go?
or is there another way to maybe even boost single thread performance with cmp by doing branches on diferent parts of a clustered core, or is there some other way to boost single thread performance with cmp?
One hundred years from now It won't matter
What kind of car I drove What kind of house I lived in
How much money I had in the bank Nor what my cloths looked like.... But The world may be a little better Because, I was important In the life of a child.
-- from "Within My Power" by Forest Witcraft
Generally AMD patents highlight aspects from a certain micro-architecture
which is the same in a whole set of patents and described in each of them.
These micro-architectures have evolved over the last 10-15 years.
Cluster based superscalar processing has evolved to Cluster based multi-
threading as a more effective way of doing SMT which suffers from a
high pressure on the Load/Store units and the L1 caches as Fred Weber
already mentioned years ago as AMD CTO.
The origins of Cluster based superscalar processing as a successor of
superscalar processing go back even further in time as this DEC patent
filed in 1995 shows: http://www.freepatentsonline.com/6167503.pdf
Regards, Hans
Last edited by Hans de Vries; 08-28-2009 at 12:29 AM.
~~~~ http://www.chip-architect.org ~~~~ http://www.physics-quest.org ~~~~
Well, another case of wrong information by obmitting parts of it.
I wrote "Charlie Demerijan seems to know", just for the record. That means I'm not 100% sure about what he knows. He did never say, that he guesstimated around but that he has seen the microarchitecture during a meeting with some AMD guy.
@thread:
Hello Hans, nice to see you here.
and thats exactly my problem with guys like mr demerjian, fuad etc. They always where shown non public information by "some guy" (insider, someone who knows an insider etc.) and most of the time it turns out to be hot air.
Your work on the other hand is based on facts and its cool to see someone invests so much time in scraping together little pieces and tries to make a picture out of it.
Your gathered information coincidently fits with one of his articles, maybe he had reriable source for once. I wouldn't rely on his articles to verify your theory.
~~~~ http://www.chip-architect.org ~~~~ http://www.physics-quest.org ~~~~
Yep he seems to know things, but even if he knows there are chances he may mislead people. Similar things happened with RV780 specs most websites said it has 480-640SP's but actually it was 800SP's.
Since you and Hans are quite knowledgeable on chip arc. at this thread, it seems that the proposed arc. would work in reality. Oh ya any speculation on Sandy bridge since we all ready have a possible die shot of it...!!
Also can there be a chance that SMT is implemented in a similar way as PowerPC7 than I7?
Offtopic: The fact that Charlie does have tons of connections is nothing that can be denied, and most of he's articles are actually very nice. This Charlie Demerjian hate spawned cause he's the opposite of an Nvidia fanboy, so today even bringing up obvious facts we here the green men coming into threads and typing "CHARLIE DEMERJIAN TEH LULZ!"
Nice to see both Hans de Vries and Dresdenboy in this thread will be following it closely![]()
SweClockers.com
CPU: Phenom II X4 955BE
Clock: 4200MHz 1.4375v
Memory: Dominator GT 2x2GB 1600MHz 6-6-6-20 1.65v
Motherboard: ASUS Crosshair IV Formula
GPU: HD 5770
Dresdenboy, concerning the below diagram, specifically the decode stage, you previously stated that it looks like this uarch will be able to do up to 8 threads if the threads are separate and made up of 4 individual microcode and 4 individual fastpath instructions. To get the most efficiency out of that kind of setup, will software engineers need to target this architecture specifically to get the most efficiency out of it? Will this turn out like another largely unsupported AMD implementation like SSE4A, or will this make the majority of multi-threaded software run crazy fast?
![]()
I didn't say 8 threads (just 2), but I wrote about the micro ops. So this means that one thread could have one (probably even more) microcoded (complex) operations decoded, while the other thread could be decoded using the fast path decoders. So one thread with simple ops could be decoded for the appropriate int cluster while the second thread could have some (probably microcoded, see the KGC blog entry) AVX instructions being decoded for the FPU.
Thank you very much. I needed that clarification. Will taking advantage of that capability require specific software enhancements for this architecture or does this look like a method the CPU can handle with any basic software coded with AVX support? As you can tell, I'm no expert by any means, but I want to just have an idea whether or not software engineers have to do AMD specific coding to get these enhancements to work or not (like SSE4A).
It depends on the instruction mix of the two threads, but for any given code it would be nearly impossible to compile and execute threads in a way to make maximum use of this capability. There might be combinations of threads or processes with for example standard integer code in one thread and a lot of microcoded instructions (an AVX version of Prime95 if AVX would be microcoded) in the other.
Given that all previous CPUs indicate that Microcoded operations are rare, the excessive transistors dedicated to a single thread would be wasteful.
fast path decoding is more energy efficient, and cheaper in transistor counts.
Thus the logical outcome is that if there are 8 decoders, is that at most only 2 of them would be micro-decoders and the remaining 6 should be fast path.
Fast computers breed slow, lazy programmers
The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.
http://www.lighterra.com/papers/modernmicroprocessors/
Modern Ram, makes an old overclocker miss BH-5 and the fun it was
The microcode decoders aren't necessarily full microcode decoders (as the one vector decode/microcode unit in K7-K10) but could be partly decoders. However, they just fetch microcode from microcode ROM. That's also an efficient task (like a table lookup) instead of having a fully hardwired fast path decode unit which needs to switch a lot of transistors to translate some x86 ISA instruction to the right micro ops for the internal execution engines.
Theres some interesting discussion about this topic over at AMDZone. Check it out: http://www.amdzone.com/phpbb3/viewto...st=0&sk=t&sd=a
Adobe is working on Flash Player support for 64-bit platforms as part of our ongoing commitment to the cross-platform compatibility of Flash Player. We expect to provide native support for 64-bit platforms in an upcoming release of Flash Player following the release of Flash Player 10.1.
Seems we made our greatest error when we named it at the start
for though we called it "Human Nature" - it was cancer of the heart
CPU: AMD X3 720BE@ 3,4Ghz
Cooler: Xigmatek S1283(Terrible mounting system for AM2/3)
Motherboard: Gigabyte 790FXT-UD5P(F4) RAM: 2x 2GB OCZ DDR3 1600Mhz Gold 8-8-8-24
GPU:HD5850 1GB
PSU: Seasonic M12D 750W Case: Coolermaster HAF932(aka Dusty)
Bulldozer may very well not have CMT, but the gen after or even refreshes, (like K8 to K10) may. One thing that may be safe to assume is Bulldozer having a much higher IPC than anything today. Lots of cores with high IPC per core is just fine and doesn't lend itself to the weaknesses of certain multi-threading implementations. I still can't imagine what "commonly held beliefs" JF is talking about. It better be exciting as!
And how exactly do you get that "high IPC" ? K10 and Core 2 barely struggle to get 1 IPC in average, Nehalem does slightly better ( think 1.2 ) while it has the most advanced prefetchers in the world by far ( thanks to Netburst ). That is less than 1/3 of what is possible.
The reason ? x86. It simply lacks the ingredients for allowing high IPC to be extracted. This is way Netburst was designed in the 1st place. You can improve perfomance in 2 ways :
-increase IPC
-increase frequency
Since IPC is so hard to get, Intel decided let's try to reach a frequency as high as possible. Of course, discovering new territories back in the late '90s with 180/130/90nm proved quite a challenge. Without a thermal limit, they would have probably reached Power 6 like frequencies ( 5-6GHz ).
This was one of the paths they've taken. The other was EPIC, a completely new instruction set, designed with parallelism from the ground up. We're talking of an instruction set that tells the processor everything it needs to know because the optimizations happen at compile time ( the compiler inserts hints in the code flow : process this than that,etc ). In x86 the CPU needs to find the IPC at run time; well, most of the time you're SOL.
Itanium is able to get 4-5 IPC, but it lacks frequency by being so wide.
Basically it boils down to : if you chase IPC you lose frequency because of the complexity ( prefetchers,decoders,run-ahead,scouts); if you chase frequency you can't have a high IPC.
The solution is in the middle : K8/10,Core 2/Nehalem. To expect something revolutionary out, like very high IPC in x86, is simply wishing for pigs to fly. While some do try, most fail.
Why do you think Bulldozer was delayed some 2+ years ? Where are they now compared with the original expectations ? Could it be that Bulldozer is more Niagara like, lots of small, simpler cores with accelerators + GPUs for hard crunching ( FP ) ? That's the path everyone seems to be taking; not the uber Alpha EV8 like cores.
It is pretty heavy stuff and my knowledge about CPU Design is very limited but as far as i understand it right, the plan of Dresdenboy descripes a CPU which trys to get an IPC of ~2 on the integerclusters or at least get an IPC over 1 trough high speculating of 1 integercluster to keep the other busy as possible.
Last edited by Boschwanza; 09-01-2009 at 04:48 AM.
Core i7 is the long evolution from Pentium Pro, and Phenom II is of K7 and probably abit further than that aswell.
Bulldozer is more or less something completely new built from the ground up. The only thing this CPU will have similar to K10.5 and Nehalem/Westmere is x86 pretty much.
SweClockers.com
CPU: Phenom II X4 955BE
Clock: 4200MHz 1.4375v
Memory: Dominator GT 2x2GB 1600MHz 6-6-6-20 1.65v
Motherboard: ASUS Crosshair IV Formula
GPU: HD 5770
Bookmarks