Ex AMD designer: Bulldozer to disappoint

**FlanK3r** · 04-26-2010, 11:49 AM

srry, but this is only bulvar sh*t...SB is not much better in real aplications than Lynfield/Bloomfield chips, Gulfotwn 6 core will winner for 3/4 2011 Intel chips (roadmaps and for longer time said it one man). So, Thuban is now second (real aplications), great chip at 45nm older architecture than new Bulldozzer. How can be Bulldozer whorse than Thubans?

A bit logical argument.

**~~terrace215~~** · 04-26-2010, 11:51 AM

Originally Posted by god_43

wtf???? oh come on.....seriously this guy cant even hold a discussion in a professional manner...this is not even news??? its a forum post by some random guy on an internet forum??? bs! Terrance you should have known better than to post this here, it should have gone into the wamps (w/e its called) section. i call for this thread to be either closed or moved.

lol after reading the forum more the guy had this to say.... man this guy is just some intel troll nothing else!

Had the poster been pro-AMD, I doubt you would have had any complaints about where the thread was posted. Sadly, this site is becoming more and more like AMDZone every day.

**trinibwoy** · 04-26-2010, 11:56 AM

How many C_O's can AMD have? This guy is outing people left and right including himself. If people are indeed feeding him info they'll know immediately who he is and probably won't take kindly to his little exposé. Entertaining stuff but it doesn't matter since the chip will speak for itself anyway.

**iMacmatician** · 04-26-2010, 12:02 PM

Originally Posted by trinibwoy

How many C_O's can AMD have?

Up to 26?

Maybe 28 if you include CrO and CθO (/calculator joke, why did I think of my old calculator…)

**~~wuttz~~** · 04-26-2010, 12:16 PM

c'mon bulldozer cant be much of a disappointment as much as laughabee or itanic already is, can it? =P

**ryboto** · 04-26-2010, 12:19 PM

Originally Posted by terrace215

Had the poster been pro-AMD, I doubt you would have had any complaints about where the thread was posted. Sadly, this site is becoming more and more like AMDZone every day.

The maturity level in the writing is quite low. I have a hard time seeing how someone could be employed and be part of development teams for the amount of time he's saying he was, and not have developed better communication skills. He resorts to using "fail"...really now?

Also, it does seem quite suspect that he'd be getting into the specifics of his time of employment, and the fact that he has sources inside whom he's still close with. He'll either get someone in trouble, or he's just throwing around rumors. Either way, he comes off as a tool.

**MirageSys** · 04-26-2010, 12:23 PM

This guy doesn't come across as somebody respectable, or a man of great technical acumen. Either hes just disgruntled or talking out his ass. Perhaps a combination of the two?

**Sparky** · 04-26-2010, 12:31 PM

Surely if he is an ex-employee, the current employees wouldn't be giving out too much info about stuff... That whole idea just sounds really off the wall. Yeah, let's tell someone all about our product in-the-making!

**~~wuttz~~** · 04-26-2010, 12:37 PM

y'all need to calm down.
its just intel marketing $$$ at work.

please pass the salt.

**~~terrace215~~** · 04-26-2010, 12:45 PM

Originally Posted by trinibwoy

Entertaining stuff but it doesn't matter since the chip will speak for itself anyway.

Yeah, no one ever reads sites like this one hoping for, you know, rumor and speculation and information about upcoming products.

BTW there was also ex-AMD architect, Andy Glew, who recently hinted AMD's BD was not gonna be that great:

I can't express how good it feels to see MCMT become a product. It's
been public for years, but it gets no respect until it is in a product.
It would have been better if I had stayed at Intel to see it through.
I know that I won't get any credit for it. (Except from some of the guys
who were at AMD at the time.) But it feels good nevertheless.

The only bad thing is that some guys I know at AMD say that Bulldozer is
not really all that great a product, but is shipping just because AMD
needs a model refresh. "Sometimes you just gotta ship what you got." If
this is so, and if I deserve any credit for CMT, then I also deserve
some of the blame. Although it might have been different, better, if I
had stayed.

I came up with MCMT in 1996-2000 while at the University of Wisconsin.
It became public via presentations.

I brought MCMT back to Intel in 2000, and to AMD in 2002.

I was beginning to despair of MCMT ever seeing the light of day. I
thought that when I left AMD in 2004, the MCMT ideas may have left with
me. Apparently not. I must admit that I am surprised to see that the
concept endured so many years - 5+ years after I left, 7+ years to
market. Apparently they didn't have any better ideas.

True, there were rumors. For example, Chuck Moore presented a slide
with Multicluster Multithreading on it to analysts in 2004 or 2005. But
things went quiet. There were several patents filed, with diagrams that
looked very much like the ones I drew for the K10 proposal. But, one
often sees patent applications for cancelled projects.

Of course, AMD has undoubtedly changed and evolved MCMT in many ways
since I first proposed it to them. For example, I called the set of an
integer scheduler, integer execution units, and an L1 data cache a
"cluster", and the whole thing, consisting of shared front end, shared
FP, and 2 or more clusters, a processor core. Apparently AMD is calling
my clusters their cores, and my core their cluster. It has been
suggested that this change of terminology is motivated by marketing, so
that they can say they have twice as many cores.

http://www.amdzone.com/phpbb3/viewto...art=25#p171058

**Solus Corvus** · 04-26-2010, 01:12 PM

So on one hand we have predictions that AMD doesn't have functional silicon yet, won't even have samples until 2H'10, and won't ship until 2H'11 - or maybe not until 2012 ( lol ). Then on the other hand we have CMaier saying that they have silicon samples far enough along to tell what the performance is going to be like.

How do you intend on resolving your cognitive dissonance?

Originally Posted by trinibwoy

How many C_O's can AMD have? This guy is outing people left and right including himself. If people are indeed feeding him info they'll know immediately who he is and probably won't take kindly to his little exposé. Entertaining stuff but it doesn't matter since the chip will speak for itself anyway.

Exactly. While we have no way, yet, to verify his current claims - I wouldn't count on cmaier being a reliable source of information in the future. They will either try to cut him off or deliberately feed him misinformation.

**~~terrace215~~** · 04-26-2010, 01:41 PM

Originally Posted by Solus Corvus

Then on the other hand we have CMaier saying that they have silicon samples far enough along to tell what the performance is going to be like.

How do you intend on resolving your cognitive dissonance?

Quite simply: CMaier didn't say what you claimed he said. Nice try, though

What do you think about BOTH maier and glew hearing that BD isn't gonna be all that? A conspiracy of Intel marketing $$$? Two totally bitter ex-AMDers just making it up, and putting their real names to it? Have you considered that it... might just be... true... and if true, wouldn't really be all that shocking?

**Solus Corvus** · 04-26-2010, 01:48 PM

He didn't say it explicitly. But to know what performance will be like with final silicon (which he did claim to know), then you'd need to have valid silicon.

Or you are suggesting that we should trust a second hand source on final performance numbers if the bugs haven't been worked out or first silicon isn't even done yet?

**damha** · 04-26-2010, 01:52 PM

It's easy for anyone on the internet to claim they have lunch/dinner/orgies with C_Os of AMD. But in reality C_Os don't hang out with employees who get fired and that right there tells me this man here just dumped a massive load of garbage on the internet that people are shoveling into their brains.

Cmaier, please go to the "Start" menu > Shutdown > Eat Feces button. It helps sometimes.

**Solus Corvus** · 04-26-2010, 05:14 PM

Personal attacks are uncalled for.

**Ghostbuster** · 04-26-2010, 07:32 PM

Further into the thread here http://forums.macrumors.com/showpost...&postcount=650

This paper (http://ieeexplore.ieee.org/Xplore/lo...hDecision=-203), as summarized by google:

A 533-MHz BiCMOS Superscalar RISC Microprocessor - Solid-State ...
by CA Maier - 1997 - Cited by 17 - Related articles
Cliff A. Maier, Member, IEEE, James A. Markevitch, Member, IEEE, Cheryl Senter Brashears, ...... He is currently at AMD, Sunnyvale, CA, where ...
ieeexplore.ieee.org/iel3/4/13972/00641683.pdf?arnumber=641683

Does that work? (you could also buy the paper and read the bio, of course If you google "cliff maier AMD" you'll see various references, though some of them are merely me saying I work at AMD in interviews and stuff.

**Dresdenboy** · 04-27-2010, 12:27 AM

Originally Posted by cmaier

Because AMD has no roadmap.

He should switch on the displaying of images in his browser.

Originally Posted by cmaier

Their engineering is in complete disarray. They're selling chips that were designed a decade ago.

A 12 core x86 server chip in 2000? Would have been really cool (or hot?) and large.. Even the 180 nm single core Opteron dies (no, they weren't sold) were large.

Originally Posted by cmaier

They have no chief architect. Stick a fork in 'em.

And then who is Chuck Moore, who calls himself "Chief architect for next generation x86 processor"?

I usually look at many bits of information and some rumours too, but wouldn't bet the farm on one guy. And even if I would, I'd prefer Andy Glew over C Maier.

Just some speculative thoughts: Well, BD will not become the G3 socket (that with 4 serial memory channels connected to G3MX memory extenders) computing monster with 32 cores capable of speculative threading and supporting TFP or the 8-wide design being worked on for a while. Who cares about 250W TDP? Such a beast would sell like hot cakes. If I'd know such designs existed and were already in the pipeline (and some designs have been cancelled as we heard in rumours before), I'd maybe be unhappy with any remaining architecture, which is scheduled for launch in 2011...

**iMacmatician** · 04-27-2010, 04:17 AM

Originally Posted by Dresdenboy

Even the 180 nm single core Opteron dies (no, they weren't sold) were large.

Are there any die sizes or die pictures of these?

Another post from a source of unknown veracity on the same thread:

Hi cmaier,

as you seem to have a lot of insight into the development of Bulldozer, you might find interesting the following quote with many bits of information, which I received from someone (who’s not a native English speaker as it seems), which I don’t want to disclose. Maybe you can confirm it. He also sent me a slide, which has to be kept secret, so I can’t post it here. I also heard of an unchanged instruction cache, 16k 4-way L1 data caches per core and up to 2M L2 per module.

“So as you can see, there are 4 full integer pipelines per core, capable of doing up to 4 instructions per cycle or in
the case of less utilization can run two branches eliminating branch misprediction.
It can fetch from several threads (program pointers) alternatingly including possible branch targets. For that to
work the branch prediction unit (BPU) tries to identify branches and their targets and controls the working of the
IFU. If the instruction queues of the units to be fed are already filled at high levels, the IFU/BPU pair tries to
prefetch code to avoid idle cycles. Having prefetched the right code bytes in 50% of all fetches is still better than
having no code ready at all. In reality this number is even better.
After a block of 32 code bytes is fetched and queued in an instruction fetch queue, the decode unit receives such
a packet each cycle for decoding. To decode it quickly, it has four dedicated decode subunits, where each of
them can decode most x86 and SIMD instructions on it’s own and quickly (1 per cycle and subunit). More
seldomly used or complex instructions are decoded using microcode storages (ROM and SRAM). This can
happen in parallel with the decoding of the „simple“ instructions. There are no inefficiencies like in K10. XOP and
AVX instructions are either decoded one per subunit (if operand width is <= 128 bit) or one per two subunits (256
bit, similar to the double decode of SSE and SSE2 instructions in K8). The result are „double mops“ (pairs of
micro ops, similar to the former MacroOps). After finishing decoding, the double mops (which can have one
unused slot) are sent to the dispatch unit, which prepares packets of up to four double mops (dispatch packet)
and dispatches them to the cores or the FPU depending on their scheduler fill status. Already decoded mops are
also written to the corresponding trace cache to be used later if the code has to be executed again (e.g. in loops).
Thanks to these caches, the actual decoding units are free and can be used to decode code bytes further down
the program path. If a needed dispatch packet is already in the cache, the dispatcher can dispatch that packet to
the core needing it and in parallel dispatch another packet (from the decoders or the other trace cache) to the
second core. So there won't be any bottleneck here.
The schedulers in the cores or FPU select the mops ready for execution by the four pairs of ALUs and AGUs per
core, depending on available execution resources and operand dependencies. While doing that, there is more
flexibility than was in the K10 with it’s separate lanes and the inability of OOps to switch lanes to find a free
execution ressource. To save power, the execution units are only activated, if mops needing them become ready
for execution. This is called wakeup.
The integer execution units - arithmetic logic units (ALUs) and address generation units (AGUs) - are organized in
four pairs - one per instruction pipeline. They can execute both x86 integer code, memory ops (also for FP/SIMD
ops) and, which is the biggest change, can be combined to execute SSE or AVX integer code. This increases
throughput significantly and frees the FP units somewhat. The general purpose register file (GPRF) has been
widened to 128 bit to allow for such a feature. The registers will be copied between GPRF and the floating point
register file (FPRF) if an architectural SIMD register (the registers specified by the ISA) is used for integer first
and floating point later or vice versa. Since this doesn't happen often, it has practically no impact on performance.
Instead the option to use the integer units for integer SIMD code (SSE, XOP and AVX) the overall throughput of
SIMD code increases dramatically.
The FPU contains the already known two 128 bit wide FMAC units. These are able to execute either one of the
new fused multiply add (FMA) instructions or alternatively an floating point add and mul operation (or other types
of operations covered by the internal fpadd and fpmul units). This ability provides both a lower energy
consumption and higher throughput for the simpler operations. As AMD already stated, the two 128 bit units will
be either used in parallel by the two threads running on the integer cores but could in cycles, where one core
doesn't need the FPU, both be used by only one thread, increasing it's FP throughput. This happens on a per
cycle basis and resembles some form of SMT. The FPU scheduler communicates with the cores, so that they can
track the state of each instruction belonging to the threads running on them.
Both the integer and the floating point units need data to work with. This is provided by the two 16k L1 data
caches. Each core has it's own data cache and load store unit (LSU). The load store unit handles all memory
requests (loads and stores) of the thread running on the same core and the shared FPU. It is able to serve two
loads and one store per cycle, each of them up to 128 bit wide. This results in a load bandwidth of 32B/cycle and
a store bandwidth of 16B/cycle - per core. A big change compared to the LSU of the K10 is the ability to do data
and address speculation. So even without knowing the exact address of a memory operation (which isn't known
earlier than after executing the mop in an AGU), the unit uses access patterns and other hints to speculate, if
some data is the same as some other data, where the address is already known. And finally the LSU is also able
to do execute all memory operations out of order, not only loads. To make all this possible with not too big effort
the engineers at AMD added the ability to create checkpoints at any point in time and go back to this point and
replay the instruction stream in case of a misspeculation.
To reduce the number of mispredicted branches and the latency of the resulting fetch operations, the branch
predictors have been improved. They are able to predict multiple branches per cycle and can issue prefetches of
code bytes, which might be needed soon. Together with the trace caches, it is often possible, that even after a
branch misprediction (which is only known after executing the branch instruction), the correct dispatch packets
are already in the trace cache and can be dispatched from there with low latency.
One big feature, which improves performance a lot, is the ability to clock units at different clock frequencies
(provided by flexible and efficient clock generators), to power off any idle subunit and to adapt sizes of caches,
TLBs and some buffers and queues according to the needs of the executed code. A power controller keeps track
of load and power consumption of each of the many subunits and adapts clocks and units as needed. Further it
increases throughput and power consumption of heavily loaded units as long as the processor doesn't exceed it's
power consumption and temperature limits. For example if the queues and buffers of core 0 are filled and the
FPU is idle, then the power controller will switch off the FPU (until it will be waked up for executing FP code) and
increase the clock frequency of core 0. If core 0 has not that many memory operations (less pressure on cache),
the cache might be downsized to 8kB, 2-way by switching off 2 of the 4 ways it has. This way the power, the
processor is allowed to use, will be directed to where it is needed and not to drive idle units. This is called
Application Power Management as you might heard in some rumors on the net.“

At least the details don’t sound like the architecture would be a miss. The guy also told me, that first samples (don’t know, if already 32nm) run very well with really good performance and power characteristics, outperforming their fastest desktop chips already.

**Dresdenboy** · 04-27-2010, 05:11 AM

Originally Posted by iMacmatician

Are there any die sizes or die pictures of these?

Another post from a source of unknown veracity on the same thread:

That was nearly 7 years ago. Maybe I have my notes somewhere. Photos weren't allowed. IIRC that were ~8 dies in X and ~5 dies in Y direction on a wafer resulting in about 400 sqmm.

That text could actually originate from an april joke article published on amdzone.com.

**MTd2** · 04-27-2010, 05:29 AM

Why do you guys believe any random guy claiming to be someone else?

**Dresdenboy** · 04-27-2010, 05:47 AM

Originally Posted by MTd2

Why do you guys believe any random guy claiming to be someone else?

That's human psychology. The ability to see, what one wishes to see, is built into human's nature.

**iMacmatician** · 04-27-2010, 07:15 AM

Originally Posted by Dresdenboy

That text could actually originate from an april joke article published on amdzone.com.

Ah, I see. Was that the one that was taken down really quickly (I didn't get to see it)?

**thebanik** · 04-27-2010, 08:29 AM

I find every1 on this thread very close minded that doesnt mean I believe whatever the guy is dishing out, but if you go by what he said which is Bulldozer might not be impressive, (which is a pretty objective word). To me even Thuban is not that impressive, its just its pricing which makes it hit the right market. Similarly maybe Bulldozer wont be having earth-shattering performance or anything but its just the price which would make it suitable in the targeted market. (But any/every ambitious, skilled engineer working on a project would like to make something which is infact earth shattering and changes how people do things in that field). So maybe in the end, afterall all the time spent on the project its not upto his standard....Was just free, if it bothered some1 please excuse me, :p

**Solus Corvus** · 04-27-2010, 10:13 AM

Originally Posted by terrace215

What do you think about BOTH maier and glew hearing that BD isn't gonna be all that? A conspiracy of Intel marketing $$$? Two totally bitter ex-AMDers just making it up, and putting their real names to it? Have you considered that it... might just be... true... and if true, wouldn't really be all that shocking?

I don't know how I missed this edit.

Both Maier and Glew are second hand sources. It doesn't have to be a conspiracy or marketing for second hand information to fall apart. We may trust these individuals as sources but we don't know who their sources are and what their motivations may be. To add to your colorful (

) list of possibilities why this might not be true: It could be deliberate misinformation from AMD, a former friend and coworker that isn't too happy about them leaving looking to ruin their reputation, old information, perhaps early silicon or performance simulations didn't go too well, etc etc. Accepting that this information is true until proven false isn't any more productive then dismissing it outright.

Let's say that it is true and BD isn't very good. What do you expect us to do? Cheer in the streets? Huddle under the covers and cry? I simply won't buy an AMD processor that generation if it doesn't perform competitively. It won't be "shocking" because I, unlike some, am not setting my expectations irrationally high or low. I think a better word would be disappointment.

Since you keep mentioning Andy Glew, I'd recommend that everyone reading this thread go look at his blog and Multi-star architecture if you haven't already. BD may or may not be a good implementation of some of those ideas. But Glew makes a very good argument about why this direction is the future for x86 architectures. Mr Glew makes it sound like they are on the right track with BD but the first implementation isn't really done yet. And Andy Glew's information also contradicts your timetable, terrace. He says they are shipping it even though it's not really working well because they want to get their new model lineup out the door. But if they were going to delay it again, as you have been claiming, then why wouldn't they simply take as long as needed to work the kinks out?

**~~terrace215~~** · 04-27-2010, 02:05 PM

Originally Posted by Solus Corvus

And Andy Glew's information also contradicts your timetable, terrace. He says they are shipping it even though it's not really working well because they want to get their new model lineup out the door. But if they were going to delay it again, as you have been claiming, then why wouldn't they simply take as long as needed to work the kinks out?

I think you misunderstand the timescale(s), or that two different things are involved. They're committed to shipping it (the BD microarchitecture) because that's all they have right now. It's too late to make significant changes in the arch. To start with another *microarchitecture*, you're looking at 5 years. However, if the *implementation* of the BD microarchitecture is struggling, there can very well be delays as they attempt to tune the silicon as best they can. Make sense? To me, maier's comments are not at all inconsistent with the comment by Glew.

Thread: Ex AMD designer: Bulldozer to disappoint

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions