You speak the truth therrr
Printable View
If you are going to say that there is a 20% compromise because we have shared resources, then you have to say that Intel has an 85% compromise from their shared architecture. They share execution units and HT gives you a ~14% integer increase.
Some people like to do math but they don't like to do all the math.
Can you say, "single-threaded integer IPC will be higher than the previous generation" ?
Because that is what Alsup is saying is NOT the case. (integer performance, yes, but integer performance/clock drops slightly in a thread, made up for by a faster clock)
Anything less than that statement could be satisfied through the not-in-dispute FP improvements, or the more cores part. It's got to be: Integer (not FP), IPC (or "performance/clock", not just "performance"), Single-threaded
Something like, "BD will have higher single-threaded integer performance/clock than the previous generation."
That would actually address (and contradict) the statement that Alsup made.
Aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaargh
I'm devolving from the level of discussion in this threaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaargh
I'm breaking into tears.
Well I for one likes to keep on topic.
madcho, the 33% more cores for 50% more perf is on server loads. You can't reliably guesstimate from there.
What happened to power in those statements? Does performance per watt suddenly not matter? ;)
Or could it be that the adding most of a second integer core actually uses a nice chunk of power when adding that 80% performance? :rolleyes:
Is it possible that the thing to look at would be the performance/W improvements that the 2 different degrees of resource-sharing provide? Nahhhhhh.
That statement is from AMD's ex Chief Architect. I think the guy is now retired, but if you think Intel is paying him to post on comp.arch... :rofl:
And your previous statements manage to artfully avoid getting all of "integer" "IPC" (note, IPC, not "performace") and "single-threaded" covered.
It wouldn't be hard to state, assuming your new architect will sign off on it.
Come one this gets boring... we can discuss the matter when we have actuall numbers. As it stands now BD is expected to increase IPC and ST performance. Theres not much point in it discussing with a marketing guy and try to make him slip some information regarding this concern.
Apparently he wants it to be very specific. He thinks something like, AMD drops integer performance to 10% of K10, but increased FP performance by 1001%, hence increased single-threaded performance.
Meh, looks like he still haven't read the whole thread. And he's running out of things to argue about.
EDIT: Anyway, third question answers terrace215's power questions. Honestly man, read the thread, read the slides, we're not here to spoonfeed you, especially seeing how eager you are. Stop nitpicking, and don't say you're not nitpicking.
http://blogs.amd.com/work/2010/08/30...%80%93-part-2/
Nothing that we don't know of though.
^^
lol
@ terrace215: Give it a break man?
i think that a certain forum rule REALLY describes what certain persons are doing in here...
19. Trolling
Anyone entering the forum with the express intent to cause trouble or harm is subject to immediate and permanent ban.
Terrace are you buying a ready product or 1 kg of Ghz like tomatoes?
Who cares about clock for clock if the part performs better in a given TDP budget which it was designed for.
To everyone else reading this thread: can we start a pool for Movieman to ban terrace from all threads including the word AMD? :shakes:
I won't discount the possibility that taking out extra ALUs could lead to a bottleneck. We don't know enough to say otherwise at this point.
But I'm not going to reject the possibility that with all the frontend and cache improvements 2 well fed ALUs could beat 3 poorly fed ones. ALU count alone doesn't determine the average IPC, only the max. K10 isn't anywhere near 2 on average as you pointed out.
Why all the fascination with integer instructions? Real code uses a mix of int, fp, logical, and memory instructions. If the IPC of BD versus K10 increases when executing real world code isn't that what matters?
Of course adding a whole second set of execution resources is going to increase power consumption compared to HT. It's also going to perform better.
its obvious that he his scared that bulldozer will become the 2010 pentium killer ... so his stock will likely plunge ... poor him
i guess that it's up to the admins to decide on this issue; not on us to start a witchhunt on certain persons....
while i think that the only purpose of terrace' posts is to create chaos and troll 80% of all forum members (and to secretly earn some more money from his shares / or directly from intel) we aren't the ones who should decide if a user gets banned only because he says things that we don't like
You guys apparently don't realize that:
performance != performance/clock (IPC)
IPC != single-threaded IPC
and that the Alsup statement was about the *Integer* pipeline.
It doesn't matter how big the font is that says "Single-thread performance is higher." or "Bulldozer IPC will be higher." Neither of those address the Alsup claim which was that:
Single-threaded integer performance PER CLOCK (i.e. IPC) will be ~5% lower.
I would not have thought that the distinction would require a great deal of analytical reasoning ability to comprehend, but the numerous replies (with one exception that I've seen) indicate that I am either incorrect in my assessment or that educational systems are failing.
And really, all the personal stuff because someone posts something you disagree with? Really?
QFT
2011 ;)
he still has plenty of time to sell his stock; he only thinks that trolling the crap out of a forum is going to give him another 1-2 months of rising stock prices (but it's extremely unlikely that some nerds like us posting on this forum are going to affext stock prices :ROTF:)
This is interesting bit from Question's Set no 2 @ AMD blog:
Quote:
“Is there any”programmable-tangible” improvement in synchronization between cores in the same module? In other words, will I get tangible performance improvement if I can partition my multi-threaded algorithm to pairs of closely interacting threads, and schedule each pair to a module?” – Edward Yang
That is a very interesting question.
For the majority of software, the OS will work in concert with the processor to manage the thread to core relationships. We are collaborating with Microsoft and the open source software community to ensure that future versions of Windows and Linux operating systems will understand how to enumerate and effectively schedule the Bulldozer core pairs. The OS will understand if your machine is setup for maximum performance or for maximum performance/watt which takes advantage of Core Performance Boost.
However, let’s say you want to explore if you can get a performance advantage if your threads were scheduled on different modules. The benefit you can gain really depends on how much sharing the two threads are going to do.
Since the two integer cores are completely separate and have their own execution clusters (pipelines) you get no sharing of data in the L1 – and there is no specific optimizations needed at the software level. However, at the L2 cache level there could be some benefits. A shared L2 cache means that both cores have access to read the same cache lines – but obviously only one can write any cache line at any time. This means that if you have a workload with a main focus of querying data and your two threads are sharing a data set that fits in our L2, then having them execute in the same module could have some advantages. The main advantage we expect to see is an increase in the power efficiency of the cores that are idle. The more idle other cores are, the better chance the busy cores will have to boost.
However, there is another consideration to this which is how available other cores are. You need to weigh the benefits of data sharing with the benefit of starting the thread on the next available core. Stacking up threads to execute in proximity means that a thread might be waiting in line while an open core is available for immediate execution. If your multi-threaded application isn’t optimized to target the L2 (or possibly the L3 cache), or you have distinctly separate applications to run, and you don’t need to conserve power, then you’ll likely get better performance by having them scheduled on separate modules. So it is important to weigh both options to determine the best execution.
load/store performance is increased from K10...
just take your time and really think about it if you don't know it already
just stop trolling this forum if the only point in your posts is creating chaos
just stop hiding your real employer if you get paid by intel for spreading completely wrong information on this forum
Terrace, seriously, just give it a rest. If people don't want to listen to your view (whether it be wrong or right), saying it incessantly over and over isn't going to do much. You should know this, since people have been saying the same thing over and over to you, and you don't listen either, so, goes both ways ;)
Integer performance isn't all there is to a CPU's overall performance. There is a lot more to it than that. Come on...
Why not wait until the thing is closer to release and we have some harder numbers instead of this "he said she said they said" game that gives random tidbits for people to grab onto and wave around frantically insisting they have all the answers?
It's done. now lets move on..;)
That was a bit of a corporal punishment :p: ammm its ok to express ones view, if the other person does not like it he should ignore the other guy.
News section is one of the most happening section in XS :D but i am not one in charge or one who can judge. But i do think that now that he is gone the thread will turn boring with less people digging up technical jargon and what not....
The guy above somehow forgets that 2 integer pipelines are now "dislodged" from the integer core and placed inside the FP cluster.Those are integer SIMD units.So if you want to properly count, count all the integer resources .2 ALUs,2Agens(we really have no idea if these can do more than Adress Generation) + 2 or 1 integer simd pipeline.Quote:
Originally Posted by random now banned due
A good function to calc terrasse IQ :
short int terrasse(void);
terrasse{
return 0;
}
Ok this is a troll sry :).
About definitions we all should know :
Performance on one thread = IPC x Frequency
CPU Performance in heavy multithread load= IPC x Frequency x multithread speed up.
I guess i'm right ;)
Edited: Lets keep this friendly huh? My typing fingers are getting tired.
I for one think that Movieman was more than patient with that due.Actually that is an understatement :)
I'm really sick of people posting in a thread only to comment about how someone else is a troll. So what if they are? Ad hominems still don't make for valid arguments or civilized discussion. These threads would be so much cleaner if people only remembered "attack the argument, not the person".
Hooray!!! Thats the best thing I've read all day.:yepp: It was funny in the beginning but he started to get a little annoying after a while. :shrug:
Hopefully the discussion will stay on what is known about BD and not what people dream up in thier minds.
I think once more info is released this thread will stay plenty activeQuote:
But i do think that now that he is gone the thread will turn boring with less people digging up technical jargon and what not....
Just wondering, what with all the chatter about bulldozer being 2+2 (alu+agu), aren't all intel cpus from core 2 and on 3+1? Correct me if I'm wrong, but if that's the case and the grand majority of consumer applications don't use more than 2 alus at a time, then I don't really what the issue is.
Now what I can see being an issue is Sandy Bridge performing considerably better than expected (I recall many rumors saying it was just an efficiency platform, and minimal if no ipc improvements would be seen), however that really isn't a discussion for this thread anyways.
Wise words, but people resort to personal attacks if they have no or limited understading whats going on... :p:
While its true that they are only 3+1 conroe introduced a 4(+1) for the decoding stage, just as BD did now. So the utilisation of the alus is/was higher.
thanks for the update from round 2, i was waiting for this, he posts a thread in the AMD section, but im too busy here to check it out every 5 minutes.
the round of questions do provide some more fun info, and the comments are where the real goodies show up across the next few days
It may not refute the argument, but if someone is genuinely a "troll" they don't really deserve to be able to participate in a civil discussion. Thread crapping when a person can't prove an argument or disprove one they dislike does nothing but detract from the quality of the overall debate.
Nothing personal against "terrace215", or anyone else... Opinions are just that, opinions! Everybody has one, just like they have an a$$#0L3... Facts, we will only find out when someone perhaps will leak some numbers... I did read in forums here only that somewhere in a cave in my own damned country, there's a system running this piece of hardware in question... A whole 16-pack (for as many cores) :P will be given to you dear fella... find out more please! You know who you are...
Movieman, if you come to New Delhi, India, do sound me off, i'll buy you a beer. :)
I just wish that it would be here soon... :P More pleasurable than owning the chip itself would be knowing what black magic went into making it :D
EDIT:
1) Ok, i had to apologize for my language...
2) Seriously getting harder to decide which would be more fun, owning one... or knowing about it more... :P Both, would be better :D
Quite interesting, by that definition he wasn't a trol lat all. He provided a logical argument/question with some facts (even when they where old), yet people had no real facts to counter his spesific question (ST IPC). The only facts that where available where increased ST performance and increased IPC (not specifed if its ST or not).
Anyway personally I prefer hard numbers, so all this theoretical mindgames arn't my cup of tea.
only that every "fact" he postet was already debunked several posts earlier (actually in the BD news in the OP)
the 2 AGU / 2 ALU isn't genuine at all as the new units made Bulldozer wider than K8 was which was able to do only a total of 3 AGU / ALU operations at the same time; Bulldozer is cap able of doing a total of 4 agu/alu operations at the same time
additionally Single threaded IPC has to increase to make 8 cores running at 90% of their peak performance (each core in a BD module, when both cores are fully utilized, runs at 90% of the performance compared to a single BD core with an unutilized second core in its own module)
so all cores in the 150% faster statement by AMD have to run at only 90% of their peak single thread performance, which makes calculating single thread IPC even more complicated than ever before ;)
You are actually the worst troll on xs, You constantly complane about people trolling and yet all you have to do is look at your own post history http://www.xtremesystems.org/forums/...rchid=20942262 Do you agree that you are trolling far more in the anand sb preview thread than the guy you are complaning about here did in this thread? We can compare posts and see who stuck to the topic in hand with there posts and offerd rebutels relavent to the subject, And who did not.
Its the same people trying to turn XS into the zone, You guys hate any negitive input in any AMD thread but just look at your posts in any Intel threads, Dont take my word, Just look.
why cant we just let this thread die already...geeez gallag take it to pm man!
Hit a nerve? Again, Just look at your post history, You are one of them, All pro AMD and anti Intel, I can understand you being pro AMD but I just hate the way you guys get on peoples cases for being negitive or simple not being over positive in AMD threads yet you will go into Intel threads and do it your self???
Fanboyz (AMD, Intel, nVidia ... whatever) are so boring ...
Is it possible to have a fair discussion here between open-minded people ?
Yes one module is capable of 4 agu/alu ops, but that requiers 2 threads. For single thread your down to 2/2, but with a fornt-end thats much more capable then that of K8
Your making a mistake, your talking about relative performance, which is not related to IPC at all or is only one part of the equation.
Is that the paranoia about Intel paying of anandtech you like to bring up in Intel threads when you are trying to be negitive, Could you just try not being a hypocrite? Stop going into Intel threads just to be negitive or stop complaning when people do it to AMD. And anyone can see I am not talking shi2 or being paranoide, All they have to do is look at your post history, Its not my opinion, Its all there in black and white.
I will not ruin this thead anymore, A lot of good info in it and it would be a shame for it to be locked so all I am saying is think about it guys, Are you what you say you hate?
What paying up of anandtech,what are you babbling about? :shrug: Where did i say that in the thread you mention?You are lost. The review @ AT was quite good .
Denial is just a one symptom of paranoia btw.
A lot of good info was lost in between 50 posts terrace alone produced in this thread,posts that were basically a beating of dead horse,posting questions that can't be answered here or even if they were(like JF did) the answers would be disregarded and trolling would continue.
We don't know that for sure. And what was K8 capable in theory and what was possible in reality are 2 different things.
gallag so true:yepp:
Why its on the chart?
Sure the forntend remains and is capable of decoding 4+1 instructions, but for singelthread int workload one core only has 2alu/agus.
There is no way bulldozer can fuse the int cores to make it act like a virtual 4alu/agu core, afair even JF denied this in this thread.
There won't be any "fusing" of the cores,that's not possible. But AMD is not disclosing all the details such as ,why they now call the 2 pipelines Agen?There was never in the past such a term in their diagrams.It's has always been AGU and this was paired with 1 ALU,with separate schedulers.Now we have unified scheduler and this new Agen unit ,outside of L/S unit that is also on the diagram(standing separately too).And finally we have 2 additional integer units in the FPU .SO if you want to count it you can say its 2+2+1(2) in terms of integer execution power.
edit: the only "fusing" that may happen is in terms of FPU execution,when one core can use both FMAC units for itself.
all i see is, AMDers are most likely proud of AMD's CPU mArch (sometime over optimist about it), and regarding Intel mArch, they admit its good but not ceding on Intel's superiority claim (sometime understating or underestimating Intel mArch capability). Intel trolls ? Well, nothing that AMD creates worth jacksh1t for 'em, that's one thing for sure. Sometime i'd like to think this forum drama as class struggle in life, beetween the bully haves and the defensive have nots, LOL. :rofl:
In current architecture our pipelines are shared between ALU and AGU. With bulldozer we actually break them out and make them dedicated, 2 for each.
Different engineers write things different ways. It is not that "AMD is changing what we call them" but that an engineer wrote it that way. Marketing tends to refrain from editing technical slides as they know more than us in that area.
Our friend who goes by the name dougsf30/terrace215/chipper/chipdesigner/tatertot/justaview/gloo...
and 100 more names has a history of driving people nuts (and maybe himself as well...)
To recapitulate this thread:
AMD Architects : IPC increases (Anand article commenting on the 2 ALUs an 16KB L1)
terrace215 post: IPC decreases, because of the 2 ALUs..
terrace215 post: IPC decreases, because of the 16KB caches
terrace215 post: IPC decreases, AMD presentation sheet no.X tells us so.
terrace215 post: IPC decreases, AMD presentation sheet no.Y confesses this.
JF-AMD posting: IPC increases!! instead of getting worse.
terrace215 post: IPC decreases, the marketing guy isn't talking about IPC
terrace215 post: IPC decreases, don't trust marketing guys.
terrace215 post: IPC decreases, Bulldozer is only optimized for server workloads.
terrace215 post: IPC decreases, AMD presentation sheet no.Y confesses this.
JF-AMD posting: IPC increases!!!! You are spreading FUD
terrace215 post: IPC decreases, AMD presentation sheet no.X tells us so.
terrace215 post: IPC decreases, The AMD architect says it decreases by 5%
terrace215 post: IPC decreases, because of the 2 ALUs..
terrace215 post: IPC decreases, AMD has given up improving IPC.
JF-AMD posting: IPC increases!!!!!!! How many times did I tell you!!!
forever{
terrace215 post: IPC decreases, because .....
terrace215 post: IPC decreases, says .... of AMD
terrace215 post: IPC decreases, according to AMD's presentation.
terrace215 post: IPC decreases, don't trust marketing guys.
terrace215 post: IPC decreases, because of the 2 ALUs..
terrace215 post: IPC decreases, the marketing guy isn't talking about IPC
terrace215 post: IPC decreases, because of the 16KB caches
terrace215 post: IPC decreases, AMD has given up improving IPC.
terrace215 post: IPC decreases, The AMD architect says it decreases by 5%
terrace215 post: IPC decreases, Bulldozer is only optimized for server workloads.
terrace215 post: IPC decreases, AMD presentation sheet no.X tells us so.
terrace215 post: IPC decreases, The more I post the more it decreases.
terrace215 post: IPC decreases, The more I post the more it decreases.
terrace215 post: IPC decreases, The more I post the more it decreases.
.....}
until (interrupt by Movieman)
Regards, Hans
^^^LOL! That is the most epic post EVER! So very very true.
Hello Hans.
Are you sure of all those names at the top of your post?
The reason I ask is that I got a "holier than thou" multi PM response from him after I removed his access to News and AMD section.
Yes, I chuckled too at your post!:D
Oh, forgot,no one drives me nuts. I just smile, grab my hammer and hit them upside the head so hard their grandchildren will walk with a 15degree list.:p:
"The more I post the more it decreases." part is the critical point the code above,cracked me up :D .
That was truly an epic post Hans :)
Agner Fog's microarchitecture.pdf is a good place to start.It has a part where it tries to identify the bottlenecks in every major x86 design today,so there is 10h(or wrongly called K10). Essentially 10h can in theory do a massive of 9(nine) "micro ops"* but retire only 3 "macro ops"** . There is a bottleneck in the retirement part of the design(but the utilization of 9 units can't be effectively measured in real world as the document says;it is clear that some of the time exec. units are underutilized ,especially 3rd AGU which is redundant due to 2 ports to L1D cache).
*macro op is split into these micro instructions and then sent to execution units
**macro op is an instruction the decoder deals with;1 x86 instruction typically = 1 or 2 macro ops
edit:
continued on to Bulldozer
Front end can take up 4 x86 instructions(can't tell what is the relation to the RISC like macro ops in 10h decoder stage) and dispatch it in 2 groups of 4(macro ops?). Each integer core can do 4 instructions (2 arithmetic and 2 address,but the Agen unit can maybe do some math work too ). Still a lot is unknown so we can't say what else is in there and how AMD organized it.At least not until launch .
http://www.xbitlabs.com/articles/cpu...0_6.html#sect0
Quote:
Upon the availability of data, the scheduler may issue one integer operation to ALU and one address operation to AGU from each queue. There can be maximum two simultaneous memory requests. So, up to 3 integer operations and 2 memory operations (64-bit read/write in any combination) may be issue for execution per clock. Micro-operations from various arithmetic MOPs are issued for execution from their queues in an out-of-order manner, depending on the readiness of the data.
isn't the bulldozer going to be released in 2nd quarter 2011 when the sandybridge 8 core arrives to do battle? :D
page 251 of: http://support.amd.com/us/Processor_TechDocs/25112.PDF
or alternatively:Quote:
A.3 Superscalar Processor
The AMD Athlon 64 and AMD Opteron processors are aggressive, out-of-order, three-way
superscalar AMD64 processors. They can fetch, decode, and issue up to three AMD64 instructions
per cycle with a centralized instruction control unit (ICU) and two independent instruction
schedulers—an integer scheduler and a floating-point scheduler. These two schedulers can
simultaneously issue up to nine micro-ops to the three general-purpose integer execution units
(ALUs), three address-generation units (AGUs), and three floating-point execution units. The
processors move integer instructions down the integer execution pipeline, which consists of the
integer scheduler and the ALUs, as shown in Figure 6 on page 252. Floating-point instructions are
handled by the floating-point execution pipeline, which consists of the floating-point scheduler and
the floating-point execution units.
http://www.chip-architect.com/news/2...Core.html#1.20
But don't forget that the average number of ALU instructions is something like 0.4/cycle
which is 4, 5 times less as two ALUs can provide.
Regards, Hans
Guys ... David has posted a terrific summary of Bulldozer ... http://www.realworldtech.com/page.cf...2610181333&p=1
:rofl:
I honestly am waiting like thousands more to get a sneak peak... :P I don't live in that city which i mentioned earlier with the cave running the (early sample) hardware in question... or i'd have had done anything, akin to indiana jones (which is lamo me thinks) to get to the 16 core unobtanium-optronium! :P
No, a BD Core has 2 ALUs AND 2 AGUs available. 2+2=4. A Phenom II has 3 ALUs OR 3 AGUs. 6/2 = 3.
EDIT:
And PLEASE, can we dedicate this thread to Bulldozer and not forum moderation? I too welcome the ban, but I'm sure we got enough criticism and back-patting here. There are other places we can continue doing that. :)
I heard via the grapevine that it'll be 1H2011 for server, 4Q2011 for desktop :(
K10 has 3 ALUs and 3 AGUs. No matter how hard you and others try to downplay K10 execution resources, fact is, a K10 integer core has more resources than a BD integer core.
The docs linked by Hans are pretty clear.
http://www.xtremesystems.org/forums/...&postcount=681
Bobcat is 2way(2ALU+2AGU) design,has 90% of Propus and is a low power design with solid perfromance.One can expect Bulldozer core to stump over Bobcat core but both have less ALUs/AGUs than 10h. Number of units means nothing if you can't effectively use them and you know that.The number of core level changes is pretty big,from L/S improvements,prefetch,BP,shared L2 etc..As Anand wrote(info from AMD) ,per core performance will be better than 10h.
No, you are wrong. Old architecture has shared resources, new architecture has dedicated resources.
A BD integer core will do more IPC and perform single threads faster than an old core.
Why do you keep saying these things even though I have posted the information in multiple places?
I'm pretty sure that when you have the position JF has in a company you get pretty accurate numbers from engineering and so on. There is no need for him to sit down and bench engineering samples personally. Would be quite stupid if engineering lied about the performance in internal reviews and documents.
You know this isn't Dilbertland right? ;)
10h can retire 3 macro ops.BD integer core/fp core should be able to do 4.