As quoted by LowRun......"So, we are one week past AMD's worst case scenario for BD's availability but they don't feel like communicating about the delay, I suppose AMD must be removed from the reliable sources list for AMD's products launch dates"
That's pretty much what he have said countless times already.
http://www.xtremesystems.org/forums/...&postcount=602
I don't have any reason to be in building 400. And it is better off that the marketing guy is not "dropping in" on them.
Our performance engineering team has done a real accurate job on performance modeling in the past, I have no reason to doubt them. Generally the worst that we see is too much conservatism, not too much optimism.
OK, so let me get the gist of all of this whole thread down to two statements:
1. People are claiming Bulldozer will be slower than existing products because they are sharing resources in the processor and sharing is inherently worse.
2. People are claiming that even though Bulldozer has dedicated resources relative to the old architecture that shares them, this is worse.
OK, I got it now.
When AMD had 64-bit and Intel had only 32-bit, they tried to tell the world there was no need for 64-bit. Until they got 64-bit.
When AMD had IMC and Intel had FSB, they told the world "there is plenty of life left in the FSB" (actual quote, and yes, they had *math* to show it had more bandwidth). Until they got an IMC.
When AMD had dual core and Intel had single core, they told the world that consumers don't need multi core. Until they got dual core.
When intel was using MCM, they said it was a better solution than native dies. Until they got native dies. (To be fair, we knocked *unconnected* MCM, and still do, we never knocked MCM as a technology, so hold your flames.)by John Fruehe
I'll make it short and easy to understand. Original quote:
Which is 100% true, as K10 has more execution units. I don't see the words perfomance, shared or dedicated in this post. Then you say:
Which is wrong, based on the above. I just pointed it out, but it seems it was a perfect excuse to ignore what the guy is actually saying (as you like to do) and repeat the same post you've been repeating how many times now?
I hope you properly get it now.
Friends shouldn't let friends use Windows 7 until Microsoft fixes Windows Explorer (link)
Let's say their past history isn't as imaculate as you portray it. There is an alternate discussion on BD details on Aces and Paul Demone directly answers JFs claims :
BD taped out a month or two ago. If they were lucky silicon is mostly functional. If not, they are working overtime to fix it and get working samples. Silicon is being characterized and in pre-validation stage.Originally Posted by Paul Demone
In other words, benchmarks and performance are second place at this time, most important is getting a functional chip.
What this all means, every claim about BD performance is based on estimates done without having actual silicon in hand.
SUN Rock was meant to be the greatest chip done in the past decade with with innovative features like transactional memory and scout threads. I still remember Jonathan Schwartz, how ecstatic he was over Rock.
Rock turned out a complete dud, burning 300w and abisymal performance.
I have to wonder why Paul would even say this unless he just wants to argue:
The first section deals with repeating the whole "OK, so it's faster overall but what about single threaded work?! Ha!" We've already been told that BD is faster than the current gen at both, which he even acknowledges in the second half. As such, what point was there to even posting the first part? As for the second part, who cares? If the frequencies are higher due to the changes in the chip and that permits an overall faster singlethreaded and multithreaded experience than is possible with current designs, then why does it matter if the new chip ticks faster? I'm not saying I think that clock for clock the new part will be slower at this point, but even if it were it would be fine given that in the end it's still faster and not only clocked higher.ROFL. A Niagara has higher "aggregate" (across all threads) IPC than a US-IV
but far lower single thread performance. Listen for what a salesman doesn't
say! Higher single thread performance than K10? Probably, but at far higher
clock rates enabled by a deeper pipeline, simpler cores, and a process shrink.
Particle's First Rule of Online Technical Discussion:
As a thread about any computer related subject has its length approach infinity, the likelihood and inevitability of a poorly constructed AMD vs. Intel fight also exponentially increases.
Rule 1A:
Likewise, the frequency of a car pseudoanalogy to explain a technical concept increases with thread length. This will make many people chuckle, as computer people are rarely knowledgeable about vehicular mechanics.
Rule 2:
When confronted with a post that is contrary to what a poster likes, believes, or most often wants to be correct, the poster will pick out only minor details that are largely irrelevant in an attempt to shut out the conflicting idea. The core of the post will be left alone since it isn't easy to contradict what the person is actually saying.
Rule 2A:
When a poster cannot properly refute a post they do not like (as described above), the poster will most likely invent fictitious counter-points and/or begin to attack the other's credibility in feeble ways that are dramatic but irrelevant. Do not underestimate this tactic, as in the online world this will sway many observers. Do not forget: Correctness is decided only by what is said last, the most loudly, or with greatest repetition.
Rule 3:
When it comes to computer news, 70% of Internet rumors are outright fabricated, 20% are inaccurate enough to simply be discarded, and about 10% are based in reality. Grains of salt--become familiar with them.
Remember: When debating online, everyone else is ALWAYS wrong if they do not agree with you!
Random Tip o' the Whatever
You just can't win. If your product offers feature A instead of B, people will moan how A is stupid and it didn't offer B. If your product offers B instead of A, they'll likewise complain and rant about how anyone's retarded cousin could figure out A is what the market wants.
http://flamewheelspin.ytmnd.com/
perfectly sums up this thread...
)
No need to; the point was simply to take the appropriate spoon of salt wrts to marketing and performance claims for a future product.
Not in the slightest.Originally Posted by JF-AMD
First of all, nobody claimed BD will be slower than existing products either in performance overall or single threaded performance. Nobody brought in discussion the dedicated vs. shared resources but you, so false dilemma you have there.
The only point raised ( by me at least ) was that given the design trade-offs BD did ( which I addressed in a previous post in details - my POV nothing more and which David Kanter also mentioned in his article), it is expected that BD will lose slightly in performance per clock compared to K10 in integer code . Overall performance of BD, including single threaded will be higher no doubt than K10. But not per clock.
On what do you base your opinion on? Deeper pipeline?
Are there any bits of info regarding cache inclusiveness/exclusiveness, other than 16 kB L1D, which hints for inclusive cache?
I'm still predicting inclusive cache for the L1D size. There is no reason to stick to exclusive cache as it gives virtually no benefit because of poor L2/L1 and L3/L2 ratios. It just slows every memory operation quite a bit while giving marginal improvement on hit rate. Anyone with some knowledge on the performance penalty due to exclusive cache? I'd believe that inclusive cache would bring more than enough to compensate for any loss the deeper pipeline could potentially cause, brining the cache latencies to near Nehalem numbers, if not greater. SB seems to be a real badass on this, so I can't see BD topping near it's latecies with inclusive cache.
Last edited by Calmatory; 08-31-2010 at 05:35 AM.
He was adressing JF's point about IPC being higher ( Paul doubts that ). I am surprised it isn't obvious.
Well, you see, neither me, Paul or others are interested in absolute values for benchmarks scores. My interest is how they got there, the uarch, the trade-offs, the clever stuff done to hide bottlenecks, the corner cases,etc. I don't give a rats ass if it scores 101 FPS in I-don't-know-what-game or does SuperPi in -2 sec.As for the second part, who cares? If the frequencies are higher due to the changes in the chip and that permits an overall faster singlethreaded and multithreaded experience than is possible with current designs, then why does it matter if the new chip ticks faster? I'm not saying I think that clock for clock the new part will be slower at this point, but even if it were it would be fine given that in the end it's still faster and not only clocked higher.
The fun is analyzing the intentions and the implementation, not the end result. I take great pleasure in reading about Netburst, Prescott,Tejas, Nehalem ( 1st one ) , Tanglewood, Rock, etc even if some were duds in the end. It may suck, but it was innovative and challenging.
Well after readin' all the stuff about BD, my nooby chip expertese tells me that.
IPC will be be improved at the same clocks, than current AMD processors.
It will take less space per core
It will clock higher than the current crop of AMD processors.
It looks like it will be highly competitive in the server market, but behind in the 'gamers' segment (possibly close to matching today's Intels because of clockspeed, but not surpassing it in IPC).
obviously no one will know until it gets leaked.
I don't agree with the word "estimates"
A design is validated and debugged long before it goes to silicon. Validation
is done both by cycle accurate software simulation and FPGA hardware
emulation. An FPGA hardware implementation of the core, or entire processor,
can run 10+MHz and can be made cycle accurate. This is also how you do
performance tuning during the design phase itself.
Typically operating systems are booted and many software applications
are run long before you go to silicon.
About your link......
What in this musing from the investment board inhabitants can be classified
as not being investor FUD and of any technical relevance concerning
the architectural details of bulldozer?
Regards, Hans
~~~~ http://www.chip-architect.org ~~~~ http://www.physics-quest.org ~~~~
I don't get why you guys are so sure it can't offer more ipc than k10. I think it does make a difference that the 3 could either be ALUs or AGUs because they couldn't do so simultaneously. Add to the fact that many applications don't even use a full ALU/AGU, so combined with a better prefetcher bulldozer should offer good ipc gains.
It's been confirmed many times over that 80% number is integer cores in a single module vs integer cores in different modules, and the performance is lost due to shared components in the modules, not due to weaker cores.
in fact its been said couples of times by JF-AMD himself ....
sharing most inevitably mean communism for some .... but not for me .. if its to bring a good product at an affordable price with a big improvement over the last product im all for anything really
arguying with the man who works at the company to wich you decide to pick about said product .. and said person is in talk with engineers who built the damn thing ....
I agree & savantu doesn't help to have objectiv view about facts.
If JF- said IPC will be better, this is true ... Why ? It's simple, he don't want be unemployed.
Good marketing is telling the true ... And Henri Richard has done some big mistake, and now don't work anymore for AMD.
Bad guys don't stay so long ...
Last edited by madcho; 08-31-2010 at 05:55 AM.
What if there are multiple threads with lots of AVX instructions? Single module can only feed one AVX instruction at a time, or two 128-bit SSEx instructions, or 4 64-bit FPU instructions, right?
Up to <number of modules> threads running AVX there should be no performance penalty as long as there is no other FP instructions in the fly. More there are, lower the AVX performance will be. And if one adds more AVX threads, the FPU units will just starve and there is no performance improvement?
In short: If I want to do lots of AVX, I can only run <number of modules> threads for improved performance?
Repeating for the 10th time already,10h can retire(back end of the chip) 3 macro ops,period.It has 9 execution units. There's your problem.
by the time this what if comes there will be more powerfull cpu's cappable of doing more then a single avx instruction per core etc....
and anyway isnt avx a better suited instruction for massive multimedia task in term of the way it process the info ???? so even if the cpu's might be limited by that fact they will most likely finish their job easily right ????
Quotes taken out of context can be true, but in context they can mean something different. Your quote was a response to my post about pipes. He is trying to make 3 pipelines appear like 6 pipes. Which is a twist to the truth.
BD has more resources since it can use 2 ALUs and 2 AGUs every clock, Phenom II averages at 1.5 ALUs and 1.5 AGUs since the share pipe. Again, if you can't use it, it isn't a resource. 2+2=4 (3+3)/2=3.
The discussion is still around IPC. Even if you try to make it look different. And it's still about BDs integer execution capacity compared to k8 (10h), we are pointing out that BDs 4 pipes seems a bit stronger than K8s 3 pipes.
And by adding the different parts of K8s pipeline together some people here are trying to make them look twice as strong.
4 pipes equals more resources than 3.
Last edited by -Boris-; 08-31-2010 at 06:24 AM.
Bookmarks