Istanbul vs. Nehalem clck-4-clock (almost) in Linpack

**AKM** · 07-01-2009, 12:44 AM

Here's AnandT using a newer Linpack version 10.1

http://www.anandtech.com/weblog/showpost.aspx?i=529

**~~Zucker2k~~** · 07-01-2009, 05:19 AM

Originally Posted by villa1n

I m just going to ignore your insults, and apparent lack of understanding as to why N size was adjusted. The Value of N for the nehalem, was the original value set, IE the largest number for that platform before swapping occurs

Facepalm. Platform? Please, don't be delirious, unless you're suggesting the x5550 system is incapable of handling more than 12GB. I have already said it, and I'll repeat, larger ram=larger problem size=larger gflops.

Impact of problemsize over 30k is almost negligible

Please post a complete quote.... as I couldn't find that claim in that document; on the other hand, I found this:

Linpack alternates between computation and communication when solving a system of equations. The work to be done in the computation phases depends on the number and size of the tiles to be solved. How the system of equations is tiled affects how the workload is balanced across the system, and how efficiently each tile can be processed. These factors can be significant. Tile definition and distribution are determined by the problem size, N, and the number of tiles that are used. This, and the speed of communication, affects how much time is spent in communication. As problem sizes increase, the amount of computation grows faster than the cost of communication, and the impact of communication performance diminishes accordingly.

Linpack is a kernel benchmarks; that is, it focuses on measuring the performance of a small, relatively simple mathematical kernel. In so doing it gives an accurate measure of the performance of a single subsystem of the computer. But Linpack is relatively insensitive to many important factors, so it is not a good indicator of the system or cluster as a whole.

2. System Performance
Linpack is a good indicator of how much performance can be obtained using 64-bit vector floating-point operations. Scalar operations can be substituted for vector operations, but when both are available the vector operations are usually faster. In this benchmark the operands are very often cached, so the operations tend to be processor-core-intensive. As such, performance is almost entirely dependent on core frequency. Performance also generally increases with problem size, rapidly for small sizes, more slowly for larger problems, so memory size is important. The benchmark is affected only slightly by cache size, memory speed and network performance. Figure 1 compares the performance of several systems over a variety of problem sizes, processor frequencies, processor cores, and cache sizes. Many different experiments were run for each problem size, and only the best results for each system are reported. Both single-core and dualcore AMD Opteron processors were used, as well as Intel Xeon processors with 1MB and 2MB L2 caches. All AMD Opteron processors have a 1MB L2 cache, and all Intel Xeon processors have a single core with Hyperthreads disabled. Systems are listed in order of increasing performance.

It is obvious by the hardware being used that this is an old document, but I'll ignore that fact since the software aspect is more interesting to me; that is, the significance of ram size to the realization of higher gflops due to the allocation of a bigger problem size. I don't know how many times I've said this in this thread, but I don't mind repeating myself.

Here's more:

Another feature that stands out is that performance increases (nearly) monotonically with problem size. Performance grows rapidly until N is in the range of 5000 to 10000, after which performance continues to improve much more slowly. In all cases N1/2 is well below 5000, and in most cases it is even below 1000. This is significant for two reasons. First, it shows that these systems are able to obtain a significant fraction of their achievable performance with very small problem sizes. Stated differently, near maximum performance can be achieved on many problems with modest investment. It isn’t an unusual occurance achieved only after heroic efforts.

Older hardware/software combination using SSE2 for computations. Things are very different now; imc, tri-channel, qpi, sse4.1.

The second reason this is significant is that it shows two of the older Linpack versions, DP and TPP, do not accurately reflect system performance. Linpack DP and Linpack TPP use small, fixed problem sizes of 100 equations and 1000 equations, respectively. It is very clear from the diagram that both show performance in a range that is heavily dominated by factors other than floatingpoint performance. Unfortunately, both benchmarks are still occasionally used by customers trying to assess server performance. When this happens, it should be pointed out that very small Linpack runs will probably not measure what they believe it will, and that better alternatives are available.

Ouch!

A third feature, one that is a little harder to see, is the comparison of cache sizes. Among the systems tested is one with two 3.6 GHz Intel Xeon processors, code named “Nocona”, with a 1MB L2 cache. Another system is nearly identical, but it uses two 3.6 GHz Intel Xeon “Irwindale” processors, with a 2MB L2 cache. The difference in performance is measurable and consistent, but it is quite small, only about 1%. This is due, in part, to the fact that the BLAS are so well tuned that their cache hit ratios are already quite high, and larger caches have little opportunity for improvement. It is also due to the fact that the BLAS libraries are tuned for a specific memory hierarchy, and the library we used was tuned for the 1MB caches. Using a library tuned for the larger cache would improve performance, perhaps by as much as 5%.

The next diagram, Figure 2, shows that Linpack is not especially sensitive to the memory performance of the system, because of the high cache utilization already mentioned. This set of experiments compares the performance of two e326 clusters, one with 333 MHz Double Data Rate (DDR333) memory installed, the other (actually the same cluster) with DDR400 memory installed. The cluster with DDR400 memory has better than 20% greater memory bandwidth than the cluster with DDR333 memory installed, yet the Linpack performance difference between the two clusters is less than 1%. This is a clear indication that memory performance does not play a significant role in determining Linpack performance.

There goes your memory performance theory too, ala ddr3 vs ddr2.

Originally Posted by informal

Good post villa1n,you summed it up well

. Even though you explained it already in previous post,somehow he failed to understand it

As for you, you're blind, and can follow him over the edge of the cliff.

**Shintai** · 07-01-2009, 05:32 AM

Good work Zucker2k. Hopefully it can save us from more misleading. But that would most likely be too much to hope for with all the brand loyalty around.

**Donnie27** · 07-01-2009, 06:24 AM

Originally Posted by marten_larsson

Nah, I just don't buy it. There's always something wrong with a review whichever CPU/GPU/whatever comes out on top, it's just that some tend to scream when it's in one direction and others in the other direction. Example; HardOCP's review of Deneb. They were using slower and less RAM for the AMD system than the Intel one (C2Q) and some were complaining. Still the "other" side had to defend the review and say "but the results are similar to what others have shown so it shouldn't matter". Why not?

If it's the same here I wouldn't know. But I do know that when I read a thread title like this one, I'm sure I'll be reading some comments about the review/whatever being bad.

To Movieman; you say trust "us"(XS) instead of "them"(reviewers/bloggers). The only problem is that quite a few of "them" are a part of "us". And seeing how some behave here I'd rather trust some of "them" than a few of "us"

In the end it's all about credibility. If someone is consistently showing skewed results one way or another then that person loses credibility and that goes to reviewers and forum members alike. (And I'm sure you agree with me here.)
In this case it was about not stating some of the settings (HT on/off turbo on/off etc). Instead of ranting, why not just ask him and maybe try it yourselves (those of you who got the hardware to do so) to validate his results (or not).

Long post and drifting away from topic. I'll stop.

Yes I agree! I said I don't know. Honest though, I or no one here should have to ask for a simple setup disclaimer or list/notes. I'm not bashing him or anyone else who thinks this test was OK! On that note, others shouldn't be jumping folks who thought it wasn't. There are some pretty good posts in this thread. The biggest "rant" was Pro or for this reveiw IMHO.

Martin I got nothing but respect for you and Villa1n. It gets harder and harder to just to talk with anyone here without having them act out just because of a disagreement. Or have some even what I call Grudge-flame I asked about HT being on or off? The point was that it wasn't disclosed. I'm not about to just blindly take most folks' word for it and it doesn't matter what the results are. I still think the test/s should have been done with 24GB in each system with the same data sets and let the chips fall where they may.

**informal** · 07-01-2009, 06:28 AM

AT already posted their results which clearly show that in Linpack,Shanghai is marginally behind Nehalem at the same clock. SMT does not provide any benefit in this workload and can even hurt performance on Nehalem system. That being said,real cores scale much better in this HPC workload and Istanbul is a performing better here.

**Donnie27** · 07-01-2009, 07:24 AM

Originally Posted by informal

AT already posted their results which clearly show that in Linpack,Shanghai is marginally behind Nehalem at the same clock. SMT does not provide any benefit in this workload and can even hurt performance on Nehalem system. That being said,real cores scale much better in this HPC workload and Istanbul is a performing better here.

That's moot informal! We know! Now post about Z's point about "Problem" size?

**ajaidev** · 07-01-2009, 07:25 AM

I am more interested in how much AMD has to pay and how much profit it gets.

I am not very sure about how much SMT Implementation costs to intel per chip, other than that High-k must be more expensive to implement than AMD's SOI. This all translates into processor prices.

The x5500 has 731M transistors and the Istanbul's have 904 million transistors, so that 173M tran. more and these cost about 83mm2 more per die. But this is not bad as shanghai was 758 million transistors and did not come very close to performance with the x5500.

In the end of the day what really matters is the platform cost, performance per watt and ideal energy consumption and AMD did a fine job in at least 2 of these. In a way this reminds me of the X7460, i am sure that a X7460 would out perform the x5500 also but eat up more energy as compared to the Istanbul in the process.

EDIT:- People say that the Xeon x5500 has much more cache than the Opteron, well if we see the total cache the X550 has 9Mb "L2-1Mb and the L3-8Mb" and the 2435 has 9Mb "L2-3Mb and the L3-6Mb"

Now i know that the 2435 has to share its cache among 6 cores that means .5Mb of L2 exclusive and L3 shared. The x5500 on the other hand has .25Mb L2 for each core and the L3 is shared.

Speed of the cache of L2 and L3 differ among similar speed processor's. Generally the speed table is something like this in 2435 vs x5500 in terms of performance:-

Intel's L2<AMD's L2<Intel's L3<AMD's L3

So, AMD's cores get much more or faster L2 exclusively as compared to the either limited L2 or slower L3 the x5500 cores get.

**demonkevy666** · 07-01-2009, 09:31 AM

Originally Posted by ajaidev

I am more interested in how much AMD has to pay and how much profit it gets.

I am not very sure about how much SMT Implementation costs to intel per chip, other than that High-k must be more expensive to implement than AMD's SOI. This all translates into processor prices.

The x5500 has 731M transistors and the Istanbul's have 904 million transistors, so that 173M tran. more and these cost about 83mm2 more per die. But this is not bad as shanghai was 758 million transistors and did not come very close to performance with the x5500.

In the end of the day what really matters is the platform cost, performance per watt and ideal energy consumption and AMD did a fine job in at least 2 of these. In a way this reminds me of the X7460, i am sure that a X7460 would out perform the x5500 also but eat up more energy as compared to the Istanbul in the process.

EDIT:- People say that the Xeon x5500 has much more cache than the Opteron, well if we see the total cache the X550 has 9Mb "L2-1Mb and the L3-8Mb" and the 2435 has 9Mb "L2-3Mb and the L3-6Mb"

Now i know that the 2435 has to share its cache among 6 cores that means .5Mb of L2 exclusive and L3 shared. The x5500 on the other hand has .25Mb L2 for each core and the L3 is shared.

Speed of the cache of L2 and L3 differ among similar speed processor's. Generally the speed table is something like this in 2435 vs x5500 in terms of performance:-

Intel's L2<AMD's L2<Intel's L3<AMD's L3

So, AMD's cores get much more or faster L2 exclusively as compared to the either limited L2 or slower L3 the x5500 cores get.

512 x6 +6144 L3 = 9216
256 x4 + 8192 L3 =9216

I think intel needs less transistors for cache.
I think AMD is still using sdram type for L3 which takes about 6 and intel uses a new way for 1
the nehalem has lots of transistors for Core logic, and IPC.

**Shintai** · 07-01-2009, 09:53 AM

Originally Posted by demonkevy666

512 x6 +6144 L3 = 9216
256 x4 + 8192 L3 =9216

I think intel needs less transistors for cache.
I think AMD is still using sdram type for L3 which takes about 6 and intel uses a new way for 1
the nehalem has lots of transistors for Core logic, and IPC.

SRAM cant be 1 transistor. Nehalem is 6 transistors per SRAM in the L3. And 8 transistors per SRAM in the L1 and L2 for power saving purposes.

**villa1n** · 07-01-2009, 12:06 PM

Originally Posted by Zucker2k

Facepalm. Platform? Please, don't be delirious, unless you're suggesting the x5550 system is incapable of handling more than 12GB. I have already said it, and I'll repeat, larger ram=larger problem size=larger gflops.

Please post a complete quote.... as I couldn't find that claim in that document; on the other hand, I found this:

It is obvious by the hardware being used that this is an old document, but I'll ignore that fact since the software aspect is more interesting to me; that is, the significance of ram size to the realization of higher gflops due to the allocation of a bigger problem size. I don't know how many times I've said this in this thread, but I don't mind repeating myself.

Here's more:

Older hardware/software combination using SSE2 for computations. Things are very different now; imc, tri-channel, qpi, sse4.1.

Ouch!

There goes your memory performance theory too, ala ddr3 vs ddr2.

As for you, you're blind, and can follow him over the edge of the cliff.

Again, you just search for Schema consistent information, as opposed to semantically converting what your reading into how it applies. You did find what i was pointing out in that article, you just skimmed over it, and focussed on the shortfalls of linpack... all of which have been avoided in the test. Like i said, this company does this for a living, they will not make the mistakes you are making or pointing out.

let me show you what you glossed over that is the key fact.

Another feature that stands out is that performance increases (nearly) monotonically with problem size. Performance grows rapidly until N is in the range of 5000 to 10000, after which performance continues to improve much more slowly.

notice the numbers.. they become important.. here is an illustration just to help you out incase you are a visual learner..
And the funny part is you had to click my link which stated "Impact of problemsize over 30k is almost negligible" and it meant exactly that.

The information your talking about is problem sizes 1/7, 1/4 smaller than the problem sizes used in the current test... where they cancelled out all the short comings BY USING A LARGER PROBLEM SET.

Now to be fair, you had to have seen this graph, but maybe math is not your strong point so i ll post and explain.

As you can see, the flattening out of the scores, shows that the bigger problem size removes those big variations that are weaknesses of SMALL N sizes. The main point in the article is about using small N sets to determine performance.. 1000-10000 . As you can see above 30 000 size for N, the Gflops peak and almost flatline, meaning the processors reach near maximum. So adding the ram you want to nehalem lets say to 24gb, allowing the size of N to be lets say 50 000 would only yield maybe .5% performance increase on gflops.. this is why they set custom ram sizes, to get each architectures most efficient spot, since istanbul has 6 cores, its max was higher, 6 physical cores.

You ve also saved me the time in showing that SMT has no effect on linpack..this is why the test shows that in LINPACK type scenarios ISTANBUL outperforms NEHALEM. Which is what is important to the clients since their applications directly reflect the type of calculations linpack does.

you also quoted

nfortunately, both benchmarks are still occasionally used by customers trying to assess server performance. When this happens, it should be pointed out that very small Linpack runs will probably not measure what they believe it will, and that better alternatives are available.

This is true, this is why if your looking for stream processing applications you should look to nehalem, this is why if your server is used for a multitude of different tasks you should probably go with nehalem, (depending on performance needs), but when your needs are cluster computing, with specific applications that emulate the calculations used in linpack, you should take heed to the results posted!! which is why the article was posted!! you keep expanding the scope, the scope is defined and specific, these results dont generalize to other areas, thats why they have tests to test other areas..!?

As for the ddr2 vs ddr3 you posted, anandtech showed this too, that it only amounts to a small difference, but they give it to nehalem to use it to the max, even if it only adds 6-8% over a ddr2 config, because you would use it in your construction of the node. I think thats what you fail to realize, this test is constructing, the most efficient node for each platform, and keeping everything else aside platform specific advantages the same, to assess which arch is the most efficient for designing clusters that use linpack calculations. thats it, pure and simple. Again i fail to see what is difficult to grasp about that?

You ve quoted random things, that don't apply to this situation, and i will blame that on schema consistant bias.

As well, please stop ending with insults, its really uneeded, and inappropriate. If you knew someone who was in a car accident and cant remember his wife or children,

is not the appropriate way. You are misusing the term as it is, but i will grant you ignorance in its actual psychological use, as you probably saw the term on tv.

**Hornet331** · 07-01-2009, 12:18 PM

Its kinda funny how HPC get thrown into here.

I didn't know gainestown was the HPC platform, i thought beckton on boxbro was.

.

**poke349** · 07-01-2009, 12:37 PM

Originally Posted by Hornet331

Its kinda funny how HPC get thrown into here.

I didn't know gainestown was the HPC platform, i thought beckton on boxbro was.

.

Anything with more than 1 socket kinda implies it's HPC. There's no reason have more than 1 socket other than for the computing power.

EDIT: Well... dual-socket mobos have more ram... but if you need that much ram anyway, you can definitely use more computing power.

**villa1n** · 07-01-2009, 12:41 PM

Originally Posted by Donnie27

That's moot informal! We know! Now post about Z's point about "Problem" size?

The thing is its not moot, it explains the results perfectly. 4 cores nehalem vs 4 cores shanghai = close results.

4cores nehalem vs 6 cores istanbul = istanbul ahead... this is obvious because its basically like adding a dual core ontop of a 4core shanghai, so if the results were close to begin with in the original comparison, and SMT has no effect or negative effect, then the addition of TWO more real cores on die, should put AMD ahead, pure and simple

I addressed the problem sized arguement. with problem sizes over 30k, variations are so small they are insignificant. So each platform was tuned to its most efficient use of cores. For nehalem, a triple channel 12gb set up with a 36k problem size N yielded the highest efficiency for the platform at 74gflops with 87% efficiency. that is 4 cores!! the cores pack way more punch than Istanbul, the problem is there is only 4 of them.
Istanbul with a 41k N size and 16gb ram required to flesh out hte problem size, hit 99 Gflops with 6cores!
The reason Istanbul is more efficient is that 6cores are on one die, hence allowing for EACH node of the HPC to have 99Gflops vs 74Gflops for the same cost. This is important in a Cost efficiency scaling market. There is no question single core vs single core that nehalem is a more powerful architecture, it just happens that in this case it is not the right choice in construction of Clusters for clients who have LinPack type calculation needs.

**villa1n** · 07-01-2009, 12:43 PM

Originally Posted by Hornet331

Its kinda funny how HPC get thrown into here.

I didn't know gainestown was the HPC platform, i thought beckton on boxbro was.

.

by themselves they are not, but these are nodes, not single servers.

**marten_larsson** · 07-01-2009, 01:41 PM

I believe villa1n has shown pretty clear that this test was not as bad as many has stated. You only need to look at the AT numbers to see that the claims could very well be reasonable. No one is questioning AT, right?

**Donnie27** · 07-01-2009, 02:16 PM

Originally Posted by villa1n

The thing is its not moot, it explains the results perfectly. 4 cores nehalem vs 4 cores shanghai = close results.

4cores nehalem vs 6 cores istanbul = istanbul ahead... this is obvious because its basically like adding a dual core ontop of a 4core shanghai, so if the results were close to begin with in the original comparison, and SMT has no effect or negative effect, then the addition of TWO more real cores on die, should put AMD ahead, pure and simple

I addressed the problem sized arguement. with problem sizes over 30k, variations are so small they are insignificant. So each platform was tuned to its most efficient use of cores. For nehalem, a triple channel 12gb set up with a 36k problem size N yielded the highest efficiency for the platform at 74gflops with 87% efficiency. that is 4 cores!! the cores pack way more punch than Istanbul, the problem is there is only 4 of them.
Istanbul with a 41k N size and 16gb ram required to flesh out hte problem size, hit 99 Gflops with 6cores!
The reason Istanbul is more efficient is that 6cores are on one die, hence allowing for EACH node of the HPC to have 99Gflops vs 74Gflops for the same cost. This is important in a Cost efficiency scaling market. There is no question single core vs single core that nehalem is a more powerful architecture, it just happens that in this case it is not the right choice in construction of Clusters for clients who have LinPack type calculation needs.

Sorry, that's not what he's talking about and what I called moot. We know Hyper-threading can affect many tests. I never said if Hyperthreading affected anything good or bad. That = moot! I asked as if it was disclosed?

According to this tests, actually Nehalem was more efficient, look again, I know it was a long time ago.

Results:
CPU Model Problem Size (N) Theoretical Peak Actual Peak Efficiency Node Cost $ per GFLOP
Nehalem X5550 2.66GHz 35840 85.12 GFLOPS 74.03 GFLOPS 86.97% $3,800.00 $51.33
Istanbul 2435 2.6GHz 41216 124.8 GFLOPS 99.38 GFLOPS 79.63% $3,500.00 $35.21

I'm sure he could have used the 6 core based Dunnington as well

**marten_larsson** · 07-01-2009, 03:05 PM

Originally Posted by Donnie27

Sorry, that's not what he's talking about and what I called moot. We know Hyper-threading can affect many tests. I never said if Hyperthreading affected anything good or bad. That = moot! I asked as if it was disclosed?

According to this tests, actually Nehalem was more efficient, look again, I know it was a long time ago.

Results:
CPU Model Problem Size (N) Theoretical Peak Actual Peak Efficiency Node Cost $ per GFLOP
Nehalem X5550 2.66GHz 35840 85.12 GFLOPS 74.03 GFLOPS 86.97% $3,800.00 $51.33
Istanbul 2435 2.6GHz 41216 124.8 GFLOPS 99.38 GFLOPS 79.63% $3,500.00 $35.21

I'm sure he could have used the 6 core based Dunnington as well

Well, he already said that the Nehalem performed better per core (or % of theoretical performance as well) but that's not the issue. The fact remains that the Istanbul performed better per chip than Nehalem did therefor higher efficiency per chip. You also clearly showed that Istanbul was far cheaper per GFLOP.

Would've been interesting to see power figures as well though.

**Hornet331** · 07-01-2009, 03:15 PM

Originally Posted by marten_larsson

Would've been interesting to see power figures as well though.

SpecPower

http://www.tecchannel.de/server/proz...35/index8.html

A 2S 2435 is a bit worse then a 2S X5570.

tought the Xeon system is a bit handycaped due to a rendundant PSU.

with one PSU it scores 100 Points more:
http://www.tecchannel.de/server/proz...5/index12.html

**villa1n** · 07-01-2009, 03:34 PM

Originally Posted by Donnie27

Sorry, that's not what he's talking about and what I called moot. We know Hyper-threading can affect many tests. I never said if Hyperthreading affected anything good or bad. That = moot! I asked as if it was disclosed?

According to this tests, actually Nehalem was more efficient, look again, I know it was a long time ago.

Results:
CPU Model Problem Size (N) Theoretical Peak Actual Peak Efficiency Node Cost $ per GFLOP
Nehalem X5550 2.66GHz 35840 85.12 GFLOPS 74.03 GFLOPS 86.97% $3,800.00 $51.33
Istanbul 2435 2.6GHz 41216 124.8 GFLOPS 99.38 GFLOPS 79.63% $3,500.00 $35.21

I'm sure he could have used the 6 core based Dunnington as well

Yes, i said that in my post.. nehalem per core efficiency was higher by about 7 percent. But in overall node performance, there was 25 gflop difference. The point of the test, which is relevant to the company was per node efficiency, total gflop output, vs price.

Dunnington could have been used, if they wanted maximum gflops regardless of cost.. but a quick google, and i m sure those in the server market can find lower prices, but this price seems to put dunnington out of the price performance market, with the cost of the cpu being double an istanbul , for comparison X5550 is about the same as opteron 2435, with the price here.

**~~Zucker2k~~** · 07-01-2009, 04:29 PM

Originally Posted by villa1n

Bla, bla bla...

It's simple, and I addressed it, but you're too slow to catch it so let me explain. You're comparing Nehalem's peak gflops throughput in Linpack to Intel processors released in 2004/2005 respectively.

Even a newbie won't make that comparison. The biggest flaw of your argument is that you're assuming that a combination of all the hardware (improvements in Nehalem) and software improvements (in linpack) makes no difference to the perceived hardware/software characteristics 4 years ago. Well, all that is is speculation until we know for sure that these characteristics are the same with Nehalem/Linpack today. Even you should know better than that.

I don't know where you get the idea I'm insulting you, you must be a very sensitive individual so here's an advice, you've dished more than you've proven you can take and believe me, if I start dishing you'd know for sure, so refrain from being rude before you demand you be treated otherwise.

**Hans de Vries** · 07-01-2009, 04:58 PM

Originally Posted by Donnie27

Results:
CPU Model Problem Size (N) Theoretical Peak Actual Peak Efficiency Node Cost $ per GFLOP
Nehalem X5550 2.66GHz 35840 85.12 GFLOPS 74.03 GFLOPS 86.97% $3,800.00 $51.33
Istanbul 2435 2.6GHz 41216 124.8 GFLOPS 99.38 GFLOPS 79.63% $3,500.00 $35.21

This shows how extremely efficient Intel's MKL (Math Kernel Library) is.
The actual peak scores are very close to the theoretical maximum which
is pretty amazing given that the datasets are all in DRAM.

AMD's own Math library is improving lately (if this here is representative)
comparing Istanbul with AMD's library to Nehalem with intel's library.

http://forums.amd.com/devblog/

With the remark that it's now acknowledged that there will be higher
speed SE Istanbuls in the near future (2.9GHz ???)

These results are one of the reasons that many Opteron based super
computers will be upgraded to Istanbul. For example the next no.1 on
the Supercomputer list, Jaguar which will reach a peak performance
of well over 2 PetaFlops this summer.

http://blogs.knoxnews.com/knx/munger...astest_co.html

The use of these highly efficient Math libraries is essential to reach
an actual performance which is not too far from the peak performance.

Regards, Hans.

**villa1n** · 07-01-2009, 05:46 PM

Originally Posted by Zucker2k

It's simple, and I addressed it, but you're too slow to catch it so let me explain. You're comparing Nehalem's peak gflops throughput in Linpack to Intel processors released in 2004/2005 respectively.

Even a newbie won't make that comparison. The biggest flaw of your argument is that you're assuming that a combination of all the hardware (improvements in Nehalem) and software improvements (in linpack) makes no difference to the perceived hardware/software characteristics 4 years ago. Well, all that is is speculation until we know for sure that these characteristics are the same with Nehalem/Linpack today. Even you should know better than that.

I don't know where you get the idea I'm insulting you, you must be a very sensitive individual so here's an advice, you've dished more than you've proven you can take and believe me, if I start dishing you'd know for sure, so refrain from being rude before you demand you be treated otherwise.

Sigh. I am showing how as N size increases, the scores flatten out, ie, the advantages of having enough memory to hold a bigger problemset, become less and less, as the size increases, after a certain threshold. Of course LinPack and CPu's have improved...but the math is the same, and we re still all x86 architecture..? So the graph may have shifted, and of course the Gflops have increased thats apparent, but the shape and distribution will hold.. you add a constant, the shape still says the same..?

Hans graph illustrates nicely what they were trying to avoid in the test by using large datasets, at the 400-500pt mark, look at the jump, and how distorted the results get, but as the problem size increases, they flatten out, and from that chart it looks as if nehalem at the end starts dropping off. I honestly think we re just so focused on micro applications, relatively ie desktop OS performance, that we forget CPU's have other largescale applications where strengths and weaknesses from desktop scale applications do not apply. I find it refreshing, because my assumption would have initially been, that nehalem would have out performed istanbul, by a fair margin, again, the data shows differently, the comparison given the context and parameters used, makes sense and demonstrates the advantages for the intended use.

As for the assumptions, I never made the assumption that hardware hasn't improved, I just used 2 data points, scores from today, and scores from then to show how scaling with these types of calculations work in processors.

As for the insults, you insulted me twice in the latter paragraph, calling me slow as a reference to a mental handicap or deficit of some sort, and calling me a newbie with the comparison that someone completely unfamiliar with what this is about, would not be foolish enough to make a foolish comparison.. It has nothing to do with sensitivity, although i will admit, the amnesia remark you made is a soft spot, since i have personal experience with people losing memories of family and self, post car accident, and i don't find it a laughing matter, as it wasn't said in jest on your part.
It has to do with respect and discussing something, I assume each others aim is to come to consensus as to what the data implies, not a matter of right and wrong, better or worse, I would think your goal is to improve understanding, but maybe i m too optimistic in that regard. I m not demanding anything, you can do what you want, I will just cease to respect or acknowledge you, its completely your perogative.

Again, i will flat out ask, What is it you are dissatisified with, or what do you hope the outcome of this will be. If you respond argumentatively despite whats being shown, then I might as well stop my discussion, as you are not seeking the same aims I am, if your goal is to troll any data that shows amd ahead in something. I m neither for or against either companies, they are a provider of something i like to play with as a hobby, and to that effect i take interest in both their strengths and weaknesses, since amd has been on the weak side for a while its nice to see them coming back and forcing innovation on both sides.

Originally Posted by Hans de Vries

This shows how extremely efficient Intel's MKL (Math Kernel Library) is.
The actual peak scores are very close to the theoretical maximum which
is pretty amazing given that the datasets are all in DRAM.

AMD's own Math library is improving lately (if this here is representative)
comparing Istanbul with AMD's library to Nehalem with intel's library.

http://forums.amd.com/devblog/

With the remark that it's now acknowledged that there will be higher
speed SE Istanbuls in the near future (2.9GHz ???)

These results are one of the reasons that many Opteron based super
computers will be upgraded to Istanbul. For example the next no.1 on
the Supercomputer list, Jaguar which will reach a peak performance
of well over 2 PetaFlops this summer.

http://blogs.knoxnews.com/knx/munger...astest_co.html

The use of these highly efficient Math libraries is essential to reach
an actual performance which is not too far from the peak performance.

Regards, Hans.

Thanks for the further Clarification Hans

That Picture is Fantastic!

**marten_larsson** · 07-02-2009, 03:12 AM

Originally Posted by Hornet331

SpecPower

http://www.tecchannel.de/server/proz...35/index8.html

A 2S 2435 is a bit worse then a 2S X5570.

tought the Xeon system is a bit handycaped due to a rendundant PSU.

with one PSU it scores 100 Points more:
http://www.tecchannel.de/server/proz...5/index12.html

I was referring to this specific task since it's the one being reviewed/tested.

And thank you Hans for clearing this up even further!

**~~Zucker2k~~** · 07-02-2009, 04:19 AM

Originally Posted by marten_larsson

I was referring to this specific task since it's the one being reviewed/tested.

And thank you Hans for clearing this up even further!

How did he do that?

**Ursus** · 07-02-2009, 05:04 AM

Originally Posted by villa1n

As for the insults, you insulted me twice in the latter paragraph, calling me slow as a reference to a mental handicap or deficit of some sort, and calling me a newbie with the comparison that someone completely unfamiliar with what this is about, would not be foolish enough to make a foolish comparison.. It has nothing to do with sensitivity, although i will admit, the amnesia remark you made is a soft spot, since i have personal experience with people losing memories of family and self, post car accident, and i don't find it a laughing matter, as it wasn't said in jest on your part.
It has to do with respect and discussing something, I assume each others aim is to come to consensus as to what the data implies, not a matter of right and wrong, better or worse, I would think your goal is to improve understanding, but maybe i m too optimistic in that regard. I m not demanding anything, you can do what you want, I will just cease to respect or acknowledge you, its completely your perogative.

Again, i will flat out ask, What is it you are dissatisified with, or what do you hope the outcome of this will be. If you respond argumentatively despite whats being shown, then I might as well stop my discussion, as you are not seeking the same aims I am, if your goal is to troll any data that shows amd ahead in something. I m neither for or against either companies, they are a provider of something i like to play with as a hobby, and to that effect i take interest in both their strengths and weaknesses, since amd has been on the weak side for a while its nice to see them coming back and forcing innovation on both sides.

Thanks for the further Clarification Hans

That Picture is Fantastic!

I have been reading posts on these forum for a while now, and today I decided to post for the first time.
I am amazed at your patience in dealing with an immature troll. (I am not just talking about this thread)
You stay calm and explain yourself in a clear manner even when someone is being impossible. I just had to say this and give you credit for your posting history, especially in this thread.

I am very interested in this particular subject as I run a business that could benefit from cluster computing, and the obvious first choice we will have to make is will we go with amd or intel.

Way to go.

Thread: Istanbul vs. Nehalem clck-4-clock (almost) in Linpack

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions