48-cores

**cdolphin** · 11-30-2010, 10:27 AM

If you were to use small FFT, most of the instructions are loaded in the CPU's cache, so it may be more intensive...
Nice scaling on wprime

FEA is really computationally intensive, what software do you use at work? As an undergraduate of engineering, we mostly use Solidworks, some people use ComSol, however.

**alpha754293** · 11-30-2010, 06:05 PM

Originally Posted by AFQ

What you are using to cool it? Load temps are near 30 :o

Air. And that's the low temperatures, not load temps. At full load, it's between 48-56 C.

(Since the screenshot was taken, the system's been moved into the server room where it's significantly cooler).

Originally Posted by Movieman

Alpha:
I'll save youthe thought process.
It won't OC,not even the dualie MC's will OC.
What they are is rock solid and dependable.
I have 2 of the 6168's(1900mhz 12 core) in the Asus KGPE-D16 board and it's been at 100% load 24/7 since may without a hickup..
The Dynatron A6 heatsinks work excellent but are loud.
And as to Sam's 27+ score in CB11.5 that was a fluke!

Yes and thank Gawd for that. (That they're rock solid and stable).

Originally Posted by poke349

Yeah, OpenMP (alone) is kinda useless beyond like 2 sockets. The best approach for these machines is to have one MPI process for each socket, and then use OpenMP within each socket.

Obviously, converting an OpenMP program to an MPI one can require redesigning and rewriting it from scratch...

Yea...OpenMP embedded in a MPI environment....you're almost like asking for trouble at that point. Besides, I don't know/think if the BIOS actually ennumerates the cores in order or of they're kinda random. I don't think that there's anyway to truly tell (since it isn't like you can make one of the cores light up to indicate which processor it's on). Pity. It would be sooo helpful in MPP code development.

Originally Posted by FlawleZ

Does IntelBurn test stress the CPU more than Prime95?

I don't think so. IntelBurnTest is based on LINPACK (HPL) and that's solving a linear algebra problem (matrix algebra/linear system of equations) using Gauss elimination with partial pivoting. The total number of operations is O(n^3) then some. So, yea. I forget what the formula was that gives the breakdown between FPADD and FPMUL # of operations.

Originally Posted by poke349

I doubt it. IBT needs to transfer data between cores (which is slow for these massive systems).

Prime95 does not.

*Don't take my word for it, I could be completely wrong.

You can download the source (written in C) for LINPACK HPL. I don't code, so it won't really mean anything to me. I'm not sure how they get the parallelization to work, but most supercomputers probably use MPI. And I'm not sure how much communication there are, because as far as I know, most of them do NOT increase the size of the problem (I think it's still on the order of either 100, 1000, or maybe 10000). So, I would tend to believe that what really happens when they test the supercomputers with LINPACK HPL is that it spawns multiple copies of the problem and solves it independent of each other, but that's just my guess. I don't know for sure and I haven't really been able to find precisely how supercomputers do their measurements.

Originally Posted by cdolphin

If you were to use small FFT, most of the instructions are loaded in the CPU's cache, so it may be more intensive...
Nice scaling on wprime

FEA is really computationally intensive, what software do you use at work? As an undergraduate of engineering, we mostly use Solidworks, some people use ComSol, however.

Ansys.

COMSOL is VERY complicated. It's great if you know what's going on and it's great for multiphysics problems, but it isn't an easy program to work with. You pretty much DO need to be a Ph.D. to understand it.

And my only other criticism of/about it is that it (at least used to be) Java so having Java errors were somewhat of a common/regular occurance for me which really put me off it (if the differential equations stuffed inside a matrix didn't turn me off it already).

FEA CAN be very computationally intensive. It depends on how you set up your problem and what solver you use.

PCG solver is less intensive (from a strictly CPU perspective because it doesn't do the whole Gauss elimination/only FPADD&FPMUL operations), and as a solver, it tends to take very long to convergence onto a solution. However, as part of an analysis system, in some of the testing that I've done, it is at LEAST twice as fast as using the sparse solver (which is quite counter intuitive).

**poke349** · 11-30-2010, 06:49 PM

Yea...OpenMP embedded in a MPI environment....you're almost like asking for trouble at that point. Besides, I don't know/think if the BIOS actually ennumerates the cores in order or of they're kinda random. I don't think that there's anyway to truly tell (since it isn't like you can make one of the cores light up to indicate which processor it's on). Pity. It would be sooo helpful in MPP code development.

It's a very standard approach we use at UIUC. Virtually all our supercomputing research and projects do this. Typically we'll have a system with 1024 or more sockets - each with 4 - 8 cores.
If MPI is installed and configured correctly (and they are on all our clusters), it will place a different process in each socket. All threads of that process will be restricted to cores belonging to that socket.

The effect is that you can restrict all the MPI overhead to inter-node/inter-socket communication. All inner-socket is fast and shared memory. Parallelism here is done using OpenMP since it has no message-passing overhead.

EDIT:
Dave if you're reading this... No, I can't run WCG on these machines.

They're research machines that run Linux - and I would get fired if I hogged a whole machine (> 1000 cores) for more than an hour.

**alpha754293** · 12-01-2010, 08:35 PM

Originally Posted by poke349

It's a very standard approach we use at UIUC. Virtually all our supercomputing research and projects do this. Typically we'll have a system with 1024 or more sockets - each with 4 - 8 cores.
If MPI is installed and configured correctly (and they are on all our clusters), it will place a different process in each socket. All threads of that process will be restricted to cores belonging to that socket.

The effect is that you can restrict all the MPI overhead to inter-node/inter-socket communication. All inner-socket is fast and shared memory. Parallelism here is done using OpenMP since it has no message-passing overhead.

EDIT:
Dave if you're reading this... No, I can't run WCG on these machines.

They're research machines that run Linux - and I would get fired if I hogged a whole machine (> 1000 cores) for more than an hour.

So your program has both MPI and OpenMP pragmas then? (or whatever their equilvalents are for that -- I'm not a programmer so I have no idea).

That's certainly an interesting way of doing things.

How can you ID the sockets though?

And what's UIUC?

**poke349** · 12-01-2010, 08:48 PM

Originally Posted by alpha754293

So your program has both MPI and OpenMP pragmas then? (or whatever their equilvalents are for that -- I'm not a programmer so I have no idea).

That's certainly an interesting way of doing things.

How can you ID the sockets though?

And what's UIUC?

MPI doesn't use pragmas. The programming model is completely different from OpenMP. You have multiple processes running the same program, and you can ID your process. Since every process is running it's own copy of the program you can do whatever you want within the program - including OpenMP.

Each process has it's own memory space and are completely independent. Communication between them is done explicitly by the programmer.

Since each instance of the program is running on its own socket/node, you don't need to ID which physical socket it's on - since all sockets/nodes are considered equal. (for heterogeneous machines, there's more to be done, but I won't go into that.)

UIUC = University of Illinois @ Urbana-Champaign

**alpha754293** · 12-02-2010, 10:56 PM

Originally Posted by poke349

MPI doesn't use pragmas. The programming model is completely different from OpenMP. You have multiple processes running the same program, and you can ID your process. Since every process is running it's own copy of the program you can do whatever you want within the program - including OpenMP.

Each process has it's own memory space and are completely independent. Communication between them is done explicitly by the programmer.

Since each instance of the program is running on its own socket/node, you don't need to ID which physical socket it's on - since all sockets/nodes are considered equal. (for heterogeneous machines, there's more to be done, but I won't go into that.)

UIUC = University of Illinois @ Urbana-Champaign

If you can't tell which cores are on which socket, then how can you be sure that it's MPI between and OpenMP within?

Like, I was testing the Stream memory benchmark to try and find out the memory bandwidth was intra-socket vs. inter-socket and I couldn't really come up with any good way of doing it because well..for one, the version that I was using was OpenMP. (I don't think that it had any MPI in it). And two, whenever I reset the OMP_NUM_THREADS, it would like re-assign the core ID or something such that on a 48-core system like this, and if I set OMP_NUM_THREADS=24, run the benchmark, OMP_NUM_THREADS=48, run it again, OMP_NUM_THREADS=24 and run it a third time; the cores that were used in the first 24-"thread" test may or may not be the same as the cores that were used in the second.

And that the only way that I would have been able to "control" that would be to set the processor affinity in the command window (it's still running Windows) before starting benchmark, so that it will bind it to the cores consistently.

**poke349** · 12-03-2010, 12:38 AM

Originally Posted by alpha754293

If you can't tell which cores are on which socket, then how can you be sure that it's MPI between and OpenMP within?

Like, I was testing the Stream memory benchmark to try and find out the memory bandwidth was intra-socket vs. inter-socket and I couldn't really come up with any good way of doing it because well..for one, the version that I was using was OpenMP. (I don't think that it had any MPI in it). And two, whenever I reset the OMP_NUM_THREADS, it would like re-assign the core ID or something such that on a 48-core system like this, and if I set OMP_NUM_THREADS=24, run the benchmark, OMP_NUM_THREADS=48, run it again, OMP_NUM_THREADS=24 and run it a third time; the cores that were used in the first 24-"thread" test may or may not be the same as the cores that were used in the second.

And that the only way that I would have been able to "control" that would be to set the processor affinity in the command window (it's still running Windows) before starting benchmark, so that it will bind it to the cores consistently.

It's hard to explain unless you know how the MPI model works.

I guess I'll start with OpenMP. And I'm probably gonna say a bunch of things you already know.

When you use OpenMP, it will create a bunch of threads and randomly assigns them to whatever cores it is allowed to use.

On single-socket machines, this is perfectly fine because all cores share the same memory and have the same latency/bandwidth to all the memory.

But on multi-socket/multi-node machines (yours probably has 4 sockets/nodes), memory isn't shared by all the cores. So the hardware essentially has to emulate it. The effect is NUMA (non-uniform memory access).
If a core is accessing memory that happens to be in the same socket, it's fast. If it's in a remote socket, it needs to go through the inter-connect and is therefore very slow.
This is why OpenMP falls apart on large systems with many sockets/nodes.

Here's where MPI comes in. (now I'm gonna simplify things here.)
The MPI model uses multiple processes. Each process (and all its threads) is locked to a specific socket/node. The protocol ensures this so the programmer doesn't need to worry about it.
From the programmer's side, you don't know which process is on which socket, but you don't need to know because they'll all be the same (for most systems).
This is similar to OpenMP where you don't know which core each thread is, but it doesn't matter since they're all the same.

In MPI, each process is independent. They can't touch each other's memory. So the only way to get data between them is to do it explicitly. The programmer does this by calling an MPI function that sends XX data from process A to process B. This will send a copy of the data from A to B.

Since each process is in its only socket. The only traffic over the inter-socket connection are these "messages" - which the programmer has complete control over.

Now here is why MPI is faster than OpenMP for multi-socket/multi-node machines.

In OpenMP, suppose you have two threads A and B.
Thread A will compute some data. Then thread B will read it 100 times.
In OpenMP, if A and B are in different sockets/nodes, that data will be sent over the interconnect 100 times.
In MPI, the programmer will say, "Send the data from process A to process B". This moves the data from A to B. Now B has a local copy of the data. Therefore those 100 accesses will be very fast.

OpenMP: 100 transfers over the interconnect.
MPI: 1 transfer over the interconnect.

Now, within each process/socket, you might want to run multiple threads. You can use OpenMP here. All threads for that process will be restricted to the cores within that socket. So you won't inadvertently thrash data over the interconnect.

So in your case with 48 cores.

The optimal way to do it is usually:
Break the problem into 4 parts. At the MPI level, create 4 processes. MPI will automatically lock each process to a different socket.
For each of these 4 parts, you run OpenMP with 12 threads.

The reason you shouldn't run MPI with 48 processes, is that you will have a lot of communication. Remember that each time MPI sends data from A to B, it makes a copy of it. (and probably reads/writes it multiple times)
But this is unnecessary if A and B are in the same socket. (Since memory is shared within the socket, there's no need to copy it.)

Well, that's my "short" explanation. MPI is much more complicated and broad than this. I've been mentioning that MPI locks a process to a socket. But it can do other things if you have more/less processes than sockets/cores.
It's theoretically possible to run 15 MPI processes on a 12 core machine... MPI will try to handle it in whatever way it thinks is optimal (and it has full knowledge of the hardware).
I'm just trying to keep it simple.

**alpha754293** · 12-04-2010, 04:44 PM

Originally Posted by poke349

It's hard to explain unless you know how the MPI model works.

I guess I'll start with OpenMP. And I'm probably gonna say a bunch of things you already know.

That's alright.

Originally Posted by poke349

When you use OpenMP, it will create a bunch of threads and randomly assigns them to whatever cores it is allowed to use.

On single-socket machines, this is perfectly fine because all cores share the same memory and have the same latency/bandwidth to all the memory.

But on multi-socket/multi-node machines (yours probably has 4 sockets/nodes), memory isn't shared by all the cores. So the hardware essentially has to emulate it. The effect is NUMA (non-uniform memory access).
If a core is accessing memory that happens to be in the same socket, it's fast. If it's in a remote socket, it needs to go through the inter-connect and is therefore very slow.
This is why OpenMP falls apart on large systems with many sockets/nodes.

It's 4 sockets, single node (if by node, you a referring to the system in the (imaginary) cluster).

Originally Posted by poke349

Here's where MPI comes in. (now I'm gonna simplify things here.)

ok. Simple is good.

Originally Posted by poke349

The MPI model uses multiple processes. Each process (and all its threads) is locked to a specific socket/node. The protocol ensures this so the programmer doesn't need to worry about it.
From the programmer's side, you don't know which process is on which socket, but you don't need to know because they'll all be the same (for most systems).
This is similar to OpenMP where you don't know which core each thread is, but it doesn't matter since they're all the same.

In MPI, each process is independent. They can't touch each other's memory. So the only way to get data between them is to do it explicitly. The programmer does this by calling an MPI function that sends XX data from process A to process B. This will send a copy of the data from A to B.

Since each process is in its only socket. The only traffic over the inter-socket connection are these "messages" - which the programmer has complete control over.

Now here is why MPI is faster than OpenMP for multi-socket/multi-node machines.

In OpenMP, suppose you have two threads A and B.
Thread A will compute some data. Then thread B will read it 100 times.
In OpenMP, if A and B are in different sockets/nodes, that data will be sent over the interconnect 100 times.
In MPI, the programmer will say, "Send the data from process A to process B". This moves the data from A to B. Now B has a local copy of the data. Therefore those 100 accesses will be very fast.

OpenMP: 100 transfers over the interconnect.
MPI: 1 transfer over the interconnect.

Now, within each process/socket, you might want to run multiple threads. You can use OpenMP here. All threads for that process will be restricted to the cores within that socket. So you won't inadvertently thrash data over the interconnect.

So is parallelization with OpenMP ALWAYS done with/by multi-threading? I was also under the assumption that for multi-processing, that it has to spawn a separate process.

Originally Posted by poke349

So in your case with 48 cores.

The optimal way to do it is usually:
Break the problem into 4 parts. At the MPI level, create 4 processes. MPI will automatically lock each process to a different socket.
For each of these 4 parts, you run OpenMP with 12 threads.

The reason you shouldn't run MPI with 48 processes, is that you will have a lot of communication. Remember that each time MPI sends data from A to B, it makes a copy of it. (and probably reads/writes it multiple times)
But this is unnecessary if A and B are in the same socket. (Since memory is shared within the socket, there's no need to copy it.)

Well, that's my "short" explanation. MPI is much more complicated and broad than this. I've been mentioning that MPI locks a process to a socket. But it can do other things if you have more/less processes than sockets/cores.
It's theoretically possible to run 15 MPI processes on a 12 core machine... MPI will try to handle it in whatever way it thinks is optimal (and it has full knowledge of the hardware).
I'm just trying to keep it simple.

So....suppose I wanted to time how long it takes for data to transfer INTRA-socket (between cores on the same physical chip), can I write it as MPI-OpenMP-MPI? Or how would I go about doing that?

I understand that doing it INTER-socket would just be straight MPI.

**poke349** · 12-04-2010, 05:12 PM

Yes, OpenMP is always done by multi-threading. (I haven't seen otherwise.)

So I asked my professor about how exactly MPI is implemented on these large machines. Turns out that it's a little different than I thought.

Basically, every motherboard is running a separate OS. Within each motherboard is 2 or 4 sockets.
Each node is defined as a single OS where all memory is shared. Within that node, the memory may be NUMA because the different sockets.
MPI processes are typically one per OS/motherboard, or one per socket. Sometimes you have to configure OpenMP or MPI to lock threads/processes to specific cores.

So there's multiple levels of programming.
From bottom up:

All in one socket/node - use OpenMP.
All in one motherboard - use OpenMP with affinities, or MPI
Inter-node (cluster/supercomputer) - MPI only

So I wasn't completely right about everything I said in my last post, but I was close.

And the larger the supercomputer, the more levels you need. Each time you go up a level, the communication speeds goes down. The slowest connection being between different server racks...
An example is:
All in one socket/node - use OpenMP.
All in one motherboard - use OpenMP with affinities, or MPI
Local Rack - MPI only
Global - MPI only

I'm not sure how they use MPI over multiple levels like this. I'll probably find out sometime in the next year or two when I get deeper into the program.

EDIT:

To answer your question:
You can configure MPI/Linux to lock the processes into different sockets. Then you send a large message between different processes and time it.
I'm still new to Linux so I don't know how to do this yet. But my prof said it is pretty simple.

**alpha754293** · 12-05-2010, 10:04 AM

Originally Posted by poke349

Yes, OpenMP is always done by multi-threading. (I haven't seen otherwise.)

So I asked my professor about how exactly MPI is implemented on these large machines. Turns out that it's a little different than I thought.

Basically, every motherboard is running a separate OS. Within each motherboard is 2 or 4 sockets.
Each node is defined as a single OS where all memory is shared. Within that node, the memory may be NUMA because the different sockets.
MPI processes are typically one per OS/motherboard, or one per socket. Sometimes you have to configure OpenMP or MPI to lock threads/processes to specific cores.

So there's multiple levels of programming.
From bottom up:

All in one socket/node - use OpenMP.
All in one motherboard - use OpenMP with affinities, or MPI
Inter-node (cluster/supercomputer) - MPI only

So I wasn't completely right about everything I said in my last post, but I was close.

And the larger the supercomputer, the more levels you need. Each time you go up a level, the communication speeds goes down. The slowest connection being between different server racks...
An example is:
All in one socket/node - use OpenMP.
All in one motherboard - use OpenMP with affinities, or MPI
Local Rack - MPI only
Global - MPI only

I'm not sure how they use MPI over multiple levels like this. I'll probably find out sometime in the next year or two when I get deeper into the program.

EDIT:

To answer your question:
You can configure MPI/Linux to lock the processes into different sockets. Then you send a large message between different processes and time it.
I'm still new to Linux so I don't know how to do this yet. But my prof said it is pretty simple.

I have to say, from what you've said, you have quite a remarking and astounding understanding of it.

Course, I am guessing that this is specifically what you're studying and if it isn't, then kudos to you man!

I ask because for example, with programs like Folding@Home, I've been arguing over there that F@H can be made to run distributed, (across multiple systems/nodes) but they keep saying "no no no, cuz there's too much communication" but no one over there has been able to tell me HOW they measure it or how much communication is there really (i.e. no proof. They may be talking just from theory).

I'm also interested in being able to measure the speeds.

And when you said that you can tell which socket the processes belong to, I was even more curious because that would definitely come in handy.

Right now, because I'm running a single simulation job/task across 32-cores on the same physical system (node), so whenever I start it, I set the CPU affinity of the command prompt window (given that I'm still running Windows), before starting the actual run/simulation. By doing it that way, all of the MPI processes (slaves and master) will inherit the affinity of the parent window.

And it works. What doesn't happen is that I can't prevent migration WITHIN the 32-core-affinity (not that I can "see" it happening, but I'm guessing that it probably is).

I had a script once before that someone wrote that did it for F@H (setting the CPU affinity) in Linux. I don't think that I have it anymore and I don't really remember who wrote it either, but it would be nice to have something similiar to it for Windows.

It's great that you seem to REALLLY know your stuff. Moreso than I do, and I thank you for your patience in answering my dumb questions.

**poke349** · 12-05-2010, 11:00 AM

Well, I'm just a first year grad student so it's not my area of expertise - yet.

**Tiltevros** · 12-06-2010, 01:26 AM

is there a possibility to overclock tyan s7025??? i cant find anything on that :S

**s0lid** · 12-06-2010, 10:54 AM

Originally Posted by Tiltevros

is there a possibility to overclock tyan s7025??? i cant find anything on that :S

From the experience of DP folks of xs.wcg i can say: No. Even if you find setfsb version that has your mobos pll chip you won't go over 136BLCK. That's either by the design of the server mobo or Intel has limited the max BLCK on server chipsets. SR-2 is the ONE for ocing DP i7s.

**alpha754293** · 12-06-2010, 08:27 PM

Originally Posted by poke349

Well, I'm just a first year grad student so it's not my area of expertise - yet.

I will tell you that your knowledge and your area of expertise may prove to be handy one day.

P.S. I tried running your Multi-threaded Pi program earlier today.

Couldn't really get a good handle on it because I think that it's overloading the UPS when we run the system full blast.

Hopefully that will be resolved tomorrow so that I can get some results back to you.

P.S. #2 Have you ever tested the scalability of your Multi-threaded Pi program?

**poke349** · 12-06-2010, 10:23 PM

Originally Posted by alpha754293

I will tell you that your knowledge and your area of expertise may prove to be handy one day.

Thanks

Well... it IS my research area right now. So I "better" learn it at some point for me to be of any use.

P.S. I tried running your Multi-threaded Pi program earlier today.

Couldn't really get a good handle on it because I think that it's overloading the UPS when we run the system full blast.

Hopefully that will be resolved tomorrow so that I can get some results back to you.

lol?

P.S. #2 Have you ever tested the scalability of your Multi-threaded Pi program?

I haven't, but others have. The scalability sucks on NUMA - because it's all shared-memory programming.
On Windows, I use threads directly to bypass OpenMP overhead. But that doesn't solve the scalability problems of OpenMP. For that, as mentioned in earlier posts, I need MPI.
Look at the some of the quad-socket results on my thread. (I moved them off my thread to my site, but I link to the full list from my thread.)

You can see that the quad-socket Barcelona's don't do too well... (they get beaten by single-socket i7s)
There's also an 8-socket Barcelona in there - less than 10% faster than the quad-sockets at the same clock.

The 4-socket Beckton machine gets like 40% less "throughput/cycle" compared to the Gainestowns and Westmeres...

But... If you compare single-socket to dual-socket, they scale almost perfectly. (1.8x - 1.9x speedup from 1 -> 2 sockets @ same clock)

Core 2 -> Harpertown: This is all uniform memory. Even the dual-socket Harpertown is uniform memory - both sockets go through the same external memory controller.
Core i7 -> Gainestown: Gainestown/Westmere is NUMA, but barely so. The latency penalty for accessing the other socket's memory is only 30% - most of which gets hidden behind HyperThreading... lol

**alpha754293** · 12-07-2010, 06:40 AM

Originally Posted by poke349

Thanks

Well... it IS my research area right now. So I "better" learn it at some point for me to be of any use.

lol?

I haven't, but others have. The scalability sucks on NUMA - because it's all shared-memory programming.
On Windows, I use threads directly to bypass OpenMP overhead. But that doesn't solve the scalability problems of OpenMP. For that, as mentioned in earlier posts, I need MPI.
Look at the some of the quad-socket results on my thread. (I moved them off my thread to my site, but I link to the full list from my thread.)

You can see that the quad-socket Barcelona's don't do too well... (they get beaten by single-socket i7s)
There's also an 8-socket Barcelona in there - less than 10% faster than the quad-sockets at the same clock.

The 4-socket Beckton machine gets like 40% less "throughput/cycle" compared to the Gainestowns and Westmeres...

But... If you compare single-socket to dual-socket, they scale almost perfectly. (1.8x - 1.9x speedup from 1 -> 2 sockets @ same clock)

Core 2 -> Harpertown: This is all uniform memory. Even the dual-socket Harpertown is uniform memory - both sockets go through the same external memory controller.
Core i7 -> Gainestown: Gainestown/Westmere is NUMA, but barely so. The latency penalty for accessing the other socket's memory is only 30% - most of which gets hidden behind HyperThreading... lol

Okay, so it looks like that we are exceeed the capabilities of the 1 kW power supply.

That, I think is falling on the IT manager because when we started specing out the system, I told him to go with a 2U Supermicro rackmounted barebone system because then we know that everything is going to work.

But he insisted on using a consumer-level case (Cooler Master Stacker 810) and he bought a Corsair 1 kW power supply.

Trying to load the system up with just one core, and with all DIMMS populated -- the highest I saw on the Kill-A-Watt was 810 W before the system shut itself off/down.

So it looks like now, he has to go hunt for another power supply.

I haven't been able to run anything on it because of that, so it'll probably be like another two weeks before the system is back up and running and stable enough for me to do more testing.*rolls eyes*

Oh well.

In any case, I was using a program I think called CoreInfo that helped determined where all four sockets was NUMA or not.

I think initially, it maps it such that there was only one NUMA node for all four sockets.

But then people were telling me that that's bad because it's saying that the processors think that there's one general pool of memory so CPU0 could be trying to access memory belonging to CPU3 and it doesn't know any better.

So, I had to play around with bank, node, and channel interleaving settings in the BIOS until CoreInfo showed four NUMA nodes.

And that's what I've got it set to right now.

*edit*
You know your Multi-threaded Pi program -- is that OpenMP or MPI or some combination thereof?

Also do you think that you might be able to port a distributed version of that some time down the road? I only ask because our IT manager guy pretty much have said that by this time next year, he wants us to be at 128-cores total.

Thread: 48-cores

Thread Tools

Search Thread

Rate This Thread

Display

Bookmarks

Bookmarks

Posting Permissions