New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

**rcofell** · 03-30-2011, 08:30 AM

Originally Posted by poke349

I'm getting speedups of up to 30% using numactl interleave on Linux.

The speedup depends on the size - for smaller sizes, it sometimes backfires.
I'll accept any settings as valid benchmarks. Tuning the OS like this is part of the game, lol.

As far as I can tell, only Windows Server has any sort of NUMA-awareness support. But you need Win Server anyway to get more than 2 sockets in the first place.
There might be a setting somewhere in the OS that can be set to force interleaved memory allocation. If there isn't, maybe there's a special WinAPI malloc() function that does interleaved allocation. If I find it, I'll try it out.

But in any case, interleaved memory allocation isn't a "solution" to NUMA.
It lets the program get the full memory bandwidth, but it doesn't get rid of the interconnect contention and latency.

The program will still need to be redesigned to run well on NUMA.

I've been meaning to post in here for awhile, as I've got access to an 8-socket Barcelona machine (w/ ~80GB ram) here in our research group at school that I've tried running things on. The NUMA nature of the machine definitely causes sub-optimal scaling for the smaller workloads (those below 1B).

I tried numactl for 1B and performance goes from 648sec (without) to 466sec (interleave all), so it does make a noticeable difference. Almost a 1.4X speedup in this case.

For comparison (without numactl), 10B takes 5,379sec and 100M takes 110sec.

Originally Posted by poke349

You are correct, but there's an even bigger problem with the program.
y-cruncher allocates all the memory it needs in a single gigantic malloc(). The OS will put that entire memory block on the node that called the malloc(). So the result is that most (if not all) of the memory will be concentrated in a single node.

The 4 nodes will all try to access that same memory - all of which is in one node. So not only do you have contention on the interconnect, but the aggregate memory bandwidth is only equal to that of one socket. (since all the memory is concentrated there)
If the memory is interleaved, you still don't get rid of the interconnect traffic, but at least now you have 4 sockets of memory bandwidth.

This problem does not occur in "normal" applications that do lots of small allocations. For these normal apps, the OS will bias all the allocations on the nodes that called them - so memory accesses are mostly local.
But since y-cruncher does one massive malloc() at the beginning, the OS is not able to do this.

There were several major reasons why I made the program do a single malloc() and it's own memory management as opposed to letting the OS handle it.

Memory allocation is expensive - (hundreds/thousands of cycles?)
malloc() is not thread-safe. Placing locks around it is not scalable. Even if it is thread-safe (with the right compiler options), it's still implemented with locks.
Memory allocation is not deterministic. The address that malloc() returns differs between program instances even with the exact same settings.
The memory manager I put into the program is deterministic - even with multi-threading.
It's a lot easier to hold together 100,000 lines of multi-threaded code if its deterministic.

Yeah, tying all of the memory allocation to one thread/node doesn't do so well with NUMA I imagine. Beyond just the interconnect bandwidth impact, coherency traffic must become a significant overhead as well.

Determinism definitely is desirable with parallel programs, though in what capacity are you using the deterministic malloc()? Is it mainly for debug, or structured memory/pointer accesses? As for thread-safe, are you saying just using barriers to properly synchronize isn't good enough?

At the expense of memory capacity trade-off, could you have each thread malloc() a (relatively) small buffer to work out of? In the interest of creating some locality and doing burst transfers.

Originally Posted by alpha754293

More stupid questions:

will MPI or OMP help?

How is it doing the multi-threading/multi-processing now anyways?

Can threads do malloc()? (You might have actually said it already, but I'm a dummy non-programmer trying to wrap my head around programming concepts.)

I've always been under the impression/belief that multi-processing can happen ONLY if it spawns slave processes. So the whole idea of multi-threading != multi-processing in my mind (or at least it doesn't seem to be as efficient or doesn't use the full resources available to it).

I'm curious of the current implementation as well

Is it all p-thread based? MPI would definitely be the way to go for ensuring proper locality on a NUMA system, at the expense of all the icky manual data sharing. I've only done some basic use of OpenMP, so I don't know too much of the advanced details like optimizations for NUMA systems, but is it really applicable to your current algorithm?

I remember reading sometime ago about your implementation spawning a large amount of threads, do you limit the number of work threads to the available hardware threads, or go beyond? Having excessive threads would likely thrash around too much with larger NUMA systems and be counter productive.

How independent are the threads? Is there much synchronization/sharing of data between them, or does each basically work on its own independent chunk of the series and combine in a tree-reduce fashion?

**poke349** · 03-30-2011, 10:32 AM

Originally Posted by alpha754293

More stupid questions:

will MPI or OMP help?

How is it doing the multi-threading/multi-processing now anyways?

Can threads do malloc()? (You might have actually said it already, but I'm a dummy non-programmer trying to wrap my head around programming concepts.)

I've always been under the impression/belief that multi-processing can happen ONLY if it spawns slave processes. So the whole idea of multi-threading != multi-processing in my mind (or at least it doesn't seem to be as efficient or doesn't use the full resources available to it).

Yes, threads can malloc(), but malloc() isn't thread safe. Some compilers can make it thread-safe with the right options, but that's done using locks - which limits scalability. I'll get to that later.

Originally Posted by rcofell

I've been meaning to post in here for awhile, as I've got access to an 8-socket Barcelona machine (w/ ~80GB ram) here in our research group at school that I've tried running things on. The NUMA nature of the machine definitely causes sub-optimal scaling for the smaller workloads (those below 1B).

I tried numactl for 1B and performance goes from 648sec (without) to 466sec (interleave all), so it does make a noticeable difference. Almost a 1.4X speedup in this case.

For comparison (without numactl), 10B takes 5,379sec and 100M takes 110sec.

Yeah, tying all of the memory allocation to one thread/node doesn't do so well with NUMA I imagine. Beyond just the interconnect bandwidth impact, coherency traffic must become a significant overhead as well.

Determinism definitely is desirable with parallel programs, though in what capacity are you using the deterministic malloc()? Is it mainly for debug, or structured memory/pointer accesses? As for thread-safe, are you saying just using barriers to properly synchronize isn't good enough?

At the expense of memory capacity trade-off, could you have each thread malloc() a (relatively) small buffer to work out of? In the interest of creating some locality and doing burst transfers.

I'm curious of the current implementation as well

Is it all p-thread based? MPI would definitely be the way to go for ensuring proper locality on a NUMA system, at the expense of all the icky manual data sharing. I've only done some basic use of OpenMP, so I don't know too much of the advanced details like optimizations for NUMA systems, but is it really applicable to your current algorithm?

I remember reading sometime ago about your implementation spawning a large amount of threads, do you limit the number of work threads to the available hardware threads, or go beyond? Having excessive threads would likely thrash around too much with larger NUMA systems and be counter productive.

How independent are the threads? Is there much synchronization/sharing of data between them, or does each basically work on its own independent chunk of the series and combine in a tree-reduce fashion?

For computation threads, threading is done using recursive thread-forking.
For I/O threads, they are done using iterative thread-spawning.

I'll only focus on the computation threads.

Basically, there's a ton of binary divide-and-conquer crap in the program. Since each of these divide-and-conquer algorithms typically involve some sort of data merging/splitting at each stage, you can't easily split them into N threads at the start. So instead, I integrate the recursive thread-splitting into these binary divide-and-conquer algorithms.
At each stage, I split the current thread into two. In Windows, this is done by spawning another thread. In Linux, it's done using recursive OpenMP.
There's no pthread version yet, but it would basically be a copy/paste of the Windows code with some renaming. So far, OpenMP works just fine, so there's no need for pthreads yet.
By adjusting the depth of the thread-splitting, I can change the # of threads. This is the primary reason why y-cruncher is locked to power-of-two threads.

Some of the new code that I'm working on right now is able to do non-power-of-two threads using the same recursive splitting method. So in the future, I might be able to lift the power-of-two restriction on the program.

If the routine needs memory, then the memory heap is split in half (or some pre-determined ratio). This is also done recursively along with the thread-splitting.
All memory allocations that a thread does are on it's own part of the heap. They never touch another thread's memory - hence why its deterministic. No locks, no synchronization, no overhead.
The only non-deterministic bugs are the ones where one thread accidentally touches another thread's memory. But these are rare unless an out-of-bound array access happens to cross a thread boundary in the heap.
(In fact, I have yet to encounter a bug like this since I started this project)

There's a lot of fancy math needed to prove that a routine will never use more than XX memory - especially when you factor in memory fragmentation.
Another fancy protocol is needed to join threads when a large value needs to be returned via the heap.

So the memory allocation is deterministic, and completely free of locks/synchronization. A memory allocation is typically 20-100 cycles depending on how much stuff is on the heap.

Locking/synchronization isn't scalable, so there's none of it in the program. Thread-joining is the only form of "synchronization" that's used.

OS-managed memory allocation isn't scalable for this reason. All that page committing stuff needs to access global data - using locks...
To top it off the OS needs to zero newly allocated memory before the program can use it. (This is for security reasons: so you can't like steal a password by logging into to something, closing the program and allocating that physical memory into another program.)

As for pre-allocating a buffer for each thread to get better locality in NUMA - that's essentially the same as MPI as you have to explicitly move data in and out of that buffer.

At any given point in the program, all the threads are entirely independent.
When all threads complete their given tasks, they are assigned new tasks on different parts of memory. So no single part of memory is accessed by only one thread.

Implementing this in MPI will require inter-node matrix transposes - which is a huge research topic in my field.
In shared memory, these transposes can be eliminated by directly accessing the memory that's needed - this is what y-cruncher does right now.

In any case, an MPI implementation will pretty much require a rewrite of all the high-level code. I don't have immediate plans for this. And when I do, it'll be built on top of the NUMA/shared-memory layer.

EDIT:
Yes, it's a tree-reduction.

EDIT 2:
The program does in fact use more threads than hardware threads. This is to reduce the impact of load imbalance.
For systems with a non-power-of-two hardware threads, it needs to be rounded up to the next power-of-two anyways.

Thread: New Multi-Threaded Pi Program - Faster than SuperPi and PiFast

Thread Tools

Search Thread

Rate This Thread

Display

Hybrid View

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions