PDA

View Full Version : Programming for clusters, anyone?


angra
05-18-2007, 12:04 PM
So, I was curious if any of you guys do parallel computer programming? I know that not everyone on XS is a coder (see how tiny the coder's corner is). With multi-core being the wave of the future (or the present), just about every program is going to be impacted in one way or another by parallelism, so it's an issue that's going to be more and more important...

so are any of you guys parallel systems programmers? Any of you learning? Thinking about it?

I really need a positive, friendly conversation on XS to help me keep the faith...

pythagoras
05-18-2007, 02:29 PM
Well over the years I have often thought of getting into coding, just never found the time, unless you count a bit of commodore64 basic:eek: .

I hear parallel coding requires a completely different set of skills. Would you say someone starting from scratch in the parallel world would have an advantage here, given similar skill sets as far as mathematics and logic go?

And where would one start with parallel coding, books websites?

Regards

John.

angra
05-18-2007, 02:42 PM
That's a good set of questions!

I don't think I can answer on whether starting from scratch on parallel is easier or not. I programmed for 20 years before I touched anything parallel, so I just can't even imagine starting from scratch in that mindset. My intuition is that at this time, the answer is no - it would not be easier to start from scratch with parallel. The reason for that is that the underlying abstractions that have been built up around single-cpu programming still dominate the input software languages. Some of them may support parallelism better than others, but for the most part, they are still built thinking of the computer as made up of (now many) elements that do one instruction at a time.

There are a good number of books that are intended to be introductions to parallel programming. I would have to research a bit to make a canonical list, but I would be happy to do so. to be honest, I think I could best explain the concepts to people who already know how to program, so if someone was learning from me, I would tell them to learn how to write software first, then worry about parallelism. A working knowledge of C is probably helpful but not really required.

As an alternate, I might suggest they look at streaming languages, such as StreamIt, Brook/Sequioa, or the GPU-centric languages from rapidmind and peakstream. These languages are intended to be implicitly parallel rather than explicitly parallel, which might be easier to grasp for a new programmer...code is written in a way that expresses the flow of data through operations rather than a sequence of operations execute, which can be more intuitive way to express some algorithms, and is much easier for compilers/tools to automatically parcel out to multiple execution units.

pythagoras
05-18-2007, 02:50 PM
I see, I was thinking more along the lines of coding parallel systems directly rather than through a compiler or interpreter. Is machine code programming the reserve of a few people these days?

If you are coding via compilers/interpreters is there much optimisation you can do, or is the talents of the compiler coder more important?

Regards

John.

When I'm talking about compilers/interpreters I suppose I really mean a high level language rather than coding in machine code and then compiling

angra
05-18-2007, 03:00 PM
I see, I was thinking more along the lines of coding parallel systems directly rather than through a compiler or interpreter. Is machine code programming the reserve of a few people these days?

If you are coding via compilers/interpreters is the much optimisation you can do, or is the talents of the compiler coder more important?

Regards

John.

another excellent set of questions.

Assembly programming (I think what you mean by machine code) is phenomenally time consuming (aka EXPENSIVE!) to do. This is especially true in the parallel environment where the overall system complexity goes up drastically. In most cases it is only a very few people in the world who will produce code via assembly that is much faster than what a compiler creates, because the complexity of optimization is so great.

One element that you overlooked in your assembly versus compilers/interpreters dichotomy is libraries. This is also known in some circles as middleware. A classic example that many people on XS can relate to of middleware is OpenGL. The idea is that for a specifc application or problem type someone would define a set of functions that covers the set of things you'd like to do in that domain (say 3d-graphics for opengl). It is then left to vendors or motivated individuals to create hyper-optimized implementations of the defined function set. This allows a relatively large number of programmers/application writers to benefit from the optimization work that is done by a small number of people. As you can guess, this is a hugely important method that is very great when succesful. the hard part is coming up with a stable API that a lot of people will use. It also happens to be my personal favorite way to leverage new technology. since the types of things people want to write programs to do typically changes much more slowly than the types of computing hardware we want to run on.

pythagoras
05-18-2007, 03:09 PM
I think the stanford guys would probably give you a good discussion on this topic, do you do much scientific programming of parallel systems? They are always looking for volunteers, they just never find any that understand the scienctific models enough to help with the coding.

Regards

John.

angra
05-18-2007, 03:22 PM
The stanford guys are great, and I know them well. I have worked on projects that some stanford people have been involved in as well, and I have tremendous respect for the professors and students working in this area over there. Bill Dally and Mark Horowitz come to mind as prime candidates for "pay attention to what these guys have to say". Stanford guys were some of the early pioneers of the GPGPU community (i.e. using graphics processors to do non graphics computing), and Stanford is the home of Stream-C, Brook, and Sequioa.

pythagoras
05-18-2007, 03:34 PM
I have to agree, and Erik Lindahl is a star. I know gromacs didnt start as parallel but elements of it have been ported to the smp client. It took Eric all of 18 hours to program around a hardware bug in the athlon and opteron proccessors so that they wouldnt freeze when presented with a really rare set of instructions. Me and another administrator at F@H spent quite a few weeks working on this with amd texas, and when we passed on the hardware dump he did in that many hours what amd didnt know how to get around, data alignment was the problem.

Regards

John.

embeejay
05-20-2007, 11:11 AM
Hey angra, nice thread :)

I have been a programmer for many years and have done my share of parallel running programs during the years, tho i have never tried to work with a cluster.
Mostly i have been using perl and php for the past many years, doing manly websites and backend for them (db integration and such stuff) but i used to do a lot of silly pet projects when i was younger and the one that comes to mind in this case was a truly parallel chat-daemon i coded in perl way back in the mid 90's. The idea was that every person who logged into this chat-server got his/her own thread (or well really processes because it was based on fork) and then they all talked together through unix sockets. There was no real reason not to make it a single process but i decided to go the fork way as an exercise and it was great fun, although it caused all kinds of problems i hadn't foreseen when i started out :D

And this is probably where this long windy writing gets interesting: The difference in my eyes between single threaded programming and multi threaded is the way you think about your code, or at least it needs to be in my eyes.

I don't think parallelism is more difficult as such, you just need to start thinking about everything you do in a different light, and as such, all beginning is hard. Where you usually don't have to think to much about I/O all of a sudden this becomes your main focus (race conditions) you have to start thinking in transactions (atomic procedures) and so on... there aren't really, the way i see it, more ways to make things go wrong, there are just other ways, than usual, where you are likely to make mistakes.

Most of all i think parallelism is cool, for me it's a bit how recursion was in the old days.. a powerful new way to do things easier than you could before.. it's like every new thing (not that it is new as such, i am talking about the paradigm shift) - we have to master it before we see it's true potential... i still remember how i felt the first time i learned regular expressions, today i can't live without them (or well to be more precise, do my work).

I don't know if this jumble of thoughts on the subject made any sense, sorry if not, i am still hungover, it's been a tough weekend :D

uOpt
05-21-2007, 09:25 AM
So, I was curious if any of you guys do parallel computer programming? I know that not everyone on XS is a coder (see how tiny the coder's corner is). With multi-core being the wave of the future (or the present), just about every program is going to be impacted in one way or another by parallelism, so it's an issue that's going to be more and more important...

so are any of you guys parallel systems programmers? Any of you learning? Thinking about it?

I really need a positive, friendly conversation on XS to help me keep the faith...

You are throwing too many things into one basket.

Parallel programming for SMP machines such as multi-cores and multi-CPU systems is an entirely different matter than programming clusters. Clusters usually means different computers, and then you don't share memory anymore (unless you go as far as a distributed virtual memory system).

When it comes to programming SMP systems, I only have these bits of advice:

Forget about all these fancy higher-level tools, unless you are willing to go as far as using a special-purpose parallel language such as Erlang (which I don't recommend either).
You build parallelism and well-performing parallel programs by applying the basic well, not by applying a gazillion of non-basic abstractions. You use pthreads (or Win32 threads), and C++, and that's that. Don't fool around with distractions. Yes, it is painful, but yes, it is worth it.
If you go actual cluster (multiple machines, not shared memory), then you use TCP or UP sockets, that's it. All that RPC rubbish just gets in the way, not to speak of CORBA. Now, I don't say things like CORBA are useless, but they are more useful if you need to do inter-vendor communications. They are much less useful if both sides of the exchange are hand-designed by you.


Having said that, there are some higher-level architectures that I respect greatly. Namely, the "radio" toolkit for Python has very valuable capabilities and doesn't constrain you unnecessary.

angra
05-21-2007, 10:05 AM
uOpt:

indeed you are correct, they are all different in many ways. I was mostly being broad in scope in order to maximize the chances of making a conversation :).

with regards to clusters, don't forget MPI - you can use sockets directly, but there are indeed standardized wrappers at varying levels of abstraction - not all of them suck. I agree with your overall assesment of RPC, and to some degree CORBA. I think CORBA does have a place, it's just not in ultra-high-performance systems, at least right now. There are some big players in the high performance computing world that are big advocates of an OMG-style approach to complexity management.

The primary problem with higher level languages is that they have not been suitably "matured" (for lack of a better term) there just hasn't been enough need for them to really put them through their paces and get them fully flushed out. Much like the situation with C vs. Assembly 15 or 20 years ago - the compilers have not matured enough yet to outdo (in general) what a smart human can do in terms of partitioning the problems, and managing the communication.

On the flip side, hand coding is a lot more work...in the hobbyist domain we can say "suck it up", but in the commercial world, effort equals money, and it often (usually?) is worth a performance loss to gain a reduced overall cost of ownership for the software. Sometimes it isn't, though, so if you want the absolute maximum performance, you are absolutely correct - hand code down to as low a level as necessary to gain the control you need to get the performance you want.

As an aside, multi-core does not necessarily go hand-in-hand with SMP. It does for AMD and Intel chips, but the Cell processor, as well as other heterogeneous chip multiprocessor systems (I'm thinking of various university research processors as my example here) rely heavily on the localization of memory to gain the performance they achieve.

anyway, good topic :) and some good comments so far. I assume uOpt that you do some of the type of programming you describe.

uOpt
05-21-2007, 11:49 AM
Well, as I said, the moment you actually have to talk to somebody else's software there's a good reason to go CORBA or XMLRPC or whatever. It won't be fast anyway :) and it is easier to assign blame.

I am in the commercial world, however, and we hand-code. The moment that performance and/or error handling is the whole point about your product, things go "hand-coding" very quickly.

Now, the question is: when is performance a critical point for your product/company's success? Obviously it usually doesn't matter much for some middleware layer, which is 80% of business programming. But performance is critical much more often than people think, and it comes in when the topic of "scalability" comes up. There is nothing worse than having a product that performs well for a limited amount of users/data/machines and that doesn't scale. Your customers will react very violently if they managed to grow their business (the business they use your software for) by a factor of 10, sales-wise, but your product breaks down even when just putting on 2 times the original load. Customers will eat you alife. Growth is everything, and you need to be scalable.

I also never got the point of demanding SMP-capabilities of software (aka OS-level threading) to make use of 2 or 4 CPUs, but implement the whole thing in a language that is 2 or 50 times slower than a high-performance language in the first place :)

EvlUndrWareNome
06-13-2007, 02:29 PM
i have access to a cluster that has now grown nearing a 100 nodes in a cluster at my old high school. [Its in a secure closet only me and the teacher have a key to] .

I would LOVE to have a cluster client... the folding power would... be sexy :D

id love to see this screen pegged full with folding at home:
http://bofh.be/clusterknoppix/pics/clusterknoppix3.png

not mine, ill have to dig up a screen shot of the 85 node beast we had.

P_1
06-13-2007, 08:27 PM
I have done a little bit of parallel programming with pthreads, using mutexes and semaphores, which I understand is very expensive. I am interested in learning more advanced parallel programming techniques and contributing to your goal at hand.

adamsleath
06-13-2007, 09:39 PM
Example parallel programming models

Parallel programming models include:

* POSIX Threads
* PVM
* MPI
* OpenMP
* TBB
* Charm++
* Cilk
* Global Arrays
* HPF
* SHMEM
* Stream processing
* Pipelining
* Partitioned global address space: UPC,Co-array Fortran, Titanium
* Occam (programming language)
* Ease (programming language)
* Erlang (programming language)
* Linda coordination language

Other research-level models are:

* Cray's Chapel,
* Sun’s Fortress,
* IBM’s X10

Programming languages/models:

* Message Passing Interface/MPICH
* Linda
* *Lisp
* Cilk
* BMDFM: Binary Modular Dataflow Machine - Parallel Runtime Environment (BMDFM)
* Sieve C++ Parallel Programming System

uOpt
06-14-2007, 12:04 PM
I have done a little bit of parallel programming with pthreads, using mutexes and semaphores, which I understand is very expensive. I am interested in learning more advanced parallel programming techniques and contributing to your goal at hand.

No, that's the light stuff. Everything else is based on top of it.

Exception: you can have cheaper mutexes than most POSIX threads implementations provide, at least if you are willing to bind to a particular model of CPU. Most POSIX threads implementations have function calls for mutexes, when locking and testing would only require a single atomic assembly instruction.

Of course none of this does any *clustering*, it is machine local only (unless you do to a system with distributed virtual memory).

SparkyJJO
06-14-2007, 02:35 PM
OK this has gone waaay over my head :p:

P_1
06-14-2007, 02:38 PM
No, that's the light stuff. Everything else is based on top of it.

Exception: you can have cheaper mutexes than most POSIX threads implementations provide, at least if you are willing to bind to a particular model of CPU. Most POSIX threads implementations have function calls for mutexes, when locking and testing would only require a single atomic assembly instruction.

Of course none of this does any *clustering*, it is machine local only (unless you do to a system with distributed virtual memory).

Well I was thinking about it in terms of game development, where the hit from using mutexes and semaphores makes multithreading almost pointless. I meant advanced as in learning how to write code that doesn't even need mutexes or semaphores and achieving data parallelism.

uOpt
06-15-2007, 11:24 AM
Well I was thinking about it in terms of game development, where the hit from using mutexes and semaphores makes multithreading almost pointless.


This is absolutely not true. Who told you this?

The reasons why multithreaded games do so badly is that they do domain decomposition (aka the threads do different things) and not data decomposition (the thread can do different things but they can also do the same thing at the same time, just on different data chunks). As a result, minimum framerate is almost not reduced at all, since exceptionally slow frames are usually only slow in one activity. Since one activity is spread to one CPU at most, the minimum frameframe doesn't go up as much as expected.


I meant advanced as in learning how to write code that doesn't even need mutexes or semaphores and achieving data parallelism.

Look for "lock-free algorithms".

However, even here is problem is not the overhead of mutexes or semaphores themselves. The problem is that what is locked is locked for too long.

For example, if you are too lazy to do a lock-free linked list and you use a lock on the whole list (a normal list with one mutex), then your performance might suffer if you have many threads trying to write different places of the list. However, the process of locking, as in the mutex functions, is trivial and doesn't play a role in the delay. The delay comes from threads being blocked on those locks.

P_1
06-16-2007, 05:59 PM
http://www.gamasutra.com/features/20051205/paquet_01.shtml

"Mutex objects are very expensive to use. Every operation performed on a mutex object requires a user mode to kernel mode transition"

There was also a presentation by Tim Sweeney that talked about mutexes being expensive as well.
http://chipsandbs.blogspot.com/2006/02/next-generation-programming-languages.html
“Manual synchronization (shared state concurrency) is hopelessly intractible here.”
“[Atomic] Transactions are the only plausible solution to concurrent mutable state”

I am assuming manual means using mutexes/semaphores.

uOpt
06-18-2007, 01:46 PM
http://www.gamasutra.com/features/20051205/paquet_01.shtml

"Mutex objects are very expensive to use. Every operation performed on a mutex object requires a user mode to kernel mode transition"

There was also a presentation by Tim Sweeney that talked about mutexes being expensive as well.
http://chipsandbs.blogspot.com/2006/02/next-generation-programming-languages.html
“Manual synchronization (shared state concurrency) is hopelessly intractible here.”
“[Atomic] Transactions are the only plausible solution to concurrent mutable state”

I am assuming manual means using mutexes/semaphores.

Yeah, well.

You can implement a mutex as just a busy-wait spinlock, then it does not involve a system call.

It's true, however, that at least in the Windoze world "mutex" usually means a kernel calling (sleep wait, not busy wait) construct and IIRC you have no busy-wait spinlocks at all. Apparently those are considered too dangerous to give to casual users :rolleyes:

P_1
06-18-2007, 09:32 PM
wouldn't a busy wait spin lock type mutex also use up a ton of clock cycles, maybe even more so than kernel calls?

uOpt
06-20-2007, 12:07 PM
wouldn't a busy wait spin lock type mutex also use up a ton of clock cycles, maybe even more so than kernel calls?

If you use it in a situation where you shouldn't use a busy wait, of course. It'll be catastrophic.

But if your estimated rate of collision is low enough, the rate of locks per time is high despite this and the time in locked state is small of course it's worth it. Typically this is the case when there is a very high rate of lock/unlock, but it is one thread that does the overwhelming majority of these locks and only occasionally you'll have other threads.

sc00p
06-21-2007, 10:24 AM
Could you propheads code an interface for this nVidia Tesla (http://www.nvidia.com/object/tesla_computing_solutions.html) so someone with deepish pockets could run f@h on it :p:

angra
06-26-2007, 01:18 PM
sc00p - in theory, if F@H as open source, it could be done. it isn't though (for good reason). Some of the number crunching routines could be ported (and I might just do it if I get the time, or assign a student...), but getting it used by the F@H project requires the blessing/integration/work (i.e. effort) of Stanford.