Is it possible to get speed-up from running this on a GPU? It seems that this algorithm lends itself well to being parallelized efficiently and if each computation is independent from one another then it will work well on a GPU. But it seems this program requires a large amount of memory and I don't know if the GPU can access CPU memory.

Have you looked into this at all?