2TB working memory

**zatak** · 01-03-2011, 07:31 AM

I need a system with 2TB of working memory. Before you laugh, let me give a little explanation. We are simulating an entire rhesus macaque brain. To do this fast we may use the darwin cluster at cambridge (128 C1060 NVIDA GPU, 7TB RAM cluster wide, mellanox infiniband). We might also try something like GPUGrid using boinc, either with volunteers or on our campus using CPU's.

unfortunately brain activity is not easily parcelled. Everything is related to everything else and happens at the same time. that's sort of why it's different to a computer, certainly a von-neumann computer at any rate. I digress.

What I'd like to do is build the cheapest system I can to test. clearly getting 2TB of RAM together even in multiple machines is very very costly. So, I was thinking of the following strategy:

Tyan make a barebones that can take 8 GPU's at full x16 speed, 2 xeons (one per bus) and I think 148GB of RAM.

Now, bear in mind that the GPU's will

a) be very busy computing things
b) move data to and from host ram across the PCIe bus only at x8 speed (x16 if data moves only in one direction, but it will be moving in both)
c) will be moving data constantly (they can compute at the same time, it's asynchronous, but they cannot compute with data they don't have and they get it over PCie)

the PCIe is a huge bottleneck (and nvidia, intel and co are working hard to try to get parallel co-processing devices like this behind that bus. the intel guy wouldn't say when - his device was cool though btw; basically 32 or 64 or something PIII chips on a pcie card with some onboard RAM, I digress again)

so:

what I thought was, replace two of eight GPU's with two really good pciex8 RAID cards. do x16 exist?

to those cards, attach many small (like 60GB or so) SSDs.

RAID0 them together with a stripe size of 4k (linux's memory page size)

the idea being that each RAID card could now send data at pciex8 speed, and hold 1TB.

format both RAID cards/devices as SWAP (one on each side of the machine, i.e. one per bus, one per cpu, one per bank of DIMMS)

linux can use swap space intelligently across multiple devices.

then bascailly, just pretend to have 2TB of RAM. but in fact just have a PCIex8 speed paging system and 148GB of RAM.

Like a cache hierarchy GPULOCAL<->GPUGLOBAL<->HOSTRAM<->RAIDSWAP

does that sound viable? would the RAID system hold together? what would be a good RAID card to look at?

cheers,

C

**One_Hertz** · 01-03-2011, 08:18 AM

I read your post 5 times but I do not quite understand what you want to do. Do you need an array that can read AND write at PCI-E 8x (1.0 or 2.0 standard???) at the same time? What block size do you need these kind of speeds at?

From my basic understanding of what you want to do, Flash media is simply not fast enough. Big slowdowns happen when you try to read and write to Flash media at the same time.

Is it possible to configure your setup to receive data from one array and send data to another array? That would make it more viable. You will not be able to achieve 4GB/s (I am assuming 2.0 PCI-E) read and 4GB/s write from two raid cards. Your best bet would be the LSI 9260. I believe one of those can read at over 2GB/s at maximum. Two of them should be able to give you 4GB/s reads. Writes are more difficult and I believe you would need three of those cards to write at 4GB/s. So five PCI-E slots overall... This is assuming you are reading and writing data in large blocks (128kb+). If you are doing this in small blocks then forget it. Also, I doubt your CPUs will keep up.

You may be able to do what you want with two IOdrive Octals:
http://www.fusionio.com/images/data-...-datasheet.pdf

6GB/s reads and 4.4GB/s writes each. You would likely still need to configure your system to read from one and write to the other.

**zatak** · 01-03-2011, 09:39 AM

yes, so:

4k is the RAID0 stripe size. And it is the unit of memory linux uses for paging. So I assume linux swap is a filesystem that uses 4k blocks, or is optimized for reads and write of 4k anyway. Data will always be written in 4k chunks, the same size as a stripe on the RAID0.

since it's a RAID0, there is the possibility that the RAID controller can allocate reads to some devices and writes to others in an optimal way. Surely that's a major part of what a RAID controller or algorithm does.

and that's part of the reason for using many small SSD's each with a read xor write of about 200MB/s. The other reason obviously is maximise the sum of the read xor write bandwidths.

If the RAID card can read and write simultaneously at 8x, and intelligently allocate reads and writes to different component SSD's of the RAID0 set, then that's 4x in each direction, which is a total of 4GB/s. So, that would mean 20 SSDs attached to the card, half of which would be reading and half writing at any one time in an optimal situation of simultaneous 4x read and 4x write over 8x link.

assuming the RAID controller can figure out to write to different devices in the RAID0 set than it is reading to.

And I'd have thought it would be good at that.

But I don't know what algorimths RAID0 controllers and software use.

in addition: I have not one, but two PCIe busses. And I propose to put a RAID card, and 20SSD's on each (totaling the 2TB).

The linux kernel can treat these two RAID0 volumes, formatted as swap, as a single swap space. but it knows they are different partitions, on different devices.

So the kernel may help, or be compiled with suitable options to help, by writing to one device and reading from the other in the most optimal way.

there will inevitably be times when data to be read is on a SSD that is being written to.

but not too many, because the flow of memory reads and writes in our code falls into a pattern where large areas are for reading at time t or writing at time t but not both.

linux's paging activity will reflect that. so the pattern of reads and writes should fall into a neatish pattern of address distribution across the RAID set when it's used as a page cache (virtual memory).

So: the points at issue are:

can raid controllers intelligently decide what device to write to and what to read from during simultaneous reading/writing? Surely a big database server would pretty much require that from it's RAID set?

the main question is whether there are RAID0 controllers that can take 20 SSD's, read (never mind read/write at same time) data from them simulteaneously @ 200MB/s each in 4k chunks and send 80K of data @ pciex8 speed.

**One_Hertz** · 01-03-2011, 10:41 AM

Raid controllers do not decide which device to do reads/writes on. They can not do that by definition. Your software (in conjunction with your filesystem) will request certain LBAs (logical blocks addresses) to be read and written to. These LBAs reside in fixed locations on fixed devices once you make your RAID.

Why did you decide to use a 4kb stripe size? This will put a lot of load on RAID controller CPU. This CPU will be your bottleneck so placing extra load on this is not a good idea. Larger stripe sizes will make the RAID easier for your RAID card to manage.

To be honest, I do not know if the LSI 9260 can do what you want it to do. I believe it is the most capable RAID card for the task, but it is not magic. There are users on here who have achieved over 2GB/s reads on the 9260, but it was not in 4kb chunks. I doubt it is possible in 4kb chunks.

The IODrive Octal would be much quicker (it is 16x) than any RAID you can make out of two PCI-E slots and would only occupy a single PCI-E slot, but it is $$$$$$ (but still many magnitudes cheaper than RAM) and I guess it would eat massive amounts of CPU power. Their sales team is great and will run specific benchmarks you ask of them on request.

**ripken204** · 01-03-2011, 10:55 AM

i just read "cheapest system" and "2TB of RAM" on the same line.
god i love this forum

this may be possible but speed is the question, you will be bottlenecked, depending on the speeds you are writing at which i am assuming will be crazy.
the raid controller would have to be top of the line.

then you need to worry about the life span of the SSDs....
if you are constantly writing to them at max speed, you will wear them out very fast.
and what happens when a few die? hopefully the system keeps on going.. but what about the data that was on them when they died.

**zatak** · 01-03-2011, 11:16 AM

You did read those phrases on the same line. But how the phrases relate semantically are important.

And of course all groups building supercomputers seek the cheapest means to their end. So cheapest 2TB system actually makes more sense to me than cheapest 24gb system.

If one of the drives goes we'd get another. It's working memory not storage.

I'm aware of these iodrive things. Probably best go for that. Was trying tobsave money

The stripe size was chosen because it's the size of a page.

But Linux probably has the smarts to do it 128k for this reason.

Thanks for help. Will continue to research.

This system is just to test proof of concept. We need the Cambridge or other large cluster for real work.

With the system I outlined I think the monkey would experience one second for every thousand we do. Most experiments last an hour in Neuroscience. Much longer to train an animal. 1000 hours per training session might get boring. And one of those drives would probably pop.

We'll work it out. Thnx advice.

**CrazyNutz** · 01-03-2011, 12:39 PM

What you are talking about sounds possible, 2tb SSD RAID0 as virtual memory.

1). Donot create a filesystem on it as that will only slow you down, make it a big swap partition, and use the mkswap command on it.

2). Modern RAID controllers like 3ware will make use of command queueing, and I think it can schedule time slices if there are other read/write commands in the queue. This would be a good question to ask 3ware, they are pretty helpful.

**Rubycon** · 01-03-2011, 01:40 PM

Too much latency to be practical for real time "thoughts".

Imagine being able to breathe underwater and you are told to run a marathon on the ocean floor. Except the sea is about as viscous as waffle syrup in the Yukon in February. Yeah probably like that.

While your network and processing is seemingly insanely fast (and compared to what pc users on the forum use - it is) it's not going to be what you want.

But in the name of science you have to give it a shot.

After all look at how many people laughed at the Wright Brothers.

**zatak** · 01-05-2011, 03:20 AM

who said it has to be real time?

as I said, with 6 GPU's and a virtual memory system like this working perfectly I'd estimate it would be about 1000x slower than real time.

and that may be too slow for practical reasons. But it may be fast enough to check everything works and we get the sorts of results we get, before we purchase large blocks of time on a big GPU cluster like Darwin.

even on darwin I wouldn't expect real time. but it doesn't matter at all.

it's doubtful the simulation, even with the level of anatomical detail we plan, will learn stimulus response patterns. ontogeny (the development of an organism) and epigenesis (differential gene expression depending on environment; hence identical twins can be different) are not part of the model. and in order to get scale, we have to sacrifice detail.

it's not really a model for looking at cognition or learning. it would be cool to get a result there. but too uncertain to base a project on.

the right patterns of oscillation in the right anatomical locations is more than good enough.

we're after the "resting state". a huge amount is going on at rest. basic mechanisms of functional network formation are there.

it also is intended as a "forward model" to help understand what EEG, fMRI and other imaging signals are really measuring.

so it doesn't need to be real time. there's no robot.

yet....

seriously though: 500 years away or something for the robot. sorry.

View Poll Results: can this possibly work?

Thread: 2TB working memory

Thread Tools

Search Thread

Rate This Thread

Display

2TB working memory

no robot. even with SSD RAID

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions