SC09: Intel speaks about 3D Web, demonstrates LRB.

**kl0012** · 11-18-2009, 10:54 AM

Here it is:
http://www.theregister.co.uk/2009/11...ttner_keynote/

On the SGEMM single precision, dense matrix multiply test, Rattner showed Larrabee running at a peak of 417 gigaflops with half of its 80 cores activated; and with all of the cores turned on, it was able to hit 805 gigaflops. As the keynote was winding down, Rattner told the techies to overclock it, and was able to push a single Larrabee chip up to just over 1 teraflops, which is the design goal for the initial Larrabee co-processors.

80 cores? Probably autor's mistake. But rather interesting is a 805 GFLOPS in SGEMM. As a reference point, Tesla (GTX280) hits 370 GFLOPS in the same task.

**zalbard** · 11-18-2009, 10:56 AM

So Fermi is faster from GPGPU point then?

**kl0012** · 11-18-2009, 11:09 AM

Originally Posted by zalbard

So Fermi is faster from GPGPU point then?

Will see. But if I remember correctly, Fermi only doubled Single Precission resources of GT200. Also SGEMM greatly depends on mem bandwidth so doubling FP resources dosn't always double perf. Probably Intel designed LRB with the great mem bandwidth.

**Hornet331** · 11-18-2009, 11:09 AM

On the SGEMM single precision...overclock...just over 1 teraflops

Failage?

I duno what the HD5870 or the 285GTX gets on that benchmark, but 1TFlop SP is weaksauce on larrabee side.

**G.Foyle** · 11-18-2009, 11:17 AM

So Larrabee still fails to run rasterized DirectX graphics with acceptable performance.

**fiveprime** · 11-18-2009, 11:20 AM

...and for the OpenGL environment being pushed by Advanced Micro Devices and others.

I spy a typo.

I wish there were pictures or more information on this. They might be aiming to try and reach people who already have Xeon based computing systems. Again I wish they'd put out more details on it.

**kl0012** · 11-18-2009, 11:24 AM

Originally Posted by Hornet331

Failage?

I duno what the HD5870 or the 285GTX gets on that benchmark, but 1TFlop SP is weaksauce on larrabee side.

Failage? Big win, imho. This time it is "real GFLOPS".
As I said Tesla C1060 (GTX280) with theoretical peak of 1TFLOP, hits only 370 GFLOPS in matrix multiplication.
http://www.idre.ucla.edu/events/2009...avid_Tesla.pdf

**Hornet331** · 11-18-2009, 11:35 AM

Originally Posted by kl0012

Failage? Big win, imho. This time it is "real GFLOPS".
As I said Tesla C1060 (GTX280) with theoretical peak of 1TFLOP, hits only 370 GFLOPS in matrix multiplication.
http://www.idre.ucla.edu/events/2009...avid_Tesla.pdf

Well, I said i didn't knew how to compare it to others, so if the 280gtx hits 370Gflops in real, it doesn't look that bad. Any numbers on the HD4xxx/5xxx?

**Blacky** · 11-18-2009, 11:41 AM

Originally Posted by xoqolatl

So Larrabee still fails to run rasterized DirectX graphics with acceptable performance.

Well what else you can expect from Intel in the graphic department

**Helloworld_98** · 11-18-2009, 11:58 AM

well, I think this is win, chances are your app is x86, not Open_L, so Larrabee wins in the consumers perspective.

**kl0012** · 11-18-2009, 01:15 PM

Another interesting paper:
http://techresearch.intel.com/UserFi...2009_FINAL.PDF

Results Summary: Our parallel implementation of ray-casting delivers
close to 5.8x performance improvement on quad-core Nehalem
over an optimized scalar baseline version running on a single core
Harpertown. This enables us to render a large 750x750x1000 dataset
in 2.5 seconds. In comparison, our optimized Nvidia GTX280 implementation
achieves from 5x to 8x speed-up over the scalar baseline.
In addition, we show, via detailed performance simulation, that
a 16-core Intel Larrabee [26] delivers around 10x speed-up over single
core Harpertown, which is on average 1.5x higher performance
than a GTX280 at half the flops. At higher core count, performance
is dominated by the overhead of data transfer, so we developed a lossless
SIMD-friendly compression algorithm that allows 32-core Intel
Larrabee to achieve a 24x speed-up over the scalar baseline.

**Manicdan** · 11-18-2009, 01:20 PM

they just said single precision was just over a tflop?
and how big is 80 cores?

**Chumbucket843** · 11-18-2009, 02:27 PM

^^intel doesnt like to tell the truth about gpgpu or ray tracing so dont trust any blogs, articles or whitepapers from them.

Originally Posted by Hornet331

Well, I said i didn't knew how to compare it to others, so if the 280gtx hits 370Gflops in real, it doesn't look that bad. Any numbers on the HD4xxx/5xxx?

i have seen 880 GFLOPs with dense matrix on a 4870 and the peak flops in that situation would be 960 GFLOPs. i would expect a 5870 to be a lot faster too. nvidia definitely needs to fix something here. probably memory access but shoddy code is always going to run very slow.

**Cybercat** · 11-18-2009, 02:33 PM

Originally Posted by xoqolatl

So Larrabee still fails to run rasterized DirectX graphics with acceptable performance.

Why, because they didn't demonstrate it at an HPC event?

**Nedjo** · 11-18-2009, 03:01 PM

well Larrabee is definitely on the path of Merced!

**Serra** · 11-18-2009, 03:20 PM

Originally Posted by article

Intel is also cracking the issue of sharing data between Core and Xeon CPUs and Larrabee GPU co-processors. Future Core and Xeon chips will be able to create a virtual shared memory pool that both the CPU and GPU can access so datasets are not crunched down, serialized, and moved over the PCI-Express bus from the CPU to the GPU and then back again after calculations are done. The shared virtual memory allows the CPU and GPU to work off the same data in sequence without any movement, which should radically improve performance and smooth out simulations.

^ This looks interesting to me. It's also a good way to go for Intel because nVidia has no way to do it if they lock them out. Of course, AMD could do it too now, and they may even be able to do it better at the moment.

As for performance... people can write all the blogs and take all the task-specific benches they want, I want to see actual numbers.

**fiveprime** · 11-18-2009, 03:28 PM

I too think the prospect of a shared memory pool is simply awesome. As with EVERY SINGLE HARDWARE RELEASE EVER EVER EVER you won't know till it's in the hands of the consumer. All other debate is just silly.

**Qkjhfhaiguihfma** · 11-18-2009, 08:31 PM

Originally Posted by xoqolatl

So Larrabee still fails to run rasterized DirectX graphics with acceptable performance.

ok? they just demonstrated 1tflop on a single preproduction chip, running x86. can anyone else do that?

**kl0012** · 11-18-2009, 09:38 PM

Originally Posted by Hornet331

Well, I said i didn't knew how to compare it to others, so if the 280gtx hits 370Gflops in real, it doesn't look that bad. Any numbers on the HD4xxx/5xxx?

Well, I did a little google research and found that with all the "stream computing" hype from AMD there are almost no official performance numbers. Still there are some 4870 numbers flying around on AMD forums. So here is it:
Up to 200 GFLOPS using OpenCL:
http://forums.amd.com/devforum/textt...readid=120413&
540 GFLOPS using IL/Brook+ (from AMD official):
http://forums.amd.com/forum/messagev...hreadid=105221
Some guy stated he was able to extract 880 GFLOPS using L1 texture caches but no confirmation that this method is usable in general:
http://cerberus.fileburst.net/showthread.php?t=54842
Also nice sum by AMD guy:

Although 7XX has multiple methods to access memory(a lot more than 2 if you read the ISA doc). OpenCL currenly only has one as the OpenCL programming model is pointer based, so all data has to be fully coherent(this is ignoring images which is read_only or write_only, not both). This does not allow the use of the texture unit in the same way that brook+/IL can use the texture unit. Brook+ does not allow you to alias pointers(unless you explicitly allow it) and IL you do so at your own risk. Writing to memory and reading from that same memory with the texture unit does not produce deterministic behavior. OpenCL requires that all writes and reads to global memory are coherent, so this approach is not feasible. This is a performance hit compared to a streaming model because the GPU is natively a streaming device. There is another performance hit for the R7XX since it was not designed with OpenCL in mind, our new HD5XXX series was.
One of the goals of the Stream SDK is to provide a full software stack for many different types of programmers.
That means if you want performance, AMD provides CAL/IL to do that. If you want ease of programming to the streaming model, we also provide Brook+ to do that. If you want to program in the same language across multiple devices from the same source, OpenCL.

I think that the bigest advantage of Larrabee is its ISA so you don't need to deal with various proprietary APIs. Also its memory model (coherent caches, general purpose mem hierarhy) alows much higher flexability in code development.

**Drwho?** · 11-18-2009, 09:56 PM

here is what you need to know: Here
the rest is not important ...

Pommm pom pom pom

**fiveprime** · 11-19-2009, 12:20 AM

Sigh, I got excited for...that.....

**ajaidev** · 11-19-2009, 12:35 AM

If you tune the IL right the rv770 can put out a lot.

I tried Brook+ DP but then i got two different types of instructions but the IL compiler did not combine the two together. I tried several workaround but it was a no go.

**fiveprime** · 11-19-2009, 12:39 AM

Honestly all I want to know is what the pricing is going to look like.

**Hornet331** · 11-19-2009, 02:48 AM

Originally Posted by kl0012

*snip*

So if you go with current best case (beside the theoretical 880gflops) intel larrabee can pull of twice the preformance of a HD4870. Now its to be seen how much more pefromance the HD5870 has brought to the table. If the scaling is the same the should be around 1.2TF.

**SQjay** · 11-19-2009, 04:45 AM

Anyone wondering about 40 deactivated cores?

It seams to me that they didn't have a spare NPP nearby to power up the whole chip!

If half of the chip is already on the limit of 300W then all this comparing with 4870/5870 is useless.
LRB with half chip is on par with 4870 perf. wise and on par 5970 (8x 4870) TDP wise.
So when it finally come out it will be say... 10x slower then R900 (very optimistic scenario for Intel) if they cut down power by half and bring 50% speed up in clocks.

As GPU it would bee finally nice to see LRB with TWO DIGITS FPS in any modern game.

Thread: SC09: Intel speaks about 3D Web, demonstrates LRB.

Thread Tools

Search Thread

Rate This Thread

Display

SC09: Intel speaks about 3D Web, demonstrates LRB.

Bookmarks

Bookmarks

Posting Permissions