The Case for barrel Processors

**nn_step** · 10-20-2007, 06:36 PM

Barrel or Multi-context processors have long existed in Supercomputers (CDC Cyber, etc) but for decades it has been ignored on the desktop because of a fascination for single thread performance. Which in a modern desktop is not desirable because the modern user typically is running several or more tasks at any given second. (Web Browser, music, operating system, gui, etc) The BeOS is probably the best example of how threading improves the user experience, since if one small task is delayed the user isn't going to be worried since the rest of the system is still humming along. In contrast with a single thread processor, that if a process gets stuck in a simple loop it can halt the entire system (until preempted)

A Barrel processor takes that simple logic to the extreme and turns a single processor into dozens of virtual processors running at a fraction of the speed with a significant benefits:
1) No forward feed circuits [Significant reduction in die space]
2) All instructions execute in exactly 1 clock cycle [Compiler design simplified and processor logic significantly simplified]
3) Significantly less Cache Sensitive [ 20 cycle L1 Cache would have the same performance as 1 cycle Cache]
4) Performance significantly easier to quantify [Threads * Clock speed/pipeline length] and double [Double clock speed will give exactly double performance since there are no pipeline stalls, bubbles, or wasted clock cycles]
5) Would promote more efficient use of Memory [since the modern practice of loop unrolling would have no benefit and significant deemphasis of the stack]
6) Superior user experience [Makes it possible for each application to have its own dedicated processor and significantly reduces the need for preemption and context switches]

Since I figure we might as well do a dramatic example, if Conroe was modified into a barrel processor it would run 56 threads of which each integer instruction will take 1 clock cycle and Each SIMD instruction would take 1.2 Clock cycles. When stripped to bare Execution logic, it would only take up only 12% of the transistors currently being used in Conroe and would effectively just be the green Regions in this picture(and any extra logic required for connections) :

However there are draw backs:
1) Each thread would effectively run at 300Mhz
2) Adding Cache and the associated logic would only slow the processor down
3) Modern Operating systems such as Windows and Linux would waste some of the efficiencies because their preemption cycles are too short.
4) Excessive demand for Memory bandwidth [IMC would be required for efficient performance]
5) The poor design choices of previous years would hamper performance for some applications [But not games nor small applications]

**HungryForHertz** · 10-21-2007, 08:39 AM

And how does it turn one processor into many virtual ones? Can't see how in that text.

Edit: are you saying it creates virtual processors to execute the same pieces of code over and over? I see what it does, don't see how

**nn_step** · 10-21-2007, 01:43 PM

Originally Posted by HungryForHertz

And how does it turn one processor into many virtual ones? Can't see how in that text.

Edit: are you saying it creates virtual processors to execute the same pieces of code over and over? I see what it does, don't see how

It effectively runs a separate thread on each and every single clock cycle.
So the number of threads it can run is (# Processors)x(#ALUs)x(#Stages/ALU)
Hence I say 56 threads per core for a modified conroe.

The processor only has 1 instruction per thread executing at any given second, which removes all dependencies and therefor means perfect branch and loop prediction (since they would have completely calculated the result before issuing the next instruction)

Thus the Virtual processors are slower but they can run completely different applications. The more stages the processor has, the more threads it can run concurrently. However single thread performance is identical to a processor with perfect branch prediction and 100% efficiency running at (processor's clock speed/number of stages)

**dicecca112** · 10-21-2007, 02:50 PM

Interesting, might be the way of the future. My professor in Comp Arch was talking about this, be ended it with, there are no processors out that do this.

**HungryForHertz** · 10-21-2007, 04:49 PM

I think I get it, one instruction being executed per thread at any one time. So no pipelining and therfore no stalls or bubbles?

Thanks.

**nn_step** · 10-21-2007, 05:18 PM

Originally Posted by dicecca112

Interesting, might be the way of the future. My professor in Comp Arch was talking about this, be ended it with, there are no processors out that do this.

Believe it or not but this is exactly what Cray envisioned when he was designing his first "super computers" and there are many old Super Computers that did this, however they ditched the design in attempt to improve single thread performance.

Originally Posted by HungryForHertz

I think I get it, one instruction being executed per thread at any one time. So no pipelining and therfore no stalls or bubbles?

Thanks.

Correct

**HungryForHertz** · 10-22-2007, 07:43 AM

Seems pretty radical, would it not take a lot of reprogramming or is it too low level for that?

**nn_step** · 10-22-2007, 11:32 AM

Originally Posted by HungryForHertz

Seems pretty radical, would it not take a lot of reprogramming or is it too low level for that?

completely invisible to the programmer, but it would make current optimizations counter productive.

**EniGmA1987** · 10-22-2007, 01:37 PM

Originally Posted by nn_step

4) Excessive demand for Memory bandwidth [IMC would be required for efficient performance]

In a year or two wouldnt this problem be mostly gone with the mass adoption of DDR3? And memory will only be getting faster...

Or would it require far more speed than DDR-1800?

**nn_step** · 10-22-2007, 06:16 PM

Originally Posted by EniGmA1987

In a year or two wouldnt this problem be mostly gone with the mass adoption of DDR3? And memory will only be getting faster...

Or would it require far more speed than DDR-1800?

well the more threads you have, the more bandwidth you need.

**KTE** · 10-26-2007, 05:01 AM

Interesting but I've looked into this briefly before.

-How would you work out the actual real-time core frequency and maintain equal speed on all the virtual processors?
-How would you read this with an application?
-How would your software keep accurate time?

-How would core frequency and voltage be modified?

-What is the power demand of such a processor?

-How to implement technologies such as Speedstep?

-One thread, one processor at 1GHz; how does it fare against a 1GHz Athlon XP (difference in Stream, FP, Memory, Int performance)?

-What efficiency will the core have between executing 1 complete run of SPi 1M and 10 runs of it simultaneously? (how much of a performance impact on each thread as they increase)
Say hypothetically, you run Super PI 1M at 1GHz. If it completes in 80 seconds, then; 80*1E9 = 80E9 cycles to calculate 1E6 digits.

How would the performance be effected per thread basis by increasing thread/v.processor count from 2-50?

-How large is the logic for such a chip? To be as effective you would need all the logic needed for a single core nowadays for each thread to work on I would imagine.

-Scaling in terms of total time/power taken for a dual-core nowadays to complete 10 runs of benchmark A vs. this processor?

Some of the qs I that spring to mind.

**STEvil** · 10-26-2007, 09:34 AM

seems very close to a GPU to me.

maybe this is an indication where CPU's and GPU's will be converging in the future when more realistic graphics are capable of through ray tracing?

Or maybe use a cut-down version of this as a thread-handler on the CPU/GPU which feeds threads into different more specialized areas of a CPU/GPU then re-combines them at the end so many-core CPU's do get speed-up advantages on single threaded apps?

**nn_step** · 10-26-2007, 04:59 PM

Originally Posted by KTE

Interesting but I've looked into this briefly before.

-How would you work out the actual real-time core frequency and maintain equal speed on all the virtual processors?
-How would you read this with an application?
-How would your software keep accurate time?

-How would core frequency and voltage be modified?

-What is the power demand of such a processor?

-How to implement technologies such as Speedstep?

-One thread, one processor at 1GHz; how does it fare against a 1GHz Athlon XP (difference in Stream, FP, Memory, Int performance)?

-What efficiency will the core have between executing 1 complete run of SPi 1M and 10 runs of it simultaneously? (how much of a performance impact on each thread as they increase)
Say hypothetically, you run Super PI 1M at 1GHz. If it completes in 80 seconds, then; 80*1E9 = 80E9 cycles to calculate 1E6 digits.

How would the performance be effected per thread basis by increasing thread/v.processor count from 2-50?

-How large is the logic for such a chip? To be as effective you would need all the logic needed for a single core nowadays for each thread to work on I would imagine.

-Scaling in terms of total time/power taken for a dual-core nowadays to complete 10 runs of benchmark A vs. this processor?

Some of the qs I that spring to mind.

1) Realtime frequency is identical a normal single tasking but the effective frequency for each Thread will be real clock speed divided by the number of stages.
2) Take Clock speed and divide by number of stages
3) Identical to a normal processor
4) Depends entirely on implementation
5) The power Demand would be significantly reduced
6) Turn off stages when not being used, (which is now more predictable)
7) Comparing it to single thread task, its performance will be less (by about 33% or more depending on the application)
8) Clock speed performance scaling will be linear, thus a 1Ghz version of it will perform EXACTLY half as well as a 2Ghz version
9) Thread level scaling is also linear, meaning that a 8 thread edition and a 4,294,967,296 Thread edition with the same number of stages would have exactly the same Single thread performance
10) The control logic is absolutely tiny compared to modern Processors (think about 0.00688% of current Processors) But the ALU and FPU will be approximately 3% larger because of the extra Registers and the associated logic
11) The time taken to run 10 Independent threads compared to a single core processor (which it is significantly smaller than) would be less time by a few thousand cycles simply because there is not the usual context switching penalty.
12) Thus a 2 ALU [8 stage] and 1FPU [16 stage](16 thread) version of this would complete 16 copies of SuperPI in less time than Conroe would assuming they would run at the exact same clock speed. (though logically speaking the significant reduction in die size would give the clock speed advantage to the Barrel processor)
13) In summery when a user or a Server is dealing with multiple requests and/Or tasks (which is extremely common these days) a barrel Processor has significant advantages but if you were to compare it in terms of single thread performance it would be at a disadvantage. In terms of Performance per Watt, significant advantage. In terms of transistors, significantly smaller and cheaper. In terms of real world application experience, a significant improvement[Open up task manager and see how many tasks are running, now imagine each and every single one of them had a dedicated processor. Don't you think that would make your system significantly more responsive?]

**KTE** · 10-26-2007, 08:53 PM

Sounds almost too good to be true. The idea theoretically is intriguing. The application in practical sense is usually much harder with all the quirks involved (and I'm not a processor engineer), but would love to see a shot at something like this by one of the main semiconductor mfgs.

Here's some more qs:

When benchmarking code in usual software, time is calculated by the software by calling certain functions; C function "time()", timeGetTime function, processor clock cycles timer by accessing the instruction RDTSC (which is accurate down to the 0.x ns), BIOS timer, chipset timer, Windows API functions QueryPerformanceCounter & QueryPerformanceFrequency or the Enhanced Timer function AFAIK. Some of these functions are redundant when the frequency fluctuates such as when using Speedstep, as the start of the code execution is not the same frequency as the end. For instance, if you start Super Pi and while its running change the Windows time to 2 seconds back, it'll error and freeze as it cannot establish correct time. Nor are many accurate for monitoring the real-time frequency of multiple cores based on PLL timer feedbacks. The timer Windows XP uses (which they have not mentioned AFAIK) checks only at system startup and does not show the current effective processor frequency but only the bootup frequency.

*When you have the BPU (Barrel Processor Unit lets use for ease), say the real-time clock frequency with only one thread is 1GHz (100 x10). The minute three more threads are being executed, the effective clock for each thread drops to 250MHz. What's the error margin for a virtual processor to error in the amount of frequency it allocates to a thread in the above scenario, and if you found a thread running slower than all the rest, how would you distinguish its allocated frequency or troubleshoot it? (since the real clock will remain unchanged and all those time functions will not show you a single VP clock)

*With this design, is not the CPU being used more closer to 100% load than with present CPUs with multiple threads executed?

**nn_step** · 10-26-2007, 09:05 PM

Originally Posted by KTE

Sounds almost too good to be true. The idea theoretically is intriguing. The application in practical sense is usually much harder with all the quirks involved (and I'm not a processor engineer), but would love to see a shot at something like this by one of the main semiconductor mfgs.

Here's some more qs:

When benchmarking code in usual software, time is calculated by the software by calling certain functions; C function "time()", timeGetTime function, processor clock cycles timer by accessing the instruction RDTSC (which is accurate down to the 0.x ns), BIOS timer, chipset timer, Windows API functions QueryPerformanceCounter & QueryPerformanceFrequency or the Enhanced Timer function AFAIK. Some of these functions are redundant when the frequency fluctuates such as when using Speedstep, as the start of the code execution is not the same frequency as the end. For instance, if you start Super Pi and while its running change the Windows time to 2 seconds back, it'll error and freeze as it cannot establish correct time. Nor are many accurate for monitoring the real-time frequency of multiple cores based on PLL timer feedbacks. The timer Windows XP uses (which they have not mentioned AFAIK) checks only at system startup and does not show the current effective processor frequency but only the bootup frequency.

*When you have the BPU (Barrel Processor Unit lets use for ease), say the real-time clock frequency with only one thread is 1GHz (100 x10). The minute three more threads are being executed, the effective clock for each thread drops to 250MHz. What's the error margin for a virtual processor to error in the amount of frequency it allocates to a thread in the above scenario, and if you found a thread running slower than all the rest, how would you distinguish its allocated frequency or troubleshoot it? (since the real clock will remain unchanged and all those time functions will not show you a single VP clock)

*With this design, is not the CPU being used more closer to 100% load than with present CPUs with multiple threads executed?

well UltraSparc T1 is close to being a proper example.
as for the details of timing logic, it is independent of clock speed and stage independent (like it already is with all modern processors) so why would that be a concern?
Also the effective Clock speed for the threads never actually changes regardless of the number of threads being executed

**KTE** · 10-27-2007, 01:12 AM

Originally Posted by nn_step

as for the details of timing logic, it is independent of clock speed and stage independent (like it already is with all modern processors) so why would that be a concern?

Are they both independent (as in do not effect each other) in modern CPUs?

IIRC the CPU PLL run at a base timer frequency (3.6MHz) which has to be synchronized with the motherboard/chipset PLLs to return a frequency value stored within core EAX:EDX registers. Isn't the time and frequency by a benchmark/application gathered by using the processor Time Stamp Counter (timing algorithm - ticks per clock cycle), and that is how the time, the start time, end time and thus time taken is calculated?

**nn_step** · 10-27-2007, 03:01 AM

Originally Posted by KTE

Are they both independent (as in do not effect each other) in modern CPUs?

IIRC the CPU PLL run at a base timer frequency (3.6MHz) which has to be synchronized with the motherboard/chipset PLLs to return a frequency value stored within core EAX:EDX registers. Isn't the time and frequency by a benchmark/application gathered by using the processor Time Stamp Counter (timing algorithm - ticks per clock cycle), and that is how the time, the start time, end time and thus time taken is calculated?

Yes, because if it wasn't the majority of the people here would have their clocks running twice as fast they they should

The ISA interface doesn't need to change, thus it is completely invisible to the programmer and the software. As far as the software knows, it is just a mini-cluster of slower procs...

**xlink** · 10-29-2007, 02:10 PM

how about one of these as a coprocessor and then one incredibly fast CPu dedicated to single thread performance...

**STEvil** · 10-29-2007, 05:14 PM

cell2

**xlink** · 10-30-2007, 12:35 AM

Originally Posted by STEvil

cell2

more along the lines of asymetrical computing with the cores being clocked diferently from one another depending on usage.

it's not something I know how to do I'll be blunt and my idea would likely be a PAIN t do via traditional coding but...

Thread: The Case for barrel Processors

Thread Tools

Search Thread

Rate This Thread

Display

The Case for barrel Processors

Bookmarks

Bookmarks

Posting Permissions