AMD Tapes Out First "Bulldozer" Microprocessors.

**savantu** · 07-19-2010, 10:09 PM

Originally Posted by Dresdenboy

Oh, I didn't know that Intel invented SMT. Several people cite sources dating back to the early 90s, like these: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

Who said that?
At most, they were the first to successfully implement it in a commercial large volume product.

Back to my posting: I linked to an Intel slide in an article, which didn't allow direct linking. Fixed that now.

My point is: There is some misunderstanding about when and where those two threads in a SMT core are active. I wanted to show, that during each cycle, both threads could be executed on the available execution units. In other units, like the decode or retirement units of the Netburst architecture (see the paper linked by you) , the threads are being decoded/retired in alternating cycles.

There is no misunderstanding. They are active at the same time. Even with the threads being decoded and retired in alternating cycle it doesn't mean anything about the parallel execution. Since we're talking after all of a shared resource it is logical that some arbitration exists at different level.

Since SMT actually is about simultaneously issueing instructions of multiple threads to a set of EUs to make better use of them,

Says who ? You're basically narrowing the definition of SMT , so you can build an argument on a corner case which would come against the new definition. How do you want to have a discussion when you're changing the definition of things so it fit your stance ?
Talk about a logical fallacy the size of Everest.

Originally Posted by Dresdenboy

The low IPCs known for many workloads are not caused by missing ILP, but mainly by missing data from caches or external memory and by branch mispredictions. So if there is no cache miss or mispredicted branch, avg. IPC might well be around 3 at that time.

Both cache hit rate and branch predictor are in the mid 90% for modern CPUs. Yet, they struggle to get above 1.5 IPC. The reason is simple : ILP is truly missing.
Most of code is programmed in a sequential manner and data dependencies are the norm.

But why don't you try using the free execution resources with the thread itself? If for example the thread stalls due to a cache miss, it could speculatively continue to run those instructions, which load data, as kind of prefetch instructions. After serving the cache miss the thread could execute normally, but having the further needed data in place.

Another reason for more EUs than for the average use is if you have a checkpointing and replaying architecture. Sometimes speculation goes wrong and it's good to be able to replay instructions quickly.[/QUOTE]

I am sure future CPUs will include run-ahead and advanced replay uarch. But it is possible that OoO execution will hinder such approaches. The first CPU with scout threads ( run ahead execution ), Sun Rock , was a major failure performance wise.

Thread: AMD Tapes Out First "Bulldozer" Microprocessors.

Thread Tools

Search Thread

Rate This Thread

Display

Threaded View

Bookmarks

Bookmarks

Posting Permissions