Nehalem Performance Improvement Features:
Update: Anand's Analysis - http://www.anandtech.com/cpuchipsets...spx?i=3264&p=2
Update : http://www.hardwaresecrets.com/article/535/3Nehalem allows for 33% more micro-ops in flight compared to Penryn (128 micro-ops vs. 96 in Penryn), this increase was achieved by simply increasing the size of the re-order window and other such buffers throughout the pipeline.
With more micro-ops in flight, Nehalem can extract greater instruction level parallelism (ILP) as well as support an increase in micro-ops thanks to each core now handling micro-ops from two threads at once.
Despite the increase in ability to support more micro-ops in flight, there have been no significant changes to the decoder or front end of Nehalem. Nehalem is still fundamentally the same 4-issue design we saw introduced with the first Core 2 microprocessors. The next time we'll see a re-evaluation of this front end will most likely be 2 years from now with the 32nm "tock" processor, codenamed Sandy Bridge.
Nehalem also improved unaligned cache access performance. In SSE there are two types of instructions: one if your data is aligned to a 16-byte cache boundary, and one if your data is unaligned. In current Core 2 based processors, the aligned instructions could execute faster than the unaligned instructions. Every now and then a compiler would produce code that used an unaligned instruction on data that was aligned with a cache boundary, resulting in a performance penalty. Nehalem fixes this case (through some circuit tricks) where unaligned instructions running on aligned data are now fast.
In many applications (e.g. video encoding) you're walking through bytes of data through a stream. If you happen to cross a cache line boundary (64-byte lines) and an instruction needs data from both sides of that boundary you encounter a latency penalty for the unaligned cache access. Nehalem significantly reduces this latency penalty, so algorithms for things like motion estimation will be sped up significantly (hence the improvement in video encode performance).
Nehalem also introduces a second level branch predictor per core. This new branch predictor augments the normal one that sits in the processor pipeline and aids it much like a L2 cache works with a L1 cache. The second level predictor has a much larger set of history data it can use to predict branches, but since its branch history table is much larger, this predictor is much slower. The first level predictor works as it always has, predicting branches as best as it can, but simultaneously the new second level predictor will also be evaluating branches. There may be cases where the first level predictor makes a prediction based on the type of branch but doesn't really have the historical data to make a highly accurate prediction, but the second level predictor can. Since it (the 2nd level predictor) has a larger history window to predict from, it has higher accuracy and can, on the fly, help catch mispredicts and correct them before a significant penalty is incurred.
The renamed return stack buffer is also a very important enhancement to Nehalem. Mispredicts in the pipeline can result in incorrect data being populated into Penryn's return stack (a data structure that keeps track of where in memory the CPU should begin executing after working on a function). A return stack with renaming support prevents corruption in the stack, so as long as the calls/returns are properly paired you'll always get the right data out of Nehalem's stack even in the event of a mispredict.
Full List:Let’s now explain other microarchitecture enhancements that Nehalem will incorporate.
First Nehalem will have four dispatch units instead of three. So what does that mean? This means that internally the CPU can have four microinstructions processing at the same time instead of three like on other Core-based CPUs (Core 2 Duo, for example). This represents a 33% improvement in the CPU processing capability. Translation: this CPU will be faster than Core 2 Duo CPUs under the same clock rate because it can process four microinstructions at the same time instead of three.
Second, Nehalem will have a second 512-entry TLB (Translation Look-aside Buffer). This circuit is a table used for the conversion between physical addresses and virtual addresses by the virtual memory circuit. Virtual memory is a technique where the CPU simulates more RAM memory on a file on the hard drive (called swap file) to allow the computer to continue operating even when there is not enough RAM available (the CPU gets what is on the RAM memory, stores inside this swap file and then frees memory for using). According to Intel adding this second table will improve the CPU performance.
And third there are enhancements on the branch prediction unit, with the addition of a second BTB (Branch Target Buffer). Branch prediction is a circuit that tries to guess the next steps of a program in advance, loading to inside the CPU the instructions it thinks the CPU will try to load next. If it hits it right, the CPU won’t waste time loading these instructions from memory, as they will be already inside the CPU. Increasing the size (or adding a second one, in the case of this CPU) of the BTB allows this circuit to load even more instructions in advance, improving the performance of the CPU.
------------------------------------------------------Performance Improvement Features:
With the next generation microarchitecture, Intel made significant core enhancements to further improve
the performance of the individual processor cores. Below we describe some of these enhancements.
Instructions per cycle improvements. The more instructions that can be run per each clock cycle, the greater the performance. In addition, in many cases, by running more instructions in any given clock cycle, the work task can complete sooner enabling the processor to more quickly get back into a lower power state. To run more instructions per cycle, Intel made several key innovations.
• Greater parallelism. One way to extract more parallelism out of software code is to increase the
amount of instructions that can be run “out of order.” This enables more simultaneous processing and
overlap latency. To be able to identify more independent operations that can be run in parallel, Intel
increased the size of the out-of-order window and scheduler, giving them a wider window from
which to look for these operations. Intel also increased the size of the other buffers in the core to
ensure they wouldn’t become a limiting factor.
• More efficient algorithms. With each new microarchitecture, Intel has included improved algorithms in places where previous processor generations saw lost performance due to stalls (dead cycles). Next generation Intel microarchitecture (Nehalem) brings many such improved algorithms to increase performance. These include:
• Faster Synchronization Primitives: As multi-threaded software becomes more prevalent, the
need to synchronize threads is also becoming more common. Next generation Intel
microarchitecture (Nehalem) speeds up the common legacy synchronization primitives (such
as instructions with a LOCK prefix or the XCHG instruction) so that existing threaded
software will see a performance boost.
• Faster Handling of Branch Mispredictions: A common way to increase performance is
through the prediction of branches. Next generation Intel microarchitecture (Nehalem)
optimizes the cases where the predictions are wrong, so that the effective penalty of
branch mispredictions overall is lower than on prior processors.
• Improved hardware prefetch and better load-store scheduling: Next generation Intel
microarchitecture (Nehalem) continues the many advances Intel made with the 45nm next
generation Intel Core microarchitecture (Penryn) family of processors in reducing memory
access latencies through prefetch and load-store scheduling improvements.
Enhanced branch prediction. Branch prediction attempts to guess whether a conditional branch will be taken or not. Branch predictors are crucial in today's processors for achieving high performance. They allow processors to fetch and execute instructions without waiting for a branch to be resolved. Processors also use branch target prediction to attempt to guess the target of the branch or unconditional jump before it is computed by parsing the instruction itself. In addition to greater performance, an additional benefit of increased branch prediction accuracy is that it can enable the processor to consume less energy by spending less time executing mis-predicted branch paths.
Next generation Intel microarchitecture (Nehalem) uses several innovations to reduce branch mispredicts
that can hinder performance and to improve the handling of branch mispredicts.
• New second-level branch target buffer (BTB). To improve branch predictions in applications that have large code footprints, such as database applications, Intel added a second-level branch target buffer (BTB). BTBs reduce the performance penalty of branches in pipelined processors by predicting the
path of the branch and caching information used by the branch.
• New renamed return stack buffer (RSB). RSBs store forward and return pointers associated with call and return instructions. Next generation microarchitecture’s renamed RSB helps avoid many common
return instruction mispredictions
Intel Smart Cache Enhancements:
The new three-level cache hierarchy for next generation Intel microarchitecture (Nehalem) consists of:
• Same L1 cache as Intel Core microarchitecture (32 KB Instruction Cache, 32 KB Data Cache)
• New L2 cache per core for very low latency (256 KB per core for handling data and instruction)
• New fully inclusive, fully shared 8MB L3 cache (all applications can use entire cache)
A new two-level Translation Lookaside Buffer (TLB) hierarchy is also included in next generation Intel
microarchitecture (Nehalem). A TLB is a processor cache that is used by memory management hardware to improve the speed of virtual address translation. The TLB references physical memory addresses in its table.
All current desktop and server processors use a TLB, but next generation Intel microarchitecture (Nehalem)
adds a new second level 512 entry TLB to further improve performance.
Improved virtualization performance. Next generation Intel microarchitecture (Nehalem) adds new features that enable software to further improve their performance in virtualized environments. For example, the next generation microarchitecture includes an Extended Page Table (EPT) for reconciling memory type specification in a guest operating system with memory type specification in the host operating system in virtualization systems that support memory type specification.
Source and much more info (SMT, QuickPath): Intel (PDF)
PDF 2 (With Slides)
More briefings here : http://www.intel.com/pressroom/archi...i_20080317fact
Update: Anand's Analysis - http://www.anandtech.com/cpuchipsets...spx?i=3264&p=2
Sounds awesome.. Excellent to see more uArch improvments, (P.S. that's the way to go, AMD) Guess it's safe to debunk those "Penryn with an IMC" theories now.. .
Can't Wait.
Bookmarks