LGA2011 is server still thinking, highend 1356
Printable View
LGA2011 is server still thinking, highend 1356
There were some news that just speak of socket 1155 and 2011, no 1356 or 1355. Eg.:
http://www.engadget.com/2010/04/21/i...aving-those-p/
http://vr-zone.com/articles/a-look-i...ay/8877-1.html
If this is true, then I guess Intel will win in the Enthusiast desktop segment because it will be 8 real Intel cores (with SMT =16 threads) and quad channel, against 4 Zambezi Modules (with CMT = 8 Threads) with dual channel.
However I wonder about the costs for the needed 8layer mainboards and for the CPUs, these will be also on an "enthusiast level". Furthermore, even in 2011, 4 Zambezi Modules should be enough for Gaming.
I hope that Zambezi is better than a 4core LGA1155 Sandy, then it would be in between Intel's 1155 and 2011 offerings, that should be ideal.
But lets wait and see.
If it weren't, it would be a disaster. Llano is supposed to be the part positioned against SB 1155. Zambezi is to go against the high-end desktop SB.
Although now that you mention it, I guess it is possible that in gaming and other low-thread situations, even SB 1155 might beat Zambezi.
You're absolutely right, people buy platforms.
Go to Dell's site or HP's site and configure apples to apples R710 vs. R715 or DL380 vs. DL385.
You will find that the processor savings does pass through to the platform level and AMD platforms are 10-15% less expensive.
Right, because we can't currently pair AMD's enthusiast desktop parts with integrated GPU chipsets. :rolleyes:
Nor will AMD release any BD core products with less then 8 cores. :rolleyes:
Or not
http://software.intel.com/en-us/foru...st.php?p=97176Quote:
Great questions … some more details to the response Max gave.
1) The chart is wrong, we will fix it. Sandy Bridge has true 256-bit FP execution units (mul, add, shuffle). They are on exactly the same execution ports as the 128-bit versions. You can get a 256-bit multiply (on port 0) and a 256-bit add (on port 1) and a 256-bit shuffle (port 5) every cycle. 256-bit FP add and multiply bandwidth is therefore 2X higher flops than 128. See IACA for the ports on an instruction-by-instruction basis.
2) The chart doesn’t mention 16-byte paths. We have true 32-byte loads (i.e. each load only uses one AGU resource and we have 2 AGU’s) but only a 48-byte/cycle total is supported to the L1 each cycle. You can’t get 48 bytes per cycle to the DCU using 128-bit operations (only 2 agu’s…). This is why a simple memory-limited kernel like matrix add (load, load, add, store) measures 1.42X speedup (would have predicted 1.5X with the current architecture in the limit; vs. 1.0X if we had double pumped).
What can we understand from "true 256bit EU" and denial of double pumping ?
So woodcrest -wolfdale - clover -harper - gaines - gulf - sandy bridge
and tigerton - dunnington - delay -delay -delay - beckton -
against
barcelona - shanghai - istanbul - lisbon/magny cours - valencia/interlagos
and yet you tell me that you can't keep track of the AMD based codenames, wouldn't call you an enthousiast but a pure intel fanboy
for those who don't understand what architecture "delay" means, it is delay => http://en.wikipedia.org/wiki/Delay (is that the good way to explain MM? :D )
can you give me a link where SB is totally trashing Nehalem? so you are saying that we should totally stop buying any intel based solution to wait untill next generation. Well the good part is that you'll have to buy a whole new platform anyhow as usual thx to intel :up:
what I have seen is some VERY theoretical benchmarks which always prove to provide the best increase (although yet according to some intel fanboys some of those like stream can't be used when amd shows data since they are to theoretical now off course they count since it is intel) with no single thread improvement (so lets assume there will be some thx to the "untuned setup") and about 20% in multi which proves the enhancements of turbo and HT. So I wouldn't call it a killer because Neh 45nm to 32nm is also a 10% integer increase core/core//ghz/ghz yet (check official integer performance results) the example baseline used to compare against SB is based on the 45nm q720 so the enhancement will be much less and about 10%.
Secondly while Gulf was able to add 2 cores and stay in the same TDP level with the same ghz it has yet to be seen if SB can do the same thing, adding 2 cores without having to reduce the ghz.
Do you have any personal user experience with Server world? I guess not, first it takes about 6months before huge companies shift orders to a new platform even when they get ES samples months in front and they still have huge orders which will remain in the old platform.
Secondly this is the Nehalem influence that you mention why the AMD server sales dropped, actually now OLD school IT bosses can yet count again on there "we are standardized on INTEL ONLY" rubbish since thx to nehalem this cpu was actually better in most cases, while previous generations it was very easy to show that opteron platform was most of the time the better buy price/performance/power wise from an IT point of view. THE MC introduction is not yet seen into server shipments. Perhaps some day you will understand that some within this forum not just work with few servers or desktops but 1000s.
and since you have all data you off course know already way before a platform is launched what will be the outcome, utter fanboy crap
pls stay only in intel related topics if that is all you can bring to the table.
Did you check the context of the double pumping statement? For me this looks to be related to loads and the cache bandwidth/AGU resources. It's also contained in point 2). I highlighted some different parts. You can also double pump cache accesses etc.
The first version of the chart (said to be wrong in 1)) contained "AVX LO" and "AVX HI" units, also drawn at the same width as the 128 bit units. Maybe they're even not using double pumping but other techniques like wave pipelining (less likely).
How would you explain the nearly unchanged area of the FPU on die? Surely not by chip stacking.
Well, you can always the question directly to Intel on the respective thread.
My answer : I do not know why the FPU is only 7% larger ( if we were to trust an analysis based on a low resolution photo with large margins of error ). Doubling the datapaths to 256bit causes what increase in die area ? I haven't seen an analysis on this.
Instead of trying to find some weirdo explanations, we could take his words at face value. The words he uses are pretty straightforward :I wouldn't be surprised of some intentional misleading done previously for deceiving the competition.Quote:
"The chart is wrong, we will fix it. Sandy Bridge has true 256-bit FP execution units (mul, add, shuffle).
He is correct Llano was suppose to go against SB, yes it has aged K10 derivatives but the GPU is quite strong from what i heard. On the other hand SB had a strong CPU but not so strong GPU.
Intel tried to equalize this imbalance and i also posted about it
http://www.xtremesystems.org/forums/...8&postcount=16
It was also reported by Fud after 7 months, well his report i a bit ammm wrong, i mean it not 100% correct.
http://www.xtremesystems.org/forums/...59&postcount=1
It's no coincidence that the architects behind the very long SIMD words
(256 bit, 512 bit and longer) are Doug Carmean and Eric Sprangle who joined
Intel from Ross technologies.
These are exactly the Hyperpipelining specialists at Intel:
(1) They co-authored the original hyperpipelining paper:
Increasing Processor Performance by Implementing Deeper Pipelines
(2) They leaded the original ~60 stage hyperpiplined Nehalem project.
http://www.theinquirer.net/inquirer/...em-slated-2005
(3) They initiated the Larrabee project. One of the main ideas behind
Larrabee is to achieve a theoretical maximum number of FLOPs on a
certain die with a limited number of transistors. A fourfold hiperpipelined
128 bit unit running at 4.8 GHz can produce 512 bit results at 1.2 GHz
using only 25%(+a bit) of the transistors of a non hyperpipelined unit.
ftp://download.intel.com/technology/...abee_paper.pdf
http://www.drdobbs.com/high-performa...ting/216402188
The SIMD units are the easiest (of all units) to hyperpipeline. All instructions
which could cause problems for hyperpipelining have been systematically
left out of the AVX and LNI specifications. (for instance data shuffles
crossing 128 bit boundaries)
Regards, Hans
Intel will never discuss that info in public,you may ask all day long.
Also the whole die is ~7% larger,not the FPU.The FPU is just tiny bit bigger.The AVX support may have contributed to that.
When you look in the past,like Yonah to Merom(both done @ 65nm),the core size investment was radical,going from 19mm2 to 31mm2 -some of that huge increase in core logic was due to physical doubling of the SSE capabilities which is the most prominent and largest perf. change when compared to Yonah.All this resulted in 15-20% perf. increase on average over Yonah and especially in the SSE code the jump was ~50-60%. We can't use AMD as an example since Hound was done on 65nm while RevF was 90nm,but you can see Hans' work here .As Hans showed,single Hound core(65nm) takes up 20% less space than an old single RevF(which was 90nm) core and you can see the 2nd FP unit in Hound highlighted by Hans De Vries .
edit: completely forgot the Brisbane core :) . 20.8 to 25.5mm2 is Brisbane to Barcelona(single core size). 22.5% increase for various core improvements and 2x SSE throughput(in theory,due to 2nd FP unit). This brought the very similar 15-20% perf. increase on average,50-60% in SSE code.