AMD says (http://www.youtube.com/watch?v=VIs1CxuUrpc)
"Synthesizable with small number of custom arrays"
Together with what was said before I think one of the main goals that AMD wants to achieve is to have easily customizable processors. Add a gpu core here, some cache there and another core here. From the slide it looks like lots of their process is already capable of being laid out by a computer.
We have the caches, the integer units and the floating point units being the fixed hand optimized blocks with stuff like the x86 decode organically filling up the space in between. AMD also says it makes it easier to put the whole thing on a different process.
I've only limited knowledge about modern synthesizing and floor planning from working with some FPGAs.
Maybe Hans or somebody in the industry can say something about Bobcat?
Not exactly sure which context your mean when you say they "do not floorplan", but they definitely allow floorplanning at some level. The first step of synthesis, RTL -> Netlist, doesn't floorplan (it just cares about standard cells usage & timing estimates/constraints), if that's what you're trying to get at. However, the second step of synthesis, Netlist -> Placement (placement tool), definitely does floor-planning.
Tools like Cadence Encounter take floorplan constraints and allow for partitioning sub-modules, however like the picture above shows the results tend to look like a jumbled mess, since strict boundaries aren't adhered to.
Well, historically [x86] chips from both camps have always been mainly custom layout in the datapath with a varying amount of synthesized control logic, seeing pictures like this is a bit of an eye opener from the norm
Another example is Intel's Pine-Trail (bigger):
The huge purple blotch running down the middle is all synthesized logic
You pretty much sum up my thoughts on the matter, it looks like they shot for a semi-custom approach by supplying some of the main datapath logic (not necessarily say the whole FPU, etc., just the important chunks) and the arrays as hard-macros/external-IP (in- or out-of-house, doesn't matter) while synthesizing the rest.
While they're definitely not unique in the approach, it will certainly provide a quicker process adaptation, since only a standard cell library and select logic/array-IP pieces would technically be necessary. Granted there's still a bit more work than just swapping libraries/IP and pressing a few buttons
Just pointing out that while we humans can definitely be more adapt at coming up with these clever (sometimes novel) solutions to optimizing layout area/timing/congestion-constraints, it's also a significant capital and time investment, so it's for ROI and time to market reasons that it doesn't always work out. The case of Bobcat is obvious an example of this, and Atom for that matter.
Honestly I would find the logically optimal euler path to be much easier for a computer to solve
But yes computerized tools aren't very good when it comes to balancing the plethora of added constraints in a physical world, hence us restricting them to sub-optimal standard cells + wiring constraints.
Another nice example is the 1.9W TDP 2GHz hardmacro version of the dual
core ARM cortex A9 in the TSMC 40G process (total size of only 6.7 mm2)
http://www.arm.com/products/CPUs/Cor...ard-Macro.html
Regards, Hans
~~~~ http://www.chip-architect.org ~~~~ http://www.physics-quest.org ~~~~
The synthesis starts with rectangular shapes but the logic migrates
during the optimization process. Some pieces of one unit even end up
in the middle of other units (typically interface logic between the two
units) For some reason the hardware synthesizer concludes that it's
electrically/timingwise better to move it there.
Regards, Hans
~~~~ http://www.chip-architect.org ~~~~ http://www.physics-quest.org ~~~~
How do you know? In a true 12 core design do you think it would be over 70%? Maybe it uses four cores, and the HT-threads easily gets maxed out. Four cores is around 66% of a quad.
My point is, having a hexa core being utilized 70% can only mean at least 4 cores. It could be 12, but there is no way for you to see it, and since it isn't 100%, I seems like it isn't using all cores.
Finally, chips from AMD has always been nicely ordered, pointing at a mostly hand made layout. I can imagine that it leads to an uneven power usage and unnecessary long circuits and timings. And wastes die space.
lol, it is the exact opposite. hand layout is much better. humans are better at finding eulerian paths and coming up with clever layouts. computers cant really do that with all of the design rules and other parameters as effectively. the difference in performance is 2.6-7x faster with custom designed circuits.
really what happens is a coder will simulate his module and make sure it reaches the targeted timing, which is usually much higher than actual delay to assure robust operation. if the logic cant reach the speed it is either rewritten or circuit designers optimize it. in certain logic families it must be entirely custom designed.
circuits that are custom designed are usually things like power gating, clock distribution, and analog circuits such as pll's, dll's, and memory controllers/ io pads.
Given the information that AMD has released today, is it possible for anyone here to make an educated guess on how much faster BD will be clock for clock over Deneb? Those slides go beyond my basic understanding of CPU design.
As quoted by LowRun......"So, we are one week past AMD's worst case scenario for BD's availability but they don't feel like communicating about the delay, I suppose AMD must be removed from the reliable sources list for AMD's products launch dates"
Phenom was bad in branch prediction, so AMD improved it far beyond intel's best, just watch the die space used by it on the only "litle bobcat". It's just impressive. Alu Pipes seem to have been change from 3ALU/AGU that can do 1.5 of each per cyle to 2AGU + 2ALU that can do 2 per cylce. So more IPC.
About L1D it's not very clear, we need know if it's inclusive or not. So it can be incredible faster, or simply as good as old Phenom II, latency is said to be masked.
L2 latency is 17cycles, so it's not bad. and seem to be 1MB and shared between 2 Alu cores, so that mean less data will go L3 to change of core.
Pipe is longer, but aimed for ramping up clocks, and prediction is far better to hide the bad effect of the long pipe.
It's gonna be the "core 2" effect i think.
computers need to be programed by humans in order to make stuff up ... but even then we dont make perfect stuff up .. so it would be easier for a human to design the best layout by hand then let a robot do the work ... that's all im saying.... some logic can be done by computers .. but some complicated logic portion of a chip would indeed benefit a human intervention
Bookmarks