Results 1 to 25 of 4519

Thread: AMD Zambezi news, info, fans !

Threaded View

  1. #11
    Xtreme Member
    Join Date
    Sep 2008
    Location
    Eastern Tennessee (from Minnesota)
    Posts
    241
    Quote Originally Posted by DarthShader View Post
    You might want to measure the voltages with a Voltometer, if possible on that board, cause I don't think going as low as you describe is possible with silicone based semiconductors, but I might be wrong.
    I'd be more than happy to, but I'm not entirely sure where I'd need to measure at :\ I have no qualms about poking around, even at the small SMD caps, but if I took that approach it'd literally take me a week to get lucky enough to find it lol I'll drag out my spare PSU and HDD, boot the board and probe around the major components (MOSFETs to start).

    Quote Originally Posted by Dresdenboy View Post
    @Apokalipse, Dolk:
    Nice to see at least some technical discussion going on, other than analyzing the wave of faked BD screenshots ^^

    To your discussion:
    Well, after providing 2 instruction streams to a BD module, the OS is kind of out of the loop of executing them.

    The fetch/decode/dispatch path works on 2 threads, capable of switching between them on a per-cycle-basis. The dispatch unit dispatches groups of "cops" (complex ops, AMD term) belonging to either thread (I hope, my English skills aren't killing the message here ). Such a dispatch group might contain ALU, AGU and FP ops of one thread. I assume the FP ops or the according dispatch group have a tag denoting their respective thread.

    From now on the integer and the FP schedulers are kind of independently working on the cops/uops they received, until control flow changes. The FP scheduler "sees" a lot of uops asking for ready inputs and producing outputs. Except for control flow changes it actually doesn't even need to know to which thread an op belongs to. The thread relation is assured by a register mapping (has to be done for OOO execution anyway). So the FPU could execute uops as their inputs become available.

    If there are 256b ops, they're already decoded as 2 uops ("Double decode" type) to do calculations on both 128b wide halves of the SIMD vector. I don't think that there is an explicit locking mechanism. The FPU could basically issue both uops in the same cycle or - if there are ready uops of the other thread - issue them to one 128b unit during 2 consecutive cycles (the software optimization manual show latencies of 256b ops supporting this). This actually depends on the implemented policy: issue uops based on their age only or issue them ensuring balanced execution of both threads.

    But what's important to your discussion: this happens independently from both integer cores.
    To expand on this bit of info, I found this when looking for hints on 'Dozer pricing with Google lol It's from a 20 Questions AMD blog, from Aug 2010...

    Quote Originally Posted by John Fruehe@AMD[B
    ]“Is there any”programmable-tangible” improvement in synchronization between cores in the same module? In other words, will I get tangible performance improvement if I can partition my multi-threaded algorithm to pairs of closely interacting threads, and schedule each pair to a module?”[/B] – Edward Yang

    That is a very interesting question.

    For the majority of software, the OS will work in concert with the processor to manage the thread to core relationships. We are collaborating with Microsoft and the open source software community to ensure that future versions of Windows and Linux operating systems will understand how to enumerate and effectively schedule the Bulldozer core pairs. The OS will understand if your machine is setup for maximum performance or for maximum performance/watt which takes advantage of Core Performance Boost.

    However, let’s say you want to explore if you can get a performance advantage if your threads were scheduled on different modules. The benefit you can gain really depends on how much sharing the two threads are going to do.

    Since the two integer cores are completely separate and have their own execution clusters (pipelines) you get no sharing of data in the L1 – and there is no specific optimizations needed at the software level. However, at the L2 cache level there could be some benefits. A shared L2 cache means that both cores have access to read the same cache lines – but obviously only one can write any cache line at any time. This means that if you have a workload with a main focus of querying data and your two threads are sharing a data set that fits in our L2, then having them execute in the same module could have some advantages. The main advantage we expect to see is an increase in the power efficiency of the cores that are idle. The more idle other cores are, the better chance the busy cores will have to boost.

    However, there is another consideration to this which is how available other cores are. You need to weigh the benefits of data sharing with the benefit of starting the thread on the next available core. Stacking up threads to execute in proximity means that a thread might be waiting in line while an open core is available for immediate execution. If your multi-threaded application isn’t optimized to target the L2 (or possibly the L3 cache), or you have distinctly separate applications to run, and you don’t need to conserve power, then you’ll likely get better performance by having them scheduled on separate modules. So it is important to weigh both options to determine the best execution.
    It seems as though we won't really be able to utilize a Bulldozer chip to it's full extent until Windows 8 I'm no coder, but I have a feeling it's not going to be as easy a thing to do to make Windows 7 capable of that where M$ can just roll out a 50MB patch, but I'll cross my fingers just in case!!

    EDIT: JUST in case there was any confusion still, here are some comments he posted from the same Blog entry:

    "John Fruehe September 13, 2010
    A module will have 2 cores in it. It will be seen by the hardware and software as 2 cores. The module will essentially be invisible to the system.
    "

    "John Fruehe September 8, 2010
    We have already said one core per thread, period. But a single thread gets all of the front end, all of the FPU and all of the L2 cache if there is not a second thread on the module.
    "
    Last edited by Formula350; 05-17-2011 at 04:59 PM.

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •