>If using the lock prefix is a legacy operation what are
>the modern ones?
Linus Torvalds:
I don't think there are any - I think they just meant that
they made the old legacy instructions run faster, instead
of trying to introduce anything new.
Which I really look forward to testing. The serialization
overhead of Core 2 is better than many other processors,
but everything else is so good that it still stands out
like a sore thumb. We have lots of kernel loads where one
of the biggest costs is just locking (even without any
nasty contention and cacheline ping-ping), because of how
it serializes the pipeline.
Now that people are trying to push more and more multi-
threaded programming paradigms, the locking is finally
getting some real exposure. It's always been a big issue
in kernels, but now all the fast user-level locking is
making it show up in "normal" loads too.
--------------------------------------
That's something I'm also looking forward too. Even without contention acquiring locks is *painful*. Unless the data/code you're protecting takes a significant amount of time to process/execute you'll be bitten by the sheer cost of the lock/unlock couples so there is room for *lots* of improvement there.
----------------------------------------
+1
It's not uncommon for Java workloads to waste 10% or more of the time processing uncontended locks, and I've seen up to ~30% in real-world apps(1).
The underlying reason is that many critical parts of the core Java library are synchronized (StringBuffer, HashTable, many I/O functions). While there are new APIs that avoid this (StringBuilder, HashMap, etc) there is lots of legacy code that uses the old APIs directly or indirectly.
JVMs usesoptimization tricks to avoid this (lock removal, lock elison, lazy unlocking etc) but that only serves to alleviate the problem, and doesn't entirely resolve it.
-- Henrik
(1) Measured as the increase in throughput when locks are forcefully disabled in JRockit (using -XXlazyunlocking or just hacking the JVM to not issue CAS instructions). The 30% number comes from a JSP-heavy app I ran into some time back. SPECjbb2005 gains ~10% by the use of -XXlazyunlocking.
Faster Synchronization Primitives: As multi-threaded software becomes more prevalent, the
need to synchronize threads is also becoming more common. Next generation Intel
microarchitecture (Nehalem) speeds up the common legacy synchronization primitives (such
as instructions with a LOCK prefix or the XCHG instruction) so that existing threaded
software will see a performance boost.
That's actually the part that I like the most. Better overall IPC is a very nice thing but lowering the cost of the synchronization primitives is much more interesting. It enables parallelization of 'harder' workloads which are not really suitable to parallelization and reap lower benefits because of the synchronization overhead.
Bookmarks