Page 1 of 2 12 LastLast
Results 1 to 25 of 28

Thread: Penryn SSE4 benchmark - DivX encoding

  1. #1
    all outta gum
    Join Date
    Dec 2006
    Location
    Poland
    Posts
    3,390

    Penryn SSE4 benchmark - DivX encoding

    There is an experimental SSE4 support in latest version of DivX codec, and DivX developers have benched Penryn ES with that feature:

    This is quite interesting and proves SSE4 is much better improvement than SSSE3. But mind you the clause after an asterisk: I bet they just don't want people to know real Penryn performance.

    I haven't seen this posted yet, but sorry if this has been known.
    Source: http://www.divx.com/divx/windows/codec/?cid=DP0000216
    www.teampclab.pl
    MOA 2009 Poland #2, AMD Black Ops 2010, MOA 2011 Poland #1, MOA 2011 EMEA #12

    Test bench: empty

  2. #2
    Xtreme Member
    Join Date
    Mar 2007
    Location
    The Netherlands
    Posts
    199
    The Penryn will support SSE4.1 and Barcelona will support SSE4a btw.

  3. #3
    Xtreme Cruncher
    Join Date
    Aug 2006
    Location
    Denmark
    Posts
    7,747
    Quote Originally Posted by MichelinGuy View Post
    The Penryn will support SSE4.1 and Barcelona will support SSE4a btw.
    SSE4A is just 4 or 6 (cant remember) instructions of the 47(54) SSE4. Since AMD aitn allowed to implement it before after a certain time.So 45nm K10 should be able to support SSE4.1 but not the last 7 instructions in SSE4.2.
    Crunching for Comrades and the Common good of the People.

  4. #4
    Registered User
    Join Date
    Dec 2004
    Posts
    16
    It's about the same results Anand came up with in his penryn prevew a couple of weeks ago. http://www.anandtech.com/cpuchipsets...oc.aspx?i=2972

    looking forward to this fall I certanly hope AMD can Intel to get back on track.

  5. #5
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    it sounds like sse4 brings a big boost on a filter technique developed especially for sse4.
    its a nice feature but imo its being hyped way too much.

  6. #6
    Xtreme Legend
    Join Date
    Mar 2005
    Location
    Australia
    Posts
    17,242
    Team.AU
    Got tube?
    GIGABYTE Australia
    Need a GIGABYTE bios or support?



  7. #7
    Xtreme X.I.P. MaxxxRacer's Avatar
    Join Date
    Aug 2004
    Location
    Los Angeles, Ca USA
    Posts
    12,551
    That bench was between SSE4 and SSE2. It SHOULD have been between SSE3 and SSE4. For all we know the difference between SSE4 and SSE3 is minimal.

  8. #8
    Xtreme Cruncher
    Join Date
    Aug 2006
    Location
    Denmark
    Posts
    7,747
    Its nice to see continual development of SSE, specially since SSE replaces the x87 in 64bit Windows. Not a moment too soon to get rid of x87.

    Intel said SSE4 would be as big as SSE2. SSE2 was a major part that could eliminate the x87 aswell as heavy performance boost.
    Crunching for Comrades and the Common good of the People.

  9. #9
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Belgrade, Serbia
    Posts
    187

    Cool I think I know more about SSE4...

    I think I know more about SSE4 than you folks, so please let me explain:

    AMD has SSE4a in K10. It is just a handfull of instructions:

    Code:
    1.6.1  AMD Instruction Set Enhancements
    
    The AMD Family 10h processor has been enhanced with the following new 
    instructions:
    
    LZCNT, POPCNT—Advanced Bit Manipulation (ABM) instructions operate on 
    general purpose registers. MOVNTSS, MOVNTSD, EXTRQ, INSERTQ -- SSE4a
    instructions operate on XMM registers.
    None of them are of much use except for the LZCNT and POPCNT.

    None of them will be available in Penryn.

    On the other hand, Intel has SSE4.1, SSE4.2 and POPCNT.

    SSE4.1 will come with Penryn and it has 47 new instructions, many of them already used in DivX 6.6.

    Reason why direct comparison is not possible in the above mentioned benchmark is that SSE4 code does full window search for motion estimation instead of partial so results are not directly comparable with either older DivX versions or with SSE2/SSE3/SSSE3 code path in the version 6.6. That said, if you take into account that SSE4 version does a lot more work than the old one it is trully remarkable how much faster it is.

    With that out of the way let me explain SSE4.2. That extension comes with Nehalem and will consist of only few instructions (so called application accelerators) -- PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM, PCMPEQQ, PCMPGTQ, CRC32, POPCNT.

    POPCNT instruction deserves further explanation. Namely, it seems that first Nehalem CPUs won't implement it (or it will be reserved for server/workstation CPUs?) since it has its own dedicated bit for detecting its presence. It is a bit 23 in ECX when you execute CPUID with EAX=1.

    I also want to bring your attention to potential AMD/ATI killer -- MOVNTDQA which comes with Penryn.

    That instruction enables extremely fast reading from MMIO space which is always marked as USWC (uncacheable, write combining).

    What does that mean? Well it means that instead of say 800 MB/sec readback from video card you will have 7,000MB/sec which is 9x speedup. That was measured with two threads and 1066 MHz FSB and it is very close to theoretical peak of 8.5GB/sec for the mentioned FSB speed.

    This will IMO have a great (positive) impact on GPGPU applications. Main obstacle in GPGPU today is the fact that moving data to and from the GPU is slow. Especially readback is slow. MOVNTDQA should change that once GPU vendors optimize their drivers to use it.

    Of course it can also be used to speed up disk access for huge RAID0 arrays, network I/O, etc.

    If AMD CPUs don't get MOVNTDQA any time soon, I believe that they will have a serious problem. That problem will be called "ATI video cards working faster with Intel CPUs" -- irony at its best.

    That brings us to the conclusion -- Intel has became very aggresive in promoting new extensions this time. You can already download Instruction Set Reference, SDK and even emulator to test your code. New compiler is in the works as we speak (currently it is at version 10.0.018 beta) and I expect final 10.0 release to have SSE4 support.

    Everything I wrote here you can find at this link.
    Last edited by audiofreak; 05-06-2007 at 03:17 PM.

  10. #10
    Xtreme Addict
    Join Date
    Apr 2005
    Location
    Houston, TX
    Posts
    1,196
    Try it out for yourself. You can get the DivX 6.6 Pro with K-Lite Codec Packs. If you convert a movie in VirtualDub and use DivX it asks to use the SSE4 optimization.

  11. #11
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Belgrade, Serbia
    Posts
    187
    Quote Originally Posted by Major_A View Post
    Try it out for yourself. You can get the DivX 6.6 Pro with K-Lite Codec Packs. If you convert a movie in VirtualDub and use DivX it asks to use the SSE4 optimization.
    Who should try it out?!?

    I certainly would but don't have Penryn here.

  12. #12
    Xtreme Enthusiast
    Join Date
    May 2005
    Location
    USA
    Posts
    563
    Quote Originally Posted by audiofreak View Post
    I also want to bring your attention to potential AMD/ATI killer -- MOVNTDQA which comes with Penryn.

    That instruction enables extremely fast reading from MMIO space which is always marked as USWC (uncacheable, write combining).

    What does that mean? Well it means that instead of say 800 MB/sec readback from video card you will have 7,000MB/sec which is 9x speedup. That was measured with two threads and 1066 MHz FSB and it is very close to theoretical peak of 8.5GB/sec for the mentioned FSB speed.

    This will IMO have a great (positive) impact on GPGPU applications. Main obstacle in GPGPU today is the fact that moving data to and from the GPU is slow. Especially readback is slow. MOVNTDQA should change that once GPU vendors optimize their drivers to use it.

    Of course it can also be used to speed up disk access for huge RAID0 arrays, network I/O, etc.

    If AMD CPUs don't get MOVNTDQA any time soon, I believe that they will have a serious problem. That problem will be called "ATI video cards working faster with Intel CPUs" -- irony at its best.

    That brings us to the conclusion -- Intel has became very aggresive in promoting new extensions this time. You can already download Instruction Set Reference, SDK and even emulator to test your code. New compiler is in the works as we speak (currently it is at version 10.0.018 beta) and I expect final 10.0 release to have SSE4 support.

    Everything I wrote here you can find at this link.
    Well that sucks... Penryn will have it and Barcelona won't? Bad news for AMD if thats true

  13. #13
    Xtreme X.I.P.
    Join Date
    Nov 2002
    Location
    Shipai
    Posts
    31,147
    thx audiofreak!
    interesting...

  14. #14
    Quote Originally Posted by derektm View Post
    Well that sucks... Penryn will have it and Barcelona won't? Bad news for AMD if thats true
    Not really. Considering neither the Netburst nor current Core 2 have it, it's hardly a must-have. And considering it won't exist on any 65nm Intel chips, and 45nm won't be a significant % of Intel's output until 2009 basically it's a non-event.

    This SSE4 thing is a red-herring thrown out to distract from Barcelona's expected all-round performance boost. Maybe AMD will unveil the WDGE instruction which will give a massive wedgie to anybody who's using an Intel proc

  15. #15
    Xtreme Addict
    Join Date
    Apr 2005
    Location
    Houston, TX
    Posts
    1,196
    Quote Originally Posted by audiofreak View Post
    Who should try it out?!?

    I certainly would but don't have Penryn here.
    Someone with a Penryn? The point was you get the DivX Pro codec in K-Lite codec packs, so it's free to try.

  16. #16
    Live Long And Overclock
    Join Date
    Sep 2004
    Posts
    14,058
    All i need is something on a laptop thats at least 50% faster than my turion x2 X(

    Perkam

  17. #17
    Xtreme Addict
    Join Date
    Jul 2006
    Location
    Between Sky and Earth
    Posts
    2,035
    Didn't see no SSE3 encoding or similar since was developed, SSE2 only, so I'm reserved about SSE4 since SSE2 is the only one used and even this, rarely, SSE4's good for an individual not for the majoraty - but that's ok to I guess.
    Last edited by XSAlliN; 05-06-2007 at 11:32 PM.

  18. #18
    Xtreme Addict
    Join Date
    Jan 2005
    Posts
    1,366
    Quote Originally Posted by XSAlliN View Post
    Didn't see no SSE3 encoding or similar since was developed, SSE2 only, so I'm reserved about SSE4 since SSE2 is the only one used and even this, rarely, SSE4's good for an individual not for the majoraty - but that's ok to I guess.
    Actualy, SSE3 doesn't suitable too much for video encoding. These instructions operates on "horizontal" data and it is perfect for graphic rendering.
    http://www.anandtech.com/cpuchipsets...spx?i=2350&p=2
    Last edited by kl0012; 05-06-2007 at 11:49 PM.

  19. #19
    Xtreme Addict
    Join Date
    Jul 2006
    Location
    Between Sky and Earth
    Posts
    2,035
    Quote Originally Posted by kl0012 View Post
    Actualy, SSE3 doesn't suitable too much for video encoding. These instruction operates on "horizontal" data and it is perfect for graphics rendering.
    http://www.anandtech.com/cpuchipsets...spx?i=2350&p=2
    By similar I was referring to graphic rendering or others related, since even that I've seen most in SSE2.

  20. #20
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Belgrade, Serbia
    Posts
    187
    Quote Originally Posted by DeepThought86 View Post
    Not really. Considering neither the Netburst nor current Core 2 have it, it's hardly a must-have. And considering it won't exist on any 65nm Intel chips, and 45nm won't be a significant % of Intel's output until 2009 basically it's a non-event.

    This SSE4 thing is a red-herring thrown out to distract from Barcelona's expected all-round performance boost. Maybe AMD will unveil the WDGE instruction which will give a massive wedgie to anybody who's using an Intel proc
    If that is what you want to believe in the presence of some of the hard facts I gave you that's ok with me. I am not trying to convince/convert anyone.

    IMO, Barcelona improvements are there just to catch up with Core microarchitecture which has certainly set a new standard. K10 cannot take over the lead -- it simply doesn't have that much improvement installed. Especially if you consider that it won't compete against Core but against Penryn.

  21. #21
    all outta gum
    Join Date
    Dec 2006
    Location
    Poland
    Posts
    3,390
    We definitely need SSE1337 with one instruction: PWNGE in barcelona
    www.teampclab.pl
    MOA 2009 Poland #2, AMD Black Ops 2010, MOA 2011 Poland #1, MOA 2011 EMEA #12

    Test bench: empty

  22. #22
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    Quote Originally Posted by audiofreak View Post
    If that is what you want to believe in the presence of some of the hard facts I gave you that's ok with me. I am not trying to convince/convert anyone.

    IMO, Barcelona improvements are there just to catch up with Core microarchitecture which has certainly set a new standard. K10 cannot take over the lead -- it simply doesn't have that much improvement installed. Especially if you consider that it won't compete against Core but against Penryn.
    And how do you know how Barcelona will perform?
    K10 is architecturally more advanced(in terms of integration) than any intel CPU until Nehalem is released.And Nehalem is a late 2008 product that will go against native 8 core NextGen AMD core due out in 2009.By than, we will have MCM versions of Fusion,K11(native 8C) and Shangai(probably MCM 8core) on the AMD side,and Nehalem,Westmere and Larabee at intel's.

  23. #23
    Xtreme Member
    Join Date
    Apr 2006
    Location
    Belgrade, Serbia
    Posts
    187

    Cool

    Quote Originally Posted by informal View Post
    And how do you know how Barcelona will perform?
    I have read their Software Optimization Guide for K10 for starters.

    I am aware that they have made architectural improvements. I also heard that they will launch at higher than expected clock speeds.

    If that particular bit of information is true, then it can only mean their architectural improvements were not enough to stay competitive.

    As far as I know neither Intel nor AMD like pushing the clock speed up when they don't need to because it means much lower yields, less working chips at higher grades and thus less profit.

    Quote Originally Posted by informal View Post
    K10 is architecturally more advanced(in terms of integration) than any intel CPU until Nehalem is released.
    I never said it isn't. Unfortunately it misses that streaming load instruction I mentioned above. And a bunch of other usefull instructions.

    Quote Originally Posted by informal View Post
    And Nehalem is a late 2008 product that will go against native 8 core NextGen AMD core due out in 2009.
    I wouldn't bet so much on "late 2008", more like beginning of H2 -- they could still pull it back, it just depends on how fast they are able to convert their fabs to 45nm process. NextGen AMD will have to go against Westmere in 2009.

    Quote Originally Posted by informal View Post
    By than, we will have MCM versions of Fusion, K11 (native 8-core) and Shangai (probably MCM 8-core) on the AMD side, and Nehalem, Westmere and Larrabee at intel's.
    Don't forget Gesher. Yes, it will be fun. Now if only my pocket could follow...

  24. #24
    Xtreme Cruncher
    Join Date
    Jun 2006
    Posts
    6,215
    I thought that Gesher fits into "tock" part of the new intel roadmap strategy and thus is 2010 product.

    As for Barcelona and it's improvements ,we will not have to wait much longer than June/July for benchmarks.As from what i heard,the "official" Spec scores are just ballpark estimates for bottom line performance numbers in float and int!Not the numbers based on K10 ,but the ones based on estimated Opteron K8 QC performance at the same clock as K10.So those results are just a ballpark figure,the real results are being probably somewhat more interesting and are kept for Computex time(in order to create a largest "bang" effect and avoid the unfortunate Osborne effect).
    Cheers

  25. #25
    Xtreme Addict
    Join Date
    Apr 2003
    Posts
    1,092
    Just like with SSE, SSE2 and SSE3 it will take years for it to become mainstream in software, so if you haven't got it in the next few years... big whoop.

    This seems like good old Intel marketing to me, telling you it has some feature which can do great things and you think you MUST have it... but you really don't :p

    Also: These instructions only really seem to help with encoding.. now who really does that on a daily basis? Only very few people I think. You always see it in benchmarks, but nobody really does that.
    Last edited by Thorry; 05-08-2007 at 03:49 AM.
    The world vs the USA: The whole world hates you!
    USA: Why?? Why does the whole world hate us?
    The world: Because the whole world hates you, and you don't even know why!

Page 1 of 2 12 LastLast

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •