Results 1 to 9 of 9

Thread: "Captain, she's only hitting on 4 cores...."

  1. #1
    Xtreme Cruncher
    Join Date
    Jan 2009
    Location
    Nashville
    Posts
    4,162

    "Captain, she's only hitting on 4 cores...."

    dual Operton 8356, 4 cores each. Running Debian. Has been crunching for over a year close to 2. No problems.

    Was running a mix of WCG and E@H, 50/50. Noticed one day it was running 4 E@H and 2 WCG with more of both in queue. Rebooted and all 8 used.

    Yesterday I noticed again 1 WCG and 4 E@H with plenty of both in queue. Rebooted. Now running 4 E@H. All else Waiting to run. Since then I have:

    Shut off and restarted. uninstalled Boinc and reinstalled. Checked BIOs even though I had made no changes. All looked ok.
    Checked settings at E@H and WCG. Detached from WCG, aborted all tasks, removed any reference to WCG in Boinc files. Changed settings like % of CPU and then back. Checked all settings on web site and Boinc multiple times. Another dual 8356 is running same E@H and WCG profile and no problems. Checked for heat.

    With Top I notice that both CPUs are being used, 2 cores each and which cores being used does change, more than 15 minutes and less than 30. Installed Ubuntu 12.0.4.2 and same. Logs show no errors. dmesg shows all 8 "cpus" detected and started, Boinc Event Log shows 8 cores.

    It has a good Corsair PSU, will check that next, maybe CPUs not getting enough juice?? I am at a loss. Do not know what to do next. Have googled everything I can think of and nothing.

    Ideas?


  2. #2
    Xtreme Addict Evantaur's Avatar
    Join Date
    Jul 2011
    Location
    Finland
    Posts
    1,043
    what does top say... having zombie processes?

    I like large posteriors and I cannot prevaricate

  3. #3
    Xtreme Cruncher
    Join Date
    Jan 2009
    Location
    Nashville
    Posts
    4,162
    None. No zombies

  4. #4
    Xtreme crazy bastid
    Join Date
    Apr 2007
    Location
    On mah murder-sickle!
    Posts
    5,878
    Check and reset your BOINC preferences for WCG. It sounds like they might have been corrupted somehow.

    [SIGPIC][/SIGPIC]

  5. #5
    Xtreme Addict Evantaur's Avatar
    Join Date
    Jul 2011
    Location
    Finland
    Posts
    1,043
    you could try to borrow cpus from your other rig and see if it's still acting up with those

    I like large posteriors and I cannot prevaricate

  6. #6
    Xtreme Legend
    Join Date
    Mar 2008
    Location
    Plymouth (UK)
    Posts
    5,279
    Heat? When in failed condition what does system monitor show under resources? (not used debian before so made a couple of assumptions here)


    My Biggest Fear Is When I die, My Wife Sells All My Stuff For What I Told Her I Paid For It.
    79 SB threads and 32 IB Threads across 4 rigs 111 threads Crunching!!

  7. #7
    Xtreme Addict Evantaur's Avatar
    Join Date
    Jul 2011
    Location
    Finland
    Posts
    1,043
    Quote Originally Posted by OldChap View Post
    Heat? When in failed condition what does system monitor show under resources? (not used debian before so made a couple of assumptions here)
    dmesg would have spam like this in it

    [ 1199.201405] CPU4: Core temperature above threshold, cpu clock throttled (total events = 1)
    [ 1199.201406] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1)

    I like large posteriors and I cannot prevaricate

  8. #8
    Xtreme crazy bastid
    Join Date
    Apr 2007
    Location
    On mah murder-sickle!
    Posts
    5,878
    Yes, I should have thought to get him to post dmesg output when it's happening. Might be some clues there.

    OC, you might want to copy and paste the output into a text file and attach it, as it can be quite large. Alternatively use the command
    Code:
    dmesg > dmesg.txt
    and it will send the output to the file for you, then just post/attach/link to the file.

    [SIGPIC][/SIGPIC]

  9. #9
    Xtreme Cruncher
    Join Date
    Jan 2009
    Location
    Nashville
    Posts
    4,162
    While it certainly did look like CPU throttling due to heat there was no other evidence of it. Logs and dmesg showed nothing. In the end I again uninstalled Boinc but this time deleted the /var/lib/boinc-client directory. The data directory. Reinstalled and all good now. Looks as if there was a corrupted file somewhere in the data directory.

    All good now:


    Thanks for the ideas of things check!

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •