Announcement

Collapse
No announcement yet.

Processes hanging randomly

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Processes hanging randomly

    Hi everyone,

    I upgraded to 15.04 yesterday on my laptop: everything went smoothly.

    Then I did the same on my desktop computer, and I'm running into a serious issue.

    Basically, my Java IDE (a relatively big piece of software starting lot of threads and stuff) will randomly hang during startup. Then:
    - The console from which I started the IDE will hang too (), always. If I used Klauncher, then Klauncher will also becomes unresponsive.
    - As well as Firefox, sometimes, when loading new tabs.
    - Amarok seems to randomly hangs too.

    After some time waiting, the whole display sometimes becomes unresponsive.

    CPU is idle, memory is fine. LOSF show no excessive files consumption.

    I did a clean reinstall of the whole system (also wiped my whole home folder). It did not help.

    The same set of software is working perfectly well on my laptop. So, I fear this is HW related.

    I've a Nvidia card, I tried to use both the proprietary drivers and the Nouveau drivers. Same problem.

    I spotted some interesting message in my kern.log coming from the kernel scheduler (see gist kern_generic.log): INFO: task java:4509 blocked for more than 120 seconds.

    Out of desperation, I installed the low-latency linux image (which is, I assume, using some different scheduling algorithm) and the message did not reappears, but the issue persisted.

    I also tried to start a failsafe session SDDM (in order to see if this was somehow Plasma related). I was then confronted to two behaviors:
    - simply selecting the choice "Failsafe" make the login manager to freeze.
    - After clicking on "login", I see briefly a black screen with a text ("starting version 219"), then I'm back to the login screen.

    I'm linking a gist with:
    - The complete strace (default print settings) of the Java process.
    - My kern.log with the generic and low latency kernel.
    https://gist.github.com/anonymous/ad83e170fae184612a2b

    I would be extremely thankful if anyone could point me in some direction on how I could debug this issue. I'm currently considering the following options:
    - Installing another window manager to see if the problem persist.
    - Installing a vanilla kernel from kernel.org.
    - Downgrading to 14.10 (I will eventually do that if cannot find a solution quickly enough).

    Thanks,
    Marc

    #2
    Ok, I maybe managed to solve my issue (I stay cautious since it was erratic by nature).

    I tried first to reproduce it using various kernel versions available from the Vivid git repository ( http://kernel.ubuntu.com/git/ubuntu/ubuntu-vivid.git/ ). I spotted quickly the problem with the 3.18 and 4.0 branches, but the 3.16 Kernel (version shipped with 14.10) was apparently working fine.

    Then I did a bios update (with little to no expectation)... And it apparently solved everything.

    The exact cause remains unclear, but I have some clues that may or may not be relevant:
    - Prior to the update, the kernel clocksource was set to HPET since TSC was found unstable on my machine (that was already so on Kubuntu 14.10). This apparently affects X79 based motherboard from Asus with Intel mutli-core processors. See this thread: http://www.overclock.net/t/1347771/s...#post_19018999
    - After the update TSC was back as the current clocksource.
    - I learned recently about the -f option of strace. When used, I've seen an interesting output in one of the child processes apparently caught in a loop invoking a specific set of system calls:
    13768 clock_gettime(CLOCK_MONOTONIC, {1528, 419045733})=0
    13768 clock_gettime(CLOCK_MONOTONIC, {1528, 419097975})=0
    13768 clock_gettime(CLOCK_MONOTONIC, {1528, 419149658})=0
    13768 gettimeofday({1430110843, 398015}, NULL)=0
    13768 clock_gettime(CLOCK_MONOTONIC, {1528, 419259449})=0
    13768 gettimeofday({1430110843, 398118}, NULL)=0
    13768 clock_gettime(CLOCK_MONOTONIC, {1528, 419364212})=0
    13768 clock_gettime(CLOCK_MONOTONIC, {1528, 419411145})=0
    13768 clock_gettime(CLOCK_MONOTONIC, {1528, 419458358})=0
    13768 futex(0x2b250c42ea44, FUTEX_WAIT_BITSET_PRIVATE, 1, {1528, 429353595}, ffffffff)=-1ETIMEDOUT

    (I did not go as far as generating a core dump from the thread, since I'm working with HotSpot I do not have debug symbols anyway, but I could have used OpenJDK for the same purpose).

    So, I guess that the problem maybe had something to do with the kernel clocksource.

    Or maybe this was solved for something entirely unrelated. The bios update was simply labelled "Improve system stability". In addition, an unstable TSC is not something uncommon, and since it is one of the first error message in the kernel logs with little known implications, it is easy to blame it for whatever may go wrong on the system. In my situation, the two other clocksource available (HPET and ACPI_PM) were suffering from the same issue even if they have not much in common except being significantly slower than the TSC.

    Interestingly, the bios update seems to have solved another problem I had with an Oracle Express database aborting transactions for no obvious reasons when accessed from a WebLogic application server. Hell, even Amarok equalizer started working again after the update (it was ineffective after the 15.10 upgrade)... Go figure.

    Comment

    Working...
    X