Announcement

Collapse
No announcement yet.

Frequent OS crashes - looking for troubleshooting ideas

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    #16
    May/likely have already been stated, but, issues like you are describing, especially in a laptop, the things to suspect are: dirty/dusty air intake/exhaust ports; faulty RAM; overheating CPU/GPU (heatsinks separating/loosing contact).
    Using Kubuntu Linux since March 23, 2007
    "It is a capital mistake to theorize before one has data." - Sherlock Holmes

    Comment


      #17
      How about a swap issue, like none or not enough?
      If you think Education is expensive, try ignorance.

      The difference between genius and stupidity is genius has limits.

      Comment


        #18
        Originally posted by Snowhog View Post
        May/likely have already been stated, but, issues like you are describing, especially in a laptop, the things to suspect are: dirty/dusty air intake/exhaust ports; faulty RAM; overheating CPU/GPU (heatsinks separating/loosing contact).
        I'll check this out right away, just waiting for a large zip file to complete trying to force a crash. If it does I will try to start my backup job again.
        CPU is at 80° right now
        Since writing the message above I had 3 more crashes, after the system ran all day

        08.11. Next crash 2 minutes after reboot. Not able to switch to tty2. Drive must have disappeared again right after the boot launcher as it was complaining about a missing cryptodrive, but the code showing the message must have booted from the same drive
        08.11. Next crash 2 minutes after reboot. Was thrown over to tty2 which was frozen already, 80GB backup was running again.
        08.11. Evening, system ran all day after the initial crash this morning. Crash happened during rsync of 80GB VM to second disk, not able to switch to tty2 any longer
        ...

        Comment


          #19
          Originally posted by SpecialEd View Post
          How about a swap issue, like none or not enough?
          Or even a failing HDD/SSD?

          How OLD is this laptop?
          Using Kubuntu Linux since March 23, 2007
          "It is a capital mistake to theorize before one has data." - Sherlock Holmes

          Comment


            #20
            How many errors does your SDD report?
            Code:
            [FONT=courier new][B]$ sudo smartctl -a /dev/sda[/B][/FONT]
            [sudo] password for jerry: 
            smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.18.0-11-generic] (local build)
            Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
            
            === START OF INFORMATION SECTION ===
            Device Model:     *****************GB
            Serial Number:    **************
            LU WWN Device Id: 5 002538 e4032b7df
            Firmware Version: RVT01B6Q
            User Capacity:    500,107,862,016 bytes [500 GB]
            Sector Size:      512 bytes logical/physical
            Rotation Rate:    Solid State Device
            Form Factor:      2.5 inches
            Device is:        Not in smartctl database [for details use: -P showall]
            ATA Version is:   Unknown(0x09fc), ACS-4 T13/BSR INCITS 529 revision 5
            SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
            Local Time is:    Thu Nov  8 12:19:17 2018 CST
            SMART support is: Available - device has SMART capability.
            SMART support is: Enabled
            
            === START OF READ SMART DATA SECTION ===
            SMART overall-health self-assessment test result: PASSED
            
            General SMART Values:
            Offline data collection status:  (0x00) Offline data collection activity
                                                    was never started.
                                                    Auto Offline Data Collection: Disabled.
            Self-test execution status:      (   0) The previous self-test routine completed
                                                    without error or no self-test has ever 
                                                    been run.
            Total time to complete Offline 
            data collection:                (    0) seconds.
            Offline data collection
            capabilities:                    (0x53) SMART execute Offline immediate.
                                                    Auto Offline data collection on/off support.
                                                    Suspend Offline collection upon new
                                                    command.
                                                    No Offline surface scan supported.
                                                    Self-test supported.
                                                    No Conveyance Self-test supported.
                                                    Selective Self-test supported.
            SMART capabilities:            (0x0003) Saves SMART data before entering
                                                    power-saving mode.
                                                    Supports SMART auto save timer.
            Error logging capability:        (0x01) Error logging supported.
                                                    General Purpose Logging supported.
            Short self-test routine 
            recommended polling time:        (   2) minutes.
            Extended self-test routine
            recommended polling time:        (  85) minutes.
            SCT capabilities:              (0x003d) SCT Status supported.
                                                    SCT Error Recovery Control supported.
                                                    SCT Feature Control supported.
                                                    SCT Data Table supported.
            
            SMART Attributes Data Structure revision number: 1
            Vendor Specific SMART Attributes with Thresholds:
            ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
              5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
              9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1161
             12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       155
            177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       4
            179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
            181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
            182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
            183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
            187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
            190 Airflow_Temperature_Cel 0x0032   074   047   000    Old_age   Always       -       26
            195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
            199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
            235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       6
            241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       4625517805
            
            SMART Error Log Version: 1
            No Errors Logged
            
            SMART Self-test log structure revision number 1
            No self-tests have been logged.  [To run self-tests, use: smartctl -t]
            
            SMART Selective self-test log data structure revision number 1
             SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
                1        0        0  Not_testing
                2        0        0  Not_testing
                3        0        0  Not_testing
                4        0        0  Not_testing
                5        0        0  Not_testing
            Selective self-test flags (0x0):
              After scanning selected spans, do NOT read-scan remainder of disk.
            If Selective self-test is pending on power-up, resume after 0 minute delay.
            As you can see, my SSD is clean and has worn only 2.15 TBW out of 300TBW guaranteed, after 1161 hours of run time. According to:
            https://www.virten.net/2016/12/ssd-t...ten-calculator
            Last edited by GreyGeek; Nov 08, 2018, 12:26 PM.
            "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
            – John F. Kennedy, February 26, 1962.

            Comment


              #21
              Guys, I really appreciate your help, feels good to see I am not alone :-)

              Opened the bugger, and yes there was quite some dust in there. Temperature looks like it's lower now 35°, can't remember seeing anything below 40° and it feels more quiet now, I'll have a close eye on this.
              It's these slow creepy changes which are easy to miss!

              Just pulled out the invoice, purchased the machine in March last year.

              The smartclt doesn't seem to work for NVMe drives:
              thomas@hermes:~$ sudo smartctl -a /dev/nvme0n1
              smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.18.0-10-generic] (local build)
              Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

              === START OF INFORMATION SECTION ===
              Model Number: SM961 NVMe SAMSUNG 512GB
              Serial Number: S34YNX0HC01903
              Firmware Version: CXA74D0Q
              PCI Vendor/Subsystem ID: 0x144d
              IEEE OUI Identifier: 0x002538
              Total NVM Capacity: 512.110.190.592 [512 GB]
              Unallocated NVM Capacity: 0
              Controller ID: 2
              Number of Namespaces: 1
              Namespace 1 Size/Capacity: 512.110.190.592 [512 GB]
              Namespace 1 Utilization: 304.360.701.952 [304 GB]
              Namespace 1 Formatted LBA Size: 512
              Local Time is: Mon Nov 5 19:09:13 2018 CET
              Firmware Updates (0x16): 3 Slots, no Reset required
              Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
              Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
              Warning Comp. Temp. Threshold: 70 Celsius
              Critical Comp. Temp. Threshold: 73 Celsius

              Supported Power States
              St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
              0 + 6.80W - - 0 0 0 0 0 0
              1 + 4.90W - - 1 1 1 1 0 0
              2 + 3.20W - - 2 2 2 2 0 0
              3 - 0.0400W - - 3 3 3 3 210 1500
              4 - 0.0050W - - 4 4 4 4 2200 6000

              Supported LBA Sizes (NSID 0x1)
              Id Fmt Data Metadt Rel_Perf
              0 + 512 0 0

              === START OF SMART DATA SECTION ===
              Read NVMe SMART/Health Information failed: NVMe Status 0x2002

              thomas@hermes:~$
              Hence I quickly looked up this https://www.percona.com/blog/2017/02...-flash-health/ and installed nvme-cli which gave me this:
              thomas@hermes:~$ sudo nvme -list
              Node SN Model Namespace Usage Format FW Rev
              ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
              /dev/nvme0n1 S34YNX0HC01903 SM961 NVMe SAMSUNG 512GB 1 344,74 GB / 512,11 GB 512 B + 0 B CXA74D0Q
              thomas@hermes:~$ sudo nvme smart-log /dev/nvme0
              Smart Log for NVME device:nvme0 namespace-id:ffffffff
              critical_warning : 0
              temperature : 32 C
              available_spare : 100%
              available_spare_threshold : 50%
              percentage_used : 1%
              data_units_read : 180.612.670
              data_units_written : 39.295.503
              host_read_commands : 922.103.392
              host_write_commands : 427.185.726
              controller_busy_time : 2.012
              power_cycles : 3.376
              power_on_hours : 1.451
              unsafe_shutdowns : 250
              media_errors : 0
              num_err_log_entries : 406
              Warning Temperature Time : 0
              Critical Composite Temperature Time : 0
              Temperature Sensor 1 : 32 C
              Temperature Sensor 2 : 36 C
              Thermal Management T1 Trans Count : 0
              Thermal Management T2 Trans Count : 0
              Thermal Management T1 Total Time : 0
              Thermal Management T2 Total Time : 0
              thomas@hermes:~$
              Next backup attempt is running now, with clean fans...

              Comment


                #22
                Dust in Desktop units isn't as much of a problem (not that it isn't one, just less of one) as there is so much air space. In a laptop, air space is extremely limited (has to be), so obstructions (dust/dirt/lint/etc) blocking intake/exhaust ports can have significant impacts on performance and operation. IF either/both are user accessible, checking the heat sinks is another thing to inspect. The thermal paste used to secure them to the CPU/GPU sometimes comes loose after so many on/off conditions (expansion/contraction cycles due to heating up and cooling down). Often, either inferior thermal paste is used, or enough of it isn't used to make contact fully with the heat sink and the surface of the CPU/GPU, leaving spots that heat up more than the rest. Bottom line: Laptops need regular, proper maintenance to get the longest use out of them.
                Using Kubuntu Linux since March 23, 2007
                "It is a capital mistake to theorize before one has data." - Sherlock Holmes

                Comment


                  #23
                  Thanks Snowhog, I knew about this...in theory...I started my laptop journey with this massive Compaq toaster https://en.wikipedia.org/wiki/Compaq_Portable_386 and had laptops ever since, but for some reason I never ran into anything like this. Maybe I exchanged the machines more frequently when traveling globally and now that I work from home I may collect more dust. In any case the backup is running with a 80GB zip file being created in parallel and the CPU is showing ~60° that's a good 15° less than before...
                  I really hope that was it...

                  Comment


                    #24
                    I would look into these:
                    unsafe_shutdowns : 250
                    media_errors : 0
                    num_err_log_entries : 406
                    How are you supposed to shut down an NVME?
                    "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
                    – John F. Kennedy, February 26, 1962.

                    Comment


                      #25
                      No idea :-)
                      I would expect the NVM driver to manage this but in my case it may not have been possible, so far no crash since I blew out the inwards of my machine...maybe, hopefully this was it.
                      I will in any case check the number a few times to see if the number goes up.
                      num_err_log_entries went up to 410, the rest remained the same.

                      Comment


                        #26
                        In the log what are the errors attributed to?
                        "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
                        – John F. Kennedy, February 26, 1962.

                        Comment


                          #27
                          It's multiple entries of those two which I get with sudo nvme error-log /dev/nvme0:
                          Entry[60]
                          ...
                          error_count : 350
                          sqid : 0
                          cmdid : 0x3b
                          status_field : 0x4212(INVALID_LOG_PAGE: The log page indicated is invalid. This error condition is also returned if a reserved log page is requested)
                          parm_err_loc : 0x28
                          lba : 0
                          nsid : 0xffffffff
                          vs : 0
                          cs : 0
                          ...
                          Entry[61]
                          ...
                          error_count : 349
                          sqid : 0
                          cmdid : 0x35
                          status_field : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
                          parm_err_loc : 0x2c
                          lba : 0
                          nsid : 0
                          vs : 0
                          cs : 0

                          Comment


                            #28
                            It looks to me that the firmware on your NVME is not matching the structure of the NVME. IOW, the NVME has a buggy interface. Time to take this to Dell or the SSD manufacturer.
                            "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
                            – John F. Kennedy, February 26, 1962.

                            Comment


                              #29
                              Originally posted by GreyGeek View Post
                              It looks to me that the firmware on your NVME is not matching the structure of the NVME. IOW, the NVME has a buggy interface. Time to take this to Dell or the SSD manufacturer.
                              Quick update, the story continues, no crash on Friday, plenty on Saturday.

                              To rule out the NVMe drive I installed Kubuntu on a USB attached HDD, thanks to btrfs it's been very easy to create an exact copy of my NVMe. Also a good test for my backup routine.
                              Unfortunately the USB based system crashed as well. Same story, root becomes ro and the OS falls over.
                              Now, I can't completely switch of the NVMe drive as the system boots from the UEFI partition on NVMe before it points to the external HDD, but after the system has booted there should be no access to the NVMe drive. The NVMe driver is in memory though I guess?
                              I think that means the NVMe drive is off the radar now.
                              I also installed Win10 on the second internal drive and tried to put some stress on it, didn't manage to make it fall over so far.

                              Means I am probably back to a software issue rather than hardware? Maybe I need to abandon my backup and consider a minimum install next, adding must have services one after another.

                              I also opened a call with Dell now, will see what they come up with.

                              By the way, the system is significantly cooler now that I blew the dust out, so that was definitely an exercise worth doing.

                              Edit: Almost 6 months after I started this thread I finally have got the solution.
                              It turned out to be a hardware problem with the Samsung NVMe drive as described here:


                              None of the solutions described did work but at least it was clear what the problem was, today I received a new (Toshiba) drive from Dell.
                              Boy, what a journey, but I learned a lot along the way.
                              Last edited by Thomas00; Feb 12, 2019, 11:44 AM. Reason: Found solution

                              Comment

                              Working...
                              X