Linux Kernel panic issue: How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem
A panic may occur as a result of a hardware failure or a software bug in the operating system. In
many cases, the operating system is capable of continued operation
after an error has occurred. However, the system is in an unstable state
and rather than risking security breaches and data corruption, the
operating system stops to prevent further damage and facilitate
diagnosis of the error and, in usual cases, restart. After recompiling a
kernel binary image from source code, a kernel panic during booting the
resulting kernel is a common problem if the kernel was not correctly
configured, compiled or installed. Add-on hardware or malfunctioning
RAM could also be sources of fatal kernel errors during start up, due to
incompatibility with the OS or a missing device driver. A kernel may
also go into panic() if it is unable to locate a root file
system. During the final stages of kernel userspace initialization, a
panic is typically triggered if the spawning of init fails, as the
system would then be unusable.
Background
When I face this issue i have checked the log in /var/log/messages. In that i have got below mentioned log
Background
When I face this issue i have checked the log in /var/log/messages. In that i have got below mentioned log
INFO: task jbd2/vda3-8:250 blocked for more than 120 seconds.
Not tainted 2.6.32-431.11.2.el6.x86_64 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Step by step troubleshooting data and logs
Following command will show the server memory usagesar -rwe can check the memory status that server consumes.then we have to check the cpu usage with following commandsar -uSolution for hung_task_timeout_secsExplanation
By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. As IO subsystem responds slowly and more requests are served, System Memory gets filled up resulting in the above error, thus serving HTTP requests.TestingI tested this theory with the following: Changevm.dirty_ratio
andvm.dirty_backgroud_ratio
sysctl -w vm.dirty_ratio=10
sysctl -w vm.dirty_background_ratio=5
commit the change
sysctl -p
Make it permanent
When the server seemed more stable and no Kernel/Swap/Memory Panic for a week,
I edited
/etc/sysctl.conf
file to make these permanent after reboot.
vi /etc/sysctl.conf
add the 2 lines at the bottom
vm.dirty_background_ratio = 5 vm.dirty_ratio = 10
save and exit
then reboot the server.
No comments:
Post a Comment