Recently I came across a strange issue with a php server process running inside a docker container was mysteriously getting terminated in our production cluster. There was no sign of docker container crash, process was not crashed due some unhandled exception as there is no such indication in the application logs, also it was not getting killed by kernel OOM killer as there no indication of it in the system logs.
Unfortunately this issue was occurring in production and it is not easily possible to debug the application as the crash appeared to happen randomly when heavy work is done by the php process.
This issue was occurring when the application is given more work to do in a single use case. I was successfully able to reproduce in test environment after a few days as it was taking more than 10 hours or so for a crash to occur.
Crash was occurring at different points for different input in this use case but it was crashing reproducibly at the exact same point for same input.
Now after successfully reproducing the issue, next step was to identify possible causes for this bug.
Application was processing work in a background php process called by php-fpm worker process and as stated already in problem statement, this crash was not happening due it being killed by kernel OOM killer and the docker container was not crashed as well also other work that was being processed by application during the crash did continue unaffected.
This was an isolated crash and there was no error or exceptions logged either in the application logs or in the system logs.
Also we can rule out other users being involved as the crash was occurring consistently after a particular amount of progress for an input.
After ruling out the usual suspects, I enabled tracing for the running process using strace and monitored only for signals received by the process using below command:
strace -p "$pid" -e 'trace=!all' -e 'signal=!SIGCHLD' -o /var/log/monitor_${pid}.log
In the above strace command, I’m monitoring the process using -p pid and disabled tracing of all system calls using -e trace=!all and excluded the SIGCHLD signal alone as it was spawning multiple child processes for network IO and there was a lot of noise from this signal due to normal exit of child processes.
Once tracing was enabled, I patiently waited for the process to crash and once it crashed, I could see it was killed by SIGPROF signal.
--- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
+++ exited with 255 +++
SIGPROF signal is sent by kernel when a process cpu profiling timer expires.
Below is an excerpt about this signal from GNU libc manual:
This signal typically indicates expiration of a timer that measures both CPU time used by the current process, and CPU time expended on behalf of the process by the system. Such a timer is used to implement code profiling facilities, hence the name of this signal. [2]
Since we now know how this process is getting terminated, next step is find out why?
In php, SIGPROF signal can be sent to a process when the time limit set by set_time_limit() function or the value set by max_execution_time directive expires. This is used to terminate an unresponsive php web process.
PHP cli process by default has max_execution_time set to 0, meaning no limit and there was no set_time_limit() function called in this code base.
But this max_execution_time directive is directly set in php and it was added in beginning of a different file. This file imported by a different unrelated file which is then imported by a php function after this background process is created.
I found that this was added as a quick fix to an unrelated bug where a web process was taking more than 30 seconds of cpu time to complete and got killed by default max_execution_time of 30 seconds for fpm process.
After figuring out why this bug happened, correct fix was to remove this max_execution_time directive defined in php file and increasing max_execution_time directive in fpm/php.ini file so that it does not affect the background process.
Once the above was done, the php background process which was doing the heavy lifting no longer got killed :-) .
References: