I had a ap27_2.6_opencl WU running and it was already past the remaining time, which was displayed as --- but running time still counts up. But it was also stuck at 27% and the progress did not change.
So i followed the advice from another WU issue here and suspended and resumed it. Then it had the status "Waiting to run" forever.
I restarted BOINC, but it could not stop that process. Now the linux kernel seems to have issues because of that process.
What's going on here? Can i do anything to complete this WU?
May 20 17:11:53 gaming kernel: INFO: task ap27_2.6_opencl:1605 blocked for more than 122 seconds.
May 20 17:11:53 gaming kernel: Not tainted 5.6.13 #1-NixOS
May 20 17:11:53 gaming kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 20 17:11:53 gaming kernel: ap27_2.6_opencl D 0 1605 1482 0x80004002
May 20 17:11:53 gaming kernel: Call Trace:
May 20 17:11:53 gaming kernel: ? __schedule+0x250/0x6d0
May 20 17:11:53 gaming kernel: ? schedule+0x4a/0xb0
May 20 17:11:53 gaming kernel: ? schedule_timeout+0x20f/0x300
May 20 17:11:53 gaming kernel: ? ttm_bo_move_to_lru_tail+0x28/0xc0 [ttm]
May 20 17:11:53 gaming kernel: ? ttm_eu_backoff_reservation+0x43/0x60 [ttm]
May 20 17:11:53 gaming kernel: ? dma_fence_default_wait+0x15f/0x1f0
May 20 17:11:53 gaming kernel: ? dma_fence_release+0x140/0x140
May 20 17:11:53 gaming kernel: ? dma_fence_wait_timeout+0xdd/0x100
May 20 17:11:53 gaming kernel: ? amdgpu_vm_fini+0xe7/0x470 [amdgpu]
May 20 17:11:53 gaming kernel: ? idr_destroy+0x71/0xb0
May 20 17:11:53 gaming kernel: ? amdgpu_driver_postclose_kms+0x15d/0x230 [amdgpu]
May 20 17:11:53 gaming kernel: ? drm_file_free.part.0+0x210/0x2c0 [drm]
May 20 17:11:53 gaming kernel: ? drm_release+0x4b/0x80 [drm]
May 20 17:11:53 gaming kernel: ? __fput+0xb9/0x250
May 20 17:11:53 gaming kernel: ? task_work_run+0x8a/0xb0
May 20 17:11:53 gaming kernel: ? do_exit+0x360/0xaa0
May 20 17:11:53 gaming kernel: ? handle_mm_fault+0xc4/0x1f0
May 20 17:11:53 gaming kernel: ? do_group_exit+0x3a/0xa0
May 20 17:11:53 gaming kernel: ? __x64_sys_exit_group+0x14/0x20
May 20 17:11:53 gaming kernel: ? do_syscall_64+0x4e/0x160
May 20 17:11:53 gaming kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 20 17:12:34 gaming sudo[25188]: davidak : TTY=pts/1 ; PWD=/home/davidak ; USER=root ; COMMAND=/run/wrappers/bin/su
May 20 17:12:34 gaming sudo[25188]: pam_unix(sudo:session): session opened for user root by (uid=0)
May 20 17:12:34 gaming su[25189]: Successful su for root by root
May 20 17:12:34 gaming su[25189]: pam_unix(su:session): session opened for user root by (uid=0)
May 20 17:12:41 gaming systemd[1]: Stopping BOINC Client...
May 20 17:13:56 gaming kernel: INFO: task ap27_2.6_opencl:1605 blocked for more than 245 seconds.
May 20 17:13:56 gaming kernel: Not tainted 5.6.13 #1-NixOS
May 20 17:13:56 gaming kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 20 17:13:56 gaming kernel: ap27_2.6_opencl D 0 1605 1 0x80004002
May 20 17:13:56 gaming kernel: Call Trace:
May 20 17:13:56 gaming kernel: ? __schedule+0x250/0x6d0
May 20 17:13:56 gaming kernel: ? schedule+0x4a/0xb0
May 20 17:13:56 gaming kernel: ? schedule_timeout+0x20f/0x300
May 20 17:13:56 gaming kernel: ? ttm_bo_move_to_lru_tail+0x28/0xc0 [ttm]
May 20 17:13:56 gaming kernel: ? ttm_eu_backoff_reservation+0x43/0x60 [ttm]
May 20 17:13:56 gaming kernel: ? dma_fence_default_wait+0x15f/0x1f0
May 20 17:13:56 gaming kernel: ? dma_fence_release+0x140/0x140
May 20 17:13:56 gaming kernel: ? dma_fence_wait_timeout+0xdd/0x100
May 20 17:13:56 gaming kernel: ? amdgpu_vm_fini+0xe7/0x470 [amdgpu]
May 20 17:13:56 gaming kernel: ? idr_destroy+0x71/0xb0
May 20 17:13:56 gaming kernel: ? amdgpu_driver_postclose_kms+0x15d/0x230 [amdgpu]
May 20 17:13:56 gaming kernel: ? drm_file_free.part.0+0x210/0x2c0 [drm]
May 20 17:13:56 gaming kernel: ? drm_release+0x4b/0x80 [drm]
May 20 17:13:56 gaming kernel: ? __fput+0xb9/0x250
May 20 17:13:56 gaming kernel: ? task_work_run+0x8a/0xb0
May 20 17:13:56 gaming kernel: ? do_exit+0x360/0xaa0
May 20 17:13:56 gaming kernel: ? handle_mm_fault+0xc4/0x1f0
May 20 17:13:56 gaming kernel: ? do_group_exit+0x3a/0xa0
May 20 17:13:56 gaming kernel: ? __x64_sys_exit_group+0x14/0x20
May 20 17:13:56 gaming kernel: ? do_syscall_64+0x4e/0x160
May 20 17:13:56 gaming kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 20 17:14:11 gaming systemd[1]: boinc.service: State 'stop-final-sigterm' timed out. Killing.
May 20 17:14:11 gaming systemd[1]: boinc.service: Killing process 1605 (ap27_2.6_opencl) with signal SIGKILL.
May 20 17:14:11 gaming systemd[1]: boinc.service: Failed with result 'timeout'.
May 20 17:14:11 gaming systemd[1]: Stopped BOINC Client.
May 20 17:14:11 gaming systemd[1]: boinc.service: Consumed 1month 3w 6d 1h 18min 20.255s CPU time, received 240.1M IP traffic, sent 7.4G IP traffic.
May 20 17:14:11 gaming systemd[1]: boinc.service: Found left-over process 1605 (ap27_2.6_opencl) in control group while starting unit. Ignoring.
May 20 17:14:11 gaming systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 20 17:14:11 gaming systemd[1]: Started BOINC Client.
I'm not sure if the PCIe connectors of the Mainboard are OK. I bought it used and tested with 4 different GPUs and all had dropouts for 1-3 seconds. That would probably lead to calculation errors in BOINC. But i don't had that issue with shorter WUs.
Update: I rebootet the system (well reboot didn't work, so i pulled the power plug) and the WU runs again. It's at 30% now and shows 30 minutes runtime. It was 1h30m before! It seem to have crashed the whole system.
Does it make sense to run it for a day or for however it will take or is the WU invalid anyway? |