Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Problems and Help :
AMD ROCm CL_OUT_OF_HOST_MEMORY when running through BOINC
Author |
Message |
|
Hello all,
Sorry if this has been asked before, but the search nor search engines are not returning any recent results for this.
I'm trying to run jobs on an RX 6900 XT but all tasks eventually fail with CL_OUT_OF_HOST_MEMORY. An abbreviated log from one such task:
geneferocl 3.3.3-2 (Linux/OpenCL/64-bit)
Running on platform 'AMD Accelerated Parallel Processing', device 'gfx1030', vendor 'Advanced Micro Devices, Inc.', version 'OpenCL 2.0 ' and driver '3452.0 (HSA1.1,LC)'.
40 computeUnits @ 2660MHz, memSize=16368MB, cacheSize=16kB, cacheLineSize=64B, localMemSize=64kB, maxWorkGroupSize=256.
Supported transform implementations: ocl ocl2 ocl3 ocl4 ocl5
Command line: ../../projects/www.primegrid.com/geneferocl_linux64_3.3.3-2 -boinc -q 319238584^32768+1 --device 0
Normal priority change failed (needs superuser privileges.
Checking available transform implementations...
Using OCL2 transform
Error: OpenCL error detected: CL_OUT_OF_HOST_MEMORY.
Errors occurred for all available transform implementations
Waiting 10 minutes before attempting to continue from last checkpoint...
This was from a Genefer 15 task (the later "can't acquire lockfile" was from me trying to reproduce it).
The thing is, I can't reproduce it outside BOINC! If I suspend the task, copy the environment the task was run in (from /proc/$pid/environ), go to the slot (before BOINC cleans it up) as boinc:boinc, and manually execute the command in the snippet above minus the "-boinc", the task completes!
The output is slightly different as the OpenCL runtime complains about dlopen not finding nvidia and Intel platforms, so maybe the tasks run from BOINC use a different runtime and/or icd directory, but it could also just be from stderr redirection happening after these messages are printed. Unfortunately I'm not that experienced with OpenCL.
From a manual (successful) run, similarly abbreviated:
geneferocl 3.3.3-2 (Linux/OpenCL/64-bit)
Command line: /var/lib/boinc-client/projects/www.primegrid.com/geneferocl_linux64_3.3.3-2 -q 319238584^32768+1 --device 0
Normal priority change succeeded.
Checking available transform implementations...
dlerror: libintelocl.so: cannot open shared object file: No such file or directory
dlerror: libMesaOpenCL.so.1: cannot open shared object file: No such file or directory
dlerror: libnvidia-opencl.so.1: cannot open shared object file: No such file or directory
Testing 319238584^32768+1...
Using OCL2 transform
Running on platform 'AMD Accelerated Parallel Processing', device 'gfx1030', vendor 'Advanced Micro Devices, Inc.', version 'OpenCL 2.0 ' and driver '3452.0 (HSA1.1,LC)'.
40 computeUnits @ 2660MHz, memSize=16368MB, cacheSize=16kB, cacheLineSize=64B, localMemSize=64kB, maxWorkGroupSize=256.
Starting initialization...
Initialization complete (0.026 seconds).
Estimated time for 319238584^32768+1 is 0:00:46
319238584^32768+1 is composite. (RES=6fe9fa0b8bfb5e5f) (278663 digits) (err = 0.0000) (time = 0:00:49) 11:51:38
Has anybody had similar problems? I'm running ROCm 5.2.3 on Debian testing (Linux 6.0.0-4-amd64), BOINC 7.20.2. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,522,612 RAC: 5,341

|
Several OpenCL implementations appear to be installed (libintelocl, libMesaOpenCL, libnvidia-opencl). If the dlerror message is not present, it is possible that the wrong dll was loaded from BOINC.
You should try to uninstall them.
| |
|
|
Thanks for the suggestion. Unfortunately it doesn't seem to work. The packages providing these implementations were already uninstalled, it was just the ICDs that remained in /etc/OpenCL/vendors. I've removed all of them except the one for the current AMD OpenCL installation. I also searched the whole filesystem for the library files (and variants) but they're not there. Both the working and the broken one load /opt/rocm-5.2.3/lib/libOpenCL.so.1.2 and the associated support libraries (checked using /proc/$pid/maps). | |
|
|
I have the same problem here.
New computer I'm currently setting up, and the software is in a somewhat messy state after experimenting with different ways to get OpenCL up and running at all.
Now I have the AMD OpenCL stack in usable shape, as far as I can see, and the result is as can be seen here: https://www.primegrid.com/result.php?resultid=1507777023
I doubt it's a real out-of-memory situation, as (without a GPU task running):
# LANG=C free -m
total used free shared buff/cache available
Mem: 63427 5956 52301 25 5903 57471
Swap: 975 0 975
I would appreciate any hint about what I can do to fix this.
Thanks,
Zyfdnug | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,522,612 RAC: 5,341

|
I have the same problem here.
It is a different problem. The 'gfx1036' is the Integrated Graphics of Ryzen 9 7900X. It cannot test a DYFL. | |
|
|
That is quite interesting... first, I was not aware this CPU had an integrated graphics device.
Also, if I start the binary from the shell, as root user, it uses a different device (which would be the discrete Radeon card):
# /var/lib/boinc-client/projects/www.primegrid.com/genefer22g_linux64_22.12.02 -p -n 22 -b 1053460 -f gproof
geneferg version 22.12.2 (linux x64, gcc-7.5.0, boinc-7.20.2)
Copyright (c) 2022, Yves Gallot
genefer is free source code, under the MIT license.
Command line: '-p -n 22 -b 1053460 -f gproof'
Running on device 'gfx1031', vendor 'Advanced Micro Devices, Inc.', version 'OpenCL 2.0 ', driver '3513.0 (HSA1.1,LC)', data size: 96 MB.
Resuming from a checkpoint.
7.58% done, 26:07:20 remaining, 1.21 ms/bit.
I did not notice the different device identifiers (and if I did, I wouldn't be able to interpret them anyway ;-)
So it appears that, for whatever reason, the binary picks different OpenCL devices to work with when called from the shell and from the boinc manager.
My question then is -- how can I tell boinc or PrimeGrid which GPU to use for OpenCL? | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437
                        
|
# man clinfo
...
To selectively enable/disable platforms, one way is to move or rename the *.icd files present in
/etc/OpenCL/vendors/ and then restoring them one by one. When using the free-software ocl-icd
OpenCL library, a similar effect can be achieved by setting the OPENCL_VENDOR_PATH or
OCL_ICD_VENDORS environment variables, as documented in libOpenCL(7).
Other implementations of libOpenCL are known to support OPENCL_VENDOR_PATH too.
| |
|
|
In this case, both devices are handled by the same vendor's software, so disabling a vendo for OpenCL would be counterproductive ;-)
I tried disabling the CPU-integrated device in boinc's cc_config.xml file. The boinc log reports to have it disabled, also reports the exclude_gpu tag to not be recognized, but Primegrid's software still uses it when run through boinc.
So I tried running the boinc client in the foreground (not through systemd), and got very promising results: https://www.primegrid.com/result.php?resultid=1508473533
Getting this behaviour controlled through configuration seems to be a bit tricky ;-) | |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1895 ID: 37043 Credit: 825,277,592 RAC: 580,819
                     
|
In this case, both devices are handled by the same vendor's software, so disabling a vendo for OpenCL would be counterproductive ;-)
I tried disabling the CPU-integrated device in boinc's cc_config.xml file. The boinc log reports to have it disabled, also reports the exclude_gpu tag to not be recognized, but Primegrid's software still uses it when run through boinc.
So I tried running the boinc client in the foreground (not through systemd), and got very promising results: https://www.primegrid.com/result.php?resultid=1508473533
Getting this behaviour controlled through configuration seems to be a bit tricky ;-)
You can also exclude it thru a few lines in your cc_config.xml file:
<options>
<use_all_gpus>1</use_all_gpus>
<exclude_gpu>
<url>https://www.primegrid.com/</url>
<device_num>0</device_num>
</exclude_gpu>
With you changing the device number based on what Boinc sees when it first starts up in the Event Log. This gets a bit longer if you crunch for multiple Projects as you would need an exclude section for each one | |
|
|
The GPU exlusion via configuration is indeed what I tried next.
I have
coproc_info.xml:
<coprocs>
<ati_opencl>
<name>AMD Radeon RX 6750 XT</name>
<vendor>Advanced Micro Devices, Inc.</vendor>
<vendor_id>4098</vendor_id>
<available>1</available>
<half_fp_config>0</half_fp_config>
<single_fp_config>191</single_fp_config>
<double_fp_config>63</double_fp_config>
<endian_little>1</endian_little>
<execution_capabilities>1</execution_capabilities>
<extensions>cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_
khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable
_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program </extensions> <global_mem_size>12868124672</global_mem_size> <local_mem_size>65536</local_mem_size>
<max_clock_frequency>2880</max_clock_frequency>
<max_compute_units>20</max_compute_units>
<nv_compute_capability_major>0</nv_compute_capability_major>
<nv_compute_capability_minor>0</nv_compute_capability_minor>
<amd_simd_per_compute_unit>4</amd_simd_per_compute_unit>
<amd_simd_width>32</amd_simd_width>
<amd_simd_instruction_width>1</amd_simd_instruction_width>
<opencl_platform_version>OpenCL 2.1 AMD-APP (3513.0)</opencl_platform_version>
<opencl_device_version>OpenCL 2.0 </opencl_device_version>
<opencl_driver_version>3513.0 (HSA1.1,LC)</opencl_driver_version>
<device_num>0</device_num>
<peak_flops>14745600000000.000000</peak_flops>
<opencl_available_ram>12868124672.000000</opencl_available_ram>
<opencl_device_index>0</opencl_device_index>
<warn_bad_cuda>0</warn_bad_cuda>
</ati_opencl>
<ati_opencl>
<name>gfx1036</name>
<vendor>Advanced Micro Devices, Inc.</vendor>
<vendor_id>4098</vendor_id>
<available>1</available>
<half_fp_config>0</half_fp_config>
<single_fp_config>191</single_fp_config>
<double_fp_config>63</double_fp_config>
<endian_little>1</endian_little>
<execution_capabilities>1</execution_capabilities>
<extensions>cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program </extensions>
<global_mem_size>536870912</global_mem_size>
<local_mem_size>65536</local_mem_size>
<max_clock_frequency>2200</max_clock_frequency>
<max_compute_units>1</max_compute_units>
<nv_compute_capability_major>0</nv_compute_capability_major>
<nv_compute_capability_minor>0</nv_compute_capability_minor>
<amd_simd_per_compute_unit>4</amd_simd_per_compute_unit>
<amd_simd_width>32</amd_simd_width>
<amd_simd_instruction_width>1</amd_simd_instruction_width>
<opencl_platform_version>OpenCL 2.1 AMD-APP (3513.0)</opencl_platform_version>
<opencl_device_version>OpenCL 2.0 </opencl_device_version>
<opencl_driver_version>3513.0 (HSA1.1,LC)</opencl_driver_version>
<device_num>1</device_num>
<peak_flops>563200000000.000000</peak_flops>
<opencl_available_ram>536870912.000000</opencl_available_ram>
<opencl_device_index>1</opencl_device_index>
<warn_bad_cuda>0</warn_bad_cuda>
</ati_opencl>
<warning>NVIDIA: libcuda.so: cannot open shared object file: No such file or directory</warning>
<warning>ATI: libaticalrt.so: cannot open shared object file: No such file or directory</warning>
</coprocs>
cc_config.xml:
<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
</log_flags>
<options>
<exclude_gpu>
<url>http://www.primegrid.com/</url>
<device_num>1</device_num>
</exclude_gpu>
</options>
</cc_config>
The coproc_info file should make it clear what the id numbers refer to.
This is what I observe now:
root@Zwerg:/var/lib/boinc# sudo -u boinc boinc [243/375]
27-Apr-2023 16:37:46 [---] Starting BOINC client version 7.20.5 for x86_64-pc-linux-gnu
27-Apr-2023 16:37:46 [---] This a development version of BOINC and may not function properly
27-Apr-2023 16:37:46 [---] log flags: file_xfer, sched_ops, task
27-Apr-2023 16:37:46 [---] Libraries: libcurl/7.88.1 OpenSSL/3.0.8 zlib/1.2.13 brotli/1.0.9 zstd/1.5.4 libidn2/2.3.3 libpsl/0.21.2 (+li
bidn2/2.3.3) libssh2/1.10.0 nghttp2/1.52.0 librtmp/2.3
27-Apr-2023 16:37:46 [---] Data directory: /var/lib/boinc-client
27-Apr-2023 16:37:50 [---] OpenCL: AMD/ATI GPU 0: AMD Radeon RX 6750 XT (driver version 3513.0 (HSA1.1,LC), device version OpenCL 2.0,
12272MB, 12272MB available, 14746 GFLOPS peak)
27-Apr-2023 16:37:50 [---] OpenCL: AMD/ATI GPU 1 (ignored by config): gfx1036 (driver version 3513.0 (HSA1.1,LC), device version OpenCL
2.0, 512MB, 512MB available, 563 GFLOPS peak)
27-Apr-2023 16:37:50 [---] libc: version 2.36
27-Apr-2023 16:37:50 [---] Host name: Zwerg
27-Apr-2023 16:37:50 [---] Processor: 24 AuthenticAMD AMD Ryzen 9 7900X 12-Core Processor [Family 25 Model 97 Stepping 2]
27-Apr-2023 16:37:50 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr
sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmper
f rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat
_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq r
dseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mb
m_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean fl
ushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi u
27-Apr-2023 16:37:50 [---] OS: Linux Debian: Debian GNU/Linux 12 (bookworm) [6.1.0-7-amd64|libc 2.36]
27-Apr-2023 16:37:50 [---] Memory: 61.94 GB physical, 976.00 MB virtual
27-Apr-2023 16:37:50 [---] Disk: 464.48 GB total, 438.70 GB free
27-Apr-2023 16:37:50 [---] Local time is UTC +2 hours
27-Apr-2023 16:37:50 [---] Config: GUI RPCs allowed from:
27-Apr-2023 16:37:50 [---] 192.168.0.1
27-Apr-2023 16:37:50 [---] Zwerg.redacteddomainname
27-Apr-2023 16:37:50 [PrimeGrid] Config: excluded GPU. Type: all. App: all. Device: 1
27-Apr-2023 16:37:50 [PrimeGrid] General prefs: from PrimeGrid (last modified 29-Nov-2020 16:20:55)
27-Apr-2023 16:37:50 [PrimeGrid] Computer location: home
I think the above log output shows that the correct GPU device is disabled for Primegrid.
Also, task reports show that only gfx1031 is in use now.
However, jobs running under systemd fail with opencl error: CL_OUT_OF_HOST_MEMORY. If I run the boinc client on the shell, as the boinc user, things seem to proceed correctly.
On the shell, as root user, managed to chew through https://www.primegrid.com/result.php?resultid=1508328550
I'll leave the client run for a while now, but at this time, it looks like systemd or the unit file bring in some problem. I already tried running the boinc client with much increased locked memory limit, but that did not lead to success. | |
|
|
Looks like this is an issue only when running under systemd. Both root or the boinc client user can, when started on the shell, properly crunch their numbers using OpenCL.
Unfortunately, I have no reference system where I could try PrimeGrid under boinc with a different Linux distribution.
Has anybody ever needed to change systemd unit settings, and can recommend anything?
Thanks,
Zyfdnug | |
|
|
I managed to resolve this -- patience, actually thinking a bit myself, and a good search engine brought me to
https://github.com/BOINC/boinc/issues/4948
which contained the actual solution here.
In short, adding ProtectSystem=full via systemctl edit boinc-client.service proved to be a useful solution.
I'll reach out to the Debian project's boinc package maintainers with this information, so the installation can either be adapted, or at least the documentation be updated.
Zyfdnug | |
|
Post to thread
Message boards :
Problems and Help :
AMD ROCm CL_OUT_OF_HOST_MEMORY when running through BOINC |