PrimeGrid
Please visit donation page to help the project cover running costs for this month

Toggle Menu

Join PrimeGrid

Returning Participants

Community

Leader Boards

Results

Other

drummers-lowrise

Advanced search

Message boards : Problems and Help : AMD ROCm CL_OUT_OF_HOST_MEMORY when running through BOINC

Author Message
Luc Everse
Avatar
Send message
Joined: 18 Nov 22
Posts: 2
ID: 1542475
Credit: 1,042,115
RAC: 0
321 LLR Silver: Earned 100,000 credits (373,548)PPS Sieve Silver: Earned 100,000 credits (364,068)AP 26/27 Silver: Earned 100,000 credits (125,333)GFN Silver: Earned 100,000 credits (179,166)
Message 157996 - Posted: 23 Nov 2022 | 10:57:26 UTC

Hello all,

Sorry if this has been asked before, but the search nor search engines are not returning any recent results for this.

I'm trying to run jobs on an RX 6900 XT but all tasks eventually fail with CL_OUT_OF_HOST_MEMORY. An abbreviated log from one such task:

geneferocl 3.3.3-2 (Linux/OpenCL/64-bit) Running on platform 'AMD Accelerated Parallel Processing', device 'gfx1030', vendor 'Advanced Micro Devices, Inc.', version 'OpenCL 2.0 ' and driver '3452.0 (HSA1.1,LC)'. 40 computeUnits @ 2660MHz, memSize=16368MB, cacheSize=16kB, cacheLineSize=64B, localMemSize=64kB, maxWorkGroupSize=256. Supported transform implementations: ocl ocl2 ocl3 ocl4 ocl5 Command line: ../../projects/www.primegrid.com/geneferocl_linux64_3.3.3-2 -boinc -q 319238584^32768+1 --device 0 Normal priority change failed (needs superuser privileges. Checking available transform implementations... Using OCL2 transform Error: OpenCL error detected: CL_OUT_OF_HOST_MEMORY. Errors occurred for all available transform implementations Waiting 10 minutes before attempting to continue from last checkpoint...


This was from a Genefer 15 task (the later "can't acquire lockfile" was from me trying to reproduce it).

The thing is, I can't reproduce it outside BOINC! If I suspend the task, copy the environment the task was run in (from /proc/$pid/environ), go to the slot (before BOINC cleans it up) as boinc:boinc, and manually execute the command in the snippet above minus the "-boinc", the task completes!

The output is slightly different as the OpenCL runtime complains about dlopen not finding nvidia and Intel platforms, so maybe the tasks run from BOINC use a different runtime and/or icd directory, but it could also just be from stderr redirection happening after these messages are printed. Unfortunately I'm not that experienced with OpenCL.

From a manual (successful) run, similarly abbreviated:
geneferocl 3.3.3-2 (Linux/OpenCL/64-bit) Command line: /var/lib/boinc-client/projects/www.primegrid.com/geneferocl_linux64_3.3.3-2 -q 319238584^32768+1 --device 0 Normal priority change succeeded. Checking available transform implementations... dlerror: libintelocl.so: cannot open shared object file: No such file or directory dlerror: libMesaOpenCL.so.1: cannot open shared object file: No such file or directory dlerror: libnvidia-opencl.so.1: cannot open shared object file: No such file or directory Testing 319238584^32768+1... Using OCL2 transform Running on platform 'AMD Accelerated Parallel Processing', device 'gfx1030', vendor 'Advanced Micro Devices, Inc.', version 'OpenCL 2.0 ' and driver '3452.0 (HSA1.1,LC)'. 40 computeUnits @ 2660MHz, memSize=16368MB, cacheSize=16kB, cacheLineSize=64B, localMemSize=64kB, maxWorkGroupSize=256. Starting initialization... Initialization complete (0.026 seconds). Estimated time for 319238584^32768+1 is 0:00:46 319238584^32768+1 is composite. (RES=6fe9fa0b8bfb5e5f) (278663 digits) (err = 0.0000) (time = 0:00:49) 11:51:38


Has anybody had similar problems? I'm running ROCm 5.2.3 on Debian testing (Linux 6.0.0-4-amd64), BOINC 7.20.2.

Yves GallotProject donor
Volunteer developer
Project scientist
Send message
Joined: 19 Aug 12
Posts: 843
ID: 164101
Credit: 306,522,612
RAC: 5,341
GFN Double Silver: Earned 200,000,000 credits (306,523,106)
Message 158002 - Posted: 23 Nov 2022 | 13:08:40 UTC - in response to Message 157996.

Several OpenCL implementations appear to be installed (libintelocl, libMesaOpenCL, libnvidia-opencl). If the dlerror message is not present, it is possible that the wrong dll was loaded from BOINC.
You should try to uninstall them.

Luc Everse
Avatar
Send message
Joined: 18 Nov 22
Posts: 2
ID: 1542475
Credit: 1,042,115
RAC: 0
321 LLR Silver: Earned 100,000 credits (373,548)PPS Sieve Silver: Earned 100,000 credits (364,068)AP 26/27 Silver: Earned 100,000 credits (125,333)GFN Silver: Earned 100,000 credits (179,166)
Message 158007 - Posted: 23 Nov 2022 | 18:45:14 UTC

Thanks for the suggestion. Unfortunately it doesn't seem to work. The packages providing these implementations were already uninstalled, it was just the ICDs that remained in /etc/OpenCL/vendors. I've removed all of them except the one for the current AMD OpenCL installation. I also searched the whole filesystem for the library files (and variants) but they're not there. Both the working and the broken one load /opt/rocm-5.2.3/lib/libOpenCL.so.1.2 and the associated support libraries (checked using /proc/$pid/maps).

Zyfdnug
Send message
Joined: 27 Sep 19
Posts: 12
ID: 1196790
Credit: 419,079,109
RAC: 1,054,326
Discovered 1 mega prime321 LLR Jade: Earned 10,000,000 credits (10,020,294)Cullen LLR Turquoise: Earned 5,000,000 credits (7,134,499)ESP LLR Turquoise: Earned 5,000,000 credits (9,192,421)Generalized Cullen/Woodall LLR Jade: Earned 10,000,000 credits (10,533,388)PPS LLR Jade: Earned 10,000,000 credits (10,682,815)PSP LLR Turquoise: Earned 5,000,000 credits (7,615,223)SoB LLR Jade: Earned 10,000,000 credits (11,628,805)SR5 LLR Jade: Earned 10,000,000 credits (10,281,154)SGS LLR Turquoise: Earned 5,000,000 credits (7,851,517)TRP LLR Turquoise: Earned 5,000,000 credits (8,104,314)Woodall LLR Jade: Earned 10,000,000 credits (10,139,340)321 Sieve (suspended) Turquoise: Earned 5,000,000 credits (5,125,474)Cullen/Woodall Sieve Jade: Earned 10,000,000 credits (16,318,579)PPS Sieve Jade: Earned 10,000,000 credits (12,678,331)AP 26/27 Jade: Earned 10,000,000 credits (10,794,810)GFN Double Silver: Earned 200,000,000 credits (264,868,339)WW (retired) Turquoise: Earned 5,000,000 credits (6,120,000)
Message 161926 - Posted: 24 Apr 2023 | 22:44:32 UTC - in response to Message 157996.

I have the same problem here.

New computer I'm currently setting up, and the software is in a somewhat messy state after experimenting with different ways to get OpenCL up and running at all.

Now I have the AMD OpenCL stack in usable shape, as far as I can see, and the result is as can be seen here: https://www.primegrid.com/result.php?resultid=1507777023

I doubt it's a real out-of-memory situation, as (without a GPU task running):

# LANG=C free -m total used free shared buff/cache available Mem: 63427 5956 52301 25 5903 57471 Swap: 975 0 975


I would appreciate any hint about what I can do to fix this.

Thanks,

Zyfdnug

Yves GallotProject donor
Volunteer developer
Project scientist
Send message
Joined: 19 Aug 12
Posts: 843
ID: 164101
Credit: 306,522,612
RAC: 5,341
GFN Double Silver: Earned 200,000,000 credits (306,523,106)
Message 161945 - Posted: 25 Apr 2023 | 12:19:17 UTC - in response to Message 161926.
Last modified: 25 Apr 2023 | 12:20:44 UTC

I have the same problem here.

It is a different problem. The 'gfx1036' is the Integrated Graphics of Ryzen 9 7900X. It cannot test a DYFL.

Zyfdnug
Send message
Joined: 27 Sep 19
Posts: 12
ID: 1196790
Credit: 419,079,109
RAC: 1,054,326
Discovered 1 mega prime321 LLR Jade: Earned 10,000,000 credits (10,020,294)Cullen LLR Turquoise: Earned 5,000,000 credits (7,134,499)ESP LLR Turquoise: Earned 5,000,000 credits (9,192,421)Generalized Cullen/Woodall LLR Jade: Earned 10,000,000 credits (10,533,388)PPS LLR Jade: Earned 10,000,000 credits (10,682,815)PSP LLR Turquoise: Earned 5,000,000 credits (7,615,223)SoB LLR Jade: Earned 10,000,000 credits (11,628,805)SR5 LLR Jade: Earned 10,000,000 credits (10,281,154)SGS LLR Turquoise: Earned 5,000,000 credits (7,851,517)TRP LLR Turquoise: Earned 5,000,000 credits (8,104,314)Woodall LLR Jade: Earned 10,000,000 credits (10,139,340)321 Sieve (suspended) Turquoise: Earned 5,000,000 credits (5,125,474)Cullen/Woodall Sieve Jade: Earned 10,000,000 credits (16,318,579)PPS Sieve Jade: Earned 10,000,000 credits (12,678,331)AP 26/27 Jade: Earned 10,000,000 credits (10,794,810)GFN Double Silver: Earned 200,000,000 credits (264,868,339)WW (retired) Turquoise: Earned 5,000,000 credits (6,120,000)
Message 161956 - Posted: 25 Apr 2023 | 23:27:32 UTC - in response to Message 161945.

That is quite interesting... first, I was not aware this CPU had an integrated graphics device.

Also, if I start the binary from the shell, as root user, it uses a different device (which would be the discrete Radeon card):

# /var/lib/boinc-client/projects/www.primegrid.com/genefer22g_linux64_22.12.02 -p -n 22 -b 1053460 -f gproof geneferg version 22.12.2 (linux x64, gcc-7.5.0, boinc-7.20.2) Copyright (c) 2022, Yves Gallot genefer is free source code, under the MIT license. Command line: '-p -n 22 -b 1053460 -f gproof' Running on device 'gfx1031', vendor 'Advanced Micro Devices, Inc.', version 'OpenCL 2.0 ', driver '3513.0 (HSA1.1,LC)', data size: 96 MB. Resuming from a checkpoint. 7.58% done, 26:07:20 remaining, 1.21 ms/bit.


I did not notice the different device identifiers (and if I did, I wouldn't be able to interpret them anyway ;-)

So it appears that, for whatever reason, the binary picks different OpenCL devices to work with when called from the shell and from the boinc manager.

My question then is -- how can I tell boinc or PrimeGrid which GPU to use for OpenCL?

Profile composite
Volunteer tester
Send message
Joined: 16 Feb 10
Posts: 1172
ID: 55391
Credit: 1,211,016,878
RAC: 1,196,437
Discovered 2 mega primesFound 1 prime in the 2018 Tour de PrimesFound 1 prime in the 2022 Tour de PrimesFound 1 prime in the 2023 Tour de Primes321 LLR Jade: Earned 10,000,000 credits (12,025,988)Cullen LLR Ruby: Earned 2,000,000 credits (3,619,286)ESP LLR Ruby: Earned 2,000,000 credits (3,433,680)Generalized Cullen/Woodall LLR Ruby: Earned 2,000,000 credits (2,443,837)PPS LLR Emerald: Earned 50,000,000 credits (51,578,099)PSP LLR Turquoise: Earned 5,000,000 credits (7,464,143)SoB LLR Emerald: Earned 50,000,000 credits (51,531,274)SR5 LLR Turquoise: Earned 5,000,000 credits (7,459,747)SGS LLR Turquoise: Earned 5,000,000 credits (6,350,962)TRP LLR Turquoise: Earned 5,000,000 credits (7,584,042)Woodall LLR Amethyst: Earned 1,000,000 credits (1,780,886)321 Sieve (suspended) Emerald: Earned 50,000,000 credits (50,256,050)Cullen/Woodall Sieve Emerald: Earned 50,000,000 credits (99,687,988)Generalized Cullen/Woodall Sieve (suspended) Emerald: Earned 50,000,000 credits (50,009,610)PPS Sieve Double Gold: Earned 500,000,000 credits (510,410,473)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Jade: Earned 10,000,000 credits (10,165,888)TRP Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,071,454)AP 26/27 Double Bronze: Earned 100,000,000 credits (103,478,322)GFN Emerald: Earned 50,000,000 credits (88,954,425)WW (retired) Sapphire: Earned 20,000,000 credits (20,000,000)PSA Double Bronze: Earned 100,000,000 credits (102,762,384)
Message 161957 - Posted: 26 Apr 2023 | 5:00:32 UTC - in response to Message 161956.
Last modified: 26 Apr 2023 | 5:01:56 UTC

# man clinfo ... To selectively enable/disable platforms, one way is to move or rename the *.icd files present in /etc/OpenCL/vendors/ and then restoring them one by one. When using the free-software ocl-icd OpenCL library, a similar effect can be achieved by setting the OPENCL_VENDOR_PATH or OCL_ICD_VENDORS environment variables, as documented in libOpenCL(7). Other implementations of libOpenCL are known to support OPENCL_VENDOR_PATH too.

Zyfdnug
Send message
Joined: 27 Sep 19
Posts: 12
ID: 1196790
Credit: 419,079,109
RAC: 1,054,326
Discovered 1 mega prime321 LLR Jade: Earned 10,000,000 credits (10,020,294)Cullen LLR Turquoise: Earned 5,000,000 credits (7,134,499)ESP LLR Turquoise: Earned 5,000,000 credits (9,192,421)Generalized Cullen/Woodall LLR Jade: Earned 10,000,000 credits (10,533,388)PPS LLR Jade: Earned 10,000,000 credits (10,682,815)PSP LLR Turquoise: Earned 5,000,000 credits (7,615,223)SoB LLR Jade: Earned 10,000,000 credits (11,628,805)SR5 LLR Jade: Earned 10,000,000 credits (10,281,154)SGS LLR Turquoise: Earned 5,000,000 credits (7,851,517)TRP LLR Turquoise: Earned 5,000,000 credits (8,104,314)Woodall LLR Jade: Earned 10,000,000 credits (10,139,340)321 Sieve (suspended) Turquoise: Earned 5,000,000 credits (5,125,474)Cullen/Woodall Sieve Jade: Earned 10,000,000 credits (16,318,579)PPS Sieve Jade: Earned 10,000,000 credits (12,678,331)AP 26/27 Jade: Earned 10,000,000 credits (10,794,810)GFN Double Silver: Earned 200,000,000 credits (264,868,339)WW (retired) Turquoise: Earned 5,000,000 credits (6,120,000)
Message 161959 - Posted: 26 Apr 2023 | 8:11:52 UTC - in response to Message 161957.

In this case, both devices are handled by the same vendor's software, so disabling a vendo for OpenCL would be counterproductive ;-)

I tried disabling the CPU-integrated device in boinc's cc_config.xml file. The boinc log reports to have it disabled, also reports the exclude_gpu tag to not be recognized, but Primegrid's software still uses it when run through boinc.

So I tried running the boinc client in the foreground (not through systemd), and got very promising results: https://www.primegrid.com/result.php?resultid=1508473533

Getting this behaviour controlled through configuration seems to be a bit tricky ;-)

Profile mikey
Avatar
Send message
Joined: 17 Mar 09
Posts: 1895
ID: 37043
Credit: 825,277,592
RAC: 580,819
Discovered 2 mega primesFound 12 primes in the 2023 Tour de Primes321 LLR Turquoise: Earned 5,000,000 credits (5,008,621)Cullen LLR Ruby: Earned 2,000,000 credits (2,074,615)ESP LLR Ruby: Earned 2,000,000 credits (2,243,517)Generalized Cullen/Woodall LLR Ruby: Earned 2,000,000 credits (2,142,353)PPS LLR Turquoise: Earned 5,000,000 credits (8,311,770)PSP LLR Ruby: Earned 2,000,000 credits (2,232,103)SoB LLR Ruby: Earned 2,000,000 credits (2,934,612)SR5 LLR Ruby: Earned 2,000,000 credits (3,116,975)SGS LLR Turquoise: Earned 5,000,000 credits (6,780,752)TRP LLR Turquoise: Earned 5,000,000 credits (5,023,333)Woodall LLR Turquoise: Earned 5,000,000 credits (5,047,133)321 Sieve (suspended) Sapphire: Earned 20,000,000 credits (23,770,672)Cullen/Woodall Sieve Sapphire: Earned 20,000,000 credits (41,346,875)Generalized Cullen/Woodall Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,813,253)PPS Sieve Double Silver: Earned 200,000,000 credits (379,972,459)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Ruby: Earned 2,000,000 credits (2,446,797)AP 26/27 Emerald: Earned 50,000,000 credits (83,269,628)GFN Double Bronze: Earned 100,000,000 credits (144,258,775)WW (retired) Emerald: Earned 50,000,000 credits (64,048,000)PSA Sapphire: Earned 20,000,000 credits (20,457,430)
Message 161962 - Posted: 26 Apr 2023 | 10:41:17 UTC - in response to Message 161959.

In this case, both devices are handled by the same vendor's software, so disabling a vendo for OpenCL would be counterproductive ;-)

I tried disabling the CPU-integrated device in boinc's cc_config.xml file. The boinc log reports to have it disabled, also reports the exclude_gpu tag to not be recognized, but Primegrid's software still uses it when run through boinc.

So I tried running the boinc client in the foreground (not through systemd), and got very promising results: https://www.primegrid.com/result.php?resultid=1508473533

Getting this behaviour controlled through configuration seems to be a bit tricky ;-)


You can also exclude it thru a few lines in your cc_config.xml file:

<options>
<use_all_gpus>1</use_all_gpus>

<exclude_gpu>
<url>https://www.primegrid.com/</url>
<device_num>0</device_num>
</exclude_gpu>


With you changing the device number based on what Boinc sees when it first starts up in the Event Log. This gets a bit longer if you crunch for multiple Projects as you would need an exclude section for each one

Zyfdnug
Send message
Joined: 27 Sep 19
Posts: 12
ID: 1196790
Credit: 419,079,109
RAC: 1,054,326
Discovered 1 mega prime321 LLR Jade: Earned 10,000,000 credits (10,020,294)Cullen LLR Turquoise: Earned 5,000,000 credits (7,134,499)ESP LLR Turquoise: Earned 5,000,000 credits (9,192,421)Generalized Cullen/Woodall LLR Jade: Earned 10,000,000 credits (10,533,388)PPS LLR Jade: Earned 10,000,000 credits (10,682,815)PSP LLR Turquoise: Earned 5,000,000 credits (7,615,223)SoB LLR Jade: Earned 10,000,000 credits (11,628,805)SR5 LLR Jade: Earned 10,000,000 credits (10,281,154)SGS LLR Turquoise: Earned 5,000,000 credits (7,851,517)TRP LLR Turquoise: Earned 5,000,000 credits (8,104,314)Woodall LLR Jade: Earned 10,000,000 credits (10,139,340)321 Sieve (suspended) Turquoise: Earned 5,000,000 credits (5,125,474)Cullen/Woodall Sieve Jade: Earned 10,000,000 credits (16,318,579)PPS Sieve Jade: Earned 10,000,000 credits (12,678,331)AP 26/27 Jade: Earned 10,000,000 credits (10,794,810)GFN Double Silver: Earned 200,000,000 credits (264,868,339)WW (retired) Turquoise: Earned 5,000,000 credits (6,120,000)
Message 161990 - Posted: 27 Apr 2023 | 15:12:58 UTC

The GPU exlusion via configuration is indeed what I tried next.


I have
coproc_info.xml:

<coprocs> <ati_opencl> <name>AMD Radeon RX 6750 XT</name> <vendor>Advanced Micro Devices, Inc.</vendor> <vendor_id>4098</vendor_id> <available>1</available> <half_fp_config>0</half_fp_config> <single_fp_config>191</single_fp_config> <double_fp_config>63</double_fp_config> <endian_little>1</endian_little> <execution_capabilities>1</execution_capabilities> <extensions>cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_ khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable _store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program </extensions> <global_mem_size>12868124672</global_mem_size> <local_mem_size>65536</local_mem_size> <max_clock_frequency>2880</max_clock_frequency> <max_compute_units>20</max_compute_units> <nv_compute_capability_major>0</nv_compute_capability_major> <nv_compute_capability_minor>0</nv_compute_capability_minor> <amd_simd_per_compute_unit>4</amd_simd_per_compute_unit> <amd_simd_width>32</amd_simd_width> <amd_simd_instruction_width>1</amd_simd_instruction_width> <opencl_platform_version>OpenCL 2.1 AMD-APP (3513.0)</opencl_platform_version> <opencl_device_version>OpenCL 2.0 </opencl_device_version> <opencl_driver_version>3513.0 (HSA1.1,LC)</opencl_driver_version> <device_num>0</device_num> <peak_flops>14745600000000.000000</peak_flops> <opencl_available_ram>12868124672.000000</opencl_available_ram> <opencl_device_index>0</opencl_device_index> <warn_bad_cuda>0</warn_bad_cuda> </ati_opencl> <ati_opencl> <name>gfx1036</name> <vendor>Advanced Micro Devices, Inc.</vendor> <vendor_id>4098</vendor_id> <available>1</available> <half_fp_config>0</half_fp_config> <single_fp_config>191</single_fp_config> <double_fp_config>63</double_fp_config> <endian_little>1</endian_little> <execution_capabilities>1</execution_capabilities> <extensions>cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program </extensions> <global_mem_size>536870912</global_mem_size> <local_mem_size>65536</local_mem_size> <max_clock_frequency>2200</max_clock_frequency> <max_compute_units>1</max_compute_units> <nv_compute_capability_major>0</nv_compute_capability_major> <nv_compute_capability_minor>0</nv_compute_capability_minor> <amd_simd_per_compute_unit>4</amd_simd_per_compute_unit> <amd_simd_width>32</amd_simd_width> <amd_simd_instruction_width>1</amd_simd_instruction_width> <opencl_platform_version>OpenCL 2.1 AMD-APP (3513.0)</opencl_platform_version> <opencl_device_version>OpenCL 2.0 </opencl_device_version> <opencl_driver_version>3513.0 (HSA1.1,LC)</opencl_driver_version> <device_num>1</device_num> <peak_flops>563200000000.000000</peak_flops> <opencl_available_ram>536870912.000000</opencl_available_ram> <opencl_device_index>1</opencl_device_index> <warn_bad_cuda>0</warn_bad_cuda> </ati_opencl> <warning>NVIDIA: libcuda.so: cannot open shared object file: No such file or directory</warning> <warning>ATI: libaticalrt.so: cannot open shared object file: No such file or directory</warning> </coprocs>


cc_config.xml:
<cc_config> <log_flags> <task>1</task> <file_xfer>1</file_xfer> <sched_ops>1</sched_ops> </log_flags> <options> <exclude_gpu> <url>http://www.primegrid.com/</url> <device_num>1</device_num> </exclude_gpu> </options> </cc_config>


The coproc_info file should make it clear what the id numbers refer to.

This is what I observe now:

root@Zwerg:/var/lib/boinc# sudo -u boinc boinc [243/375] 27-Apr-2023 16:37:46 [---] Starting BOINC client version 7.20.5 for x86_64-pc-linux-gnu 27-Apr-2023 16:37:46 [---] This a development version of BOINC and may not function properly 27-Apr-2023 16:37:46 [---] log flags: file_xfer, sched_ops, task 27-Apr-2023 16:37:46 [---] Libraries: libcurl/7.88.1 OpenSSL/3.0.8 zlib/1.2.13 brotli/1.0.9 zstd/1.5.4 libidn2/2.3.3 libpsl/0.21.2 (+li bidn2/2.3.3) libssh2/1.10.0 nghttp2/1.52.0 librtmp/2.3 27-Apr-2023 16:37:46 [---] Data directory: /var/lib/boinc-client 27-Apr-2023 16:37:50 [---] OpenCL: AMD/ATI GPU 0: AMD Radeon RX 6750 XT (driver version 3513.0 (HSA1.1,LC), device version OpenCL 2.0, 12272MB, 12272MB available, 14746 GFLOPS peak) 27-Apr-2023 16:37:50 [---] OpenCL: AMD/ATI GPU 1 (ignored by config): gfx1036 (driver version 3513.0 (HSA1.1,LC), device version OpenCL 2.0, 512MB, 512MB available, 563 GFLOPS peak) 27-Apr-2023 16:37:50 [---] libc: version 2.36 27-Apr-2023 16:37:50 [---] Host name: Zwerg 27-Apr-2023 16:37:50 [---] Processor: 24 AuthenticAMD AMD Ryzen 9 7900X 12-Core Processor [Family 25 Model 97 Stepping 2] 27-Apr-2023 16:37:50 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmper f rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat _l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq r dseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mb m_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean fl ushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi u 27-Apr-2023 16:37:50 [---] OS: Linux Debian: Debian GNU/Linux 12 (bookworm) [6.1.0-7-amd64|libc 2.36] 27-Apr-2023 16:37:50 [---] Memory: 61.94 GB physical, 976.00 MB virtual 27-Apr-2023 16:37:50 [---] Disk: 464.48 GB total, 438.70 GB free 27-Apr-2023 16:37:50 [---] Local time is UTC +2 hours 27-Apr-2023 16:37:50 [---] Config: GUI RPCs allowed from: 27-Apr-2023 16:37:50 [---] 192.168.0.1 27-Apr-2023 16:37:50 [---] Zwerg.redacteddomainname 27-Apr-2023 16:37:50 [PrimeGrid] Config: excluded GPU. Type: all. App: all. Device: 1 27-Apr-2023 16:37:50 [PrimeGrid] General prefs: from PrimeGrid (last modified 29-Nov-2020 16:20:55) 27-Apr-2023 16:37:50 [PrimeGrid] Computer location: home


I think the above log output shows that the correct GPU device is disabled for Primegrid.

Also, task reports show that only gfx1031 is in use now.

However, jobs running under systemd fail with opencl error: CL_OUT_OF_HOST_MEMORY. If I run the boinc client on the shell, as the boinc user, things seem to proceed correctly.
On the shell, as root user, managed to chew through https://www.primegrid.com/result.php?resultid=1508328550

I'll leave the client run for a while now, but at this time, it looks like systemd or the unit file bring in some problem. I already tried running the boinc client with much increased locked memory limit, but that did not lead to success.

Zyfdnug
Send message
Joined: 27 Sep 19
Posts: 12
ID: 1196790
Credit: 419,079,109
RAC: 1,054,326
Discovered 1 mega prime321 LLR Jade: Earned 10,000,000 credits (10,020,294)Cullen LLR Turquoise: Earned 5,000,000 credits (7,134,499)ESP LLR Turquoise: Earned 5,000,000 credits (9,192,421)Generalized Cullen/Woodall LLR Jade: Earned 10,000,000 credits (10,533,388)PPS LLR Jade: Earned 10,000,000 credits (10,682,815)PSP LLR Turquoise: Earned 5,000,000 credits (7,615,223)SoB LLR Jade: Earned 10,000,000 credits (11,628,805)SR5 LLR Jade: Earned 10,000,000 credits (10,281,154)SGS LLR Turquoise: Earned 5,000,000 credits (7,851,517)TRP LLR Turquoise: Earned 5,000,000 credits (8,104,314)Woodall LLR Jade: Earned 10,000,000 credits (10,139,340)321 Sieve (suspended) Turquoise: Earned 5,000,000 credits (5,125,474)Cullen/Woodall Sieve Jade: Earned 10,000,000 credits (16,318,579)PPS Sieve Jade: Earned 10,000,000 credits (12,678,331)AP 26/27 Jade: Earned 10,000,000 credits (10,794,810)GFN Double Silver: Earned 200,000,000 credits (264,868,339)WW (retired) Turquoise: Earned 5,000,000 credits (6,120,000)
Message 162352 - Posted: 8 May 2023 | 20:37:55 UTC - in response to Message 161990.

Looks like this is an issue only when running under systemd. Both root or the boinc client user can, when started on the shell, properly crunch their numbers using OpenCL.


Unfortunately, I have no reference system where I could try PrimeGrid under boinc with a different Linux distribution.

Has anybody ever needed to change systemd unit settings, and can recommend anything?

Thanks,

Zyfdnug

Zyfdnug
Send message
Joined: 27 Sep 19
Posts: 12
ID: 1196790
Credit: 419,079,109
RAC: 1,054,326
Discovered 1 mega prime321 LLR Jade: Earned 10,000,000 credits (10,020,294)Cullen LLR Turquoise: Earned 5,000,000 credits (7,134,499)ESP LLR Turquoise: Earned 5,000,000 credits (9,192,421)Generalized Cullen/Woodall LLR Jade: Earned 10,000,000 credits (10,533,388)PPS LLR Jade: Earned 10,000,000 credits (10,682,815)PSP LLR Turquoise: Earned 5,000,000 credits (7,615,223)SoB LLR Jade: Earned 10,000,000 credits (11,628,805)SR5 LLR Jade: Earned 10,000,000 credits (10,281,154)SGS LLR Turquoise: Earned 5,000,000 credits (7,851,517)TRP LLR Turquoise: Earned 5,000,000 credits (8,104,314)Woodall LLR Jade: Earned 10,000,000 credits (10,139,340)321 Sieve (suspended) Turquoise: Earned 5,000,000 credits (5,125,474)Cullen/Woodall Sieve Jade: Earned 10,000,000 credits (16,318,579)PPS Sieve Jade: Earned 10,000,000 credits (12,678,331)AP 26/27 Jade: Earned 10,000,000 credits (10,794,810)GFN Double Silver: Earned 200,000,000 credits (264,868,339)WW (retired) Turquoise: Earned 5,000,000 credits (6,120,000)
Message 162444 - Posted: 11 May 2023 | 2:32:10 UTC - in response to Message 162352.

I managed to resolve this -- patience, actually thinking a bit myself, and a good search engine brought me to

https://github.com/BOINC/boinc/issues/4948

which contained the actual solution here.

In short, adding

ProtectSystem=full
via
systemctl edit boinc-client.service
proved to be a useful solution.

I'll reach out to the Debian project's boinc package maintainers with this information, so the installation can either be adapted, or at least the documentation be updated.

Zyfdnug

Post to thread

Message boards : Problems and Help : AMD ROCm CL_OUT_OF_HOST_MEMORY when running through BOINC

[Return to PrimeGrid main page]
DNS Powered by DNSEXIT.COM
Copyright © 2005 - 2023 Rytis Slatkevičius (contact) and PrimeGrid community. Server load 1.99, 2.24, 2.52
Generated 23 Sep 2023 | 21:36:27 UTC