Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
1)
Message boards :
Problems and Help :
AMD ROCm CL_OUT_OF_HOST_MEMORY when running through BOINC
(Message 162444)
Posted 30 days ago by Zyfdnug
I managed to resolve this -- patience, actually thinking a bit myself, and a good search engine brought me to
https://github.com/BOINC/boinc/issues/4948
which contained the actual solution here.
In short, adding ProtectSystem=full via systemctl edit boinc-client.service proved to be a useful solution.
I'll reach out to the Debian project's boinc package maintainers with this information, so the installation can either be adapted, or at least the documentation be updated.
Zyfdnug
|
2)
Message boards :
Problems and Help :
AMD ROCm CL_OUT_OF_HOST_MEMORY when running through BOINC
(Message 162352)
Posted 32 days ago by Zyfdnug
Looks like this is an issue only when running under systemd. Both root or the boinc client user can, when started on the shell, properly crunch their numbers using OpenCL.
Unfortunately, I have no reference system where I could try PrimeGrid under boinc with a different Linux distribution.
Has anybody ever needed to change systemd unit settings, and can recommend anything?
Thanks,
Zyfdnug
|
3)
Message boards :
Problems and Help :
SGS: execv failed twice: Text file busy
(Message 162337)
Posted 32 days ago by Zyfdnug
Looks like this is indeed resolved after waiting / restarting.
I'm still a bit surprised that this affected only a subset of the tasks, but have not checked if they had something -- most likely the slot they were assigned to -- in common.
I guess this question can reasonably well be considered fully answered -- thanks!
Zyfdnug
|
4)
Message boards :
Problems and Help :
SGS: execv failed twice: Text file busy
(Message 161992)
Posted 43 days ago by Zyfdnug
I'v noticed a considerable number of Sophie Germain tasks fail with stderr such as
BOINC llr wrapper (version 8.04)
Using Jean Penne's llr (64 bit)
execl failed once: Text file busy
execl failed twice: Text file busy
Error reading the LLR version number, continuing...
LLR command line: primegrid_llr -d -oDiskWriteTime=1 -oThreadsPerTest=4 llr.in
execv failed once: Text file busy
execv failed twice: Text file busy
app error: 27648
20:07:43 (128522): called boinc_finish(27648)
I noticed these situations only today, while running boinc and Primegrid in a non-standard way.
However, this affects only a fraction of all SGS jobs:
https://www.primegrid.com/results.php?userid=1196790&offset=0&show_names=0&state=0&appid=2
Also, this seems to happen with only one host, which is actually a system I'm currently deploying and which is, software wise, not in a particulary good state.
This, however, does not explain this particular issue -- first, it's a selection of tasks affected. Second, it's an error that I would not attribute to broken hardware, but which looks more like a software issue, with boinc wrapping Primegrid which in turn manages binaries for different projects, and potentially updates the actual worker binaries ocasionally.
On the other hand, I have seen such exec*() errors very rarely, and would be quite astonished to find that the actual worker software is that frequently updated today.
The fact that those issues apparently affected this particular host only, starting today at 14:00 UTC, correlates with me starting the boinc client, as boinc user, in a terminal. I can not claim to see a potential cause for that, but it *is* an interesting coincidence.
Has anybody noticed similar errors recently, or can suggest what I could do to further analyze things?
Best,
Zyfdnug
|
5)
Message boards :
Problems and Help :
AMD ROCm CL_OUT_OF_HOST_MEMORY when running through BOINC
(Message 161990)
Posted 43 days ago by Zyfdnug
The GPU exlusion via configuration is indeed what I tried next.
I have
coproc_info.xml:
<coprocs>
<ati_opencl>
<name>AMD Radeon RX 6750 XT</name>
<vendor>Advanced Micro Devices, Inc.</vendor>
<vendor_id>4098</vendor_id>
<available>1</available>
<half_fp_config>0</half_fp_config>
<single_fp_config>191</single_fp_config>
<double_fp_config>63</double_fp_config>
<endian_little>1</endian_little>
<execution_capabilities>1</execution_capabilities>
<extensions>cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_
khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable
_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program </extensions> <global_mem_size>12868124672</global_mem_size> <local_mem_size>65536</local_mem_size>
<max_clock_frequency>2880</max_clock_frequency>
<max_compute_units>20</max_compute_units>
<nv_compute_capability_major>0</nv_compute_capability_major>
<nv_compute_capability_minor>0</nv_compute_capability_minor>
<amd_simd_per_compute_unit>4</amd_simd_per_compute_unit>
<amd_simd_width>32</amd_simd_width>
<amd_simd_instruction_width>1</amd_simd_instruction_width>
<opencl_platform_version>OpenCL 2.1 AMD-APP (3513.0)</opencl_platform_version>
<opencl_device_version>OpenCL 2.0 </opencl_device_version>
<opencl_driver_version>3513.0 (HSA1.1,LC)</opencl_driver_version>
<device_num>0</device_num>
<peak_flops>14745600000000.000000</peak_flops>
<opencl_available_ram>12868124672.000000</opencl_available_ram>
<opencl_device_index>0</opencl_device_index>
<warn_bad_cuda>0</warn_bad_cuda>
</ati_opencl>
<ati_opencl>
<name>gfx1036</name>
<vendor>Advanced Micro Devices, Inc.</vendor>
<vendor_id>4098</vendor_id>
<available>1</available>
<half_fp_config>0</half_fp_config>
<single_fp_config>191</single_fp_config>
<double_fp_config>63</double_fp_config>
<endian_little>1</endian_little>
<execution_capabilities>1</execution_capabilities>
<extensions>cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program </extensions>
<global_mem_size>536870912</global_mem_size>
<local_mem_size>65536</local_mem_size>
<max_clock_frequency>2200</max_clock_frequency>
<max_compute_units>1</max_compute_units>
<nv_compute_capability_major>0</nv_compute_capability_major>
<nv_compute_capability_minor>0</nv_compute_capability_minor>
<amd_simd_per_compute_unit>4</amd_simd_per_compute_unit>
<amd_simd_width>32</amd_simd_width>
<amd_simd_instruction_width>1</amd_simd_instruction_width>
<opencl_platform_version>OpenCL 2.1 AMD-APP (3513.0)</opencl_platform_version>
<opencl_device_version>OpenCL 2.0 </opencl_device_version>
<opencl_driver_version>3513.0 (HSA1.1,LC)</opencl_driver_version>
<device_num>1</device_num>
<peak_flops>563200000000.000000</peak_flops>
<opencl_available_ram>536870912.000000</opencl_available_ram>
<opencl_device_index>1</opencl_device_index>
<warn_bad_cuda>0</warn_bad_cuda>
</ati_opencl>
<warning>NVIDIA: libcuda.so: cannot open shared object file: No such file or directory</warning>
<warning>ATI: libaticalrt.so: cannot open shared object file: No such file or directory</warning>
</coprocs>
cc_config.xml:
<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
</log_flags>
<options>
<exclude_gpu>
<url>http://www.primegrid.com/</url>
<device_num>1</device_num>
</exclude_gpu>
</options>
</cc_config>
The coproc_info file should make it clear what the id numbers refer to.
This is what I observe now:
root@Zwerg:/var/lib/boinc# sudo -u boinc boinc [243/375]
27-Apr-2023 16:37:46 [---] Starting BOINC client version 7.20.5 for x86_64-pc-linux-gnu
27-Apr-2023 16:37:46 [---] This a development version of BOINC and may not function properly
27-Apr-2023 16:37:46 [---] log flags: file_xfer, sched_ops, task
27-Apr-2023 16:37:46 [---] Libraries: libcurl/7.88.1 OpenSSL/3.0.8 zlib/1.2.13 brotli/1.0.9 zstd/1.5.4 libidn2/2.3.3 libpsl/0.21.2 (+li
bidn2/2.3.3) libssh2/1.10.0 nghttp2/1.52.0 librtmp/2.3
27-Apr-2023 16:37:46 [---] Data directory: /var/lib/boinc-client
27-Apr-2023 16:37:50 [---] OpenCL: AMD/ATI GPU 0: AMD Radeon RX 6750 XT (driver version 3513.0 (HSA1.1,LC), device version OpenCL 2.0,
12272MB, 12272MB available, 14746 GFLOPS peak)
27-Apr-2023 16:37:50 [---] OpenCL: AMD/ATI GPU 1 (ignored by config): gfx1036 (driver version 3513.0 (HSA1.1,LC), device version OpenCL
2.0, 512MB, 512MB available, 563 GFLOPS peak)
27-Apr-2023 16:37:50 [---] libc: version 2.36
27-Apr-2023 16:37:50 [---] Host name: Zwerg
27-Apr-2023 16:37:50 [---] Processor: 24 AuthenticAMD AMD Ryzen 9 7900X 12-Core Processor [Family 25 Model 97 Stepping 2]
27-Apr-2023 16:37:50 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr
sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmper
f rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat
_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq r
dseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mb
m_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean fl
ushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi u
27-Apr-2023 16:37:50 [---] OS: Linux Debian: Debian GNU/Linux 12 (bookworm) [6.1.0-7-amd64|libc 2.36]
27-Apr-2023 16:37:50 [---] Memory: 61.94 GB physical, 976.00 MB virtual
27-Apr-2023 16:37:50 [---] Disk: 464.48 GB total, 438.70 GB free
27-Apr-2023 16:37:50 [---] Local time is UTC +2 hours
27-Apr-2023 16:37:50 [---] Config: GUI RPCs allowed from:
27-Apr-2023 16:37:50 [---] 192.168.0.1
27-Apr-2023 16:37:50 [---] Zwerg.redacteddomainname
27-Apr-2023 16:37:50 [PrimeGrid] Config: excluded GPU. Type: all. App: all. Device: 1
27-Apr-2023 16:37:50 [PrimeGrid] General prefs: from PrimeGrid (last modified 29-Nov-2020 16:20:55)
27-Apr-2023 16:37:50 [PrimeGrid] Computer location: home
I think the above log output shows that the correct GPU device is disabled for Primegrid.
Also, task reports show that only gfx1031 is in use now.
However, jobs running under systemd fail with opencl error: CL_OUT_OF_HOST_MEMORY. If I run the boinc client on the shell, as the boinc user, things seem to proceed correctly.
On the shell, as root user, managed to chew through https://www.primegrid.com/result.php?resultid=1508328550
I'll leave the client run for a while now, but at this time, it looks like systemd or the unit file bring in some problem. I already tried running the boinc client with much increased locked memory limit, but that did not lead to success.
|
6)
Message boards :
Problems and Help :
AMD ROCm CL_OUT_OF_HOST_MEMORY when running through BOINC
(Message 161959)
Posted 44 days ago by Zyfdnug
In this case, both devices are handled by the same vendor's software, so disabling a vendo for OpenCL would be counterproductive ;-)
I tried disabling the CPU-integrated device in boinc's cc_config.xml file. The boinc log reports to have it disabled, also reports the exclude_gpu tag to not be recognized, but Primegrid's software still uses it when run through boinc.
So I tried running the boinc client in the foreground (not through systemd), and got very promising results: https://www.primegrid.com/result.php?resultid=1508473533
Getting this behaviour controlled through configuration seems to be a bit tricky ;-)
|
7)
Message boards :
Problems and Help :
AMD ROCm CL_OUT_OF_HOST_MEMORY when running through BOINC
(Message 161956)
Posted 45 days ago by Zyfdnug
That is quite interesting... first, I was not aware this CPU had an integrated graphics device.
Also, if I start the binary from the shell, as root user, it uses a different device (which would be the discrete Radeon card):
# /var/lib/boinc-client/projects/www.primegrid.com/genefer22g_linux64_22.12.02 -p -n 22 -b 1053460 -f gproof
geneferg version 22.12.2 (linux x64, gcc-7.5.0, boinc-7.20.2)
Copyright (c) 2022, Yves Gallot
genefer is free source code, under the MIT license.
Command line: '-p -n 22 -b 1053460 -f gproof'
Running on device 'gfx1031', vendor 'Advanced Micro Devices, Inc.', version 'OpenCL 2.0 ', driver '3513.0 (HSA1.1,LC)', data size: 96 MB.
Resuming from a checkpoint.
7.58% done, 26:07:20 remaining, 1.21 ms/bit.
I did not notice the different device identifiers (and if I did, I wouldn't be able to interpret them anyway ;-)
So it appears that, for whatever reason, the binary picks different OpenCL devices to work with when called from the shell and from the boinc manager.
My question then is -- how can I tell boinc or PrimeGrid which GPU to use for OpenCL?
|
8)
Message boards :
Problems and Help :
AMD ROCm CL_OUT_OF_HOST_MEMORY when running through BOINC
(Message 161926)
Posted 46 days ago by Zyfdnug
I have the same problem here.
New computer I'm currently setting up, and the software is in a somewhat messy state after experimenting with different ways to get OpenCL up and running at all.
Now I have the AMD OpenCL stack in usable shape, as far as I can see, and the result is as can be seen here: https://www.primegrid.com/result.php?resultid=1507777023
I doubt it's a real out-of-memory situation, as (without a GPU task running):
# LANG=C free -m
total used free shared buff/cache available
Mem: 63427 5956 52301 25 5903 57471
Swap: 975 0 975
I would appreciate any hint about what I can do to fix this.
Thanks,
Zyfdnug
|
9)
Message boards :
Problems and Help :
GPU Errors on new Linux Mint install
(Message 161925)
Posted 46 days ago by Zyfdnug
I'm currently struggling on the same issue, not really resolved yet, but one thing that got me further was
- installing the AMD provided OpenCL software (messy, as I needed to install Ubuntu software on a Debian testing system, which left me with some slightly incorrect state, I think)
- deinstalling the Mesa OpenCL packages and removing the respective configuration, in particular from /etc/OpenCl/vendors
After that, neither clinfo nor the Primegrid software caused this particular error any more.
However, there's more to fix, probably...
Zyfdnug
|
10)
Message boards :
Problems and Help :
Radeon RX 6700XT Error on GFN-15 trough GFN-19
(Message 152702)
Posted 539 days ago by Zyfdnug
Indeed... works here, too.
Z
|
Next 10 posts
|