Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Sophie Germain Prime Search :
Why is there such a difference between these 2 run times?
Author |
Message |
|
Can someone explain why there is such a difference between these two workunits.
http://www.primegrid.com/workunit.php?wuid=263002613
and
http://www.primegrid.com/workunit.php?wuid=262760791? | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3208 ID: 50683 Credit: 135,132,479 RAC: 57,320
                         
|
One uses HT another not
One use AVX, another not.
If some user uses AVX without HT will be significantly faster then one than doesn't use AVX, but use HT. And last thing: you cannot see clock speed of those CPUs. Maybe one is on stock speed and another heavy clocked ( let say one is at 3.2 GHz and second is at 4.2 Ghz)
that will give cca 30% faster result ( just on overclocking)
____________
92*10^1439761-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
Both workunits were completed on the same computer.
The one using 6.13 ran faster than the one using AVX. | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3208 ID: 50683 Credit: 135,132,479 RAC: 57,320
                         
|
You can always send PM to user .
Why bother with assumption when you can get clear answer :)
____________
92*10^1439761-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
Both workunits were completed on the same computer.
The one using 6.13 ran faster than the one using AVX.
The difference is less than 30%: 1160 vs 1400 seconds. It can be explained by the different versions of llr as well as by the load on the other cores when you crunched those wus.
Version 6.13 uses avx, but only on avx capable cpus, which I think is not the case of first generation of intel i-7 such as yours. | |
|
|
It is my computer and I would like to know why there is a difference in the CPU completion time. | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1952 ID: 352 Credit: 6,016,693,611 RAC: 1,578,270
                                      
|
Similar experience on one host - http://www.primegrid.com/forum_thread.php?id=4206&nowrap=true#52235
____________
My stats | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3208 ID: 50683 Credit: 135,132,479 RAC: 57,320
                         
|
It looks like that is problem on Intel CPU-s
On my AMD 6 core, all task are finished in +/- 2 seconds ( at most)
____________
92*10^1439761-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13956 ID: 53948 Credit: 393,160,197 RAC: 187,115
                               
|
It is my computer and I would like to know why there is a difference in the CPU completion time.
You're comparing two different WUs, and they may run at different speeds. Usually, similar WUs run in similar times, but not always.
With your computer idle, try running those particular numbers manually (16978976538405*2^666667-1 and 16800905003685*2^666667-1), and see if the timings are reproducible. If they are, then probably what happened is that the larger of the two numbers required a larger FFT size (or a different FFT), which made the computation slower.
If they run at the same speed, then it's unlikely anyone on this forum will be able to provide you with a satisfactory answer since the behavior would be unreproducible.
____________
My lucky number is 75898524288+1 | |
|
|
It is my computer and I would like to know why there is a difference in the CPU completion time.
You're comparing two different WUs, and they may run at different speeds. Usually, similar WUs run in similar times, but not always.
With your computer idle, try running those particular numbers manually (16978976538405*2^666667-1 and 16800905003685*2^666667-1), and see if the timings are reproducible. If they are, then probably what happened is that the larger of the two numbers required a larger FFT size (or a different FFT), which made the computation slower.
If they run at the same speed, then it's unlikely anyone on this forum will be able to provide you with a satisfactory answer since the behavior would be unreproducible.
You did notice that the LLR program was different, didn't you?
(3.8.8 vs 3.8.6)
Anyhow, it is a Windows issue as you are running GPU WUs as well. I do occasionally see this with my windows host as it is has HT only and crunching GPU. I know that Windows needs all available cores on CPU to get best performance but then GPU suffers. In contrast, I do find Linux more robust to this as I see virtually no difference with workload as seen below with PRPNet LLR binaries.
If it still bothers you, copy your LLR binaries from the boinc directory into a some test directory and run WU again from the command line similar to below for Linux.
From a very busy Linux host with no GPU and HT enabled so times are unreliable - there are more CPU intensive tasks than available cores (I had to do 1 run as it was too different) :
$ ./llr_507 -v
Primality Testing of k*b^n+/-1 Program - Version 3.8.8
$ ./llr_507 -q"16978976538405*2^666667-1" -d
Starting Lucas Lehmer Riesel prime test of 16978976538405*2^666667-1
Using zero-padded Pentium4 type-1 FFT length 72K, Pass1=96, Pass2=768
V1 = 15 ; Computing U0...done.Starting Lucas-Lehmer loop...
16978976538405*2^666667-1, iteration : 60000 / 666667 [8.99%]. Time per iteration : 1.114 ms.
16978976538405*2^666667-1, iteration : 340000 / 666667 [50.99%]. Time per iteration : 1.088 ms.
16978976538405*2^666667-1 is not prime. LLR Res64: 9616FD7AE9FBE4B2 Time : 746.407 sec.
$ ./llr_507 -q"16800905003685*2^666667-1" -d
Starting Lucas Lehmer Riesel prime test of 16800905003685*2^666667-1
Using zero-padded Pentium4 type-1 FFT length 72K, Pass1=96, Pass2=768
V1 = 5 ; Computing U0...done.Starting Lucas-Lehmer loop...
16800905003685*2^666667-1, iteration : 70000 / 666667 [10.49%]. Time per iteration : 1.117 ms.
16800905003685*2^666667-1, iteration : 170000 / 666667 [25.49%]. Time per iteration : 1.041 ms.
16800905003685*2^666667-1, iteration : 440000 / 666667 [65.99%]. Time per iteration : 1.108 ms.
16800905003685*2^666667-1 is not prime. LLR Res64: 6C8A19FC527ADF5B Time : 747.959 sec.
$ ./llr_505 -v
Primality Testing of k*b^n+/-1 Program - Version 3.8.6
$ ./llr_505 -q"16800905003685*2^666667-1" -d
Starting Lucas Lehmer Riesel prime test of 16800905003685*2^666667-1
Using zero-padded Pentium4 type-1 FFT length 72K, Pass1=96, Pass2=768
V1 = 5 ; Computing U0...done.Starting Lucas-Lehmer loop...
16800905003685*2^666667-1, iteration : 160000 / 666667 [23.99%]. Time per iteration : 1.143 ms.
16800905003685*2^666667-1, iteration : 330000 / 666667 [49.49%]. Time per iteration : 1.128 ms.
16800905003685*2^666667-1 is not prime. LLR Res64: 6C8A19FC527ADF5B Time : 754.521 sec.
$ ./llr_505 -q"16978976538405*2^666667-1" -d
Starting Lucas Lehmer Riesel prime test of 16978976538405*2^666667-1
Using zero-padded Pentium4 type-1 FFT length 72K, Pass1=96, Pass2=768
V1 = 15 ; Computing U0...done.Starting Lucas-Lehmer loop...
16978976538405*2^666667-1, iteration : 160000 / 666667 [23.99%]. Time per iteration : 1.140 ms.
16978976538405*2^666667-1, iteration : 340000 / 666667 [50.99%]. Time per iteration : 1.135 ms.
16978976538405*2^666667-1 is not prime. LLR Res64: 9616FD7AE9FBE4B2 Time : 757.349 sec.
| |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1952 ID: 352 Credit: 6,016,693,611 RAC: 1,578,270
                                      
|
Original question was "Why is there such a difference between these 2 run times?"
I would add "on the very same host".
I have taken last 100 SGS results from suspicious host - http://www.primegrid.com/results.php?hostid=164884&offset=0&show_names=0&state=3&appid=
CPU time goes from 453 to 21253 secs and so are min and max of run times, magnitude of 47X.
CPU time is 128 463, run time is 166 113, around 24%. This is because initialization phase of task is taking more than one core and sometimes not only during initial phase.
Median is 553 for CPU time and 666 for run times.
Standard deviation is 2607, resp. 2765 (!).
What I would consider my *normal* host running SGS has ~280-345 sec, around 25% - that natural difference in run time that stems from different WUs.
Difference between CPU time and run time around 2%, that's good one.
Standard deviation is 17, resp. 12
____________
My stats | |
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 685 ID: 845 Credit: 2,886,414,412 RAC: 77,022
                              
|
My Q9450 with LLR64 3.8.8 stock app and Win7 x64 has stable runtimes, but on i7 980X (HT enabled) with LLR32 3.8.8 via app_info and Win7 x64 they vary a lot. Sometimes the CPU time is higher than the runtime.
I think the long startup time problem discussed in the other thread is actually the same issue.
I could imagine that this is related to GWNUM's broken hyperthreading detection (known bug in GWNUM 27.4, not fixed in 27.5), since nobody has reported those problems on non-HT CPUs and we've never seen similar problems with LLR 3.8.6/GWNUM 26.6.
____________
| |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1952 ID: 352 Credit: 6,016,693,611 RAC: 1,578,270
                                      
|
It is no HT issue. I have it disabled since 1) my VmWare edition is limited to 6 cores per CPU, 2) Brings no benefit in performance.
____________
My stats | |
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
It is no HT issue. I have it disabled since 1) my VmWare edition is limited to 6 cores per CPU, 2) Brings no benefit in performance.
1)VMware counts only the real cores not the hyperthreaded but uses all available cores.
2)HT creates only a performance gain, if you have short instructions (instructions with short runtimes) or different parts of a CPU can be used in separate order.
Integer on a modern Intel cpu needs lets say 3 cycles per instruction but floating point (LLR) needs at least the double or tripple count of cycles.
Another point is the dependency of FP from the memory access. This was best viewable after Intel decided to follow the AMD road and included the memory controller in the cpu (Core2 and Nehalem). While the integer performance was increased normal (like seen with every new cpu before), the foating point performance reached nearly the double value.
Integer on BD can be executed on both parts of a modul but the FP-unit is parted between both parts of a modul. To lower this limitation, AMD included a faster FP-unit than the Intel counterpart. BD is able to calculate Floating Point instructions nearly as fast as Integer instructions.
AMDs Phenoms are comparable to all Intel CPUs like Core, Core2 and Core_i because Intel has not changed much between all generations. You have new instructions or older instructions need a different count of cycles for their execution but nothing comparable to AMDs road of development for BD.
This is not ever a false decision...
____________
Best wishes. Knowledge is power. by jjwhalen
| |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1952 ID: 352 Credit: 6,016,693,611 RAC: 1,578,270
                                      
|
1)VMware counts only the real cores not the hyperthreaded but uses all available cores.
Wasn't able to create virtual host with 8 cores anyway.
Let's keep it simple, i5-2500, 4 cores no HT, ESXi4, Windows 2008 R2 SP.
Running test standalone is fine - usage is limited to 1 core even with affinity available to all cores, runtimes around 436 and 438 secs.
primegrid_cllr.exe -q"34908861433845*2^666666+1"
primegrid_cllr.exe -q"34909025081535*2^666666+1"
The very same tests under BOINC 7.0.23 (using same executable, same host) already taking 25 minutes and growing, using more than single core.
http://www.primegrid.com/workunit.php?wuid=264705378
http://www.primegrid.com/workunit.php?wuid=264705306
____________
My stats | |
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
1)VMware counts only the real cores not the hyperthreaded but uses all available cores.
Wasn't able to create virtual host with 8 cores anyway. Sorry, i forget that you have vSphere4. My links in the other thread pointing to vSphere5...
With vSphere4 you need an Enterprise Plus licence and a host with at least 8 cores (real or HT cores does not matter) if you want to have vSMP=8 in a VM. All lower vSphere4 versions like Hypervisor (ESXi with free-licence), ESXi (ESXi with pay-licence), Standard, Advanced or Enterprise have vSMP=4.
Let's keep it simple, i5-2500, 4 cores no HT, ESXi4, Windows 2008 R2 SP. vSphere4 needs at least 2GB RAM for installation. After installation an ESXi consumes around 1GB RAM and wants to have one free core or at least some free CPU cycles for his own work.
Please be aware that a VM with vSMP>1 gets only computation time when the same amount of cores is free or the VM got a bonus via round robin scheduling because in the last scheduling cycle were not 4 cores free for your VM with vSMP=4.
You will see faster computation times and/or smaller differences in computation times on a Quadcore host, if you limit vSMP to 2 or 1 for your W2k8-R2 VM.
...or runs only your W2k8-R2 VM on this host?
____________
Best wishes. Knowledge is power. by jjwhalen
| |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1952 ID: 352 Credit: 6,016,693,611 RAC: 1,578,270
                                      
|
Yes, it runs only single VM with W2K8R2 so this VM should have all CPU resources available. Host has 16GB RAM and 14GB for the only one VM.
(it's basically test enviroment with real aplications but low CPU usage. Or prepared VM enviroment for time in need when HP server goes down or when different VM needs to be tested).
As far as I can see, odd run times and behaviour has these features:
1. CPU with AVX
2. ESXi 4.1 (non AVX aware)
3. OS that is AVX aware (W2K8R2 SP)
Only combination of there brings problem with 3.8.8, version 3.8.6 is fine.
The other host with same HW, same ESXi but with OS AVX non-aware (Win 2003) is OK.
Yet another host with same HW, W2K8R2SP on bare matel, AVX is doing great.
So it is not hardware specific.
____________
My stats | |
|
Message boards :
Sophie Germain Prime Search :
Why is there such a difference between these 2 run times? |