Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummerslowrise

Message boards :
Generalized Cullen/Woodall prime search :
How to speed up (some) GCW units by 10%+
Author 
Message 
serge Send message
Joined: 21 Jun 12 Posts: 112 ID: 144858 Credit: 254,995,991 RAC: 86,449

Nothing of what I am about to write is new. Or difficult.
And it has already been discussed multiple times.
Observe that some the chosen bases for this particular subproject happen to be squares.
25 = 5^2, 49 = 7^2, 121 = 11^2. Some people even know that this was a deliberate choice. I've been waiting for a workunit to arrive for one of these bases for a while, and now I have one.
The candidate is 754806*121^754806+1 which can also be easily regrouped as 754806*11^1509612+1.
Let's compare:
/home/serge/NumTheory/GCW> llr d d.npg
Base prime factor(s) taken : 11
Starting N1 prime test of 754806*121^754806+1
Using zeropadded AVX FFT length 720K, Pass1=320, Pass2=2304, a = 3
754806*121^754806+1, bit: 70000 / 5222413 [1.34%]. Time per bit: 5.654 ms.
/home/serge/NumTheory/GCW/2> llr d d2.npg
Base prime factor(s) taken : 11
Starting N1 prime test of 754806*11^1509612+1
Using zeropadded AVX FFT length 640K, Pass1=640, Pass2=1K, a = 3
754806*11^1509612+1, bit: 40000 / 5222416 [0.76%]. Time per bit: 4.698 ms.
That's 20% faster. (Results are similar for AVX2.)
Why is the server sending this workunit task as 754806*121^754806+1 ?
Isn't it trivial, serverside, to fetch the candidate from the database and if/when b=x^2, send it to the client not as
100000000000000:P:1:b:1
n n
but as
100000000000000:P:1:x:1
n 2n
In this particular case:
100000000000000:P:1:11:1
754806 1509612
Too hard to implement?
 

Michael GoetzVolunteer moderator Project administrator
Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 279,473,760 RAC: 120,265

Thanks, Serge.
Do you know why LLR chooses different FFT sizes for the same number? I was not aware it would do that.
____________
My lucky number is 75898^{524288}+1  

serge Send message
Joined: 21 Jun 12 Posts: 112 ID: 144858 Credit: 254,995,991 RAC: 86,449

For LLR it is not the number that matters, but the (k,b,n,c) form, and b is taken by it verbatim, as given.
I am not ready to go into a very deep explanation, but I will try to make an approximation to an explanation (not meant to be taken for that this is exact). With b=11, it is possible to form an array of length 640K where each element is a quasidigit ("limb") in an unusual representation: some digit's weights are perhaps 11^6 and some digit's weights are 11^7, or something like that. (Off the top of my head, what I remember is that each limb on average is limited to keeping ~30 bits of information, or something like that.) Only powers of b can be used as limb weights.
In contrast if the number is entered with b=121, the program can only work with limbs of, say, 121^3 and 121^2. (It has less opportunities to pack, the larger the b.) For that reason it ponders the array of length 640K and thinks, "nah, some elements will be too large; gotta go for next FFT size", and so it does.
Long story short, using the simplest possible (k,b,n,c) (with b as low as possible) will lend more possibilities for more dense FFT arrays. If not only b=121, but also k is divisible by 11, the FFT size may be even smaller if (k,121,n,c) is transformed into (k/11,11,2*n+1,c).
As an aside, yes, it would be nice if LLR did it all itself, but as the timing test (shown earlier) demonstrates  it doesn't. But here's where we can help LLR and do transformation externally, serverside. The transformation logic is quite straightforward.[/i]  

Michael GoetzVolunteer moderator Project administrator
Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 279,473,760 RAC: 120,265

Got it, thanks for the explanation. That makes sense.
____________
My lucky number is 75898^{524288}+1  

mackerelVolunteer tester
Send message
Joined: 2 Oct 08 Posts: 2533 ID: 29980 Credit: 492,113,674 RAC: 58,124

Maybe propose this optimisation as a feature request to go in future LLR if not already done?  


Observe that some the chosen bases for this particular subproject happen to be squares.
25 = 5^2, 49 = 7^2, 121 = 11^2. Some people even know that this was a deliberate choice.
25, 49, 121; that is all the prime squares in the range 13 ≤ b ≤ 121. I wonder if there is some easy reason why n*b^n + 1 is more often composite when b is a perfect square. Does sieving remove a larger fraction, so that the expected occurrence of primes is lower for these b values?
"Deliberate choice"? I though these b values were chosen simply because they were the smallest b for which no known n with n>b2 gives a prime n*b^n + 1?
/JeppeSN
Addition: I checked on Steven Harvey's page on GC, and for all prime square b among 121, 169, 289, 361, 529, 841, 961, 1369, 1681, 1849, 2209, 2809, 3481, 3721, 4489, 5041, 5329, 6241, 6889, 7921, 9409, the only time an n is known that satisfies n>b2 is for b=5041 where:
8398*5041^8398 + 1 = 8398*71^16796 + 1
is a prime.  

Michael GoetzVolunteer moderator Project administrator
Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 279,473,760 RAC: 120,265

Serge, thanks for the optimization tip. It's appreciated!
By the way, we (and by "we", I mean "Jim") have recomputed many of the FFT sizes for the candidates. You don't always get a reduced FFT size when you use the square root of b, but you do for the vast majority of candidates. For the other bases, dividing k by b one or more times hasn't reduced the FFT size once yet.
____________
My lucky number is 75898^{524288}+1  

serge Send message
Joined: 21 Jun 12 Posts: 112 ID: 144858 Credit: 254,995,991 RAC: 86,449

It seems that in your clientserver set up the ideal place to put the reformatter would be the primegrid_llr_wrapper. It would keep the initial task parameters, reformat for (c)llr, get the result back from (c)llr, report back to server as initially requested. Then the database, the server and the accounting code would be unchanged.
primegrid_llr_wrapper for now can do only:
▪ square simplification,
▪ k simplification
Later, it can be extended to recognize b being any power. See here 
Curiously, these numbers may be hard to recognize when written in standard form (emphasis mine).
For example, they may be like
18740*3^1686621
which could be written
168660*3^1686601.
More difficult to spot are those like the following:
9750*7^292501 = 9750*7^(3*9750)1 = 9750*343^97501
8511*2^3744861 = (8511*2^2)*2^(11*8511)*41 = 34044*2048^340441.
This is in fact how the GCWs for 25, 49, 121 will end up showing in UTM lists. (And this is how GW for b=4 looks, indeed.)  

Michael GoetzVolunteer moderator Project administrator
Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 279,473,760 RAC: 120,265

It seems that in your clientserver set up the ideal place to put the reformatter would be the primegrid_llr_wrapper.
That's not the prefered place for the change, but we're still evaluating options.
____________
My lucky number is 75898^{524288}+1  

axnVolunteer developer Send message
Joined: 29 Dec 07 Posts: 285 ID: 16874 Credit: 28,027,106 RAC: 0

I could've sworn that LLR itself does the normalizing of the bases (perhaps only for power of 2?). This feature needs to be in LLR itself, tbh.
1. Normalize b if it is a power.
2. Normalize k if b divides k.
I would guess that it is a trivial change in LLR (except for printing output  where it is arguably important to use the unnormalized values).  

serge Send message
Joined: 21 Jun 12 Posts: 112 ID: 144858 Credit: 254,995,991 RAC: 86,449

Maybe LLR's philosophy is "the client is always right!"
I.e.: If the input file calls for a test of a specific FFT or a "specific arrangement of bits", then that's what it will run (even if slower, because "this is the test that was ordered").
But it indeed doesn't follow this rule for powers of 2.
bash4.2$ llr d q"27*1024^10007+1"
Starting Proth prime test of 27*2^100070+1
Using allcomplex FMA3 FFT length 10K, Pass1=128, Pass2=80, a = 11
27*2^100070+1 is not prime. Proth RES64: C14E6737D2E78E5E Time : 5.261 sec.
bash4.2$ llr d q"28*729^10007+1"
Base prime factor(s) taken : 3
Starting N1 prime test of 28*729^10007+1
Using allcomplex FMA3 FFT length 10K, Pass1=128, Pass2=80, a = 3
28*729^10007+1 is not prime. RES64: E83080E955E9B281. OLD64: B89182BC01BD1780 Time : 4.888 sec.
bash4.2$ llr d q"28*10000^10007+1"
Base factorized as : 2^4*5^4
Base prime factor(s) taken : 5
Starting N1 prime test of 28*10000^10007+1
Using allcomplex FMA3 FFT length 18K, Pass1=384, Pass2=48, a = 3
28*10000^10007+1 is not prime. RES64: 59CCA66A39ED54C4. OLD64: 0D65F33EADC7FE48 Time : 13.645 sec.
(and of course it is fully equipped to normalize the base, as a side effect of factoring the base for the purposes of the N1 mechanics.)
PFGW does what it is ordered by the input file, too.
 


And GeneFer seems to do different things, not normalizing or denormalizing:
.\genefer_windows64.exe q "6^8388608+1"
.\genefer_windows64.exe q "36^4194304+1"
.\genefer_windows64.exe q "1296^2097152+1"
.\genefer_windows64.exe q "1679616^1048576+1"
Even though the first form (where b=6 is not a square) is "canonical" and the one you would expect to see on Top 5000, it is not clear which form would actually be fastest.
Testing 6^8388608+1... 21684224 steps to go (1849:28:44 remaining)
Testing 36^4194304+1... 21684224 steps to go (747:08:06 remaining)
Testing 1296^2097152+1... 21684224 steps to go (367:41:08 remaining)
Testing 1679616^1048576+1... 21684224 steps to go (160:41:45 remaining)
Estimated time remaining for 1679616^1048576+1 is 1716:50:53
(the last one is switches to x87 (80bit) transform).
/JeppeSN  

Michael GoetzVolunteer moderator Project administrator
Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 279,473,760 RAC: 120,265

Genefer and LLR/PFGW are completely different. Genefer doesn't normalize anything (although I suppose it could.)
With regards to LLR doing some normalizations but not others, does anyone know if that's LLR's code or gwnum's code?
____________
My lucky number is 75898^{524288}+1  

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 820 ID: 55391 Credit: 746,359,483 RAC: 404,191

If you let LLR do the normalization, it will be impossible to rerun serge's benchmark comparison. But once is enough to prove a point.  


I could've sworn that LLR itself does the normalizing of the bases (perhaps only for power of 2?). This feature needs to be in LLR itself, tbh.
This appears also not to be the case. Currently, I've seen reduction in overall testing times, ranging from 9% to 33%, dependant on FFT length and weather I'm on my Sandy Bridge or Haswell. So it appears, that LLR is also not doing a normalizing for bases that are powers of 2, but in fact still tests k*16^n+/1 as base 16 number and not k*2^(n*4)+/1  even though the screen shows that k*2^(n*4)+/1 is being tested.
To sum up, at least on my system, there can be up to 33% reduction of testing time per k*b^n+/1 test, by normalizing the test, if it is a power of a base, to smallest possible base.
Just my 2 cents, take care :)
Regards
KEP  

Michael GoetzVolunteer moderator Project administrator
Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 279,473,760 RAC: 120,265

The candidate is 754806*121^754806+1 which can also be easily regrouped as 754806*11^1509612+1.
Let's compare:
/home/serge/NumTheory/GCW> llr d d.npg
Base prime factor(s) taken : 11
Starting N1 prime test of 754806*121^754806+1
Using zeropadded AVX FFT length 720K, Pass1=320, Pass2=2304, a = 3
754806*121^754806+1, bit: 70000 / 5222413 [1.34%]. Time per bit: 5.654 ms.
/home/serge/NumTheory/GCW/2> llr d d2.npg
Base prime factor(s) taken : 11
Starting N1 prime test of 754806*11^1509612+1
Using zeropadded AVX FFT length 640K, Pass1=640, Pass2=1K, a = 3
754806*11^1509612+1, bit: 40000 / 5222416 [0.76%]. Time per bit: 4.698 ms.
That's 20% faster. (Results are similar for AVX2.)
Why is the server sending this workunit task as 754806*121^754806+1 ?
Isn't it trivial[?]
You would be surprised at how utterly nontrivial it turned out to be. But it is done. Thanks for pushing us along in the right direction.
____________
My lucky number is 75898^{524288}+1  

HonzaVolunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1905 ID: 352 Credit: 3,934,788,813 RAC: 4,218,235

So, 3 of 14 bases will be 20% faster?
About 4% overall speedup for GCW LLR?
____________
My stats
Badge score: 1*1 + 5*1 + 8*3 + 9*11 + 10*1 + 11*1 + 12*3 = 186  

Michael GoetzVolunteer moderator Project administrator
Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 279,473,760 RAC: 120,265

So, 3 of 14 bases will be 20% faster?
About 4% overall speedup for GCW LLR?
Something like that, yes.
____________
My lucky number is 75898^{524288}+1  

axnVolunteer developer Send message
Joined: 29 Dec 07 Posts: 285 ID: 16874 Credit: 28,027,106 RAC: 0

You would be surprised at how utterly nontrivial it turned out to be. But it is done. Thanks for pushing us along in the right direction.
Is the base the only thing normalized or do you normalize k as well (the latter is applicable for all the bases, not just the square ones)?  

Michael GoetzVolunteer moderator Project administrator
Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 279,473,760 RAC: 120,265

You would be surprised at how utterly nontrivial it turned out to be. But it is done. Thanks for pushing us along in the right direction.
Is the base the only thing normalized or do you normalize k as well (the latter is applicable for all the bases, not just the square ones)?
Just the base. In our tests there was no advantage to normalizing k.
____________
My lucky number is 75898^{524288}+1  

axnVolunteer developer Send message
Joined: 29 Dec 07 Posts: 285 ID: 16874 Credit: 28,027,106 RAC: 0

Just the base. In our tests there was no advantage to normalizing k.
Hmmm... That was ... unexpected! Can you give me the set of (n,b) numbers used to test this? I am assuming that you used LLR's setup feature to get the FFTs?  

JimBHonorary cruncher Send message
Joined: 4 Aug 11 Posts: 916 ID: 107307 Credit: 974,532,191 RAC: 0

Speaking as the person who made the code changes, we are in fact reducing k for all bases. I was supposed to remove that code, but chose to leave it in. I neglected to tell Mike about it until now. My real life is a bit busy at the moment, so sometimes I'm forgetting things like that.
while ($k % $b == 0) {
$k /= $b;
$n++;
}  


Maximizing the return the challenge will have. Excellent!
____________
 

axnVolunteer developer Send message
Joined: 29 Dec 07 Posts: 285 ID: 16874 Credit: 28,027,106 RAC: 0

Speaking as the person who made the code changes, we are in fact reducing k for all bases. I was supposed to remove that code, but chose to leave it in. I neglected to tell Mike about it until now. My real life is a bit busy at the moment, so sometimes I'm forgetting things like that.
while ($k % $b == 0) {
$k /= $b;
$n++;
}
LOL! Well it doesn't hurt. But I replicated the result, and Mike's right  there is no need to normalize the k, since apparently LLR (or perhaps gwnum library) is doing it. I can see that when k is a multiple of base, it chooses a lower FFT (compared to adjacent k's), even without explicit normalizing.
Sorry about that  I should've done my homework before posting about it.  

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 820 ID: 55391 Credit: 746,359,483 RAC: 404,191

Hmm, a process akin to normalization could be responsible for the WTF effect, which is using a timing sidechannel during sieving to "discover" small primes in the blocking factor. So far there is no other explanation for that weirdness.  

HonzaVolunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1905 ID: 352 Credit: 3,934,788,813 RAC: 4,218,235

While discussed and implement trick with b=25,49,121 brings about 4% speedup, recently found prime makes GCW yet another 7% faster on top of that. Nice!
____________
My stats
Badge score: 1*1 + 5*1 + 8*3 + 9*11 + 10*1 + 11*1 + 12*3 = 186  

Message boards :
Generalized Cullen/Woodall prime search :
How to speed up (some) GCW units by 10%+ 