Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummerslowrise

Message boards :
Number crunching :
hey! a more efficient sieve for b*2^k+1
Author 
Message 
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

I've developed a twopart sieve for b*2^k+1 that uses the order of the multiplicative group (mod p) for test primes p that sieves kvalues rather than sieving b*2^k+1. It's exponentially faster to sieve kvalues so there will be significant speedup for large k. However the tradeoff is that determining the group order for a given p is not a trivial amount of work. Fortunately the group order of each p can be computed independently in parallel and fed at any time to a single ksieve, which runs very fast. I thought about keeping this a trade secret and sieving the heck out of SoB and other conjectures to get a significant LLR testing advantage, but that's not really in keeping with the spirit of community scientific research, and I'm not about to rent a gazillion DO droplets to find prime divisors and compute group orders. So here's how it works.
Program A receives "divisor" primes p (aka potential factors, by whatever method) and outputs the group order and kvalue whose b*2^k+1 == 0 (mod p) "the offset", or just the order if p is not a divisor for any k, or "no order" if p divides b. Only a fraction of these will be divisors for a given b. The real beauty of this is that the group order (mod p) is independent of b, so the output of program A can be reused by another instance of program A with a different b value, and this shortcircuits the looping because it doesn't need to find the group order. In fact because it's independent of b, the residue logic is identical to finding the group order for 2^k+1. The core logic without too much detail is r0 = b + 1; ... some initial processing for small p and set k = 1 or 2; ... r = r0; h=(p1)/2; while (k <= h && r != 0 && r != r0) { r = 2 * r  1; if (r >= p) r = p; k++;} THAT'S ALL FOLKS. There's no testing for r < 0 since it happens only once in the loop when r == 1 immediately following r == 0, so that test is brought outside the loop for a 10% speedup (which I saw experimentally), and a second loop resumes the search for the order (if not already supplied) sans testing for r == 0 (just until r == r0 or k == h). If the group order hasn't been found found by the time (k == h) then it's automatically (p1) and the second loop can exit as soon as it finds a 0 residue. When (r == r0) then the group order is (k  1) (because of k++ loop mechanics), and if the 0 residue hasn't been found then p is not a divisor for any kvalue. All the group orders should be included in the output whether p is divisor or not so that it can be reused for another b.
So the group order work is done once for all b values, whether Proth or PSP or SoB or 321. The group order for p is the step size used by the ksieve with initial k offset determined by program A (only the offset differs for each b value for a give p).
Program A is the one that takes a "long time" to run, finding the group order and offset for around 1,600 candidate primes per minute (single thread) at p ~ 31M. I realize this is very much smaller than the ~ 1P leading edge of sr5sieve, and also much slower than direct sieves on the values b*2^k+1, but the domain being sieved is way smaller (k rather than 2^k) so there's a benefit to be expected for larger k. And also not to berate the benefit of computing the group order of each p just once which is then reused for all other bvalues.
Program A processed p up to 51M (~ 3.1 million primes) in 880 minutes producing 1.7 million actual divisors for 321. The output fed to program B sieves the kvalues in about 3.4 seconds for k up to 50M. Using divisors up to 51M isn't a very deep sieve, leaving 3.9 million b*2^k+1 values left not factored for k up to 50M.
For comparison, this machine produces primes in 8 GB memory with the Sieve of Eratosthenes up to (2^37) in 977.5 seconds, getting significant speed by storing only odd numbers in the bit map. It might have gone faster but all cores were already churning out group orders and offsets.
I haven't run sr5sieve, so I'd like some feedback on runtimes from experienced users for comparison.
The values of p I processed so far are quite small  they fit in a 64bit long integer  so I would convert these to using GMP math functions and it would slow down a lot. It's simple to convert since programs A and B total to less than 500 lines of code, compared to the 15k+ lines of code in sr5sieve.  

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

How many candidates are left in PG's sieved files for 321 from the current leading edge of n=19,278,606 up to 50M?
My ksieve is now at p=67,342,087 and reports it has 2,393,276 candidates left up to 50M.
It's currently eliminating around 50 candidates per minute between that leading edge and 50M.
Edit: It reports 6,287,537 candidates from 19,278,606 up to 100M.
I just recompile program B with a different sieve size and it gives me the result for that in 5 seconds.
It's eliminating around 77 candidates per minute between 50M and 100M.  

streamVolunteer moderator Project administrator Volunteer developer Volunteer tester Send message
Joined: 1 Mar 14 Posts: 1051 ID: 301928 Credit: 563,881,725 RAC: 1,288

How many candidates are left in PG's sieved files for 321 from the current leading edge of n=19,278,606 up to 50M?
Proth part only (+1)  1,105,981.
321 was sieved to 95P.
My ksieve is now at p=67,342,087 and reports it has 2,393,276 candidates left up to 50M.
Is 67,342,087 a maximum prime factor you've tested? If yes, then I have to disappoint you. Right now I'm running sr1sieve on 321 (4 cores) and it advances p by 40,000,000 per second. We're usually throwing away factors below p=1e9 after processing because it's faster to resieve them if we ever will need them again.
 

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

My ksieve is now at p=67,342,087 and reports it has 2,393,276 candidates left up to 50M.
Is 67,342,087 a maximum prime factor you've tested? If yes, then I have to disappoint you. Right now I'm running sr1sieve on 321 (4 cores) and it advances p by 40,000,000 per second. We're usually throwing away factors below p=1e9 after processing because it's faster to resieve them if we ever will need them again.
When you put it that way, it seems pretty disappointing. But...
(Yes, all primes p up to 67M were tested for the given b, separating them into
a set of proven factors and a set of proven nonfactors for a given b.)
TLDR; Program_B output in real time, with web page updates in 15 to 20 second intervals.
live sieve report for b=21181
However I'm doing something quite different than sr1sieve.
Program_A classifies the p's into factors and nonfactors
(for any k's; we just don't yet know which k's the proven factors will divide until trying with the ksieve).
It distills these down to a couple of million good ones which could be used for sr1sieve (for a given b).
When I say I've checked 67M p, it means I know for certain which of these will never produce results
for a particular b, for any k up to infinity. The proven nonfactor p's would fail to produce any sieve hits
even with sr1sieve for a particular b, ever. So if you hand me a large prime, then Program_A tests in a
finite amount of time whether it would be good to use with sr1sieve on a given b.
sr1sieve can only prove that p is a factor when it stumbles upon some value b*2^k+1 which it divides.
sr1sieve can't prove the converse so you must retain (or regenerate) all values of p as possible divisors.
The speed of my sieve goes across all k values, for as many 0 <= k <= N as you can hold in a bitmap.
It's very fast to sieve with a proven factor for all 0 <= k <= billions. But this is not so useful for PrimeGrid.
We are both trying to eliminate candidate k values with factors, so that as few as possible are left over
to test for primality with LLR. Your approach is much faster and targets directly the range which will be
submitted for LLR testing, which makes a lot of sense for PrimeGrid because we want to produce
candidates that have the shortest LLR test times, first. It doesn't matter that Program_B can eliminate
huge values of k because there are so many smaller k that are left to test with LLR.
To summarize, we're kind of talking apples and oranges, even though they can both be eaten.
sr1sieve checks a range of flat space between MIN and MAX which contains some preinitialized
candidates b*2^k+1 with multiple b values and multiple k values which fall in that range, right?
Program_A determines whether p divides *any* value of b*2^k+1, and Program_B tries to find one.
Many of these proven factors don't have a k in the specified sieve range, but we are absolutely certain
some k's will be found given a large enough sieve range.
Program_A reports primes which are proven factors because 0 <= offset < prime for such primes.
Currently I use 1 offset to report a nonfactor, but i'll change that to be able to use unsigned integers.
I could run more simultaneous instances of Program_A to increase the rate of checking p & b combinations.
The speed increase for multiple b values occurs in Program_A comes from having the group order supplied to it already.
One instance has to be a leader on this so it can compute the order for other instances.
I've run it this way.
primes 1000000000 > p1G
cat p1G  ./program_A 21181  tee out.21181  ./program_A 22699  tee out.22699  ...
I don't like the way this works because tee uses buffered output even though program_A flushes
the output for every p record it writes. So the next instance has to wait until the O/S flushes multiple lines of output.
I like it better with
nohup cat p1G  ./program_A 21181 >out.21181 &
nohup tail f out.21181  cat out.21181   ./program_B 37596884 50000001 >report.21181 &
and then I can casually browse the current status with tail n 3 out.21181 and with tail n 3 report.21181
This is pretty close to what's happening to create the web page.
Program_B produces a line only when fil= changes. This is counted as the number of "effective factors".  

streamVolunteer moderator Project administrator Volunteer developer Volunteer tester Send message
Joined: 1 Mar 14 Posts: 1051 ID: 301928 Credit: 563,881,725 RAC: 1,288

sr1sieve checks a range of flat space between MIN and MAX which contains some preinitialized
candidates b*2^k+1 with multiple b values and multiple k values which fall in that range, right?
Assume you need to solve b*2^k+1 = 0 (mod p), where b is constant, k is a set of candidates inside sieving range k_min <= k < K, and p is a prime to test.
Of course we're not testing all existing k lying inside sieving range or stored in our candidate file. Instead, rewrite this equation as:
2^k = 1/b (mod p) [eq. 1]
or more algorithmically efficient form
1/(2^k) = b/c (mod p)
since c=1 for Proth primes, it simplifies further to
1/(2^k) = b (mod p) [eq. 2]
Equations [1] and [2] are discrete logarithm problem which is solved using BSGS algorithm.
Applying the algorithm and solving the equation, we'll quickly have all possible k values which must be removed from the sieve.
This how sr1sieve / sr2sieve works. sr1sieve is optimized for single sequence, sr2sieve have some tricks to work faster with many sequences (sieving few different b with same krange).
 

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

Applying the algorithm and solving the equation, we'll quickly have all possible k values which must be removed from the sieve.
This how sr1sieve / sr2sieve works. sr1sieve is optimized for single sequence, sr2sieve have some tricks to work faster with many sequences (sieving few different b with same krange).
OK, I didn't realize sr1sieve isn't testing with all primes.
So we're doing the same thing with k values, except I have O(n) search on each prime and BSGS is O(n^.5).
That's why sr1sieve selects primes much faster than Program_A.
How is sr1sieve determining the group order?
My list of possible factors from Program_A for a given b should be identical to yours,
and the sieve step length and starting point should be identical unless we're doing something different.
Can you check what sr1sieve does with the prime 89667313?
The group order I have for this one is 14944552
and the first k which is sieved out is 4605044
so the next k sieved out is 4605044+14944552=19549596
If we don't match then one of us is missing k values, or sieving out too many k's,
or sieving out the wrong k's entirely.
I think the sieve part (Program_B) would be about the same speed except that
sr1sieve has a significant performance advantage with no I/O overhead
to communicate between the 2 parts.
 

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

Applying the algorithm and solving the equation, we'll quickly have all possible k values which must be removed from the sieve.
This how sr1sieve / sr2sieve works. sr1sieve is optimized for single sequence, sr2sieve have some tricks to work faster with many sequences (sieving few different b with same krange).
OK, I didn't realize sr1sieve isn't testing with all primes.
So we're doing the same thing with k values, except I have O(n) search on each prime and BSGS is O(n^.5).
That's why sr1sieve selects primes much faster than Program_A.
How is sr1sieve determining the group order?
My list of possible factors from Program_A for a given b should be identical to yours,
and the sieve step length and starting point should be identical unless we're doing something different.
Can you check what sr1sieve does with the prime 89667313?
The group order I have for this one is 14944552
and the first k which is sieved out is 4605044
so the next k sieved out is 4605044+14944552=19549596
If we don't match then one of us is missing k values, or sieving out too many k's,
or sieving out the wrong k's entirely.
I think the sieve part (Program_B) would be about the same speed except that
sr1sieve has a significant performance advantage with no I/O overhead
to communicate between the 2 parts.
I neglected to say that the above check is for b=21181  

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

Wow, k=37598180 has been very resistant to being knocked out of my sieve as a candidate,
as the lowest candidate above the cutoff ever since prime=167.
Is it still in the official "n" candidates list for SoB 21181?
I suppose so, it must be just the next one on the list, as it is likely to have an astronomically large factor.  

JimBHonorary cruncher Send message
Joined: 4 Aug 11 Posts: 920 ID: 107307 Credit: 989,555,179 RAC: 23,710

21181*2^37598180+1 is divisible by 3799647223  

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

21181*2^37598180+1 is divisible by 3799647223
Ah, only 10 digits, not such a big factor after all.
What method did you use to get this factor?  

JimBHonorary cruncher Send message
Joined: 4 Aug 11 Posts: 920 ID: 107307 Credit: 989,555,179 RAC: 23,710

It was either sr1sieve or sr2sieve. I can't find the spiral notebooks I used at the time.  

streamVolunteer moderator Project administrator Volunteer developer Volunteer tester Send message
Joined: 1 Mar 14 Posts: 1051 ID: 301928 Credit: 563,881,725 RAC: 1,288

How is sr1sieve determining the group order?
Sorry, this question is above my math skills. As far as I know output of prime generator is fed directly into BSGS, all magic happens there. srsieve /sr1sieve/sr2sieve source is available, version with latest updates was posted somewhere on mersenneforum.org.
 

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

How is sr1sieve determining the group order?
Sorry, this question is above my math skills. As far as I know output of prime generator is fed directly into BSGS, all magic happens there. srsieve /sr1sieve/sr2sieve source is available, version with latest updates was posted somewhere on mersenneforum.org.
It's OK, there's more than one way to do it. I'll look at the code.
You said p advanced 40M per second in your sr1sieve with 4 cores at 95P.
That's not only due to a faster algorithm and implementation.
It's mainly because the density of primes is less at larger primes,
so there are bigger jumps between successive primes.
I'm currently seeing a difference of around 330 between successive primes at p ~ 119M.
With 19 instances of program_A running in parallel my throughput is about 125 primes per second,
so p is advancing at 125 * 330 ~ 41K per second now.  

Ravi FernandoProject administrator Volunteer tester Project scientist Send message
Joined: 21 Mar 19 Posts: 211 ID: 1108183 Credit: 14,457,450 RAC: 3,173

I assume by "group order" you mean the order of 2 in the multiplicative group mod p. This will always be a divisor of p1, and usually it will be (p1)/d where d is small. So the first algorithm that comes to my mind is: 1. Factor p1.
2. Initialize a "result" variable to equal p1.
3. Loop through all distinct prime divisors q of p1: a) Check whether 2^(result/q) is 1 mod p. If so, set result = result/q and repeat until either it fails or result is no longer divisible by q. 4. Return result. Disclaimer: I haven't read the source code of sr1sieve or any other sieving program, and I am virtually certain that it's smarter than anything I can come up with in 30 seconds.  

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

I assume by "group order" you mean the order of 2 in the multiplicative group mod p.
Yes. Thanks for that correction.
What I notice is that when the order of 2 in the multiplicative group formed by 2^k+1 (mod p) is p1,
then 2 is a generator with the same order in the multiplicative group formed by b*2^k+1 (mod p),
and the cycle of residues b*2^k+1 (mod p) for k = 0 .. p1 is simply a rotation by "e" positions
of the same cycle of residues of 2^k+1 (mod p). I haven't proven this, it's just a pattern I see.
Since p divides 2^{(p1)/2}+1 the pattern dictates that p also divides b*2^{e+(p1)/2}+1
so then it's sufficient to have the order of 2 and the rotation "e" to find the offset = e+(p1)/2 for the sieve.
As a concrete example with p = 29, the order of 2 in multiplicative group 2^k+1 (mod 29) is 28.
With k = 0, 2^k+1 = 2 and with b = 21181, b*2^k+1 = 2 when k = 3, so e = 3.
The k for which p divides 2^k+1 is (p1)/2 = 14, so with knowledge of the rotation e
we easily calculate that p divides b*2^k+1 with k = (p1)/2 + e = 17.
The ideal situation would be to have a formula which relates b, p, and e.  

rogueVolunteer developer
Send message
Joined: 8 Sep 07 Posts: 1259 ID: 12001 Credit: 18,565,548 RAC: 0

And if you think sr1sieve is fast, I suggest that you give srsieve2cl a try.  

Ravi FernandoProject administrator Volunteer tester Project scientist Send message
Joined: 21 Mar 19 Posts: 211 ID: 1108183 Credit: 14,457,450 RAC: 3,173

What I notice is that when the order of 2 in the multiplicative group formed by 2^k+1 (mod p) is p1,
then 2 is a generator with the same order in the multiplicative group formed by b*2^k+1 (mod p),
and the cycle of residues b*2^k+1 (mod p) for k = 0 .. p1 is simply a rotation by "e" positions
of the same cycle of residues of 2^k+1 (mod p). I haven't proven this, it's just a pattern I see.
How I would phrase this: the subgroup H generated by 2 in (Z/pZ)* has the same size as its coset bH. (That is, powers of 2 mod p cycle with the same period as b times powers of 2.) Moreover, if 2 is a generator of (Z/pZ)* (a primitive root), then the list b*2^k is the same as the list 2^k shifted cyclically by e steps, where e (a discrete logarithm) is chosen so that 2^e = 1/b mod p. This is all true, provided of course that b is not itself a multiple of p.
The ideal situation would be to have a formula which relates b, p, and e.
It sounds like you're looking for an efficient solution to the discrete log problem. If you find one, I'm sure the NSA would love to have a word with you.  

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

What I notice is that when the order of 2 in the multiplicative group formed by 2^k+1 (mod p) is p1,
then 2 is a generator with the same order in the multiplicative group formed by b*2^k+1 (mod p),
and the cycle of residues b*2^k+1 (mod p) for k = 0 .. p1 is simply a rotation by "e" positions
of the same cycle of residues of 2^k+1 (mod p). I haven't proven this, it's just a pattern I see.
How I would phrase this: the subgroup H generated by 2 in (Z/pZ)* has the same size as its coset bH. (That is, powers of 2 mod p cycle with the same period as b times powers of 2.) Moreover, if 2 is a generator of (Z/pZ)* (a primitive root), then the list b*2^k is the same as the list 2^k shifted cyclically by e steps, where e (a discrete logarithm) is chosen so that 2^e = 1/b mod p. This is all true, provided of course that b is not itself a multiple of p.
The ideal situation would be to have a formula which relates b, p, and e.
It sounds like you're looking for an efficient solution to the discrete log problem. If you find one, I'm sure the NSA would love to have a word with you.
I believe that they would. I took a couple of undergraduate courses instructed by H. C. Williams at around the time there was a rumour among the students that one of his publications was censored.  

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

It seems to me that I'm doing it backwards, trying to find a prime b * 2^{k} + 1
by using primes p to sieve out all the k's in the test range (except the first such k for each p)
where p divides b * 2^{k} + 1 so that the leftovers can be tested for primality.
The primality test is needed because a leftover k can still produce a composite
which hasn't been identified as such by a higher p than checked so far,
and more than 50% of primes tested fail as they are not factors with that b.
For numbers b * 2^{k} + 1 the recurrence relation is
S_{0} = b + 1
S_{k+1} = 2 * S_{k}  1
And this recurrence is performed (mod p).
S_{0,p} = b + 1 (mod p)
S_{(k+1),p} = 2 * S_{k,p}  1 (mod p)
For each p, the inner loop searches in t = 0 .. (order of 2)  1 for a value S_{t,p} = 0
and upon finding one, using (2 * t) as the starting offset for an iteration of the ksieve
with step length = (order of 2) to knock out k's where b * 2^{k} + 1 share the divisor p with b * 2^{t} + 1
or "do nothing" in the ksieve if this p isn't a divisor for any t (no residue of 0 is found in the cycle).
That's how I do it now, and it's pretty slow. Here's another idea.
Instead of looping (mod p), I could leave S_{k} asis,
and attempt to factor this huge number with feasibly small divisors,
using these factors as primes for determining cyclelengths
to use in the ksieve with offset k. The unfactorable part is used as if it
is also prime, with some loss of sieve coverage as a tradeoff to speed.
For example the factors of S_{0} = 21182 are 2, 7, 17, and 89
Since S_{k} is always odd for k > 0, we disregard 2 and we have
primes 7, 17, and 89 forming cycles with lengths 3, 8, and 11 (respectively)
all having cycle offset 0 to pick off k's in the ksieve.
The 3 cycle lengths are coprime so there is a long repeating supercycle in the ksieve
which knocks out (3 * 8 * 11 = 264) k values each supercycle. A simple implementation
just iterates through the whole sieve range for each small cycle length independently.
A spacetime tradeoff is realized by implementing the supercycle hits as a list of differences
between offsets within the supercycle, so that the supercycle is used with one pass of
the sieving range, resulting no repeated CPU cache refills of the sieve range, and
sometimes knocking multiple k's in the same cache line.
Unlike the first method above, we never have to search a prime's cycle for a 0 divisor at
some koffset, as we are given the offset in each iteration of the kloop.
On the other had, we have to use small trial divisors to produce factors of S_{k}.
Every value S_{k} has a divisor at offset k in the cycles corresponding to the factors
factors of S_{k}. These are the only factors which have this offset. If S_{k} is
prime, then there's only one factor and it's not being tested for here for primality.
Small factors often reappear as factors of another S_{k}.
For example 7 is a factor of S_{0} = 21182 and also of S_{3} = 169449 = 3 * 7 * 8069.
The ksieve disregards factors whose cycle lengths (in this case 2, 3, 8068) are <= k and as a result we
would sieve for offset k=3 only with cycle length 8068.
 

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

I wondered if that last idea would even be useful, so I took a random small prime, 163
$ factor 163
163: 163
computed the ksieve parameters with program_A
$ echo 163  ./cycle_filter 21181
pri=163 len=162 off=122
finding that it produces a starting value of 122 for the ksieve
then computed S_{122} = 21181 * 2^{122} + 1 and factored it
$ echo "2 122 ^ 21181 * 1 + p"  dc  factor
112617512714881212415902149375191913857025: 5 5 163 21673 353869 807707 198857917 22434688079929
Then I ran the unique factors of S_{122} through program_A
obtaining the unexpected result that all the factors > 163 of S_{122}
have the same starting offset as 163.
$ echo "2 122 ^ 21181 * 1 + p"  dc  factor  cut d : f 2  tr ' ' '\n'  uniq  ./cycle_filter 21181
pri=5 len=4 off=2
pri=163 len=162 off=122
pri=21673 len=2709 off=122
pri=353869 len=353868 off=122
pri=807707 len=807706 off=122
pri=198857917 len=198857916 off=122
pri=22434688079929 len=934778669997 off=122
THAT'S A FANTASTIC RELATIONSHIP BETWEEN THE FACTORS.
A few more examples have convinced me that this result is general.
Does anyone have a proof for this?
EDIT: I stated this in the previous post but it didn't sink in so deeply due to the density of my brain.
$ for ((k=12; k<23; k++)); do
s=$(echo "2 $k ^ 21181 * 1 + p"  dc)
echo e "\nS[$k] = $s ==========\n"
echo $s  factor  cut d : f 2  tr ' ' '\n'  uniq  ./cycle_filter 21181
done
S[12] = 86757377 ==========
pri=7 len=3 off=0
pri=941 len=940 off=12
pri=13171 len=2634 off=12
S[13] = 173514753 ==========
pri=3 len=2 off=1
pri=19279417 len=4819854 off=13
S[14] = 347029505 ==========
pri=5 len=4 off=2
pri=6469 len=6468 off=14
pri=10729 len=5364 off=14
S[15] = 694059009 ==========
pri=3 len=2 off=1
pri=7 len=3 off=0
pri=53 len=52 off=15
pri=71 len=35 off=15
pri=8783 len=4391 off=15
S[16] = 1388118017 ==========
pri=11 len=10 off=6
pri=13 len=12 off=4
pri=17 len=8 off=0
pri=19 len=18 off=16
pri=41 len=20 off=16
pri=733 len=244 off=16
S[17] = 2776236033 ==========
pri=3 len=2 off=1
pri=29 len=28 off=17
pri=739 len=246 off=17
pri=1489 len=744 off=17
S[18] = 5552472065 ==========
pri=5 len=4 off=2
pri=7 len=3 off=0
pri=158642059 len=158642058 off=18
S[19] = 11104944129 ==========
pri=3 len=2 off=1
pri=47 len=23 off=19
pri=8750941 len=583396 off=19
S[20] = 22209888257 ==========
pri=83077 len=1932 off=20
pri=267341 len=53468 off=20
S[21] = 44419776513 ==========
pri=3 len=2 off=1
pri=7 len=3 off=0
pri=23 len=11 off=10
pri=67 len=66 off=21
pri=1372633 len=228772 off=21
S[22] = 88839553025 ==========
pri=5 len=4 off=2
pri=89 len=11 off=0
pri=139 len=138 off=22
pri=287251 len=57450 off=22
 

Ravi FernandoProject administrator Volunteer tester Project scientist Send message
Joined: 21 Mar 19 Posts: 211 ID: 1108183 Credit: 14,457,450 RAC: 3,173

Then I ran the unique factors of S_{122} through program_A
obtaining the unexpected result that all the factors > 163 of S_{122}
have the same starting offset as 163.
That seems very expected to me. The offset of p is by definition the smallest k such that p divides 21181 * 2^k + 1. So you found a bunch of prime factors of 21181 * 2^122 + 1, calculated their offsets, and discovered that... all of them divide 21181 * 2^122 + 1, and most of them (the big ones, at least) don't divide any smaller number of the form 21181 * 2^k + 1. What's surprising about that?  

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

The offset of p is by definition the smallest k such that p divides 21181 * 2^k + 1.
That's true but not helpful to explaining why the larger factors don't also divide some value with smaller k.
I understand the reason, now, why the larger factors all have offset == k.
It is because the smaller factors have (order of 2) < k so they cover k with k = n * (order of 2) + offset with n > 0,
while the larger factors have (order of 2) > k, and the only way to equal k when n is 0 is to have their offset == k.
I think composite factors would follow this behaviour as well.
Indeed:
$ echo 25  ./cycle_filter 21181
pri=25 len=24 off=2
$ echo "21673 5 * p"  dc  ./cycle_filter 21181
pri=108365 len=10836 off=122
NB the program just prints "pri=", it doesn't expect the divisors to be prime.  

Ravi FernandoProject administrator Volunteer tester Project scientist Send message
Joined: 21 Mar 19 Posts: 211 ID: 1108183 Credit: 14,457,450 RAC: 3,173

It is because the smaller factors have (order of 2) < k so they cover k with k = n * (order of 2) + offset with n > 0,
while the larger factors have (order of 2) > k, and the only way to equal k when n is 0 is to have their offset == k.
I think composite factors would follow this behaviour as well.
Indeed:
$ echo 25  ./cycle_filter 21181
pri=25 len=24 off=2
$ echo "21673 5 * p"  dc  ./cycle_filter 21181
pri=108365 len=10836 off=122
NB the program just prints "pri=", it doesn't expect the divisors to be prime.
Looks correct to me, except that the order of 2 mod 25 is 20, not 24. Maybe a slight bug in the case of divisors with repeated prime factors?  

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

Looks correct to me, except that the order of 2 mod 25 is 20, not 24. Maybe a slight bug in the case of divisors with repeated prime factors?
I know why. I assumed that the input is prime, so then assumed that the order is (p1)
if it isn't less than or equal to (p1)/2. Obviously I violated that input assumption.
It's a case of garbage in, garbage out. I'll have to redo the order computation to handle composite input.  


Of course Euler's totient function Ï†(p) has to be used instead of p1 if p may be nonprime. /JeppeSN  

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

I see some interesting symmetries.
There's the obvious rotational symmetry between 2^k+1 (mod p) and b * 2^k + 1 (mod p).
Call the 2^k+1 (mod p) the principal cycle, and the other one the rotated cycle,
which is the same as the principal cycle rotated by "r" elements.
Looking at p = 29
The principal cycle and rotated cycle repeat with the (order of 2) = 28 = (p1).
The first difference between successive elements of the principal cycle is a cycle of length (p1)
that contains a pair of shorter cycles each with (p1)/2 elements, the second one identical to the first,
but with a change of sign of each element, provided that 0 in the principal cycle is replaced with p for
taking differences.
The first difference of the rotated cycle is the same thing as the first difference of the principal cycle,
but rotated by "r" elements.
The "inverse cycle" is constructed by taking the inverse of each element of the principal cycle,
using (p+1)/2 as the inverse of the 0 element to fix symmetries for first differences.
Like the principal cycle, inverse cycle has length (p1).
The "rotated inverse" cycle is the inverse of the rotated cycle.
It is the same as the inverse of the principal cycle, rotated by "r" elements.
The first difference of the inverse cycle has total length (p1) and contains a reflection symmetry
at (p1)/2, providing that the inverse of the 0 in the principal cycle taken as (p+1)/2.
The first difference of the rotated inverse cycle also has total length (p1) but
splits into 2 shorter reflection symmetries with one rotation of the total cycle.
The first symmetry contains (2r) elements (reflected about r).
This is seen in the "tid" column, as {7 4 5 5 4 7} in the 6 elements after the rotated 1.
The remaining (p12r) elements are in the second reflection below the first, reflecting around (p1)/2  r.
e = 2^k + 1
a = 2^k + 1 (mod 29) the "principal cycle"
ad = first difference of principal cycle
ai = inverse of the principal cycle
aid = first difference of inverse of principal cycle
t = b * 2^k + 1 (mod 29) with b = 21181 "the rotated cycle"
td = first difference of rotated cycle
ti = inverse of the rotated cycle ("rotated inverse cycle")
tid = first difference of rotated inverse cycle
"Synthetic zeroes", a way of dealing with the inverse of a 0 element that makes sense
for symmetries, sort of like a discrete version of "analytic continuation" of Complex Analysis.
I made this up. Maybe someone knows of real terminology for this.
Zeroes in the a and t columns replaced with 29* = p
Zeroes in the ai and ti columns replaced with 15* = (p+1)/2
The table through two full cycles (p1) each.
Differences in row k are for data[k]  data[k1]/
The differences ad, aid, td, tid, in row[0] are undefined,
so values were copied from the row further down where a=2.
p=29 k= 0 e= 2 a= 2 ad=14 ai=15 aid= 5 t=12 td= 9 ti=17 tid= 1
p=29 k= 1 e= 3 a= 3 ad= 1 ai=10 aid= 5 t=23 td= 11 ti=24 tid= 7
p=29 k= 2 e= 5 a= 5 ad= 2 ai= 6 aid= 4 t=16 td= 7 ti=20 tid= 4
p=29 k= 3 e= 9 a= 9 ad= 4 ai=13 aid= 7 t= 2 td=14 ti=15 tid= 5
p=29 k= 4 e= 17 a=17 ad= 8 ai=12 aid= 1 t= 3 td= 1 ti=10 tid= 5
p=29 k= 5 e= 33 a= 4 ad=13 ai=22 aid= 10 t= 5 td= 2 ti= 6 tid= 4
p=29 k= 6 e= 65 a= 7 ad= 3 ai=25 aid= 3 t= 9 td= 4 ti=13 tid= 7
p=29 k= 7 e= 129 a=13 ad= 6 ai= 9 aid=16 t=17 td= 8 ti=12 tid= 1
p=29 k= 8 e= 257 a=25 ad= 12 ai= 7 aid= 2 t= 4 td=13 ti=22 tid= 10
p=29 k= 9 e= 513 a=20 ad= 5 ai=16 aid= 9 t= 7 td= 3 ti=25 tid= 3
p=29 k=10 e= 1025 a=10 ad=10 ai= 3 aid=13 t=13 td= 6 ti= 9 tid=16
p=29 k=11 e= 2049 a=19 ad= 9 ai=26 aid= 23 t=25 td= 12 ti= 7 tid= 2
p=29 k=12 e= 4097 a= 8 ad=11 ai=11 aid=15 t=20 td= 5 ti=16 tid= 9
p=29 k=13 e= 8193 a=15 ad= 7 ai= 2 aid= 9 t=10 td=10 ti= 3 tid=13
p=29 k=14 e= 16385 a=29* ad= 14 ai=15* aid= 13 t=19 td= 9 ti=26 tid= 23
p=29 k=15 e= 32769 a=28 ad= 1 ai=28 aid= 13 t= 8 td=11 ti=11 tid=15
p=29 k=16 e= 65537 a=26 ad= 2 ai=19 aid= 9 t=15 td= 7 ti= 2 tid= 9
p=29 k=17 e= 131073 a=22 ad= 4 ai= 4 aid=15 t=29* td= 14 ti=15* tid= 13
p=29 k=18 e= 262145 a=14 ad= 8 ai=27 aid= 23 t=28 td= 1 ti=28 tid= 13
p=29 k=19 e= 524289 a=27 ad= 13 ai=14 aid=13 t=26 td= 2 ti=19 tid= 9
p=29 k=20 e= 1048577 a=24 ad= 3 ai=23 aid= 9 t=22 td= 4 ti= 4 tid=15
p=29 k=21 e= 2097153 a=18 ad= 6 ai=21 aid= 2 t=14 td= 8 ti=27 tid= 23
p=29 k=22 e= 4194305 a= 6 ad=12 ai= 5 aid=16 t=27 td= 13 ti=14 tid=13
p=29 k=23 e= 8388609 a=11 ad= 5 ai= 8 aid= 3 t=24 td= 3 ti=23 tid= 9
p=29 k=24 e= 16777217 a=21 ad= 10 ai=18 aid= 10 t=18 td= 6 ti=21 tid= 2
p=29 k=25 e= 33554433 a=12 ad= 9 ai=17 aid= 1 t= 6 td=12 ti= 5 tid=16
p=29 k=26 e= 67108865 a=23 ad= 11 ai=24 aid= 7 t=11 td= 5 ti= 8 tid= 3
p=29 k=27 e= 134217729 a=16 ad= 7 ai=20 aid= 4 t=21 td= 10 ti=18 tid= 10
p=29 k=28 e= 268435457 a= 2 ad=14 ai=15 aid= 5 t=12 td= 9 ti=17 tid= 1
p=29 k=29 e= 536870913 a= 3 ad= 1 ai=10 aid= 5 t=23 td= 11 ti=24 tid= 7
p=29 k=30 e= 1073741825 a= 5 ad= 2 ai= 6 aid= 4 t=16 td= 7 ti=20 tid= 4
p=29 k=31 e= 2147483649 a= 9 ad= 4 ai=13 aid= 7 t= 2 td=14 ti=15 tid= 5
p=29 k=32 e= 4294967297 a=17 ad= 8 ai=12 aid= 1 t= 3 td= 1 ti=10 tid= 5
p=29 k=33 e= 8589934593 a= 4 ad=13 ai=22 aid= 10 t= 5 td= 2 ti= 6 tid= 4
p=29 k=34 e= 17179869185 a= 7 ad= 3 ai=25 aid= 3 t= 9 td= 4 ti=13 tid= 7
p=29 k=35 e= 34359738369 a=13 ad= 6 ai= 9 aid=16 t=17 td= 8 ti=12 tid= 1
p=29 k=36 e= 68719476737 a=25 ad= 12 ai= 7 aid= 2 t= 4 td=13 ti=22 tid= 10
p=29 k=37 e= 137438953473 a=20 ad= 5 ai=16 aid= 9 t= 7 td= 3 ti=25 tid= 3
p=29 k=38 e= 274877906945 a=10 ad=10 ai= 3 aid=13 t=13 td= 6 ti= 9 tid=16
p=29 k=39 e= 549755813889 a=19 ad= 9 ai=26 aid= 23 t=25 td= 12 ti= 7 tid= 2
p=29 k=40 e= 1099511627777 a= 8 ad=11 ai=11 aid=15 t=20 td= 5 ti=16 tid= 9
p=29 k=41 e= 2199023255553 a=15 ad= 7 ai= 2 aid= 9 t=10 td=10 ti= 3 tid=13
p=29 k=42 e= 4398046511105 a=29* ad= 14 ai=15* aid= 13 t=19 td= 9 ti=26 tid= 23
p=29 k=43 e= 8796093022209 a=28 ad= 1 ai=28 aid= 13 t= 8 td=11 ti=11 tid=15
p=29 k=44 e= 17592186044417 a=26 ad= 2 ai=19 aid= 9 t=15 td= 7 ti= 2 tid= 9
p=29 k=45 e= 35184372088833 a=22 ad= 4 ai= 4 aid=15 t=29* td= 14 ti=15* tid= 13
p=29 k=46 e= 70368744177665 a=14 ad= 8 ai=27 aid= 23 t=28 td= 1 ti=28 tid= 13
p=29 k=47 e= 140737488355329 a=27 ad= 13 ai=14 aid=13 t=26 td= 2 ti=19 tid= 9
p=29 k=48 e= 281474976710657 a=24 ad= 3 ai=23 aid= 9 t=22 td= 4 ti= 4 tid=15
p=29 k=49 e= 562949953421313 a=18 ad= 6 ai=21 aid= 2 t=14 td= 8 ti=27 tid= 23
p=29 k=50 e= 1125899906842625 a= 6 ad=12 ai= 5 aid=16 t=27 td= 13 ti=14 tid=13
p=29 k=51 e= 2251799813685249 a=11 ad= 5 ai= 8 aid= 3 t=24 td= 3 ti=23 tid= 9
p=29 k=52 e= 4503599627370497 a=21 ad= 10 ai=18 aid= 10 t=18 td= 6 ti=21 tid= 2
p=29 k=53 e= 9007199254740993 a=12 ad= 9 ai=17 aid= 1 t= 6 td=12 ti= 5 tid=16
p=29 k=54 e= 18014398509481985 a=23 ad= 11 ai=24 aid= 7 t=11 td= 5 ti= 8 tid= 3
p=29 k=55 e= 36028797018963969 a=16 ad= 7 ai=20 aid= 4 t=21 td= 10 ti=18 tid= 10
p=29 k=56 e= 72057594037927937 a= 2 ad=14 ai=15 aid= 5 t=12 td= 9 ti=17 tid= 1
p=29 k=57 e=144115188075855873 a= 3 ad= 1 ai=10 aid= 5 t=23 td= 11 ti=24 tid= 7
p=29 k=58 e=288230376151711745 a= 5 ad= 2 ai= 6 aid= 4 t=16 td= 7 ti=20 tid= 4
 

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

This is seen in the "tid" column, as {7 4 5 5 4 7} in the 6 elements after the rotated 1.
The rotation by 1 element is superfluous.
It is eliminated by defining the first difference @ k as data[k+1]  data[k].  

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

How many candidates are left in PG's sieved files for 321 from the current leading edge of n=19,278,606 up to 50M?
Proth part only (+1)  1,105,981.
321 was sieved to 95P.
My ksieve is now at p=67,342,087 and reports it has 2,393,276 candidates left up to 50M.
Is 67,342,087 a maximum prime factor you've tested? If yes, then I have to disappoint you. Right now I'm running sr1sieve on 321 (4 cores) and it advances p by 40,000,000 per second. We're usually throwing away factors below p=1e9 after processing because it's faster to resieve them if we ever will need them again.
I decided to focus my energies for now on SOB 21181.
I find that conjectures are more appealing for sieve work.
How many candidates are left up to n=50M in SOB 21181, and how high was it sieved to?
I'm currently at p=1.44 billion and have 61'282 candidates remaining for SOB 21181 for 37'760'709 <= n <= 50'000'000.
At this stage it takes about 12 minutes to eliminate 4 candidates in that range by processing 114k primes.
I think we are comparing apples and oranges when talking about sieving.
The sr1sieve file is multigigabytes large, correct?
My sieve file is a bit map of candidate exponents up to 50M, so it's just 6 MB.  

streamVolunteer moderator Project administrator Volunteer developer Volunteer tester Send message
Joined: 1 Mar 14 Posts: 1051 ID: 301928 Credit: 563,881,725 RAC: 1,288

How many candidates are left up to n=50M in SOB 21181, and how high was it sieved to?
There are 135375 candidates total between 0 and 50M.
There are 32908 candidates left between 37,7M (current leading edge) and 50M.
SoB sieving was one of PG projects, it was run for many months (may be years) on hundreds of computers. It's sieved up to something insane, our log files have factors around p=10000000P.
The sr1sieve file is multigigabytes large, correct?
My sieve file is a bit map of candidate exponents up to 50M, so it's just 6 MB.
There is no need to sieve everything from 0 to 50M using sr1sieve. There is a firststage program  "srsieve". You need only to specify K and Nrange to sieve. It'll take just a few minutes to sieve, for example, up to p=1e9. This will eliminate majority of candidates and resulting output file will be only few megabytes long. Factors found during first stage are usually just thrown away, it's much faster to resieve again if necessary. Candidates survived during first stage should be sieved further using "sr1sieve" (tuned for single sequence) or "sr2sieve" (tuned for multiple sequences).
Please note that format "b*2^k+1" which you've used in first post is a bit unusual. In sieving software and everywhere on PrimeGrid we're using "kbnc" format  "k*b^n+c". In case of Proth primes it can be written as "k*2^n+1". In my previous replies I tried to adapt to your notation, but now I finally switched to standard "k" and "n".
 

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

Thanks, I'll switch to the standard notation. My title should say k*2^n+1.
For k=21181, I have about double the number of candidates left in both cases, 0 to 50M and 37'7M to 50M.
Given the exponential decay in number of candidates eliminated per prime p,
I have a very looooong way to go to catch up to PrimeGrid.
Sieving the very first divisor (p==3) eliminates 25M candidates (exactly 1/2) from 0 to 50M.
The next divisor (5) eliminates a further 12'5M candidates (exactly 1/4).
Then (7) eliminates 4'17M more candidates (exactly 1/12), leaving 8'3M candidates from 0 to 50M.
The first odd prime to be ignored is p==31.
It has an order of 2 (a cycle length in the 'n' sieve) equal to 5.
Since 31 divides none of those 21181*2^n+1 for n = (0 .. 4) then 31 never divides 21181*2^n+1 for any n.
Sieving to eliminate candidates is not the slow part.
Program B processes the first 79'2 million primes (up to p=1'44e9) in 50 seconds with one thread.
It gets information from Program A on which of these primes to ignore.
If Program A simply didn't output these primes, Program B would have 23'9 million fewer lines of input.
That should save a few seconds. Technically, if we want to keep the ignored primes, they should be
written to a different file than is used as input for Program B.
Computing the remainders to see whether a prime should be ignored is the job of Program A.
Computing each remainder is very fast.
It's just r[0]=21182 (when p > 21182) or r[0] = remainder(21182/p); then r[n+1]=2*r[n]1 (mod p);
but how many of these remainders there are to calculate depends on the factors of (p  1).
Program A gradually slows down with larger p, reaching a steady pace where it is rejecting
almost all primes as having an offset (the least n where p divides 21181*2^n+1) which exceeds
the sieve's upper limit.
There is no test in the loop over n to adjust a negative remainder, since the only case is r[n+1] == 1
which immediately follows r[n] == 0, at which point we've found p to be a divisor of 21181*2^n+1.
The first loop continues until (r[n+1] == r[0]) or (r[n+1] == 0) or n reaches the sieve limit (with tricks to reduce it).
If the loop ends on the first condition, we have found the order of 2 without finding a remainder of 0,
so p is not a divisor and we go onto the next prime. If it ends on the third condition then
we have exceeded the range we are sieving and we go onto the next prime. That n and prime and
remainder can be saved for later reuse if the sieve upper limit is ever extended, and we have the
massive disk space to store these residues.
Otherwise p is a divisor so we set r[n+1] == p1 then check whether it matches r[0] and if not
continue with another loop over n to find the order of 2, once again limiting the size of n with some tricks.
If the second loop terminates before it finds the order of 2, then at least it has found the
first n where r[n] == 0 which potentially eliminates one candidate in the sieve.
When Program A finds the offset AND the order of 2, then Program B loops through the sieve
with that offset and cycle length to try to eliminating multiple candidates for that p.
I strongly doubt that I've eliminated different candidates than the ones left in PrimeGrid's sieve.
 

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

I've been fiddling with AVX2 in C using Intel intrinsics.
Not quite there yet, but mostly worked out.
It's probably slower than the serial version because modulo requires
so many computations in the vector version.
Without the nitpicky edge cases, the serial way to see if p divides x = k*2^n+1 is:
x0 = k+1
x = x1 = 2 * x0  1
n = 1
result = 1 // assume not a divisor
while (x != x0) {
n++
x = 2 * x  1
if (x >= p) x = p // conditional subtraction performs mod p
if (x == 0) {
result = n
break
}
}
For AVX2 the SIMD calculation skips ahead by 4 serial calculations.
For AVX512 the vector length would be 8, and SIMD X = 256 * X  255
x0 = k+1
x1 = 2 * x0  1
X = ( x1, 2*x11, 2*(2*x11)1, 2*(2*(2*x11)1)1 ) // Initialize vector with the first 4 iterations of the serial loop
// conditional subtractions for modulo
P8 = ( 8*p, 8*p, 8*p, 8*p )
P4 = ( 4*p, 4*p, 4*p, 4*p )
P2 = ( 2*p, 2*p, 2*p, 2*p )
P = ( p, p, p, p )
E = ( x0, x0, x0, x0 ) // loop ending condition can occur in any element
result = 1
n = 1
while (true) {
// SIMD comparisons are tricky, destination argument is replaced
// matching elements set to all one bits, otherwise 0
T = E
SIMD compare T == X // updates T
break if any(X) != 0
n += 4
SIMD X = 16 * X  15 // In general the SIMD computation is (2^w)X  (2^w1) with vector length w
T = X
SIMD compare T >= P8 // updates T
SIMD AND T, P8 // updates T
SIMD X  T // updates X
T = X
SIMD compare T >= P4
SIMD AND T, P4
SIMD X  T
T = X
SIMD compare T >= P2
SIMD AND T, P2
SIMD X  T
T = X
SIMD compare T >= P
SIMD AND T, P
SIMD X  T
if any(X) == 0 {
result = n + "offset of element which is 0"
break;
}
}
 

compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,211,016,878 RAC: 1,196,437

The pseudocode in the previous post wasn't quite right.
The actual code snippets in this post are correct.
This isn't the full code, just a chunk for comparing speed.
As expected, the serial version is faster.
It takes around 24% less time than the AVX2 version.
I've begun to think that Intel's vector processing doesn't add more integer ALU resource to a core,
just more registers, and that vector processing is emulated in microcode with the existing serial ALU.
If Intel really added more ALU resource for SIMD, it would be largely wasted since very little code
uses it effectively.
When I get around to it, I'll try an AVX2 version that works on 4 independent primes in parallel
using the serial algorithm so that the modulo code is kept simple. Each element of the vector
must have independent loading and bookkeeping since each prime will cause the loop to exit
while other primes are still in progress.
serial version
#include <stdio.h>
static long offset;
static long cycle_len;
static void serial_version (long k, long prime) {
const long r0 = k+1; // n == 0
const long ini = 2*k+1; // n == 1
for (int i = 0; i < 100000; i++) { // repetitions for benchmarking
offset = 1L;
cycle_len = 1L;
long vec = ini; // vector of length 1 is a scalar
long n = 2;
for (; n <= prime  1; n++) {
vec = vec * 2  1;
if (vec >= prime) vec = prime;
if (vec == 0) {
offset = n;
break;
}
if (vec == r0) {
cycle_len = n+1;
break;
}
}
}
}
AVX2 version
#include <immintrin.h>
#include <stdlib.h>
#include <stdio.h>
static long offset;
static long cycle_len;
static void avx2_version (long k, long prime) {
/* memory for printing a vector */
__m256i *mem = (__m256i *) aligned_alloc(32, 32); // aligned on 32byte boundary
long *ptr = (long *) mem;
#define STORE(x) _mm256_store_si256(mem, x)
const __m256i c0 = _mm256_set_epi64x(0L, 0L, 0L, 0L);
const __m256i c15 = _mm256_set_epi64x(15L, 15L, 15L, 15L);
const __m256i pt1 = _mm256_set_epi64x(prime, prime, prime, prime);
const __m256i pt2 = _mm256_set_epi64x(prime*2, prime*2, prime*2, prime*2);
const __m256i pt4 = _mm256_set_epi64x(prime*4, prime*4, prime*4, prime*4);
const __m256i pt8 = _mm256_set_epi64x(prime*8, prime*8, prime*8, prime*8);
const __m256i pt1m1 = _mm256_set_epi64x(prime1, prime1, prime1, prime1);
const __m256i pt2m1 = _mm256_set_epi64x(prime*21, prime*21, prime*21, prime*21);
const __m256i pt4m1 = _mm256_set_epi64x(prime*41, prime*41, prime*41, prime*41);
const __m256i pt8m1 = _mm256_set_epi64x(prime*81, prime*81, prime*81, prime*81);
/* initial remainder */
const __m256i r0 = _mm256_set_epi64x(k+1, k+1, k+1, k+1); // n == 0
/* next remainder in 4 parallel streams */
const __m256i ini = _mm256_set_epi64x(2*k+1, 4*k+1, 8*k+1, 16*k+1); // n = (1, 2, 3, 4)
__m256i tst;
for (int i = 0; i < 100000; i++) { // repetitions for benchmarking
offset = 1L;
cycle_len = 1L;
__m256i vec = ini;
// STORE(vec);
// printf("vec: %ld %ld %ld %ld\n", ptr[3], ptr[2], ptr[1], ptr[0]);
long n = 5;
for (; n <= prime  1; n += 4) {
// vec = vec * 16  15;
vec = _mm256_sub_epi64(_mm256_slli_epi64(vec, 4), c15);
// vec = mod(vec, prime);
vec = _mm256_sub_epi64(vec, _mm256_and_si256(_mm256_cmpgt_epi64(vec, pt8m1), pt8));
vec = _mm256_sub_epi64(vec, _mm256_and_si256(_mm256_cmpgt_epi64(vec, pt4m1), pt4));
vec = _mm256_sub_epi64(vec, _mm256_and_si256(_mm256_cmpgt_epi64(vec, pt2m1), pt2));
vec = _mm256_sub_epi64(vec, _mm256_and_si256(_mm256_cmpgt_epi64(vec, pt1m1), pt1));
// STORE(vec);
// printf("vec: %ld %ld %ld %ld\n", ptr[3], ptr[2], ptr[1], ptr[0]);
// if (any(vec) == 0)
tst = _mm256_cmpeq_epi64(vec, c0);
if (! _mm256_testz_si256(tst, tst)) {
STORE(tst);
if (ptr[3] != 0) offset = n;
if (ptr[2] != 0) offset = n+1;
if (ptr[1] != 0) offset = n+2;
if (ptr[0] != 0) offset = n+3;
break;
}
// if (any(vec) == k+1)
tst = _mm256_cmpeq_epi64(vec, r0);
if (! _mm256_testz_si256(tst, tst)) {
STORE(tst);
if (ptr[3] != 0) cycle_len = n;
if (ptr[2] != 0) cycle_len = n+1;
if (ptr[1] != 0) cycle_len = n+2;
if (ptr[0] != 0) cycle_len = n+3;
break;
}
}
}
}
 

Message boards :
Number crunching :
hey! a more efficient sieve for b*2^k+1 