How about testing 2 candidate primes at the same time for most of the LLR iterations?
Since we are repeating the same calculations for different candidates, say X and Y, just with different modulii and different numbers of iterations, then use a common calculation for (X-1) iterations with the modulus which is the product of the candidates (X*Y).
At iteration X compute the LLR residue for candidate X as A (mod X), and replace the intermediate result A with A (mod Y), then continue as normally until iteration Y with A (mod Y).
Of course, we will be doing (X-1) iterations with a modulus twice as long, plus 1 additional (normal length) modulus operation.
But if the modulus code is faster than the rest of the code in each iteration, then we have a speedup by using the slower code just Y times for each 2 candidates instead of (X+Y) times.
There's a net speedup if the overhead of multiplying the candidates X and Y together, plus 1 additional normal length modulus operation, plus the extra time of using the longer modulus for (X-1) iterations,
is less than the time saved by not repeating the first X iterations of the slower code for candidate Y.
A(n+1) = An2-2 (mod X*Y)
At iteration X instead taking AX (mod X*Y)
compute LLR residue for candidate X = AX (mod X)
and replace AX with AX (mod Y)
then continue as normal until iteration Y with
A(n+1) = An2-2 (mod Y)