One possibility is that with -t4 the parent thread is getting bounced from one processor to another, this can be costly on some machines. By running more child threads than than there are processors, the scheduler might have more options to run a waiting child thread instead of running the parent thread, and so the parent might stay on the same processor for longer.
Normally the kernel scheduler will settle on the best method itself if given a bit of time, it can takes a few minutes. You could try to intervene manually by restricting the processor affinity for the parent thread, after sr2sieve has started and spawned the child threads. taskset -pc <PID> <CPU> will restrict process PID to processor number CPU. However this limits the scheduler's options and so it probably won't help in the long run.
Another possibility, it probably doesn't apply in your case, is that another program is running, and so spawning more sr2sieve threads gives sr2sieve a bigger share of total processor time at the expense of that other program.
There are two different threading models I am considering for use in future:
1. Each thread has its own prime generator, and so all threads can generate primes concurrently. This model will use (4+4T) bytes per sieve prime for T threads.
2. All threads access a single prime generator in shared memory, but only one can use it to generate primes at a time. This model will use 8 bytes per sieve prime, independent of the number of threads.
Both of these models require less communication between threads than the current model, and neither requires a separate parent thread so the bouncing problem above will not happen. Ideally I would include both models but choose which to use at runtime depending on how much memory is available and on the L2/L3 cache configuration.