I had another idea: turns out you can get a tiny speedup by ignoring cofactors right at the bottom of the 3LP range - say, the bottom 2 bits of it. It looks like the improvement won't be any more than 1%, so I'm not sure if it's worth bothering the developers about this.
