To add to the very good explanation above, GPU stage 1 is multithreaded and you actually can ask for more than one thread to drive the GPU. For 100-digit problems and below, a GPU will crunch through stage 1 so fast that even ignoring the size optimization the GPU is still mostly idle. You can throw 2-3 threads at stage 1 (-t 2 or -t 3 does this) and generate hits about 30% faster just from better GPU utilization, but you still need to boil all those hits down and stage 2 is still a bottleneck. That doesn't change the advice that a GPU is wasted on small problems; even for 150-digit polynomial selection I've found more than one thread is needed to keep the GPU fully occupied.
