Higher IPC (if this is true for the main set of instructions used by Prime95 like SSE2 or x87 ops) is not bad. What we are currently seeing is either
  • something like a somewhat similar IPC (of FP code, because the throughput is the same, but there are some small differences in latencies) - but at a lower clock (K7/K8 vs. P4), or
  • a lower IPC (of FP code again) and a lower clock (P-M vs. P4)

Future P-M derivates like Yonah will have somewhat improved FPU-IPC. Some early benchmark results (Cinebench) have shown per-clock performance improvements, but in that case there were too many factors, which contribute to a different performance. Besides core changes there were a different cache design, FSB, chipset and maybe even a different type of memory used in this test than in the Dothan based system, which has been used for comparison.

But both at Intel and AMD the CPU designs won't stop at the current throughput of 1 fp mul + 1 fp add per cycle per core. Besides increasing the number of execution units by increasing the number of cores, there will also come designs with increased throughput per core.

An Intel CPU developer guy, who switched to AMD recently, wrote a dissertation about multiscalar CPUs. These (if implemented in hardware eventually) could execute parts of a single thread, which would usually be executed serially, in parallel as long as their data dependencies and other conditions permit that. This implies, that such CPUs would have much more execution units than now. Even different loop iterations could be executed in parallel. But even without such dramatically new architectures, we will see at least higher throughput architectures.

Alone from going to full width 128 bit SSEn operation in hardware we would get twice the throughput per clock, than we have now. Currently P4, K8, P-M do their 128 bit SSEn operations as two 64 bit operations, which are executed sequentially by the units. IIRC, only the latest VIA CPU (C7 or Esther) is capable of doing full 128 bit SSEn operations.
