to save on gates, the idea is to share a pair of 32-bit multiply stages to create 64-bit results. this will likely require that a 64-bit FMUL be a variable-length pipeline, carrying out a matrix of HI-word / LO-word 32-bit multiplies and summing them. if any permutation of HI/LO-word is zero, the actual 32-32-bit multiply need not be performed.