SSE2 packs two doubles into a single XMM register, so the first temptation is to store the real and complex parts of one number per register. I've had little success with this approach: whilst it works well for addition and subtraction, it would be difficult to contrive a storage method which worked poorly for those operations.
For multiplication, the elements always seem to be in the wrong place, so many shuffle instructions become necessary. Whilst the P4 executes SHUFPD, UNPCKHPD and UNPCKLPD in the MMX_SHFT functional unit, so they don't interfere with adds and multiplies in FP_ADD and FP_MUL, the code is still unwieldy, and it sacrifices opportunities for SIMD optimisation
| MOVAPD | XMM2, XMM0 | xmm2 = [x.r x.i] |
| UNPCKHPD | XMM2, XMM0 | xmm2 = [x.r x.r] |
| MULPD | XMM2, XMM1 | xmm2 = [x.r*y.r x.r*y.i] |
| MOVAPD | XMM3, XMM0 | xmm3 = [x.r x.i] |
| UNPCKLPD | XMM3, XMM0 | xmm2 = [x.i x.i] |
| MULPD | XMM3, XMM1 | xmm3 = [x.i*y.r x.i*y.i] |
| SHUFPD | XMM3, XMM3, 2 | xmm3 = [x.i*y.i x.i*y.r] |
| ORPD | XMM3, negator | xmm3 = [-x.i*y.i x.i*y.r]: negator is a bit-mask |
| ADDPD | XMM2, XMM3 | xmm2 = [x.r*y*r-x.i*y.i x.r*y.i+x.i*y.r] |
I found that it was better to optimise at a slightly higher level. If we keep the real parts of two complex numbers in one XMM register and the imaginary parts in another, the standard sequence
inline cplx_sse2 operator*(cplx_sse2 a, cplx_sse2 b)
{
cplx_sse2 c;
c.r = _mm_sub_pd(_mm_mul_pd(a.r,b.r), _mm_mul_pd(a.i,b.i));
c.i = _mm_add_pd(_mm_mul_pd(a.r,b.i), _mm_mul_pd(a.i,b.r));
return c;
}
will compute the product of the two numbers, in four multiplies and two adds. For the Newton-set fractals, the critical operation is the simultaneous evaluation of two polynomials. One approach would be to pack a0 and b0 into one pair of memory locations, a1 and b1 into the next, and so on; we then load [x,x] into a register, and do something like
cplx_sse2 accum = coeffs[0]; // [a_0, b_0]
cplx_sse2 t = xx;
for (int i=1; i<=poldeg; i++)
{
accum = accum + coeffs[i]*t;
if (i!=poldeg) t = t*xx;
}
but this is slightly wasteful in that we are effectively computing the powers of x twice. I preferred an approach in which the layout was
| a0.r | a1.r | a0.i | a1.i | b0.r | b1.r | b0.i | b1.i |
and t started as [1,x] and was repeatedly multiplied by [x2, x2]. This code looked like
cplx_sse2 x_power_0_1 = c2f2c(x, cfi(1));
cplx_sse2 x_power_2_2 = c2f2c(x, x);
x_power_2_2 = x_power_2_2 * x_power_2_2;
cplx_sse2 t0 = poly_coeffs_SSE2[0] * x_power_0_1;
cplx_sse2 t1 = poly_coeffs_SSE2[1] * x_power_0_1;
// OK, t0 is first two terms of p, t2 is first two terms of dp
int j=2;
for (int i=1; i<poly_terms_SSE2; i++,j+=2)
{
x_power_0_1 = x_power_0_1 * x_power_2_2;
t0 = t0 + poly_coeffs_SSE2[j] * x_power_0_1;
t1 = t1 + poly_coeffs_SSE2[1+j] * x_power_0_1;
}
// polynomials now computed, but in two halves
cplx tt0 = recover_sum(t0); cplx tt2 = recover_sum(t1);
iter++; oldx = x; x=x-tt0/tt2;
where the recover_sum routine takes a pair of SSE2 registers [even.r odd.r]
and [even.i odd.i] and returns [even.r+odd.r even.i+odd.i].