Some notes on complex arithmetic with SSE2

SSE2 packs two doubles into a single XMM register, so the first temptation is to store the real and complex parts of one number per register. I've had little success with this approach: whilst it works well for addition and subtraction, it would be difficult to contrive a storage method which worked poorly for those operations.

For multiplication, the elements always seem to be in the wrong place, so many shuffle instructions become necessary. Whilst the P4 executes SHUFPD, UNPCKHPD and UNPCKLPD in the MMX_SHFT functional unit, so they don't interfere with adds and multiplies in FP_ADD and FP_MUL, the code is still unwieldy, and it sacrifices opportunities for SIMD optimisation

MOVAPD XMM2, XMM0 xmm2 = [x.r x.i]
UNPCKHPD XMM2, XMM0 xmm2 = [x.r x.r]
MULPD XMM2, XMM1 xmm2 = [x.r*y.r x.r*y.i]
MOVAPD XMM3, XMM0 xmm3 = [x.r x.i]
UNPCKLPD XMM3, XMM0 xmm2 = [x.i x.i]
MULPD XMM3, XMM1 xmm3 = [x.i*y.r x.i*y.i]
SHUFPD XMM3, XMM3, 2 xmm3 = [x.i*y.i x.i*y.r]
ORPD XMM3, negator xmm3 = [-x.i*y.i x.i*y.r]: negator is a bit-mask
ADDPD XMM2, XMM3 xmm2 = [x.r*y*r-x.i*y.i x.r*y.i+x.i*y.r]
Code for multiplying two packed complex numbers

I found that it was better to optimise at a slightly higher level. If we keep the real parts of two complex numbers in one XMM register and the imaginary parts in another, the standard sequence

inline cplx_sse2 operator*(cplx_sse2 a, cplx_sse2 b)
{
	cplx_sse2 c;
	c.r = _mm_sub_pd(_mm_mul_pd(a.r,b.r), _mm_mul_pd(a.i,b.i));
	c.i = _mm_add_pd(_mm_mul_pd(a.r,b.i), _mm_mul_pd(a.i,b.r));
	return c;
}

will compute the product of the two numbers, in four multiplies and two adds. For the Newton-set fractals, the critical operation is the simultaneous evaluation of two polynomials. One approach would be to pack a0 and b0 into one pair of memory locations, a1 and b1 into the next, and so on; we then load [x,x] into a register, and do something like

cplx_sse2 accum = coeffs[0]; // [a_0, b_0]
cplx_sse2 t = xx;
for (int i=1; i<=poldeg; i++)
{
  accum = accum + coeffs[i]*t;
  if (i!=poldeg) t = t*xx;
}

but this is slightly wasteful in that we are effectively computing the powers of x twice. I preferred an approach in which the layout was

a0.r a1.r a0.i a1.i b0.r b1.r b0.i b1.i

and t started as [1,x] and was repeatedly multiplied by [x2, x2]. This code looked like

cplx_sse2 x_power_0_1 = c2f2c(x, cfi(1));
cplx_sse2 x_power_2_2 = c2f2c(x, x);
x_power_2_2 = x_power_2_2 * x_power_2_2;
cplx_sse2 t0 = poly_coeffs_SSE2[0] * x_power_0_1;
cplx_sse2 t1 = poly_coeffs_SSE2[1] * x_power_0_1;
// OK, t0 is first two terms of p, t2 is first two terms of dp
int j=2;
for (int i=1; i<poly_terms_SSE2; i++,j+=2)
{
	x_power_0_1 = x_power_0_1 * x_power_2_2;
	t0 = t0 + poly_coeffs_SSE2[j] * x_power_0_1;
	t1 = t1 + poly_coeffs_SSE2[1+j] * x_power_0_1;
}
// polynomials now computed, but in two halves

cplx tt0 = recover_sum(t0); cplx tt2 = recover_sum(t1);
iter++; oldx = x; x=x-tt0/tt2;
where the recover_sum routine takes a pair of SSE2 registers [even.r odd.r] and [even.i odd.i] and returns [even.r+odd.r even.i+odd.i].