The inverse-trigonometric functions, particularly on the Pentium 4, are extremely slow; the only one in the instruction set is ATAN, and CHECK LINK the P4 Optimisation Guide (an absolutely indispensable reference for all performance-concerned programmers) claims that this takes between 150 and 300 cycles. Admittedly, it's a very general-purpose instruction, and is guaranteed to give 64 binary places of accuracy for almost arbitrarily unpleasant input values: but often you don't need that generality or that accuracy.
|
I decided to approximate arccos, since that was the function I needed for spherical texture mapping. From the graph to the left, you see that the arccos function is nearly-symmetrical around zero, and has infinite gradient at -1 and +1. Infinite gradient isn't a good sign for approximation algorithms, so, looking at the series expansion of arccos around 1, I decided to take advantage of the RSQRTPS instruction, and approximate the much more pleasant function sqrt(2x) arccos(1-x), whose graph you see below.
|
This function is nearly linear, so doesn't need too high a degree of polynomial approximant. I tried various degrees, constructing the approximation by a least-squares fit to the function measured at 100,000 points uniformly spaced between 0 and 1, and measuring the maximum difference between the approximation and the real function at rather more points. Degree 5 turned out to be adequate, and the relevant polynomial is
apx = -0.000007239283986332 + |
|