[I am aware that these techniques had been developed for mainframes and supercomputers by the mid-1970s; but I'm deliberately using Intel's processors as examples]
The ideal of a processor is fairly straightforward: you feed it instructions, it looks at them one by one and executes them. The instructions might perform arithmetic operations, load or store data to or from memory, perform a yes/no test, or jump (unconditionally or dependent on the result of a test) to some other point in the instruction stream; it surprises some to discover that, for most x86 code, about 50% of the instructions are loads and stores, about 13% flow-control, and only about 23% arithmetic (the remaining 10% mostly move data around between internal registers).
On the 286, the instructions are executed in the order they're read in; read an instruction, execute it, wait for the execution to finish and move on. The 386 and 486 introduced progressively more and more overlap to this process: in a particular clock-tick the machine might be reading in one instruction, decoding two more, and executing a fourth.
The Pentium was the first x86 superscalar chip. What these means is that it read instructions from the stream in pairs, and, if a pair had appropriate structure, was capable of executing both instructions in it at once.
On the Pentium Pro and its successors up to the P3, AMD's K6 and Athlon, and CPUs of that vintage from other manufacturers (the Alpha 21264, and the more recent exemplars of the Sparc, PA-RISC, MIPS and PowerPC ranges), the instruction takes place out-of-order. The CPU maintains an internal pool of instructions, reading new ones to add to it whenever it can. Each cycle, it picks a small handful of instructions from the ones in the pool whose inputs it knows, starts executes them simultaneously, collects the results from the instructions it started earlier which have just finished, feeds the results to any instructions in the pool which are awaiting them, and, if it's convinced that the result is correct and final, updates memory with it.
In an ideal world, the CPU would continue happily in this way forever, executing perhaps four instructions each cycle. But there are various things which can disrupt it
An instruction upon whose result everything else in the pool depends takes a long time to execute, and nothing can be done until it's finished.
A branch occurs which the CPU wasn't expecting; many of the instructions in the pool, fetched on the assumption that the branch didn't occur, are now useless and must be discarded and replaced. Computations currently in-flight made on that assumption must be discarded. Nothing can be done until the pool and the calculation pipelines refill.
Something - an interrupt from a network card, the need to visit virtual memory, a divison-by-zero exception - interrupts the CPU; now it has to throw away all its internal state, service the interrupt, and then pick up precisely where it left off and continue.
The classic instructions which take a long time to execute are memory operations. Owing to the inconvenient constancy of the laws of physics, the fastest technology which is cheap enough to form the main memory of a PC is perhaps a factor fifty slower than the CPU.
All CPUs nowadays use multi-layer caches, storing results which are expected to be frequently used in smaller quantities of much faster memory located physically nearer the CPU. But sometimes this doesn't work and you have to go all the way to main memory: on a 1000MHz Pentium 3, it might take many dozen cycles to load data from main memory, even if you ignore the baroquely complicated case when retrieving the data causes a page fault, and the page with the information on it is stored on an NFS-mounted drive attached to an unreliable server in Helsinki.
If you add the fact that, on a modern CPU, all access to memory goes via a memory-management unit which can only store a finite page-lookup table and needs to rely on an operating-system routine if working outside the table, it may take up to 400 1GHz P3 clock cycles to load one piece of data - enough time to perform literally thousands of arithmetic operations.
With all these layers of hierarchy - registers to L1 cache to L2 cache to TLB to main memory - it's very difficult to predict memory performance theoretically. It's fairly easy to measure it, though, using something like my Ptrchase benchmark.
You will have noticed the phrase 'which it was not expecting' in my description of the problems of branching; the CPU tries to avoid the horrible effects of branching by predicting which way each branch will go. My Branchmark benchmark tries to measure the cost of mispredicted branches and the quality of a CPU's branch prediction.