The entire memory hierarchy is measured, including onboard cache latency and size, external cache latency and size, main memory latency, and TLB miss latency.
Only data accesses are measured; the instruction cache is not measured.
The benchmark runs as two nested loops. The outer loop is the stride size. The inner loop is the array size. For each array size, the benchmark creates a ring of pointers that point forward one stride. Traversing the array is done by
p = (char **)*p;
in a for loop (the over head of the for loop is not significant; the loop is an unrolled loop 1000 loads long). The loop stops after doing a million loads.
The size of the array varies from 512 bytes to (typically) eight megabytes. For the small sizes, the cache will have an effect, and the loads will be much faster. This becomes much more apparent when the data is plotted.
As a rough guide, you may be able to extract the latencies of the various parts as follows, but you should really look at the graphs, since these rules of thumb do not always work (some systems do not have onboard cache, for example).