Please excuse me while I toot my own horn. Take a look at this:
C:\latency >run
latency
imul : 57 - 53 = 4
lea shl : 56 - 53 = 3
just lea: 55 - 53 = 2
just shl: 54 - 53 = 1
That’s right, bitches, I am dynamically measuring the latency of a single x86 instruction — accurate down to one cycle! That’s ~380 picoseconds on my hardware.
This is really hard (impossible?) to do without a serializing read time-stamp counter instruction.

That’s cool - which CPU has this instruction? Is this on real hardware that you are running at Microsoft?
Everything since K8 rev F has had it. Check out RDTSCP in the AMD Architecture Programmers Manual, volume 3:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24594.pdf
Cool!