Assuming a 1Ghz CPU and 200Mhz memory a memory read turns out to be 5 ticks. Floating point multipy is what, 3 or 4 cycles? Call it 5 to be on the safe side. Add/sub can probably be done in a couple, call it 3. Total of 4*5 + 7 + 3*5 + 2*3 + 1*3 + 1 = 52 cycles total.
Memory access cost will vary depending if data is in cache. If it is inside, it can be almost free . If not, you will pay a lot more that just 5 CPU cycles - we are talking about 6-8 MEMORY cycles for reading random place in memory - in your example, it translates to 30-40 cpu cycles, can be a lot more on faster CPU. Hopefully x and y will be near each other, so second read will be already from cache - but we are talking about variance from 0 to 80 cpu cycles for memory access alone. Mispredicted branch at end can easily cost 10-20 cycles depending on CPU type.
Depending on code layout, computations in code can go in parallel, fitting well in pipelines (but I would maybe try to move dx*dx into temp variable before executing second memory access?). Anyway, cost of rest of code can be in 10-30 cycles range - plus 0-80 cycles penalty for possible main memory access and mispredicted branching.
If somebody cares to write such code in C (working on realistic data set), it is easy to see cache missed with AMD Codeanalyst. It should be also possible to see actual pairing/pipeline stalls with it. Too bad it doesn't work with jitted java code...