Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs, Palo Alto fMarcos.Aguilera,Jeff.Mogul, Janet.Wienerg@hp.com Patrick Reynolds Duke University reynolds@cs.duke.edu Athicha Muthitacharoen MIT Lab for Computer Science athicha@lcs.mit.edu Motivation o Modern large scale system architectures -- hook together a bunch of other peoples' stuff o Many programs: load balancer, web server, app server, ad server, database, ldap, NAS, ... o Each program complex o Many machines (dozens? hundreds? thousands?) Enterprise 400K desktops 70K databases 50K servers 7K app servers 100K "excel databases" Goals o Performance tuning, identify bottlenecks o Anomoly detection o Failure detection, diagnosis, repair o capacity planning, evolution of infrastructure... ... Design space continuum black box -- cannot change or instrument code internals; just look at message patterns (not even at contents of messages) -- grey box -- between extremes. E.g., allow looking at some internals of message (size, request/reply pairing, protocol clear box -- can look at and perhaps modify code internals and message formats. E.g., generate important events, tag each message with a request ID, ... advantges/DAs tech trends? How will they affect these trade-offs? encryption system scale open source This paper (mostly)black-box tracing off-line analysis goal -- identify sources of latency in callgraph Approach (1) RPC nesting o Gray box -- assume call/return pairs are matched (or use simple fifo heuristic) o Frequency "scoreboard" Keep track of outstanding possile parents. When node sends outbound message at time T_out, then look at each of k incoming possible-parent message p that arrived at T_pin, and increment scoreboard[T_out - T_pin] += 1/k o Select from possible parents the one with highest scoreboard value as "real" parent tweak raw score -- overlapping child penalty, same child penalty, generic-child penalty (2) Generic -- convolution s_in(t) -- bitvector -- did node receive msg in interval t s_out(t) -- bitvector -- did node send msg in intarval t C(t) cross correlation of s_in and s_out "roughly, C(t) has a spike at position d iff s_out(t) contains a copy of s_in(t) time-shifted by d" Evaluation 2 goals (1) Can algorithm identify sources of latency? (2) Is this information useful to programmer who is debugging system? How convincing is (1)? o scale of experiments? (petstore v. real e-commerce) o complexity/realism of experiments o "Each new trace caused us to improve our approach" (e.g., by adding another penalty knob) o magic numbers how to set overlapping child penalty, same child penalty, generic-child penalty, v in convolution algorithm w/o ground truth for tuning? How convincing is (2)? o no case study examples... o how useful is this data o could imagine that most black boxes would at least be able to tell you "how long is avg request taking? How long is request of type t taking?" How is this better? (Possible answer: A rock and a hard place? o Can black box tracing really work? How would the algorithms proposed here work for, say, google search (which may go to dozens, hundreds of machines in parallel with high concurrency) o Can clear box tracing really work? Few companies with large scale systems will have access to all code What are odds of tracing standard emerging? How can we improve black box tracing? Key research question: What are limits of what we can accomplish with black box tracing v. clearbox tracing? Extensions Could you use black box tracing for anomoly detection, fault diagnosis, ...