Troubleshooting

Vern's elements of troubleshooting:
- Reproducibility;
- Annotation;
- Logging;
- Profiling;
- Breakpointing;
- Reverse Execution;
- Speculative Execution;
- Reaction;
- Interception;
- Sampling
Architectural approaches to networking:
- Databases and knowledge planes?
- More communication between layers?
- More annotation?
- Reactive measurement?
- Design systems for self-monitoring and for easy diagnosis.
- How to architect the system to make it easy to answer "Why can't I reach this web site?".
- How do we architect for troubleshooting? for measurement?
What is the problem?
(1) Troubleshooting problems in the current architecture; and
(2) Troubleshooting potential problems in a future architecture; (E.g., a delay-tolerant network.) and
(3) Designing a future architecture to allow troubleshooting; and
(4) General principles and observations on troubleshooting.
- Not ISP problems, but problems visible to end user? Problems bigger than an ISP problem? Problems on the local wireless network or small end network? Problems that call for decentralized instead of centralized answers?
- Are we trying to identify the problem? The responsible party? (e.g., the ISP? the web server?) The way to avoid the problem? (e.g., routing around the problem?)
- Areas: Security? Routing? Transport. Link failures. End node failures. Middleboxes.

Email

Early November phone call:
For a new design, include the right hooks for troubleshooting at the beginning.
11/23 notes from a Vern/Mark chat: Troubleshooting is hard because of modular design.
For troubleshooting the web, the email goes through the steps on reproducibility, annotation, logging, profiling, breakpointing, watchpointing, reverse execution, speculative execution, reaction, interception, sampling.
Transactions IDs?
11/27 notes from Mark/Scott chat: Cross-layer traceroute?
Tracking packets inside a node?
Logging, as in Ion and Scott's project for distributed debugging?
11/29 notes from a Vern/Scott/Sally chat:
What are the architectural constraints or guidelines imposed by the need to make things troubleshoot-able? [e.g., OPES]
What are the architectural implications, if any, of the various proposed troubleshooting mechanisms?
What about the particular troubleshooting problems posed by those middleboxes, IP tunnels, etc., that wish to stay hidden? Are these problems likely to go away in a future architecture?
How would one do troubleshooting in a different architecture, e.g., with delay-based networking?
1/12/06: Vern is interested in causality as a basic building block. "A Causal Network Architecture".
For domains with developed troubleshooting research, there is already work on causality, in an application-specific rather than a general way.
There may be no way to achieve a fine-grained causality, but part of the work is to explore the possibilities.
We are not necesarily looking for the fine-grained cause, but just looking for who to complain to. Who has responsibility? (E.g., custody transfer in DTNs.) Accountability?
Our second underlying concept is focusing on the user. Users should give creditable and actionable information.
What is missing from the old proposal is the overall blueprint, with the key concepts.

Literature

Previous work below from ICSI is marked with "***".

Application Analysis

Prasad Calyam, Weiping Mandrawa, Mukundan Sridharan, Arif Khan, and Paul Schopis. H.323 Beacon: An H.323 Application Related End-to-End Performance Troubleshooting Tool. In ACM SIGCOMM Workshop on Network Troubleshooting, September 2004.
Troubleshooting an individual audio/video application.

Routing

Di-Fa Chang, Ramesh Govindan, and John Heidemann, Exploring The Ability of Locating BGP Missing Routes From Multiple Looking Glasses, ACM SIGCOMM Network Troubleshooting Workshop, September 2004.
Automatically detecting missing routes in the BGP routing table.
Jennifer Rexford's papers on Toubleshooting routing problems.

Troubleshooting and Robustness

When transport is designed to give robust performance in the presence of middleboxes, then troubleshooting (for this problem) is not necessary. So troubleshooting is needed when there is a failure of robustness. Or alternately, automatic troubleshooting can be a mechanism towards robustness.

Troubleshooting using Management Databases, Knowledge Planes, etc.

R. Kompella et al, Cross-layer Visibility as a Service, Hotnets 2005.
"In essence, a link at one layer (e.g., IP) consists of a path - a sequence of components - at the next layer (e.g., fibers and optical amplifiers). Greater visibility across layers would significantly improve network planning, risk assessment, fault diagnosis, and network maintenance."
Jennifer Rexford's other papers on a Network-wide Control Plane.
D.D. Clark et al, A Knowledge Plane for the Internet, SIGCOMM 2003.
"We propose a new objective for network research: to build a fundamentally different sort of network that can assemble itself given high level instructions, reassemble itself as requirements change, automatically discover when something goes wrong, and automatically fix a detected problem or explain why it cannot do so."
*** The Network Oracle, J. Hellerstein, V. Paxson, L. Peterson, T. Roscoe, S. Shenker and D. Wetherall, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 28(1).
"This paper sets out a high-level research agenda aimed at building a collaborative, global end-system monitoring and information infrastructure for the Internets core state."
M. Wawrzoniak, L. Peterson, and T. Roscoe, Sophia: An Information Plane for Networked Systems. Hot Topics in Networks, Nov. 2003.
"Sophia is a distributed system that collects, stores, propagates, aggregates, and reacts to observations about the networks current conditions."

Troubleshooting Problems Caused by Middleboxes, IP Tunnels, Intermediaries, etc.?

Middleboxes No Longer Considered Harmful, M. Walfish, J. Stribling, M. Krohn, H. Balakrishnan, R. Morris and Scott Shenker, OSDI '04.
"We propose an extension to the Internet architecture, called the Delegation-Oriented Architecture (DOA), that not only allows, but also facilitates, the deployment of middleboxes. DOA involves two relatively modest changes to the current architecture: (a) a set of references that are carried in packets and serve as persistent host identiers and (b) a way to resolve these references to delegates chosen by the referenced host."
RFC 3238: IAB Architectural and Policy Considerations for Open Pluggable Edge Services, Sally Floyd and Leslie Daigle, editors. RFC 3238, Informational, January 2002.
"The overall OPES framework needs to assist content providers in detecting and responding to client-centric actions by OPES intermediaries that are deemed inappropriate by the content provider."
"The overall OPES framework should assist end users in detecting the behavior of OPES intermediaries, potentially allowing them to identify imperfect or compromised intermediaries."
Measuring Interactions Between Transport Protocols and Middleboxes (postscript, PDF), Alberto Medina, Mark Allman, and Sally Floyd, Internet Measurement Conference 2004, August 2004.
"Advertising ECN prevents connection setup for a small (and diminishing) set of hosts."
"Less than half of the web servers successfully complete Path MTU Discovery. PMTUD is attempted but fails for one-sixth of the web servers."
"For roughly one-third of the web servers, no connection is established when the client includes an IP Record Route or Timestamp option in the TCP SYN packet. For most servers, no connection is established when the client includes an unknown IP Option."

Troubleshooting DNS Problems

IETF63 Review: DNS, Jaap Akkerhuis and Peter Koch, IETF Journal, 2005.
"With the deployment of anycast for nameservers, there is a need to have an identification of which server actually answered the question. This would help debugging of anycast systems. Progress was made on the way this should develop and a new ID by Rob Austein is expected."
DNS Name Server Identifier Option (NSID), R. Austein, Sept. 2005, draft-ietf-dnsext-nsid-00.
"With the increased use of DNS anycast, load balancing, and other mechanisms allowing more than one DNS name server to share a single IP address, it is sometimes difficult to tell which of a pool of name servers has answered a particular query." ... "This note defines a protocol extension to support this functionality."
Distributed DNS Troubleshooting, Pappas, Faltstrom, Massey, Zhang, SIGCOMM 2004 Workshop on Network Troubleshooting.
"We present a troubleshooting tool designed to identify a number of DNS configuration errors."

Troubleshooting for ISPs

FALCON: Fault Alarm Correlation for IP Networks, Matthias Grossglauser.
"The first goal of the FALCON project is to collect and to analyze information about faults occurring in AT&T's IP backbone, and to gain an understanding of the relationship between low-level faults and network reliability as experienced by the customer. The second goal of this project is to devise both support tools for network operators to make fault management more effective and efficient, and design guidelines for the networking infrastructure to reduce fault occurrence and to contain their symptoms."
A Survey of Papers on Fault Management.

Measurements

M. Grossglauser and J. Rexford, Passive Traffic Measurement for IP Operations, The Internet as a Large-Scale Complex System (Kihong Park and Walter Willinger, eds.), Oxford University Press, 2005.
"Traffic measurement plays a crucial role in providing operators with a detailed view of the state of their networks."

Collecting Information for Troubleshooting

*** Providing Packet Obituaries, K. Argyraki, P. Maniatis, D. Cheriton and Scott Shenker, Hotnets III, 2004.
"A packet's obituary should be returned to every AS along the path of a packet as well as the source." "We expect that each ISP will deploy some accountability infrastructure that will communicate with its counterparts in the neighboring ISPs." "Packet audits are always sent at periodic intervals."
*** Witness: An Architecture for Developing Behavioral History, Workshop on Steps to Reducing Unwanted Traffic on the Internet (SRUTI), Mark Allman, Ethan Blanton, Vern Paxson. July 2005.
"We envision ... devising a system that can accumulate reports of unwanted traffic in a general fashion across the entire network... A user of the database makes local decisions regarding the degree to which to trust information in the database, primarily in terms of the users local assessment of the submitters reputation."
The Friends Troubleshooting Network: Qiang Huang, Helen Wang, and Nikita Borisov, Network and Distributed System Security Symposium, 2005.
"We construct the Friends Troubleshooting Network (FTN), a peer-to-peer overlay network, where the links between peer machines reflect the friendship of their owners."
N. Duffield and M. Grossglauser, Trajectory Sampling for Direct Traffic Observation, IEEE/ACM Trans. on Networking, vol 9, no 3, June 2001.

Tools (Commercial and Otherwise)

PC Pitstop's Internet Connection Center
"Are downloads taking too long to load--even with a high-speed connection? Does your ISP say "It's not our problem?" PC Pitstop's Internet Connection Center can help. This is the place to start to diagnose and hopefully fix connection problems for your home or office system. Our series of diagnostics will help you determine if you're getting optimal performance from your connection. The tools can also help diagnose whether the problem is your modem, your PC, the site you're visiting or your ISP."
S. Kandula, D. Katabi, and J. P. Vasseur. Shrink: A Tool for Failure Diagnosis in IP Networks. ACM SIGCOMM MineNet Workshop, Aug. 2005.
"We present Shrink, a tool for root cause analysis of network faults which, given a set of failed IP links, identifies the underlying cause of the faulty state... First, it effectively accounts for noisy measurement and inaccurate mapping between the IP and optical layers. Second, it has an efficient inference algorithm that finds the most likely failure causes in polynomial time and with bounded errors."
Smarts.
"SMARTS InCharge Infrastructure Management solutions deliver automated, real-time root cause and impact analysis of network problems, and give you critical lead-time to act before business services are affected... SMARTS puts you InCharge of your infrastructure management."
Network Forensics: Tapping the Internet. S. Garfinkel. O'Reilly, 2002.
"Where it once took the prowess of a national laboratory to systematically monitor all of the information sent over its external Internet connection, now this capability is available to all." "Another approach to monitoring is to examine all of the traffic that moves over the network, but only record information deemed worthy of further analysis." "Build a Monitoring Workstation."
TCP/IP Analysis and Troubleshooting Toolkit, A Wiley textbook by Kevin Burns, 2003.
"Very little literature exists on how to troubleshoot and analyze TCP/IP when things go wrong, especially in very complicated networks that now run on top of a maze of different vendor hardware and software with some to none interoperability."
Troubleshooting TCP/IP, a chapter in the 1997 O'Reilly textbook by Craig Hunt on TCP/IP Network Administration.
Practical TCP/IP: Designing, Using, and Troubleshooting TCP/IP Networks on Linux and Windows, 2003 Addison-Wesley textbook by Niall Mansfield.
Troubleshooting TCP/IP:
Cisco: Troubleshooting TCP/IP,
Microsoft: How to Troubleshoot TCP/IP Connectivity with Windows XP.
Random articles on DNS troubleshooting:
Admin's Choice: DNS TroubleShooting,
MOREnet: DNS TroubleShooting,
Cyberguard: DNS Troubleshooting - Everything Depends on It
Random articles on network troubleshooting:
Cyberguard: Network Troubleshooting A Complex Process Made Simple,
Computing at Cornell: Troubleshooting )
Random articles on email troubleshooting:
Acme: Troubleshooting Email Problems,
Scott Forsyth: Troubleshooting Email, the Telnet Way,
Kavi: Guide to Troubleshooting Email,
Toastnet: Spam and Email Troubleshooting

Other

Old ICSI Proposal.
User interfaces;
Application analysis;
Network radar: "In addition to understanding the history of events that lead up to a problem, a troubleshooting system also needs to understand the operating environment of the host experiencing the problems."
Auxliary measurement, including reactive measurements;
New measurement tools.
*** Distributed Debugging: what is the paper to cite for this?
*** Annotation Layer: what is the paper to cite for this?
*** Papers on security monitoring?
*** Papers on multi-layer traceback?

SIGCOMM 2004 Workshop: Network Troubleshooting: Research, Theory and Operations Practice Meet Malfunctioning Reality

Diagnostics?

To add:
Network Forensics? Which paper is this?
Danzig, self-monitoring?
Parasitic computing?
Henning Schulzrinne on user-centered troubleshooting?