Troubleshooting
-
Vern's elements of troubleshooting:
- Reproducibility;
- Annotation;
- Logging;
- Profiling;
- Breakpointing;
- Reverse Execution;
- Speculative Execution;
- Reaction;
- Interception;
- Sampling
-
Architectural approaches to networking:
- Databases and knowledge planes?
- More communication between layers?
- More annotation?
- Reactive measurement?
- Design systems for self-monitoring and for easy diagnosis.
- How to architect the system to make it easy to answer "Why
can't I reach this web site?".
- How do we architect for troubleshooting? for measurement?
-
What is the problem?
(1) Troubleshooting problems in the current architecture; and
(2) Troubleshooting potential problems in a future architecture; (E.g., a delay-tolerant network.) and
(3) Designing a future architecture to allow troubleshooting; and
(4) General principles and observations on troubleshooting.
- Not ISP problems, but problems visible to end user?
Problems bigger than an ISP problem?
Problems on the local wireless network or small end network?
Problems that call for decentralized instead of centralized answers?
- Are we trying to identify the problem? The responsible party?
(e.g., the ISP? the web server?)
The way to avoid the problem?
(e.g., routing around the problem?)
- Areas: Security? Routing? Transport. Link failures. End node failures.
Middleboxes.
Email
- Early November phone call:
For a new design, include the right hooks for troubleshooting at the
beginning.
- 11/23 notes from a Vern/Mark chat:
Troubleshooting is hard because of modular design.
For troubleshooting the web, the email goes through the steps on
reproducibility, annotation, logging, profiling, breakpointing,
watchpointing, reverse execution, speculative execution, reaction,
interception, sampling.
Transactions IDs?
- 11/27 notes from Mark/Scott chat:
Cross-layer traceroute?
Tracking packets inside a node?
Logging, as in Ion and Scott's project for distributed debugging?
- 11/29 notes from a Vern/Scott/Sally chat:
What are the architectural constraints or guidelines imposed by the
need to make things troubleshoot-able? [e.g., OPES]
What are the architectural implications, if any, of the various
proposed troubleshooting mechanisms?
What about the particular troubleshooting problems posed by
those middleboxes, IP tunnels, etc., that wish to stay hidden?
Are these problems likely to go away in a future architecture?
How would one do troubleshooting in a different architecture,
e.g., with delay-based networking?
- 1/12/06:
Vern is interested in causality as a basic building block.
"A Causal Network Architecture".
For domains with developed troubleshooting research, there is
already work on causality, in an application-specific rather than
a general way.
There may be no way to achieve a fine-grained causality, but
part of the work is to explore the possibilities.
We are not necesarily looking for the fine-grained cause, but
just looking for who to complain to. Who has responsibility?
(E.g., custody transfer in DTNs.)
Accountability?
Our second underlying concept is focusing on the user. Users should
give creditable and actionable information.
What is missing from the old proposal is the overall blueprint, with
the key concepts.
Literature
Previous work below from ICSI is marked with "***".
Application Analysis
Routing
-
Di-Fa Chang, Ramesh Govindan, and John Heidemann,
Exploring The Ability of Locating BGP
Missing Routes From Multiple Looking Glasses,
ACM SIGCOMM Network
Troubleshooting
Workshop, September 2004.
Automatically detecting missing routes in the BGP routing table.
- Jennifer Rexford's papers on
Toubleshooting routing problems.
Troubleshooting and Robustness
- When transport is designed to give robust performance in the
presence of middleboxes, then troubleshooting (for this problem) is not necessary.
So troubleshooting is needed when there is a failure of robustness.
Or alternately, automatic troubleshooting can be a mechanism towards
robustness.
Troubleshooting using Management Databases, Knowledge Planes,
etc.
-
R. Kompella et al,
Cross-layer Visibility as a Service, Hotnets 2005.
"In essence, a link at one layer (e.g., IP) consists of a
path - a sequence of components - at the next layer (e.g.,
fibers and optical amplifiers). Greater visibility across
layers would significantly improve network planning,
risk assessment, fault diagnosis, and network maintenance."
-
Jennifer Rexford's other papers on a
Network-wide Control Plane.
-
D.D. Clark et al,
A Knowledge Plane for the Internet, SIGCOMM 2003.
"We propose a new objective for network research: to build a
fundamentally different sort of network that can assemble itself
given high level instructions, reassemble itself as requirements
change, automatically discover when something goes wrong, and
automatically fix a detected problem or explain why it cannot do so."
-
***
The Network Oracle,
J. Hellerstein, V. Paxson, L. Peterson, T. Roscoe, S. Shenker and D.
Wetherall,
Bulletin of the IEEE Computer Society Technical Committee on Data
Engineering, 28(1).
"This paper sets out a high-level research agenda aimed at building a
collaborative, global end-system monitoring
and information infrastructure for the Internets core state."
-
M. Wawrzoniak, L. Peterson, and T. Roscoe,
Sophia: An Information
Plane for Networked Systems. Hot Topics in Networks, Nov. 2003.
"Sophia is a distributed system that collects,
stores, propagates, aggregates, and reacts to observations
about the networks current conditions."
Troubleshooting Problems Caused by Middleboxes, IP Tunnels, Intermediaries, etc.?
-
Middleboxes No Longer Considered Harmful,
M. Walfish, J. Stribling, M. Krohn, H. Balakrishnan, R. Morris and Scott
Shenker,
OSDI '04.
"We propose an extension to
the Internet architecture, called the Delegation-Oriented
Architecture (DOA), that not only allows, but also facilitates,
the deployment of middleboxes. DOA involves two
relatively modest changes to the current architecture: (a)
a set of references that are carried in packets and serve as
persistent host identiers and (b) a way to resolve these
references to delegates chosen by the referenced host."
-
RFC 3238: IAB Architectural and Policy Considerations for Open Pluggable
Edge
Services,
Sally Floyd and Leslie Daigle, editors.
RFC 3238, Informational, January 2002.
"The overall OPES framework needs to assist
content providers in detecting and responding to client-centric
actions by OPES intermediaries that are deemed inappropriate by the
content provider."
"The overall OPES framework should assist end
users in detecting the behavior of OPES intermediaries, potentially
allowing them to identify imperfect or compromised intermediaries."
-
Measuring Interactions Between Transport Protocols and Middleboxes
(postscript,
PDF),
Alberto Medina, Mark Allman, and Sally Floyd,
Internet Measurement Conference 2004, August 2004.
"Advertising ECN prevents connection setup for a small (and diminishing)
set of hosts."
"Less than half of the web servers successfully complete Path MTU Discovery.
PMTUD is attempted but fails for one-sixth of the web servers."
"For roughly one-third of the web servers, no connection is
established when the client includes
an IP Record Route or Timestamp option in the TCP SYN packet.
For most servers, no connection is established when the client includes an
unknown IP Option."
Troubleshooting DNS Problems
-
IETF63 Review: DNS,
Jaap Akkerhuis and Peter Koch, IETF Journal, 2005.
"With the deployment of anycast for nameservers, there is a need to have
an identification of which server actually answered the question. This
would help debugging of anycast systems. Progress was made on the way
this should develop and a new ID by Rob Austein is expected."
-
DNS Name Server Identifier Option (NSID), R. Austein,
Sept. 2005, draft-ietf-dnsext-nsid-00.
"With the increased use of DNS anycast, load balancing, and other
mechanisms allowing more than one DNS name server to share a single
IP address, it is sometimes difficult to tell which of a pool of name
servers has answered a particular query." ... "This note defines a
protocol extension to support this functionality."
-
Distributed DNS Troubleshooting, Pappas, Faltstrom, Massey, Zhang,
SIGCOMM 2004 Workshop on Network Troubleshooting.
"We present a troubleshooting tool designed to identify a number of DNS
configuration errors."
Troubleshooting for ISPs
-
FALCON:
Fault Alarm Correlation for IP Networks, Matthias Grossglauser.
"The first goal of the FALCON project is to collect and to analyze
information about faults occurring in AT&T's IP backbone, and to gain an
understanding of the relationship between low-level faults and network
reliability as experienced by the customer. The second goal of this
project is to devise both support tools for network operators to make
fault management more effective and efficient, and design guidelines for
the networking infrastructure to reduce fault occurrence and to contain
their symptoms."
-
A Survey of Papers on Fault Management.
Measurements
-
M. Grossglauser and J. Rexford,
Passive Traffic Measurement for IP
Operations, The Internet as a Large-Scale Complex System (Kihong Park
and Walter Willinger, eds.), Oxford University Press, 2005.
"Traffic measurement plays a crucial role in providing operators with a
detailed view of the state of their networks."
Collecting Information for Troubleshooting
-
***
Providing Packet Obituaries,
K. Argyraki, P. Maniatis, D. Cheriton and Scott Shenker,
Hotnets III, 2004.
"A packet's obituary
should be returned to every AS along the path of a
packet as well as the source."
"We expect that
each ISP will deploy some accountability infrastructure that
will communicate with its counterparts in the neighboring
ISPs."
"Packet audits are always sent
at periodic intervals."
- *** Witness:
An Architecture for Developing Behavioral History,
Workshop on Steps to Reducing Unwanted Traffic on the Internet (SRUTI),
Mark Allman, Ethan Blanton, Vern Paxson.
July 2005.
"We envision ... devising a system that can accumulate
reports of unwanted traffic in a general fashion across the entire
network... A user of the database makes local decisions regarding the
degree to which to trust information in the database, primarily in terms
of the users local assessment of the submitters reputation."
- The
Friends Troubleshooting Network:
Qiang Huang, Helen Wang, and Nikita Borisov, Network and Distributed
System Security Symposium, 2005.
"We construct the Friends Troubleshooting
Network (FTN), a peer-to-peer overlay network, where the links
between peer machines reflect the friendship of their owners."
- N. Duffield and M. Grossglauser,
Trajectory
Sampling for Direct Traffic Observation,
IEEE/ACM Trans. on Networking, vol 9, no 3, June 2001.
Tools (Commercial and Otherwise)
-
PC Pitstop's Internet Connection Center
"Are downloads taking too long to load--even with a high-speed
connection? Does your ISP say "It's not our problem?" PC Pitstop's
Internet Connection Center can help. This is the place to start to
diagnose and hopefully fix connection problems for your home or office
system. Our series of diagnostics will help you determine if you're
getting optimal performance from your connection. The tools can also
help diagnose whether the problem is your modem, your PC, the site
you're visiting or your ISP."
-
S. Kandula, D. Katabi, and J. P. Vasseur.
Shrink: A Tool for Failure Diagnosis in IP Networks. ACM SIGCOMM MineNet
Workshop, Aug. 2005.
"We present Shrink, a tool for root cause analysis of network faults
which, given a set of failed IP links, identifies the underlying cause
of the faulty state... First, it effectively accounts for noisy
measurement and inaccurate mapping between the IP and optical layers.
Second, it has an efficient inference algorithm that finds the most
likely failure causes in polynomial time and with bounded errors."
-
Smarts.
"SMARTS InCharge Infrastructure Management solutions deliver automated,
real-time root cause and impact analysis of network problems, and give
you critical lead-time to act before business services are affected...
SMARTS puts you InCharge of your infrastructure management."
-
Network Forensics: Tapping the Internet.
S. Garfinkel. O'Reilly, 2002.
"Where it once took the prowess of a national laboratory to
systematically monitor all of the information sent over its external
Internet connection, now this capability is available to all."
"Another approach to monitoring is to examine all of the traffic that
moves over the network, but only record information deemed worthy of
further analysis."
"Build a Monitoring Workstation."
-
TCP/IP Analysis and Troubleshooting Toolkit,
A Wiley textbook by Kevin Burns, 2003.
"Very little literature exists on how to troubleshoot and analyze
TCP/IP when things go wrong, especially in very complicated networks that
now run on top of a maze of different vendor hardware and software with
some to none interoperability."
-
Troubleshooting TCP/IP,
a chapter in the 1997 O'Reilly textbook by Craig Hunt on
TCP/IP Network Administration.
-
Practical TCP/IP: Designing, Using, and Troubleshooting TCP/IP Networks
on Linux and Windows,
2003 Addison-Wesley textbook by Niall Mansfield.
-
Troubleshooting TCP/IP:
Cisco: Troubleshooting TCP/IP,
Microsoft: How to Troubleshoot TCP/IP Connectivity with Windows XP.
-
Random articles on DNS troubleshooting:
Admin's
Choice: DNS TroubleShooting,
MOREnet:
DNS TroubleShooting,
Cyberguard: DNS Troubleshooting - Everything Depends on It
-
Random articles on network troubleshooting:
Cyberguard:
Network Troubleshooting A Complex Process Made Simple,
Computing at Cornell: Troubleshooting
)
-
Random articles on email troubleshooting:
Acme: Troubleshooting Email Problems,
Scott Forsyth: Troubleshooting Email, the Telnet Way,
Kavi:
Guide to Troubleshooting Email,
Toastnet:
Spam and Email Troubleshooting
Other
-
Old ICSI Proposal.
User interfaces;
Application analysis;
Network radar: "In addition to understanding the history of events
that lead up to a problem, a troubleshooting system also needs to
understand the operating environment of the host experiencing the
problems."
Auxliary measurement, including reactive measurements;
New measurement tools.
- *** Distributed Debugging: what is the paper to cite for this?
- *** Annotation Layer: what is the paper to cite for this?
- *** Papers on security monitoring?
- *** Papers on multi-layer traceback?
Diagnostics?
To add:
Network Forensics? Which paper is this?
Danzig, self-monitoring?
Parasitic computing?
Henning Schulzrinne on user-centered troubleshooting?