Troubleshooting by Layer

The earlier sections of this chapter explained a general troubleshooting methodology. When going
through this methodology, it is often helpful to approach the problem in a logical manner that leverages
the OSI model. Therefore, Cisco has started backing a model of troubleshooting that does just
that. This model has three distinct approaches: bottom-up, top-down, and divide-and-conquer.
Bottom-Up Troubleshooting Approach
As the name implies, when you use the
bottom-up troubleshooting
approach, you start with the
bottom—the Physical layer of the OSI model—and work your way up to the top—the Application
layer. This approach is used when you suspect the problem is at the Physical layer, or when you
are troubleshooting a complex network problem. In these situations, ensuring that the core components
required for networking are in place can go a long way toward isolating the problem.

The downside to bottom-up troubleshooting is that it can require the checking of each interface
along the path to see if errors are occurring there. Depending on the length of the path from
the end points of the problem, this process can be very time-consuming. In these cases, determining
the most likely culprit based on the symptoms of the trouble can save a lot of time.
Top-Down Troubleshooting Approach
If you suspect that the problem lies in a piece of software, then
top-down troubleshooting
should be used. You start by testing the application and work down the OSI layers to find the
source of the problem. The challenge to this type of troubleshooting is that you need to check
all the user’s network applications in order to find the one that is causing the errors. This is a
potentially time-consuming troubleshooting method if there are a large number of applications
that could be the source of the trouble.
Divide-and-Conquer Troubleshooting Approach
The
divide-and-conquer troubleshooting
approach allows you to select the specific layer (Data
Link, Network, or Transport) of the OSI model in which to begin troubleshooting. You make
your selection based on experience with similar problems in the past, along with the specific
symptoms of the current trouble. After selecting the layer you wish to start with, the next task
is to determine the direction of the problem by determining whether the problem exists at,
above, or below this layer. Most commonly this is done by studying output from the IOS commands
on the router or through analysis of the output of network management tools. Once the
direction of the problem is determined, you continue troubleshooting through the OSI model in
that direction until you isolate the difficulty.
Often you can check the first four layers (Physical through Transport) by using
the
traceroute
command.
Summary
With the complexity of today’s networks, it is important to adhere to a troubleshooting model
to aid in efficiently and effectively isolating and resolving network problems.
Various methods of problem isolation and the troubleshooting method itself help administrators
pinpoint problem areas and foresee future trouble. Troubleshooting skills are gained
through experience. It is unreasonable to expect that you can jump in on your first network failure
and be able to solve it quickly. Experience is the best teacher. Following a problem-solving
model helps you to reach a timely solution to network failures. It helps to know your network,
but the “shooting-from-the-hip” style of troubleshooting is nowhere near as effective as a
methodical and logical process.

Using the three steps of the Cisco troubleshooting model in order is a clear, calculated, and
logical way to make a network run more smoothly. The three methods of problem isolation
(bottom-up, top-down, and divide-and-conquer) are more subjective, and it is up to each individual
to use the appropriate method for the problem that they are facing. It is important to document
changes so you have a trail of what was done on the network. Finally, it’s important to
reverse any network alterations that did not correct the problem.
Exam Essentials
Know the three steps to the Cisco troubleshooting model and the function that each step performs.
The three steps to the Cisco troubleshooting model are gather symptoms, isolate the
problem, correct the problem, and repeat if necessary. Once a problem is resolved, documentation
should be updated.
Know the troubleshooting methodologies and how to use them.
These troubleshooting
methodologies are bottom-up, top-down, and divide-and-conquer. In addition to understanding
them, know when it is most appropriate to use each method.
Be able to apply the Cisco troubleshooting methodology to example situations.
Know how
to apply each step of the troubleshooting model in real-life scenarios. You should be able to
determine what step in a troubleshooting scenario is next in the series, and understand how to
correlate a task with the correct step in the process. 1042

Looks Can Be Deceiving

One common mistake when observing the results of a change is seeing symptoms go away
and assuming that the problem has been solved. For example, assume that users are complaining
about slow response time while accessing the Internet. In the course of troubleshooting,
you find and correct some non-optimally-configured interface settings on the
router on the users’ segment. You then go back to the user who originally reported the problem.
She reports that everything is running fine now. However, she neglects to mention the
fact that there was a shift change, and now only two people are connecting to the Internet
where there used to be 50. The next day, when all of the users are back online, the problem
repeats itself. If an analysis of the observations had been done, it would have demonstrated
that the traffic flow to the Internet had dropped off and that this could be a contributing factor
to the improvement in response times.
As is demonstrated in this example, failure to analyze your observations creates the risk that
important information can be overlooked and the problem will recur. To avoid this possibility,
make sure to look at the entire scope of the problem. Use your network management tools to
help you determine whether the problem is really resolved. You can also look at your network
baseline information to find out what the “normal” traffic pattern looks like. In this example, it
should show a sharp drop-off in utilization when the shift changes. This would tell you that the
improvement in connection speed may not be due to the interface changes you’ve made, but
rather due to a lower volume of traffic. More verification may be needed.

Step 3: Correct the Problem

The investigation gave you three leads about the source of the problem. Now it’s a matter of
checking out each possibility and determining which one is most likely the source of the issue.

The majority of the possibilities point directly at the host machine, so start there. The first
two causes are host configuration issues. Now, assume that you’ve checked the TCP/IP configuration
on the host and everything is configured properly. You can eliminate the host machine
as the culprit.
You then move on to the remaining possible cause, which is an access list on the router. While
looking at the configuration on the router, you see that an access list is applied to the Ethernet
interface directly connected to the host segment. After reviewing the syntax of the access list, you
determine that it is the cause of the failure.
Great—you’ve found the problem. Now what? Once you find the problem, you must decide
what is needed to fix it. In this case, it is an access-list problem, so there are some special considerations
about how to restore functionality. You must be careful in your actions here, because that
access list may contain other entries that provide security or other network administrative functionality.
You can’t just remove the list—you could cause new problems as you fix the original one.
The best thing to do in this situation is to make a copy of the access list in a text editor,
and then make changes that are specific to your problem. When editing the access list, change
its number. After all of the changes are made in your text editor, ensure that you have a current
backup of the configuration on the router in case you need to restore the original configuration.
Then paste the modified access list back into the router. Finally, go to the interface
and apply the new access list. By following this procedure, the access list is never removed
from the interface.
Obviously, you have now changed the access-list number that is applied to the interface, so
any documentation that refers to the original number will need to be updated. If the access list
that was causing the problem was applied only to Ethernet 0, you can now safely remove
the old list, update this list with the corrections to address your problem, and put it back on the
router. Then reapply this list to Ethernet 0. As was the case before, the access list is never
removed from the interface.
When you are going through the troubleshooting methodology, it is important that you
don’t fix one problem and cause another. Before implementing any changes, think it through or
discuss it with coworkers to pick it apart, and make sure that your solution will fix the problem
without doing anything to create adverse side effects.
Another good practice when implementing changes is to change only one thing at a time, if
possible. If multiple changes must be made, it is best to make the changes in small sets. This way
it is easier to keep track of what was done, what worked, and what didn’t. Observing the effects
of a change becomes much more effective if only a single change is made at a time. There is nothing
worse than troubleshooting your self-induced errors in addition to the original difficulties!
To summarize, follow these practices and guidelines to making changes:

Make one change or a set of related changes at a time, and then observe the results.

Make non-impacting changes—this means trying not to cause other problems while implementing
the changes. The more transparent the change, the better.

Do not create security holes when changing access lists, TACACS+, RADIUS, or other
security-oriented configurations.

Most importantly, make sure you can revert to the original configuration if unforeseen problems
occur as a result of the change. Always have a backup or copy of the configuration.

In the preceding paragraphs, there were references to observing the results of the changes.
Observing results consists of using the exact same methods and commands that were used to
obtain information to gather symptoms—to see whether the changes you implemented had the
results you want. By making a change and then testing its effectiveness, you move toward the
correct solution.
It may take one or more changes to fix the problem, but you should observe each change separately
to monitor progress and to make sure that the alteration doesn’t create any adverse
effects. After the first change is made, you should be able to gather enough information to learn
whether or not the modification was effective, even if it doesn’t entirely solve the problem.

If the changes made have corrected the problem, move on and document the modifications
that were made to the network. If the changes did not work, you need to go back and either
gather more information or try another one of the potential issues that you identified while
isolating the problem.
Iterations—repetitions of certain steps within the troubleshooting model—are simply ways
of whittling away at a larger problem. By implementing changes and monitoring the results, you
can move toward solving the overall problem.

Iterations of the troubleshooting process allow you to focus with more and more detail on
the possible causes of the failure. The result of focusing on the problem is your ability to identify
more specific possibilities for the failure.
The iteration process has its own set of steps: While working through the process, you might
get more ideas of possible sources of the trouble. Write them down; if the current changes do
not work, you have notes about some other options. If you feel that you have exhausted all of
the possible causes, you should probably go back and gather more information. You will probably
find additional clues.
This is also the time to undo any changes that had adverse effects or that did not fix the problem.
Make sure to document what was done, so it will be easier to undo the any configuration
modifications.

Document the Changes
The network problem has been officially resolved after you’ve implemented a change, observed
that the symptoms have disappeared, and can successfully execute the tests that were used to aid
in gathering information about the problem. In this example, the way to verify that the problem
is solved is for Host A to try to ftp to Host Z. If this test is successful, then the problem is resolved.
In the previous sections, we have emphasized that documentation is an integral part of troubleshooting.
When you keep track of the alterations that were made, the routers, switches, or hosts that
were changed, and when the changes occurred, you have valuable information for future reference.
There is always the possibility that something you changed might have affected something else and
you didn’t notice it. If this happens, you will have documentation to refer to, so you can undo the
changes. Or if a similar problem occurs in the future, you can refer to these documents to resolve the
new problem, based on what was done the last time. Later chapters in this book will give you more
information about documentation and establishing baseline information.

Step 2: Isolate the Problem

This step within the troubleshooting model is used to contemplate the possible causes of the failure.
Obviously, it is quite easy to create a very long list of possible causes. That’s why it’s so
important to gather as much relevant information as you can in the gathering symptoms phase.
By defining the problem and assigning the corresponding boundaries, the resulting list of possible
causes diminishes because the entries in the list will be focused on the actual problem and
not on “possible” problems.
First, review what you know about your sample problem:

Host A can’t ftp to Host Z.

Host A can’t ftp to any host on Campus B.

Host A can’t ping to anywhere outside its own network.

Host A can ftp to any host on its own network.

All other hosts on Host A’s network can ftp to Host Z, as well as to other hosts.
Based on what you know, you now need to list possible causes. These possible causes are
as follows:

No default gateway is configured on Host A.

The wrong subnet mask is configured.

There is a misconfigured access list on the router connected to the switch on Campus A.
If you had not gathered such specific information in step 1, this list could have included all
possible problems with any piece of equipment between Host A and Host Z. That would have
been a long list, and it would take a lot of time to eliminate all of the possible causes.
Remember that because these are only
possible
causes, you still have to choose the most
likely option, implement it, and observe to see whether the changes made were effective. When
the list of possible problems is long, it may require more iterations of the problem-solving steps
to actually solve the problem. In this example, you have only three possible causes, so this is a
much more manageable list. Although there may be other possible causes that you can think of
(and it’s great that you can do that), for this example and in the interest of simplicity, only these
three are listed.
Here’s where it gets interesting. You now have to check each of these possibilities and fix
them if they are the cause of the problem.

Step 1: Gather Symptoms

As you can see, the user’s problem is vague; you need more information if you are to solve the
problem any time soon. This is where the first step comes in. Gathering symptoms is the step in
the troubleshooting model when details about the problem are gathered from as many sources
as is practical. These symptoms can come from a number of sources, including but not limited
to the network devices, users, monitoring tools, and console messages.
Now, while you still have the user on the line, the first step is to ask him what he means when
he says he can’t “get to” Host Z. The user then defines the situation by telling you that he can’t
ftp to Host Z. Ask the user if he experiences any other difficulties or if this is the only one. Verify
where the user is currently located. After these preliminary questions, you’ll have a basic idea of
what is and isn’t working. Unfortunately, you can’t simply assume that FTP is broken, because
there are many other pieces of the network that can contribute to this problem.
At this point, the problem is still pretty vague and needs more definition. Additional information
should include data that excludes other possibilities and helps pinpoint the actual problem.
An example in the case we’re discussing is to verify whether you can ping, traceroute, or
telnet to Host Z, thus reducing the number of possible causes.
Depending on the user and situation, you may or may not be able to get more detailed information.
It is up to you as a network engineer or administrator to solve the problem, which
means that you may have to get the information yourself.
It is important that you gain as much information as possible to actually define the problem
correctly. Without a proper and specific definition of the problem, it will be much harder to isolate
and resolve. Information that is useful for gathering symptoms is listed in Table 33.1.

TABLE 3 3 . 1
Useful Information for Gathering Symptoms
Information Example
Symptoms Can’t telnet, ftp, or get to the WWW.
Reproducibility Is this a one-time occurrence, or does it always happen?
Timeline When did it start? How long did it last? How often does it occur? Has the
current configuration ever worked properly?
Scope What are you able to access successfully via Telnet or FTP? Which WWW
sites can you reach, if any? Who else does this affect?
Baseline Info Were any recent changes made to the network configurations?

All of this information can be used to guide you to the actual problem and to create the problem
statement. Use your network topology diagram and check each item in Table 33.1. Once
you are done talking to the user, you need to define what is working and what isn’t.
Figure 33.4 is a picture of your network. Although the large X on the Frame Relay cloud represents
that there is an FTP connectivity issue, it does not indicate the location of the failure.
Right now, all you know is that a single user cannot ftp to Host Z.

Reproduce the Problem
Before spending time and effort trying to solve this problem, verify that it is still a problem.
Troubleshooting is a waste of time and resources if the problem can’t be reproduced. It’s just
like a dog chasing its tail. If the issue is intermittent, further steps should be taken to capture as
much information as possible about the event the next time it does occur. This will help narrow
down the scope of items you will look at.

Understand the Timeline
In addition to verifying whether the problem is reproducible, it is important to investigate the
frequency of the problem. For instance, maybe it happens only once or twice a day. By establishing
a timeframe you can more readily identify any possible causes. In addition, you need to
know whether this is the first time the user has attempted this function. There is a different set
of variables involved with an item that worked yesterday but not today than there is with something
that fails during first-time use. Obviously, if it worked yesterday, you can look at what
changed overnight and look for something that is broken. If the user has never used this feature
before, there may be an existing access list or other security device that has only now been activated
by the user’s initial use of this application.
Determine the Scope of a Problem
Next, you need to find out whether anyone else is unable to ftp to Host Z. If others can ftp to Host Z
(for the sake of this example, assume that they can), you can be pretty sure that the problem is
specific to the user, either on their station or on the destination host. This step determines the scope
of the problem and helps to differentiate between a user-specific problem and a more widely spread
problem. Figure 33.5 shows that other hosts can ftp to Host Z without any problems.

Cisco Troubleshooting Model

Imagine trying to solve a network failure by using a different approach every time. With today’s
complex networks, the possible scenarios would be innumerable. Because so many different
things can go wrong within a network, it’s possible to start from many different points. Not
only is this an ineffective method of troubleshooting, but it’s also time-consuming, and time is
very valuable in a “network down” situation.
Cisco has designed an effective
troubleshooting model
that contains three steps. A troubleshooting
model is a list of troubleshooting steps or processes that can be followed to provide an
efficient manner of resolving network problems. The headings in this section contain information
specific to each step of the troubleshooting model. After the three steps are completed and the
problem is resolved, a few more actions follow, such as documenting the problem-solving events.
To be effective when troubleshooting and to achieve faster resolution times, follow the model
outlined in Figure 33.2. This flow chart shows the three steps.
The troubleshooting process begins when a network failure is reported to you. The following
are brief descriptions of the steps to take:
1.
Gather symptoms.
At this point in the process, it is important gather and document the
symptoms of the problem that is being experienced.
2.
Isolate the problem.
After identifying the symptoms, the administrator looks for commonalities
in the symptoms and tries to determine at what layer of the OSI model the problem
is occurring. During this phase, it may be necessary to go back and gather more symptoms.

3.
Correct the problem.
Based on the information that was gathered and the determinations that
were made in the previous two steps, the network administrator now makes the changes necessary
to correct the problem. Once the corrective steps have been taken, the administrator
observes the results of the changes to ensure that the problem was corrected. If the problem
was not corrected, then the changes made should be backed out and the administrator should
start the troubleshooting process over with the gather symptoms stage. If the changes do correct
the problem, then the administrator should update the necessary documentation. The
final item of importance when correcting the problem is to make sure that you make only one
change at a time. This will ensure that you do not make unnecessary changes, which could
introduce new problems.
The best way to understand how Cisco’s model works and how you should use it is by looking
at an example. For this example, assume you are in charge of operational support of the network
pictured in Figure 33.3. There are two campus networks, connected via a Frame Relay
cloud. Within each network, VLANs are connected to a Catalyst 6500 switch and then to a core
router that has a connection to the Frame Relay cloud in one way or another.

The fun begins when you get a call from a user who “can’t get to Host Z.” Based on this
information, let’s apply Cisco’s troubleshooting model to solve the user’s difficulty and fix the
problem in the network.

The Complexity of Internetworks

When a network failure occurs, time is of the essence. When a production network goes down,
several things are affected. The most important of these is the bottom line—network failures
cost money.
A good example is a call-center network. The company relies on the network to be available
for its employees so that they can take phone orders, answer inquiries, or perform other
business transactions that generate income. A failure in this environment needs to be diagnosed
and repaired in a timely manner. The longer the network is down, the more money the
company loses.
To minimize monetary and productivity losses, network failures must be resolved quickly.
Troubleshooting is an integral part of getting this done. Intimate knowledge of a network also
facilitates rapid resolution. Armed with a few troubleshooting skills and intimate knowledge of
your network, you can solve most problems rather quickly, thus saving money.
Hold on a minute. What if you’re new on the job and you don’t yet have an intimate
knowledge of the network? You can probably get up to speed quickly enough, right?
Although that may have been the case in the past, getting up to speed becomes an overwhelming
challenge in today’s complex networks. These networks consist of many facets of
routing, dial-up, switching, video, WAN (ISDN, Frame Relay, ATM, and others), LAN, and
VLAN technologies.
Figure 33.1 gives you an idea of how these technologies intertwine. Notice that ATM, Frame
Relay, Token Ring, Ethernet, and FDDI all are present. Each technology has its own properties
and commands to allow for troubleshooting. Various protocols are used for each of these technologies.
In addition, different applications require specific network resources. (At least the
seven-layer OSI model, which you will review in Chapter 36, “Protocol Attributes,” is used to
maintain a common template when designing new technologies and protocols.) It would take
you a long time to master all of the technologies implemented in the network and to be able to
solve network problems based on your knowledge of the network alone. All of these factors
contribute to today’s complex network environments.
There must be an easier, more logical way to efficiently and successfully troubleshoot without
having to become intimately familiar with every network environment. Well, you’ll be
happy to know that there is an easier option—following a troubleshooting model, which is discussed
in detail in this chapter. By following a troubleshooting model, the need for intimate
knowledge of the network is reduced. A troubleshooting model should be adopted to help
resolve network malfunctions and reduce downtime.
Let’s move on to discuss Cisco’s model in detail.

Troubleshooting Methodology

Troubleshooting is a skill that takes time and experience to fully
develop. To be successful when diagnosing and repairing network
failures, a good set of troubleshooting tools and skills is essential.
The information presented here is the foundation for the rest of the information covered on the
exam. This chapter emphasizes the importance of following a specific set of troubleshooting steps
when you try to diagnose and solve network problems. An effective troubleshooting methodology
is needed because of the complexity of today’s network environments. As a Cisco Certified Network
Professional (CCNP), you need to understand and know how to apply an efficient and systematic
troubleshooting methodology. Otherwise, you would be required to have a very intimate understanding
of the network you are troubleshooting. It is imperative that you learn troubleshooting
skills and understand the information available to you while solving network problems.