Resolving most problems requires a methodical approach and the application of your knowledge of TCP/IP and of your network.
TCP/IP is a four-layer hierarchy. Problems seen by the user in the Application Layer may be caused by problems in the lower layers.
IP requires that each system have a globally unique, software-defined address. IP uses the address to move data through networks and through the layers of software in a host. Unlike networks that use hardware addresses, IP relies on the system administrator to define the correct address. Problems are frequently caused by configuration errors.
Routing is required to deliver data between any two systems that are not directly connected by the same physical network. Subnetting divides a network into separate physical networks so that routing may even be required within a single enterprise network.
Tree steps in tracking down the real problem are:
Gather information when the problem is reported, ask the user several questions. What application failed! What is the address and hostname of the remote computer? What is the address and the hostname of the user's computer? What error message was displayed? If possible, have the user verify the problem by running the application while you talk trough it. If possible, duplicate the problem yourself.
Run preliminary tests using another application, such as PING. Check if the problem occurs in other applications on the user's host. Check if the user's problem occurs with only one remote host, with all remote hosts, or only with hosts off the user's subnet. Check if the problem occurs on other local systems or just on the user's system. Does it fail from your system? How about from other systems on the user's subnet?
Visualise each protocol and device that handles the user's data. If the problem occurs on some systems and not others, think about difference in the path that data takes from those systems. Think about where and how things could go wrong, to avoids oversimplifying the problem. It also highlights the areas that are most likely cause the user's problem. The problem can be anywhere in the path you visualise.
Some hints on analysing the test results are:
If only one application is having a problem, the application may misconfigured. If the same application fails on different local hosts, but only when connecting to a specific remote host, the application may not be available on the remote host. If the application that fails is from a different source than the TCP/IP protocol stack, e.g., a commercial protocol stack and a freeware application, the application and the stack may not be compatible. The last condition is particularly prevalent in Windows 3.1 and 3.11 when the application is designed for a specific WINSOCK.DLL and a different one is used by the stacks.
If problems occur on all local PCs, regardless of the application or the remote host they are connecting to, the problem is in one of the devices that connects the network to the outside world. If the problem only occurs on systems on a single subnet, the problem is in the device that connects the subnet to the rest of your network. If the problem only occurs on one PC, that PC is probably misconfigured. Check its configuration. If it appears okay, take your laptop and check the network link.
Pay attention to the error messages. Error messages are often vague, but they contain valuable pointers to the underlying problem.
The error Unknown host indicates a name server problem. If other computers resolve the name correctly, the user's PC is probably misconfigured. If no system resolves the name correctly, the name the user has may be wrong or the name server may be misconfigured. Have the user try to connect with the numeric address.
The error Network unreachable indicates a routing problem. It means that there is no route to the remote host. If no system can reach it, the remote site might be down. If only the user's PC has the problem, check the PC's routing configuration.
The error Cannot connect or No answer or Connection timed out means that the remote system is not responding. Either the remote system is down or a link between the user's PC and the remote system is down. If the user is trying to connect using a numeric address, it could mean that the user has the wrong address. Ask him/her to use the remote system's hostname.
Deals with the unexpected. Network problems are usually unique and sometimes difficult to resolve. Troubleshooting is an important part of maintaining a stable, reliable network service. Effective troubleshooting requires a methodical approach to the problem, and a basic understanding of how the network works. The key to solving a problem is understanding what the problem is. This is not as easy as it may seem. The surface problem is sometimes misleading, and the real problem is frequently obscured by many layers of software. When the true nature of the problem is understood, the solution of the problem is often obvious.
Approaching a Problem:
Gather detailed information about exactly what's happening. When the first problem is reported, talk to the user. Find out which application failed. What is the remote host's name and IP address? What is the user's hostname and address? What error message was displayed? If possible, verify the problem by having the user run the application while you talk him/her through it. If possible, duplicate the problem on your own system.
Does the problem occur in other applications on the user's host, or is only one application having trouble? If only one application is involved, the application may be misconfigured or disabled on the remote host. Because of rising security concerns, more and more systems are disabling some services.
Does the problem occur with only one remote host, all remote hosts, or only certain groups of remote hosts? If only one remote host is involved, the problem could easily be with that host. If all remote hosts are involved, the problem is probably with the user's system. If only hosts on certain subnets or external networks are involved, the problem may be related to routing.
Does the problem occur on other local systems? Make sure you check other systems on the same subnet. If the problem only occurs on the user's host, concentrate testing on that system. If the problem affects every system on a subnet, concentrate on the router for that subnet.
Once you know the symptoms of the problem, visualise each protocol and device that handles the data. Visualising the problem will help you avoid oversimplification, and keep you from assuming that you know the cause even before you start testing.
Approach problems methodically, don't jump into another test scenario based on a hunch, without ensuring that you can pick up your original test scenario where you left off.
Keep a historical record of the problems in case it reappears.
Don't assume a problem seen at the application level is not caused by a problem at a lower level.
Test each possibility and base your actions on the evidence of the tests.
Pay attention to error messages.
Duplicate the reported problem yourself.
Most problems are caused by human errors.
Keep your users informed, users want solutions to their problems, they're not interested in speculative techno-babble.
Don't speculate about the cause of the problem while talking to the users.
Stick to a few simple troubleshooting tools.
Don't neglect the obvious, a loose Ethernet cable is a very common network problem. Check plugs, connectors, cables, and switches.
Small things can cause big problems.
Most network problems can be solved using the free diagnostic software. Large networks probably need a network analyser, or at least a hardware tester such as a Time Domain Reflectometer (TDR).
ifcongif : Provides information about the basic configuration of the
interface. It is useful for detecting bad IP addresses,
incorrect subnet masks, and improper broadcast addresses.
arp : Provides information about Ethernet/IP address translation.
It can be used to detect systems on the local network that
are configured with the wrong IP address.
netstat : Provides a variety of information. It is commonly used to
display detailed statistics about each network interface,
network sockets, and the network routing table.
ping : Indicates whether a remote host can be reached.
nslookup : Provides information about the DNS name service.
dig : Provides information about name service.
ripquery : Provides information about the contents of the RIP update
packet being sent or received by your system.
traceroute: Tells you which route packets take going from your system
to a remote system. Information about each hop is printed.
etherfind : Analyses the individual packets exchanged between hosts on
the network. It is most useful for analysing protocol
Testing Basic Connectivity:
The ping command tests whether a remote host can be reached from your computer. This simple function is extremely useful for testing the network connection, independent of the application in which the original problem was detected. Ping allow you to determine whether further testing should be directed toward the network connection (the lower layers) or the application (the upper layers). If ping shows that packets can travel to the remote system and back, the user's problem is probably in the upper layers. If packets can't make the round-trip, lower protocol layers are probably at fault.