Every broadcast system inevitably is now a collection of computer networks. Broadcast Maintenance technicians and engineers need at to understand at least basic, triage level LAN troubleshooting skills. This Guide introduces the concepts and techniques for troubleshooting a network problem from device perspective, generally without the assistance of the network infrastructure manager. If you do have access to the switches and routers on your network, some tasks get easier but the overall process does not change dramatically.
Where to Start
Check the Cable
It can’t be said enough, that the first place to start in troubleshooting a network connection is the physical connection. Even if you have been assured that the cable and physical connections are in order, check the cable when you begin investigating the problem. If the Network Interface Card (NIC) has a link light, check that the light is lit. Unfortunately, many NICs no longer have physical link lights, so be sure to check the link status from within the operating system.
But this isn’t a lesson in cable problems.
The Process
The front-line network connection troubleshooting process breaks down into a few major tasks.
- Get your bearings
- Verify the problem
- Locate the problem
- Repair the problem
- Verify everything works
Get Your Bearings and Verify the Problem
The first two steps are the basics of any troubleshooting experience. Talk to the person reporting the problem or review the automated outage alert. Then, to verify the problem, try to reproduce the problem for yourself. Go the through the steps the user explained so you can see the problem first hand. If your monitoring system is reporting a problem, check the system yourself. The problem may actually lie with the monitoring infrastructure.
Don’t gloss over these two steps. This is where you will weed out the most false information and save yourself from significant amounts of wasted time. Combine understanding and verifying the problem with a quick check of the physical connection and you will resolve many problems rapidly.
Locate the Problem
Once you know that you have a genuine, reproducible problem in your hands, you need to figure out where in the network program flow the problem is. We’ve all had the OSI model drilled in to us. But how does that actually help solve a problem?
You can think of the OSI model mapping into three different areas to look for the problem. The most straightforward we have already covered. The OSI Physical layer translates to checking the physical components of the network, most important to check is the cable. A problem in the next three layers up most often manifest as a problem with the networking protocol stack. If you are working with a Windows operating system, this means you should head for the connection properties in the Network Control Panel. Problems in the upper three OSI layers are usually problems in the actual application itself.
Since you now have a good handle on what the problem is, break down the elements of the problem and try to eliminate OSI layers where the problem isn’t likely to be. For example, if you are on a DHCP network and the machine hasn’t received an IP address, the problem likely lies in the Data Link or Physical layers. Alternatively, if you are working with an IP based application and a ping succeeds, you probably want to look at the presentation or application layers.
Repair the Problem
After you have identified the likely source of the problem, it’s time to get your hands really dirty and fix things. Of course that is easier said than done. At the risk of beating a dead horse, remember to check the cable for obvious problems. Since you have already covered your bases there, here are some other techniques to help resolve the problem.
Remove, Reboot, Install, Reboot
For protocol stack related problems, you often go through the sequence, “Remove, Reboot, Install, Reboot.” First, remove, disable or disconnect the component that is causing the problem and reboot. Once the machine is up and running again, re-install, re-enable or re-connect the component that you are working with and reboot again. This is a decidedly Windows-centric technique but often comes into play with other operating systems, particularly if you are working with SAN technologies.
Check the Settings
You probably won’t be surprised by the number of network-related problems that result from a simple, obvious and minor misconfiguration. Start a methodical verification of the settings. Work through each configuration item individually and slowly. Rushing will make you miss the problem that is staring you right in the face. When you have the time, try to change only one setting at a time before checking to see if you have corrected the problem. When you are more pressed, use a binary search to work on half of the settings at a time. This is particularly useful for sets of boolean or checkbox settings.
DHCP
If you are working with a machine on a DHCP segment, renew the IP address. If a simple renew fails, try a release and then a renew. If you can not obtain a lease after a release, this may indicate deeper problems. Be sure to discuss the implications of DHCP or bootp address assignment in a broadcast environment with your network infrastructure managers if you have consistent address assignment problems.
Verify Everything Works
An important aspect of Service Management is verifying the problem has actually been resolved. If the problem is with a user facing aspect of the system, be sure to involve Operations, Production or Editorial as appropriate. Preferably, work with the person who originally reported the problem. Regardless of who you work, make sure the fact that the problem has been fixed and has been verified as fixed gets reported back to all affected users. Make sure that the resolution of the problem and the steps you took to get there are communicated back to other engineers and technicians who will benefit from this knowledge.
Tools
A quick Google search will lead you to a variety of third-party tools for network problem solving. While you will find several high quality open source or low cost products, most of what you find will be quite expensive. Fortunately, most operating systems have several of the tools you need to resolve the majority of problems built right in.
ipconfig/ifconfig
A quick snapshot of IP settings can be printed with the command ipconfig
or the Mac/Linux equivalent, ifconfig
. Beyond showing the current configuration, both commands have the ability to modify some networking components. It isn’t intuitive, but ipconfig
is also used for DHCP lease and DNS cache management.
Ping
Ping attempts to answer the simple question, “Is a host reachable?” There are many reasons why a device will still be working on a network but ping won’t work. However, most of those don’t apply to LANs. You always want to try at least two ping packets before you give up. When troubleshooting, try pinging from the problem device to some place else on your network and from that other place to the problem device.
When pinging from the problem device, try a few different destinations. This is where it becomes important learn your network’s architecture. The next tool, tracert
, will help with that, but good documentation and training is essential. Ping the device’s own address (not localhost), the device’s gateway, another machine on the same subnet, and move outwards from there. Routers, firewalls, DNS servers and other infrastructure devices are all good places to try.
Although not previously common in broadcast environments, DNS can play a large role in how a device functions on the network. On Windows, you can force a DNS lookup of an IP address by using the -a
option. A succeeding ping but failing DNS lookup should help guide your troubleshooting. With most Unix operating systems, ping defaults to trying a DNS lookup the -n
option will prevent the lookup. As mentioned previously, it is good to know the addresses of your DNS servers to speed troubleshooting.
tracert
Where ping tries to determine if a single host is reachable, tracert
, or the Unix equivalent traceroute
, asks each hop along a path for some status information. Hops are less important in modern LANs but tracert
is still very helpful. Each router between your problem device and the destnation will respond to the tracert
requests. Seeing the point where the trace stops will point to the source of a problem. When a point along the path doesn’t respond, tracert
will show asterisks instead of the response time. Remember to let at least two consecutive no responses go by before stopping the trace.
netstat
Once you have confirmed your problem device can talk on the network with ping
and tracert
, it is time to figure out if the device is actually connecting to everything it needs in order to function. You will want “protocol statistics and current TCP/IP network connections” to make that decision. That is where netstat
comes in.
On Windows, simply running the netstat
command with no options will print a list of active TCP connections. The -a
option will show all listening ports. If the device is a server or requires incoming connections to work, this is the listing to check to see if it has or can accept those connections. This will also force netstat
to show other protocols. If you want to see more than just the blinking lights on the System Tray, netstat -e
will give you details on the different types of traffic coming and going from the device.
The behavior of Unix netstat
is similar. Running the command with no options shows active connections for all protocols, not just TCP. To get packet statistics, check the man page to find out which command line option to use. You will probably have to specify which network interface you want statistics for.
nbtstat
The last useful tool that you already have is nbtstat
. This command line program helps you figure out if Windows Networking is working. More precisely, nbtstat
gives some information on NetBIOS over TCP/IP. This has become less important in modern broadcast environments, but there are still some systems that rely on elements of NetBIOS. The command by itself will just print the usage message. The two most useful options are -r
and -s
. The -r
switch will print information (statistics and computer names) for WINS resolution. The -s
shows what devices are connecting to the computer you are working on.