Proactive Actions for Handling Network Failure
Network failure is extremely undesirable; nevertheless, it is unavoidable. Network failure is caused by the constant additions and changes to the network that are necessary because of the increasing demand for rapid throughput, the addition of advanced service features (for example, Voice over IP [VoIP]), and the evolution of new technologies such as wireless LANs. Additional security services and protocols are constantly being developed to meet the security needs and requirements of these new technologies and products. The best approach is to be prepared and equipped with all the tools and techniques you need to cope with network failure. It is always easier and faster to recover from a network failure if you are fully prepared ahead of time.
As a self-test, review the following list to check if you are fully prepared for a network failure:
-
Proper documentation Clear documentation that is easy to follow is a must for any network and in particular for troubleshooting network devices. It is also important during troubleshooting to document any changes being made to the network so that it is easier to back out of an operation if troubleshooting has failed to identify the problem within the maintenance window. Having good backup copies of all the devices before the start of a troubleshooting session can help to restore the network to working condition more quickly.
Solutions to every problem that is solved should be documented, so that you can create a knowledge base that others in your organization can follow if similar problems occur later. Invariably this will reduce the time to troubleshoot your networks and, consequently, minimize the business impact.
-
Having a network baseline You must baseline your network, and document normal network behavior and performance as follows: at different times of the day, different days in the week, and different months in the year. This is very important, especially for the security aspects of the network. You might also consider deploying external auditing or logging tools such as Syslog, MCP, and so on, for defining the baseline. Another good practice is to constantly collect logs for comparison with the baseline to uncover any abnormal behavior that requires investigation and troubleshooting. For example, you baseline a network that has connection counts across a PIX firewall of 10K in rush hours. Suddenly, on Saturday night, your PIX experiences connections substantially above 10K. You know something is wrong, and this knowledge should trigger further investigation before a worm could possibly spread and cause more harm to the network.
-
Backup of working configuration and version information of devices on controlled areas in the network You must back up working copies of every device's configuration and version information, and keep them in a secured location that is readily accessible to selected personnel, so that you can revert back if needed.
-
Clear, concise and updated network topology Undoubtedly, one of the most important requirements in any network environment is to have current and accurate information about that network (both physical and logical topology) available to the network support personnel at all times. Only with complete information can we make intelligent decisions about network change. In addition, only with complete information can we troubleshoot as quickly and as easily as possible. The network topology should include at a minimum, the names, types, and IP address of the devices. Additionally, it is advisable to know the types of protocols, links, and services configured on the devices.
-
Tools Readily Available It is not acceptable to have to wait to download or install tools that might be required during troubleshooting. At a minimum, be sure to have available external Syslog servers and sniffer software, in addition to an FTP Server and a Trivial File Transfer Protocol (TFTP) server. For example, most PIX firewall issues can be diagnosed with the syslog server. Under some rare circumstances, you might need the sniffer capture if the syslog is not giving you any conclusive result.
-
Cohesive teamwork between Security Operation (SecOp) and Network Operation (NetOp) personnel In a mid- to large-sized network, network operations and security operations are divided. As the security is a component that goes hand-in-hand with every technology and product, it is extremely important that you have a cooperative professional relationship with those involved in troubleshooting issues that involve technologies beyond security. For example, if you have a GRE over IPSec tunnel between two sites, Security Operations might be responsible for the IPSec VPN, whereas Network Operations handles routing and switching. If you have problems with unstable VPN tunnel and packet drops, the problem might be either with the VPN or the underlying IP network. If the IP network is not stable, the tunnel will not be stable. Or the problem might simply be with the hardware encryption. In nutshell, to come up with the fastest diagnosis and resolution, both teams should form a cohesive work team.
-
Change control review You must produce a change control for all the following: every component you add to the network; every new service you turn on; and every new command you add to the devices. Change control includes documentation and thorough review with senior engineers and management. If possible, simulate the setup in the lab network before introducing the changes to production. Schedule a maintenance window within which to perform the task. Formally establish a change control review board with the senior members of the team if required.
-
Clear and concise escalation procedure This is often one of the most important and overlooked proactive measures. You should document and make available to every member of the troubleshooting team a clear and concise escalation procedure. You should also list the points of contact to external networks, including the connections to the Internet. This not only helps you as a senior engineer in the company, but helps others who are new to your network. Escalation procedures could include information on how to engage the next tier of engineering in your organization, and guidelines on when and how to engage the Cisco Support team.