Debugging DSLw
Problem
You want to debug and isolate DLSw problems.
Solution
The first thing to do with any DLSw issue is verify that the peers are working correctly, as in Recipe 15.13. If the peers are not established, then test IP connectivity with PING packets. If you can PING but the peers won't come up, then verify your configuration, as in Recipe 15.2. Ensure in particular that the remote-peer of each router precisely matches the local-peer of the other end.
If the DLSw peers are active, check the circuits, as in Recipe 15.13.
For failed circuits involving SDLC devices, check the interface, as in Recipe 15.14.
For Token Ring or Ethernet devices, verify that the interface is functioning properly as in Chapter 16.
If the peers are active and the interfaces look good, then there are three main things that could still be wrong. There could be a loop problem within the DLSw network. There could be a MAC address problem or a MAC or LSAP filtering issue. Or there could be a network congestion or performance problem.
There are several useful debug commands for use with DLSw. For looking at the router-to-router DLSw transport, you can use the debug dlsw command:
dlsw-branch#debug dlsw
You can get other useful information about SNA and LLC2 connection problems with these debug commands:
dlsw-branch#debug sna state
dlsw-branch#debug llc2 state
Discussion
If your routers will not establish a DLSw peer relationship, the most common problem is simple IP connectivity. Verify that you can PING the address in the remote-peer statement. Note also that at most, one end can use the promiscuous keyword to avoid configuring a remote-peer statement. This feature allows the router to sit and quietly wait for other routers to initiate connections. Somebody has to start the conversation.
Another common reason for failing to connect is simply a configuration mismatch between the local-peer and remote-peer definitions. Make sure that if you have a remote-peer statement, then the IP address exactly matches the address in the local-peer statement of the other router. It is not enough just to target any address on the remote-peer router; it must be exactly the same address.
If this is correct and you can PING, but still cannot connect, then there are two remaining possibilities. One is that there are port-specific filters or routing rules that permit PING but do something different with DLSw packets. This is particularly possible when using Policy-Based Routing to attempt to give DLSw packets a preferred path through the network. The preferred path may have serious congestion problems, for example.
The other possibility is that there is a complete incompatibility between the versions of DLSw running at the two ends. If both are Cisco routers, then this is not going to happen. But there are still some pre-RFC 1795 DLSw implementations out there, and Cisco's DLSw+ cannot negotiate a connection with noncompliant versions.
If you are trying to debug a DLSw problem, check first that the peers are connected:
dlsw-branch#show dlsw peers
Peers: state pkts_rx pkts_tx type drops ckts TCP uptime
TCP 10.1.1.5 CONNECT 1348082 547505 conf 0 3 0 4w6d
TCP 10.1.1.9 CONNECT 167013 200235 conf 0 3 0 4w5d
TCP 10.2.1.5 DISCONN 0 0 conf 0 0 - -
TCP 10.2.1.9 DISCONN 0 0 conf 0 0 - -
Total number of connected peers: 2
Total number of connections: 2
In this example, four peers have been configured, but only two of them are connected. A simple PING test would reveal in this case that the two disconnected peers are currently unreachable because of a network problem.
This command output also shows that there are three currently connected circuits to both of the active for a total of six active circuits. You can look at information on the individual circuits with the show dlsw circuits command:
dlsw-branch#show dlsw circuits
Index local addr(lsap) remote addr(dsap) state uptime
2164260933 4000.53aa.2440(04) 4000.3745.aaaa(04) CONNECTED 5d00h
1627390033 4000.53aa.8420(04) 4000.3745.aaaa(04) CONNECTED 2d01h
2684354644 4000.53aa.7023(04) 4000.3745.aaaa(04) CONNECTED 09:41:36
2013265928 000c.29d0.bb41(14) 4000.3745.aaab(04) CONNECTED 4w5d
654311497 4001.001b.a1f3(04) 4000.3745.aaab(04) CONNECTED 3d18h
1342177367 4000.53aa.e3a1(04) 4000.3745.aaab(04) CONNECTED 06:09:14
Total number of circuits connected: 6
Here you can see that these six circuits connect to two different FEP MAC addresses. All of the DSAP values are 0x04, and one of the circuits has an LSAP value of 0x14. This is useful in making sure that you have an appropriate SAP filters in place.
The following shows the output of a debug trace on this same router:
dlsw-branch#debug dlsw
DLSw reachability debugging is on at event level for all protocol traffic
DLSw peer debugging is on
DLSw local circuit debugging is on
DLSw core message debugging is on
DLSw core state debugging is on
DLSw core flow control debugging is on
DLSw core xid debugging is on
dlsw-branch#
May 15 22:18:35.320: DLSW Received-ctlQ : CLSI Msg : TEST_STN.Ind dlen: 40
May 15 22:18:35.320: CSM: Received CLSI Msg : TEST_STN.Ind dlen: 40 from TokenRing0/0
May 15 22:18:35.320: CSM: smac c001.001b.21aa, dmac 0020.353f.ab4a, ssap 4 , dsap 0
May 15 22:18:35.320: broadcast filter failed mac check
May 15 22:18:35.320: broadcast filter failed mac check
May 15 22:18:35.320: CSM: Write to all peers not ok - PEER_NO_CONNECTIONS
May 15 22:18:35.556: DISP Sent : CLSI Msg : TEST_STN.Req dlen: 46
May 15 22:18:35.556: DISP Sent : CLSI Msg : TEST_STN.Req dlen: 46
May 15 22:18:35.556: DISP Sent : CLSI Msg : TEST_STN.Req dlen: 46
May 15 22:18:35.556: DISP Sent : CLSI Msg : TEST_STN.Req dlen: 46
May 15 22:18:35.556: DISP Sent : CLSI Msg : TEST_STN.Req dlen: 46
May 15 22:18:35.556: DISP Sent : CLSI Msg : TEST_STN.Req dlen: 46
May 15 22:18:36.240: DLSW Received-ctlQ : CLSI Msg : TEST_STN.Ind dlen: 40
May 15 22:18:36.244: CSM: Received CLSI Msg : TEST_STN.Ind dlen: 40 from TokenRing0/0
May 15 22:18:36.244: CSM: smac c001.001b.21aa, dmac 0060.9442.7cf3, ssap 4 , dsap 0
May 15 22:18:36.244: broadcast filter failed mac check
May 15 22:18:36.244: broadcast filter failed mac check
May 15 22:18:36.244: CSM: Write to all peers not ok - PEER_NO_CONNECTIONS
Notice that one device, with a Token Ring MAC address of c001.001b.21aa, is attempting to make a connection to two different destination MAC addresses. These connections are being rejected with the message "broadcast filter failed mac check." This message appears because of the dlsw icanreach mac-exclusive command that filters out all MAC addresses from crossing the network unless they are explicitly configured on the router.
Watching this same debug trace for a little while longer reveals more interesting entries:
May 15 22:18:46.852: DLSw: 654311497 decr s - s:58 so:0 r:89 ro:0
May 15 22:18:46.852: DLSW Received-disp : CLSI Msg : DATA.Ind dlen: 336
May 15 22:18:46.852: DLSw: START-FSM (654311497): event:DLC-Data.Ind state:CONNECTED
May 15 22:18:46.852: DLSw: core: dlsw_action_l( )
May 15 22:18:46: %DLSWC-3-SENDSSP: SSP OP = 10( INFO ) to peer 10.1.1.5(2065) success
May 15 22:18:46.852: DLSw: END-FSM (654311497): state:CONNECTED->CONNECTED
May 15 22:18:46: %DLSWC-3-RECVSSP: SSP OP = 10( INFO ) from peer 10.1.1.5(2065)
May 15 22:18:46.920: DLSw: 654311497 decr r - s:58 so:0 r:88 ro:0
May 15 22:18:46.920: DLSw: START-FSM (654311497): event:WAN-INFO state:CONNECTED
May 15 22:18:46.924: DLSw: core: dlsw_action_m( )
May 15 22:18:46.924: DISP Sent : CLSI Msg : DATA.Req dlen: 308
May 15 22:18:46.924: DLSw: END-FSM (654311497): state:CONNECTED->CONNECTED
May 15 22:18:48.612: DLSw: Keepalive Request sent to peer 10.1.1.9(2065))
May 15 22:18:48.628: DLSw: Keepalive Response from peer 10.1.1.9(2065)
In this section of the trace you can see that one of the circuits (with the identifier 654311497) actually sends and receives a piece of information. You can see that it uses the first of the active peers, 10.1.1.5, and that the conversation happens on TCP port number 2065. A short time later, you can see the router exchange keep-alive messages with the other peer to ensure that the peer relationship remains active.
The following shows what happens when a particular circuit disconnects suddenly. In this case, we have deliberately cleared on of the circuits:
dlsw-branch#clear dlsw circuit 2013265928
May 15 22:18:55.412: DLSw: START-FSM (2013265928): event:ADMIN-STOP state:CONNECTED
May 15 22:18:55.412: DLSw: core: dlsw_action_r( )
May 15 22:18:55: %DLSWC-3-SENDSSP: SSP OP = 25( HLTN ) to peer 10.1.1.5(2065) success
May 15 22:18:55.412: DISP Sent : CLSI Msg : DISCONNECT.Req dlen: 4
May 15 22:18:55.412: DLSw: END-FSM (2013265928): state:CONNECTED->HALT_NOACK_PEND
May 15 22:18:55.424: SNA: Connection to Focal Point SSCP lost
May 15 22:18:55.424: DLSW Received-ctlQ : CLSI Msg : DISCONNECT.Cfm CLS_OK dlen: 8
May 15 22:18:55.424: DLSw: START-FSM (2013265928): event:DLC-Disc.Cnf state:HALT_NOACK_PEND
May 15 22:18:55.424: DLSw: core: dlsw_action_z( )
May 15 22:18:55.424: DISP Sent : CLSI Msg : CLOSE_STN.Req dlen: 4
May 15 22:18:55.424: DLSw: END-FSM (2013265928): state:HALT_NOACK_PEND->CLOSE_PEND
May 15 22:18:55.424: DLSW Received-ctlQ : CLSI Msg : DISCONNECT.Ind dlen: 8
May 15 22:18:55.424: DLSw: START-FSM (2013265928): event:DLC-Disc.Ind state:CLOSE_PEND
May 15 22:18:55.424: DLSw: END-FSM (2013265928): state:CLOSE_PEND->CLOSE_PEND
May 15 22:18:55.424: DLSW Received-ctlQ : CLSI Msg : CLOSE_STN.Cfm CLS_OK dlen: 8
May 15 22:18:55.424: DLSw: START-FSM (2013265928): event:DLC-CloseStn.Cnf state:CLOSE_PEND
May 15 22:18:55.428: DLSw: core: dlsw_action_y( )
May 15 22:18:55.428: DLSw: 2013265928 to dead queue
May 15 22:18:55.428: DLSw: START-TPFSM (peer 10.1.1.5(2065)): event:CORE-DELETE CIRCUIT state:CONNECT
May 15 22:18:55.428: DLSw: dtp_action_v( ), peer delete circuit for peer 10.1.1.5(2065)
May 15 22:18:55.428: DLSw: END-TPFSM (peer 10.1.1.5(2065)): state:CONNECT->CONNECT
May 15 22:18:55.428: DLSw: END-FSM (2013265928): state:CLOSE_PEND->DISCONNECTED
As you can see, there are several stages that the connection must go through to complete the disconnection request. First it goes into a HALT_NOACK_PEND state, meaning that both ends have not yet acknowledged the halt order. From there it goes into a CLOSE_PEND state, and finally DISCONNECTED, as the two DLSw peer routers finally agree that the connection has been terminated.
And, for one final example, here is what the debug sna state trace looks like for a similar disconnection and reconnection:
dlsw-branch#debug sna state
SNA state change debugging is on for all PUs
dlsw-branch#clear dlsw circuit 2130706520
dlsw-branch#
May 15 22:22:57.399: SNA: LS VDLCTEST: input=Disc.Ind, Connected -> PendClose
May 15 22:22:57.403: SNA: PU VDLCTEST: input=T2ResetPu, Active -> Reset
May 15 22:22:57.403: SNA: LS VDLCTEST: input=Close.Cnf, PendClose -> Reset
May 15 22:22:57.403: SNA: Connection to Focal Point SSCP lost
May 15 22:23:27.035: SNA: LS VDLCTEST: input=StartLs, Reset -> PendConOut
May 15 22:23:33.035: SNA: LS VDLCTEST: input=ReqOpn.Cnf, PendConOut -> Xid
May 15 22:23:33.303: SNA: LS VDLCTEST: input=Connect.Ind, Xid -> ConnIn
May 15 22:23:33.303: SNA: LS VDLCTEST: input=Connected.Ind, ConnIn -> Connected
May 15 22:23:33.427: SNA: PU VDLCTEST: input=Actpu, Reset -> Active
May 15 22:23:33.427: SNA: Connection to Focal Point SSCP established
In this trace, you can see the exchange of XID information and the final activation of the circuit takes less than a second here. The main time lag occurs as the end devices wait a few seconds after disconnection before trying to reconnect their SNA sessions.