core router service disruption [UPDATED 5 Nov 2015]

  • Written by
    Herman Bos
  • Published on

Service disruptions on one of the core routers location (CR1) in TCN.

5 Nov 2015 - 01:00 CR1 Supervisor placement

Following up the maintenance finished on Monday, we will add an extra Router Supervisor in CR1 for extended redundancy in case of main Supervisor failure. This maintenance will be carried out tonight, at 01:00 on the 5th of November. There is no expected impact.

This maintenance is a followup on the hardware replacement of the Supervisor in CR1. Following the manufacturers advice we reseated the Supervisor in the CR1 chassis on Monday. Tonight we will add the replacement Supervisor. This Supervisor will be running in a hot-standby mode, that makes sure that CR1 only experiences a short hickup in case of another crash. We will followup with another maintenance window in which we will do a controlled forced Supervisor switchover.

####Description

  • Place Supervisor card in CR1
  • Boot up Supervisor card
  • Sync both Supervisor cards

####Impact

  • There is no expected impact.

3 Nov 2015 - 7:00 Maintenance summary

Before replacing the hardware, the configuration changes initially planned for Thursday were deployed. This allowed us to limit impact when taking CR1 out of service for maintenance.

During hardware replacement we experienced some difficulties. This led to several additional reloads and the decision to further resolve these issues outside maintenance.

To mitigate the impact of another crash we decided to move more infrastructure dependent on CR1 to access switches or CR2. This way we were able to conclude the maintenance for today around 5:15 in a state which minimizes the impact of a crash while we followup with a new maintenance window for hardware replacement.

3 Nov 2015 - 0:00-4:00 Emergency maintenance

Following up the repeated issue we will replace the hardware in CR1 tonight. To minimize disruption will pull the planned maintenance of this Thursday forward as well since these will reduce impact of the maintenance.

####Description

  • We will perform changes in the network configuration which will cause a change of mac address of the gateway for each subnet.
  • Where applicable servers will moved from directly connected to an access switch.
  • CR2 will be configured as primary.
  • CR1 Router will be drained of traffic to limit impact of shutdown.
  • CR1 hardware replacement .

Impact

  • Gateway change. Depending on the device it may take some time to pick up this change. Generally, busy servers will pick up the changes almost instantly and servers which are mostly idle may take a while.
  • Customers with direct connections to CR1 will have this link down during hardware replacement (15-30 min.).

Further information

If you have any questions concerning the maintenance or have special requests please contact us.

2 Nov 2015 - core router service disruption

Today the issue with the router has reoccurred.

  • Analysis indicates that its a hardware issue.
  • Planning for hardware replacement in emergency maintenance window.

Timeline 2 November 2015:

  • 09:50:52: routercrash
  • 09:51:00: Engineer notified and starts investigating the issue.
  • 10:01:07: full service restored.

Impact:

  • directly connected network customers
  • directly connected servers
  • part of our virtualization platform

13 Oct 2015 - Update

  • Further analysis indicates it was a software issue.
  • During the reload the router automatically loaded an updated software image which was previously staged and waiting for a scheduled maintenance window.
  • Since we are running on a new and updated software version by now we decided not to investigate the particular issue in the old version.
  • We confirmed there are no other routers in the network running the affected version.
  • High availability mechanics for all managed environments where evaluated today and these results are further evaluated per customer and take additional measures where needed.
  • We double checked spare parts availability to allow quick replacement in case the problem reoccurs and we decide to replace the hardware.
  • We will move some environments from router ports to access ports where this is a better fit to lower impact.

12 Oct 2015 22:39-22:49 - core router service disruption

One of the core routers in TCN experienced an error and reloaded itself. This resulted in disrupted service for directly connected servers and customers.

Timeline 12 October 2015:

  • 22:39: Router crash and automatic reset
  • 22:49: Full service restored

Impact:

  • directly connected network customers
  • directly connected servers
  • part of our virtualization platform

Back to overview Newer post: encfs / configure / libboost Older post: Planned network maintenance (01:00-03:00 6 NOV 2015)