Wednesday, January 21, 2009

Internet integrity, or should I say, fragility

How reliable is the Internet?

Our team is working on metrics for last year and it gave me a chance to reflect on what works well and what doesn't. As I looked at the statistics, I was quite pleased and definitely proud of our team. We had several hardware issues that were difficult to push through as we worked with the hardware vendors. The goal of any carrier is to maintain a core network that meets or exceeds 99.999% reliability. The reality of that number is less than 6 minutes of total downtime in a year out of potential 525,600 minutes in a year. I actually have held our team accountable to 5 minutes and 18 seconds per year. We were able to meet 99.999% (around 3 minutes total downtime) on our Internet core and 100% on our other networks. Does this mean that we didn't have any customers go down? Unfortunately, no. On the core network nodes that all customers cross, we hit our numbers. On direct links, we did have customers go down. As copper plants (DSL, T1s, DS3s) continue to age, the number of failures will continue to grow over time. :-(

So how does this relate to fragility on the Internet core? That question takes us to the basic infrastructure of how an enterprise or carrier network is setup. Most large networks are setup using OSPF and BGP. There are multiple ways for network engineers to configure the network and I rarely find one engineer that agrees with another. (Can you say "standards"!) To understand how this is setup using two protocols, you need to understand how BGP determines a best path. Imagine you are headed to Washington D.C. to watch the presidential inauguration. When you get to DC, you ask for directions. Here are your two sets of directions:

Person 1: To get to the party, go down this street and you should get there in about 30 minutes.

Person 2: To get to the party, take a right on Johnson Street and go 2.3 miles. Then turn right on Constitution Avenue, go 1.4 miles and merge onto X street. Go .7 miles down X street....

Which set of directions would you want to follow? The Internet is the same way. Routers will look at their neighbor and choose the more specific path. I watched another carrier in early 2001 take a large percentage of my carrier Internet traffic to Europe by mis-configuring their BGP tables. They announced a more specific route for large portions of our routing table. It took a while, but we finally got them to notice what they had done. On the other end, I've watched customers mis-configure their own routers and try to suck down as much traffic as they could. We could handle the traffic easily, but they could not. We configure our customers with very specific elements to prevent these types of mis-configurations. I can't, however, stop another large carrier from fat-fingering a configuration. When they do, we work with other carriers to blackhole their routing announcements to minimize any issues.

The Internet is amazing in its resiliency to handle traffic the way it does. When you are configuring BGP on your network, make sure that you understand that your configuration does have the potential to impact those around you.

No comments:

Post a Comment