The Network Reliability Engineer's Manifesto

Data

6 MIN READ

The Network Reliability Engineer's Manifesto

The industry is at a crossroads. And this isn’t just hyperbole - we as an industry have some decisions to make around what we stand for - as engineers, operators, and architects, and frankly, the people that (sometimes literally) keep the lights on. It’s time to have some real talk.

We’ve heard for years that we need to automate, but so little clarity on why or how. Coming primarily from a customer background, I’ll be honest - network vendors haven’t done the industry many favors here, often speaking of automation as “the replacement for expensive network engineers”, or focusing on fairly minor benefits like doing things “faster”. As someone that’s worked in network operations before and seen how fragile it can be, doing things faster is an utterly terrifying prospect.

Software-Defined Networking really kick-started the network automation conversation back in 2012, but six years later, we’re still primarily talking about “pushing config changes” as the reason someone might want to automate. No wonder most network engineers think automation is only for hyper-scale companies like Facebook or Google. Even large enterprises make very few configuration changes to the physical network once deployed. Yet many vendors who claim leadership of the network automation conversation can rarely go beyond configuration management as a use case.

So ask yourself, as an infrastructure professional, what is really your job? Is it pushing configs? I think a lot of us on the vendor side might like to think you’re spending 100% of your time in our CLI, configuring our boxes - but I’ve spent time in operations and I know this isn't reality. I have been called back in to work because of a broadcast storm. I’ve had to drive to a data center because a bug brought part of the network down. I’ve had to build systems that talk to other systems that talk to other systems because the network is inherently a distributed application; one that touches every other aspect of IT. The reality is that network engineers have to touch all kinds of systems, both within and outside the network, ranging from routers and switches to servers, storage, applications, ticketing systems, and monitoring tools, all for the sole purpose of keeping the network running, and providing connectivity for applications and users. Automation should really be little more than a machine-compatible representation of that.

As it turns out, Facebook and Google aren’t the only organizations that care about reliability. Let’s face it - in 2018, the network is critical infrastructure for most of the planet. The cost of a network outage these days is almost always measured north of six-figures, and for many organizations it gets much, much worse. In “Site Reliability Engineering” culture, there’s a saying:

“Hope is not a strategy”.

The time for hoping a network outage won’t occur is over. It’s time to actively engineer against them. It’s time to make network reliability the total, unyielding focus of any automation effort. It’s time for you to start automating your network operations - not because it will put you out of a job, but because it will make you and your network invincible. It’s time to reset the conversation and center our efforts on Network Reliability Engineering, and how automation can play a role in making our networks more reliable.

The Focus on Reliability - What’s in it for me?

“Oh come on, Matt,” you’re probably saying by now. “We’ve always cared about reliability, so what is this all about?”

To be sure, it’s not like I’m making a massively controversial statement when I say reliability should be the focus. I don’t think anyone would explicitly argue against reliability, except for maybe pager manufacturers. However, there’s a big difference between not being outright against reliability, and obsessing over it to the point where every process is aimed at it as an ultimate goal.

Again though, whenever automation comes up, it’s always pitched as a cost-saving measure, or a way to make things go “faster”. These may seem like logical outcomes, but anyone who has started putting automation into practice will tell you it’s not this easy. Using Ansible doesn’t mean you’ve automated your network operations; I assure you, I can destroy a network just fine with an Ansible playbook. Automation is no more a time-saver as it is a cost-cutter.

Let’s say you build a huge rocket capable of going to Pluto and back. It’s a monster - an enormous machine with giant engines, worth billions of dollars and decades of research. Having the capacity of getting to Pluto and back is truly impressive. But if it blows up on the pad because of a hastily assembled subsystem, literally none of that matters.

Network Reliability Engineering is needed because network operations as a whole simply isn't reliable today. We know this because much of the network industry is inherently terrified of automation. They are mortified to think that the firefighting they're already dealing with could potentially be magnified by simply doing it faster.

I’ll let you into the big secret about automation:

Automation is just reliable operations.

Automation, when done correctly, is the cornerstone of reliability. At it’s simplest, it is a machine-executable form of what’s already in your head. Not a “developer”; yours. You are the one that is best positioned to make automation real in your environment, and offer your network as a responsive infrastructure for others to build on. Take Amazon for instance. They employ scores of network engineers just like you or I, but I've never seen an indication of what switches or routers are up when I go to the AWS services status page. Instead, I'm told which services are online. This isn't magic, and it's not "developers taking over the network". It's network engineers like you translating between network-specific tools and workflows into a format that the customers of the network actually care about. This obsessive focus on customers is surely not something only for the hyperscalers.

What you’ll find when starting down the NRE journey and using automation to make your networks more reliable is that you’ll unlock new capabilities - both for your organization, as well as yourself - that you simply didn’t have before because you were mired in the toil of network operations and fire-fighting culture. Automation does eliminate tasks, but it doesn’t eliminate people. People are literally the fulcrum on which automated operations sits, and those that use automation to make their networks more reliable will always be moving their skill-sets forward and making themselves and their organizations more valuable.

All this isn’t to say that automation won’t let your IT organization move faster. On the contrary, I’d argue that reliable network operations must strive to reduce the mean time to respond (MTTR), or mean time between failures (MTBF). Just because reliability is automation’s ultimate purpose doesn’t mean it’s the only purpose.

You’ll find that the more reliable your network is, the less time you’ll spend in toil, the more time you’ll spend making your customers happy, and the sooner you’ll get home to spend uninterrupted time with your family. This is about keeping your eyes on what will trigger all of the other benefits. A focus on reliability means you get to have your cake, and eat it too.

The NRE Manifesto

As Network Reliability Engineers, it's time to converge on a set of ideals. We'll endlessly be debating which automation tool is best until we have a real conversation about what matters; what it is that drives us, and how we use automation to make our network operations more reliable.

We’re obsessed with our customers. Everything we do is aimed at meeting their expectations. We recognize that WE are the ones that are in a position to translate between networking-specific metrics and overall service health.
We don’t rely on hope. We proactively engineer reliability into everything we do. We don’t rest on our laurels after an outage, we obsess over the details of what happens, without blame, and engineer our systems to ensure it will never happen again. We don’t just assume the network is working, we run tests and experiments in production all the time to constantly challenge this assertion.
We are full-stack. We can recite every OSPF LSA type on command, but we also deploy tools and applications to public cloud, and have an opinion on the Kubernetes network model. We automatically invite ourselves to every meeting about service meshes. Why? Because these are “the network” just as much as the access-points in our campus network, and if we’re going to engineer reliability into our network, we need to have a hand in the way our organizations deploy applications.
We embrace failure. Not because failure is fun, or outages are particularly pleasant experiences - they’re not. But we put processes and tools into place to ensure we never encounter the same failure twice. So, failure means we’re growing. Our infrastructure is encountering one more thing that will never catch it by surprise again. And in the future, where we’d normally be kept away from our families to fix an issue for the Nth time, we’ll be home for dinner.

By Matt Oswalt