How to Handle Customer Escalations as a Software Engineer 🔗
If you're a software engineer working for a company that sells software, the term "customer escalation" (or maybe "SEV1") is most likely familiar to you. It might even be a bit scary to you. In my professional experience as an engineer (both Mechanical and Software), I have handled countless of these escalations, witnessed many colleagues handle escalations, and witnessed even more colleagues avoid them like the plague.
To be clear, I am using the term "customer" in this essay more with a B2B scenario in mind (meaning the customer is another company), but I have written software for B2C businesses as well, and handled both individual and business customers. The general ideas laid out in this essay are the same.
There's a reason why many developers shy away from these situations. Obviously, it's not a part of our job that we enjoy. It's a stressful situation that can be quite frustrating. On one side, you have a possibly annoyed customer who is often pressed for time, and to make matters worse, may not be very technical. And on the other side, there's you: a possibly timid developer who is already under stress (maybe due to that Sprint Review the next day), and to make matters worse, perhaps not a native speaker of the customer's language.
I believe it's important to give some thought to the term "escalation", as I have found that developers often taken it as if it was just a name, and not a term that actually conveys something. An escalation usually originates in a support ticket (or a phone call) opened by the customer. These are first handled by the company's technical support team (or perhaps the "Customer Success" team). Their main objective is to resolve the customer's issue right then and there. If, however, the technical support representatives cannot provide a satisfactory resolution to the issue, the customer (or the representative itself) will escalate the issue atop your employer's chain of command (or "chain of support" if you will). Usually, further links of that chain are more technical than the previous, or possibly have more expertise relevant to the customer's issue.
Sometimes, however, customers will escalate an issue even to management levels. This is when the issues are either the worst, or the support the customer received up to that point was very unsatisfactory. The customer wants to make sure they receive the attention of the higher-ups of the company. The thing is, people in management positions can't really resolve technical issues, which is why developers often find themselves called in to assist with escalations where the customer is at their most frustrated and angry.
Keep this sentence firmly in your mind: from a technical standpoint, there is no higher level to escalate an issue than a developer! If you've been called on for a customer escalation, then it (usually) means that all previous links of the support chain have failed.
Now, I'm not saying this to frighten you. I'm saying this so that you understand how important it is, and why you should never take these things lightly. You wrote a piece of software that customers are now using and relying on, and it doesn't work as expected (or at all). If we can't solve the customer's issues, then no one can.
There's a silver lining to these situations, however, and I'm hoping to convince you to view escalations as opportunities to seize rather than as punishments to avoid. Escalations are arguably the best way to achieve several things:
Progress as an engineer: the Cyber world is quite vast. There's a lot to learn, and it's not uncommon for engineers in a software company to not get any new training for years. I've had many friends and colleagues describe their situation as being "stuck" knowledge-wise, unsure how to progress. Well then, look no further than a crashed system and an angry customer. I'm not saying that's the only way, but I contribute a lot of my knowledge and success to my experience solving issues.
The technologies you will encounter when working with a customer may be unfamiliar to you. The issue itself may be a total mystery to you. Is it a bug in your code? A networking issue? A hardware fault? The things you will learn tackling other people's issues will be invaluable to you, I guarantee it. And yes, you could have learned them reading tech books, of which there are many, but books and documentation will never cement such knowledge so well into your brain as handling a customer issue. They are important too, though.
Gain insight into how customers actually use your product: developers often live in a bubble of sorts. Even though we wrote the software, we don't necessarily know how customers are using it, even if our software employs various tracking mechanisms, these are probably mostly used by product teams. We tend to repeat the same exact flows when doing manual tests ad infinitum, so that we often miss the bigger picture. I've seen entire features get completely neglected because they were considered esoteric, only to find that some customers absolutely live on them. I've seen the same happy-day input being fed into systems over and over again with real-world parameters not even thought of.
I am often surprised and even taken aback when I see customers using my (or my company's) software in real time. The weird crooked flows they concoct, the impressive workarounds they create, the way they often like to push your software to its limits, and how they always abuse that one thing you were sure they never would.
When you handle customer escalations, you get to see with your own eyes how your software is actually used in the real world. Not only will this help you in the short term, it will also help you write better software and avoid mistakes later on in your career.
Gain confidence and improve your interpersonal skills: the average person doesn't like to be in the spotlight, definitely not in a negative situation such as a customer escalation call. It is not uncommon for developers to be frightened in these situations. I've seen junior colleagues go on calls with quaking voices, scared to mess up, and I've seen how with time they became more and more confident. This experience can be priceless, and not only for your career. Whether you succeed in some calls or fail in others, you will learn from all these experiences, become more confident in your abilities, and become a bigger focal point within your department, which can help you to attain a measure of leadership. And you can use these abilities anywhere. Public speaking may not look so scary to you anymore. You may find that you're capable of explaining yourself clearer even in stressful situations. Your friends and colleagues may look up to you.
Remember, the courage to do something scary doesn't come before you do it. You have to actually do it first.
Gain more attention in the company: the way you handle yourself in bad times will count a lot. In fact, your mistakes and failures will not matter as much as how you handle them, and if you handle them well, you may come out of what may well have been your fault higher rather than lower. I can illustrate my point better with a story:
Early on in my professional career, I wrote a system that managed a critical component (from a third-party) of the Israeli telephony system. I started working on that system on my first day working for the company (a large company with thousands of employees). One day, and not a full year later, we've had a major incident - that critical component was down, and most phone calls failed. This isn't just your mother not being able to call you to ask what you're gonna have for lunch, this is the banks, the military, government institutions and others not being able to make phone calls.
At first, I was sure this had nothing to do with me, and I watched from the sides as Israeli media increased the estimate of damages as time went on and the issue persisted. I remember it reaching somewhere in the hundreds of millions of NIS.
It was sometime afternoon when the company's CTO suddenly barged into my office and told me the logs from the manufacturer clearly show that the component went down after my system made a certain action. That's not how he phrased it, though, he phrased it as "the logs show that you did it." As far as he was concerned, the system and I were one and the same.
Later that day, and with the help of the manufacturer, the system administrators managed to bring the component back up, and the incident was over. For me, though, it had only begun. For two months after that incident, I found myself—a young, junior developer—in management meetings, board meetings, and even a hearing with government officials. I had to ship the entire code of my system to be audited by the manufacturer, because everybody was operating under the assumption that my code was somehow responsible for the issue. I had to go on calls with experienced engineers in the US to explain everything about my system. I had to write technical documents about the system and the exact things it did during that day. A month or so after sending my code to be audited, we received the results in a relatively long document. What mattered was the bottom line - the manufacturer had determined that my code was "well written and probably not the cause of the incident."
Some more time had passed, and the cause of the incident was still unknown. I was invited to a meeting that included company management, manufacturer representatives, and Israeli government representatives. The meeting was chaotic. People were shouting back and forth, and nothing of value was said. I, for the most part, stayed silent. The meeting was about to end with no real conclusions, when the CEO (who later went on to be a candidate for the office of Israeli Prime Minister) looked at me and asked "Ido, anything to add?" I said "yes," and looked at the manufacturer representatives. "Please look at how your component implements action X, I suspect there's a race condition based on past incidents that did not have a similar effect." I was a bit more specific, but I'm leaving it a bit abstract because it doesn't really matter. "Okay, we'll look into it," they said, and the meeting ended. A week or so later, the manufacturer found the issue - it was exactly what I said it was. They issued a fix, and this chapter of my life was finally put to rest.
This incident put me in the headlights, though. After that, everybody knew who I was, and everyone at management was impressed. I won Employee of the Year a few months later, and received bigger and more challenging projects. And a raise.
Alright Then, You Convinced Me. Now What?
If you're still a novice developer, or really scared of dealing with customers directly, then you can start by joining your colleagues when they handle escalations as a silent observer. Listen to the interactions, learn the approach of your colleagues and try to recognize the things they did right or wrong. Joining as an observer relieves you of any stress or emotional attachment to the situation so you can focus on learning. Later on, when you feel ready, let your superiors know that you're available to help with escalations as well. Of course, for a lot of us there wasn't any choice to begin with, but I've known many developers who were never called on for escalations, for whatever reason, and I feel this was a hindrance to their careers.
Now, here are some guidelines I have formed over the years when it comes to our subject:
Escalations Trump Anything Else
A good engineer will put escalations above their day-to-day work and targets. I know it's fun to write features and sucks to solve customer issues, and I also understand that your team lead or R&D management expects you to finish that new feature by Friday regardless of any urgent escalations. With all due respect to management, we are not paid to be mindless code monkeys. We know and understand that if our existing customers cannot use our product's current set of features, then adding new ones certainly won't help.
I have sometimes been in situations of conflict with higher-ups because I gave more import to existing customers and their issues than to potential customers and the new features they were promised in order to make a sale. And I'll make the same choice and put myself in the same situation any day of the week, no questions asked. In my experience, in the long run your superiors will learn to appreciate that. In startups especially—which by their nature put growth and "accumulating logos" above anything else—this is a necessary balance. If they don't appreciate that, then may I be so bold as to suggest that you're better off working someplace else. If I can't be proud of my work and know that it provides value, what's the point?
You're Not Only There to Solve the Issue
I will repeat: when a developer joins an escalation call, that means all "front lines of defense" have failed. Our main job is, of course, to find and solve the problem. But we also have a hidden agenda: we need to instill confidence within the customer. Nothing guarantees that we can find the issue on that first call, or that we can even solve it then and there. If we can, then great, the customer will most likely leave the situation with renewed confidence in the product without us doing anything else. But if we can't solve it then and there, then it is imperative that we have the customer positively confident that we can and will find and solve their issues.
To do that, we must act professionally, but not robotically. I am well known as a generally non-formal person. Being overly formal is not a good idea anyway, if you ask me. So when it comes to dealing with customers, I make sure to be polite, receptive and empathic, but also non-robotic. Speak as you normally do (minus the swearing, I guess), be as natural as you can. Make it clear to the customer that you understand that the issue may be affecting their business.
Now, when I say empathic, I definitely do not mean that you should apologize to the customer for the issues, or thank them for their patience, or drag them through a 45 second monologue of how sorry you are. Leave that to the support guys. We're there to solve issues, not caress the customer to a gentle sleep. You need to be matter of fact, direct and to the point.
If you ask the right questions, take control of the call, and have confidence in yourself, then the customer will have confidence in you. If you're not confident in yourself, then fake it. We're not trying to fool the customer, though. You're not just trying to end the call and that's it, the customer needs to be confident that you can find and solve their issue, which you will.
Similarly, do not lie or hide anything from the customer. I find that admitting my/our mistakes to the customer, even if they're embarrassing, is often very appreciated by them and helps them regain confidence. I've had several large enterprise customers who would specifically ask that I would handle their issues even when they were in parts of the product I had nothing to do with, because they could always count on getting honest answers.
No, the Customer Isn't Stupid
Earlier in the essay I mentioned that the customers we deal with during escalations may not be particularly tech savvy. Even if they are, their area of expertise may be very different than that required to understand your software and its technical domain. But never, ever, assume that the person you're talking with is an idiot. This may sound like a no-brainer, but if you've handled your fair share of customer calls, I guarantee you thought that about at least one of them.
If you think the customer is an idiot, it will eventually become apparent and the customer will feel that. That's not gonna be good for business, that's not gonna be good for anybody. So treat the customer as if they were more savvy than you, but never say anything technical without explaining it. Not in a condescending, "I know stuff and you don't" sort of way, but in a matter of fact, "this is how things work" sort of way.
For example, if you're asking the customer to follow a few steps in the software to reproduce the issue or to gather more information, explain why you want them to do that, what you expect to happen (technically) and what you hope to gain from that. If you just give them commands ("click this, click that, what does that say") then you're making the customer feel like an idiot.
Everything Is Evidence
On a more technical aspect, we may often have preconceptions about which
pieces (or sources) of information are important and relevant for the issue
and which are not. I've made that mistake many times before. You should not
dismiss anything. Not a single log line, no matter how irrelevant you think
it is. I know you think that
WARN-level log line is irrelevant, "it's just
about a record that wasn't found in the database, has absolutely nothing to
do with the issue," but the most benign incidents sometimes have cascading
effects that are hard to predict. So if you are looking at logs, read them
with scrutiny. Ask to get them in their entirety, if possible1. Also, request to have the call/session recorded. I have had calls where I
thought I saw everything there was to see and heard everything there was to
hear, only to go over the recording a day or two later and suddenly notice
something I didn't notice before, or get that sudden light bulb that leads me
to solve the issue. If you can't record the session,
than write down as much as you can.
Everyone is a Suspect
When the customer explains/presents an issue, our instinct is to immediately limit our view to a narrow part of the problem: our code. It's important to know the relevant code and what it does (and often it really is that code that is at fault), but we must always broaden our view and look at the big picture and full flow, including everything around our code and product. This can mean a lot of things, especially if our software deals with networks (which most do).
So you gotta look at every possible link of the chain that may be at fault. The network itself may be down. A firewall may be blocking access. A proxy server may be scrambling responses. The operating system's file descriptors may be exhausted. The customer may have enabled SELinux recently. The system configuration may be preventing the system from doing its work. The platform on which our software is running may have a bug or be misconfigured (do not dismiss this, you may think that Docker or Kubernetes or ECS are perfect, but they can and have had bugs). The file system may be corrupted or just full.
Even the hardware itself is suspect2. I once had a distributed system that would sporadically return invalid responses to users. For a while, we had no idea what the issue could be, rummaging through countless lines of code trying to find the culprit, until one day we found that the network adapter on one of the system's servers was malfunctioning and occasionally dropping packets.
What this all means is that to be really good at finding issues, you have to be knowledgeable about computers, networks, operating systems, not just your product. This may sound daunting, but the good news (once again) is that escalations are one way to gain that knowledge. Do yourself a favor, though, and pick up that old copy of Modern Operating Systems; read up on the OSI networking model; build a Linux From Scratch system; write a TCP server and invent your own protocol just for learning.
Now, it may be tempting to try and find something else to blame for the customer's issues, and some customers are even extremely annoyed when we ask to check these supposedly external factors. But it's extremely important, and it's better to waste a few more of the customer's minutes there on the call than potentially a few more days or weeks with an unsolved issue. So be nice and ask the customer to humor you.
Trust None of What You Hear, and Half of What You See
When you get on a customer escalation call, the customer has probably already gone through the issue with the support teams multiple times. While it's beneficial to get information from those who dealt with this customer/issue beforehand, let me impart this one you: trust no previous knowledge. Believe none of what you hear. Ask to see everything in your own eyes, even if it annoys the customer (see previous section).
So if the support representative—or even the customer themselves—tells you that their Linux kernel version is 4.6.2, don't believe it until you've seen it for yourself. If they tell you they're getting error message X, ask to see it, because it may actually be error message Y. You'd be surprised how often they are inaccurate or just plain wrong. I cannot stress that enough, many an hour have been wasted due to such miscommunications.
Don't Make Them Go Away, Make Them Never Come
That CTO I mentioned earlier used to say that the best technical support a customer can get is if they never have to call with an issue to begin with. Back then it was the tag-line for a system we developed that was meant to automatically discover and fix technical issues before the customer even noticed them. But I took it to heart in a different way that should be more obvious: write better software that is resistant to failures.
There's a whole paradigm for this called Defensive Programming. If I had to boil it down to one sentence, it would be writing software for the worst case scenario, not the best case scenario. And the latter is—unfortunately—what we usually do, because that's what overly enthusiastic managers and unrealistic deadlines tend to cause us to do.
So be a rebel. Refuse to write code that only kinda works during Sprint Reviews. Instead of presenting a half-baked feature, present nothing. Let your colleagues get the applause and back pats when they've presented that shiny new button. When your product goes live, you'll be sound asleep when they have to handle another middle-of-the-night escalation, because they never got around to actually implementing the callback for that button, it's still hard-coded.
This may sound counter to everything I wrote before. Are we trying to avoid these escalations now? Well, not exactly. Escalations are a fact of life. They won't go away whether you like it or not. They can and should, however, be limited and reduced, because if there's too many of them, they can become a burden. So don't take on more than you can chew. Inspire your colleagues to handle them as well.
This was more of a philosophical essay, and for good reason as far as I'm concerned. I feel this is a very important subject for many engineers working in enterprise settings (probably not exclusively), but I couldn't find any substantial texts about it on the Internet. I hope you found this essay useful and that you utilize some of my suggestions in your careers. If you want to share your own suggestions, thoughts or experiences, please contact me as explained below.
- I know some of you by now are shaking your heads, thinking "why would you ever need to have your customers send logs?! Doesn't your application automatically collect diagnostics and errors?" Well, not every system can do that. Maybe your software is installed on-premises in internal, secure customer networks that are not allowed to access outside networks.
- Please, please don't be that developer who thinks that "there is no hardware, it's all virtual servers".