Who is the culprit ?

anand pathak
5 min readApr 20, 2016

/* No matter how smartly we do things but we can’t escape the problems ! */

While working as DevOps engineer I realised that the job of DevOps engineer is very similar to detectives. There is very less difference between “finding a bug on production server” and “finding the criminal convicted for a crime” .
Here in this blog I am not gonna explain it with facts or logics, rather I am gonna tell you a story of how we handled the problem on live server and it’s on you to judge whether you finds any similarities.

Date : 10th April 2016
Place : Kolkata

Environment was cool and calm and I was doing some casual things when suddenly we encountered an email alert.

Yup ! very similar reaction and without waiting for a second we quickly accessed the server. We knew, we can’t let server go down so we started patching, restart few processes and started another server to bear the load. Until we would find the reason we had to keep the extra precaution enabled.When you think of a detective’s jobs it’s very similar, moment they encounter a case they also first try to keep the victim alive and tries to save other who can be attacked next.

Not every time it happens that we had to go for any discussion to solve a problem because we solve them in single shot but that day was something different, we wasn’t sure about the problem so a meeting was called and we were asked about our thought behind the problem but none of us were sure about anything so all we did was gathered the evidence at a place decided our roles in debugging and went for our job.

The criminals are smart and they use many tactics to deceive the detective, in the same manner we were deceived by the data and we were led to wrong path. Spending lot time we did not found any strong evidence. Though we managed to keep the servers alive but we were unsuccessful in finding any clues. We post-mortem the log files, used all our tools and techniques, like parsed all the logs, used AWK, sed on every log files, searched every resource graphs, but did not get any useful information. All we knew was CPU was getting higher and the load was increasing but the number of user was still same. We were certain that somewhere some code was running in background and consuming the process but it’s not easy to find logical problem in code. It is a 1200 committed repository and we know that if we were having problem in code that must be in recently committed code. We were left with only two options one was to dig all the recently committed codes and other was to clone the server and debug the programs there.

We splitted into two, some were checking the code and some of us created a clone of the server and setup some debugging tool on that server. With the help of Xhprof and Xhgui we tried to identify which part of code is taking much cpu and invested too much time in that but we did not get anything in that also . We also checked all our monitoring tools like newrelic, all aws monitoring graphs, log parser and the logstash, kibana interface.

Wished we had a beautiful mind in our team so that we can crack patterns, but we don’t have. Gradually we started getting frustrated and started thinking of other alternatives, our manager suggested to roll back each commits one by one and check whether this problem exist or not. We can’t refuse boss idea but it’s more frustrating to restart every time. We don’t had any choice so few of us started rolling back the commit and generated load similar to what we face on production. I was still busy looking on the data we had and suddenly my eye caught a request on the xhgui which was taking more than average time to execute and bit more CPU than what it should take but still that was not that odd to be claimed as a clue. I asked a colleague who worked on the application about that request and he again deceived me again saying it’s not a problem because it would be served from varnish cache. I left it with disappointment and went out. I took a break talked to my friend and had a discussion on some new technology. Coming back I forget where I was and again started on the same request which bugged me earlier. Don’t know why but I checked the number of times it is been called on the back-end and boom ! “I found a clue” It’s not been cached .Before telling it to anyone I first confirmed it, checked on varnish, memcache, APC cache drupal cache, and found that it’s not been cached at any level and also calculated occurrence of that request which was higher than usual. Cool ! now I was in the state to tell other about it.

In no time when I explain it to application team they understood the problem why it’s not cached and why number of request to that url is higher. They explained that request url carried timestamp as parameter and timestamp is changing always hence no cache is working and the ajax is been called after every 45 second so it increases the load higher. It did not took them time to remove that timestamp parameter from the url and we felt that the problem is solved..

The movies does not end without the climax, and neither our problem ended without a climax. When we all felt we have solved it and got relaxed we came to know that the load is not reduced yet and still the url carries timestamp as parameter. We all had “What the heck” moment but it did not took us time to figure it out that many people still had the old version of the javascript file and until there cache is cleared we can’t do anything. So we did another think, we forcely cached the url at varnish level by ignoring the timestamp parameter.
And finally we did it ! we solved the problem without facing any downtime and within a day we found it..

--

--