Visualising Production Data For The Team - 1st Challenge of 2013 - Duncan Nisbet

I’ve mentioned in previous posts that I’ve started working for a new company. I’ve also mentioned before how I came to love big data & how Development Teams can use it to get an idea of Customer experience on our website.

This post was meant to be a rolling narrative of how my challenge to visualise Production data in my company is progressing, but I’ve been too busy to write it!

What this post is now is a snap shot of where I’m currently up to, how I got there & what I plan to do next.

We currently run a suite of manual tests in our Test environment in order to prove a release. There is an ongoing task to automate these (at the GUI level).

We are also pushing for more unit & integration test coverage, as well as enabling us Testers to get at the code sooner - currently we have to request for the code to be deployed into out Test environment before we even get a sniff of it.

On several occasions the release (to test) has immediately failed. This is really frustrating as invariably we’ve had to wait an hour or so for the ticket to go through & now we need another release (more on this in a “smoothing the lifecycle” post TBC).

We then run the same tests in Production once the code has been released (again, see “smoothing the lifecycle” post TBC).

I’m thinking Testers provide more value further upstream, carrying out thought work, where we can be more effective.

Part of being able to test upstream is less dependency on manually proving the release in Production. I’m not saying no Tester eyes on Production just yet, but I am thinking our Customers have a better idea of how they use the site so lets see what their experience is by visualising the log data for the team to see.

So the company has several different ways it is currently monitoring its Production environment. These have come about from different teams being moved & merged so now we have some slight overlap.

The aim is to primarily use Splunk for the majority of the monitoring, but we also have New Relic which provides similar information.

There are some restrictions with Splunk for us in that we don’t currently log Apache data, which isn’t great for me trying to get a handle on the customers experience. There are some transactions I can visualise from the Access Logs so I’m working with them for now.

At the lower end of the stack, we have Nagios & Gnome for the system monitoring.

My previous experience is with Graphite. Mark Crossfield has written a post on implementing Graphite . Maybe because it was the first tool I used to view Production data, but so far I’m preferring its clean & simple UI of Graphite compared to both Splunk & New Relic. Don’t get me wrong - they both have a beautiful UI, but I’m not using them for their UI. The monitor/ dashboard I have created has a lot of dead space I can’t seem to get rid of, even with “Kiosk” mode which means I’m having to shrink the actual graphs. Meh, I’ll get over it I’m sure.

On arriving at my new company I noticed they were not visualising the stuff I was interested in, so I started setting up Greylog in our Test environment. This was a massive challenge for me as I had never tried setting anything like this up before. I didn’t get as far as I wanted as after a quick chat with the man in control it was clear that the company had invested in Splunk, both financially & training.

As such Splunk, even with its known costs & limitations, is currently preferred to a new opensource solution unknown to the company.

So I put Greylog down & started working with Splunk. Only we aint logging the Apache data in Splunk as there is currently too much of it for our allocated limit (this conversation is already underway). Not handy for what I want to achieve. So I made my point about what I was trying to do to anyone that would listen / cared / mattered & climbed back in my box.

I then moved onto the web frontend team where I discovered they use New Relic & the Apache data is logged - Hoorah!

I’m interested in the New Relic Real User Monitoring functionality, but that is currently disabled for reasons unknown to me - its an action for me to find out why.

Its taken me a while to get my head around the interface, what it can do & what metrics I need but I’m making progress.

I’ve been aiming to get to a point whereby I can demonstrate the value of having Production data visible to the Development team.

My dashboard monitor currently consists of a New Relic Dashboard for showing traffic for key transactions, a New Relic View for response times for key transactions & Google Analytics for each of the sites we operate.

Ideally, the graphs will have the same time scale & stacked to make the correlation between a release going live & any degradation in the sites performance immediately & easily obvious. A quick glance up & down the graphs will enable us to tie any change in the site performance or customer behaviour to a release - for better or for worse.

For example, if a fix is released to reduce page load times, the success / failure of the fix will be immediately obvious once traffic starts being sent to the boxes with the new code (dare I mention validated learning a la Lean Startup?).

Likewise, if site performance degrades after a release, this will also be obvious.

On my journey I’ve found some great yet obvious browser plugins / extensions - I’m not sure why I haven’t found them sooner. I’m guessing its because I haven’t needed them:

Chrome Reload - The New Relic dashboard doesn’t apparently have the option to refresh, so I get the browser to do with this nifty extension

Revolver - At the moment, screen space is at a premium, so instead of 5 different instances of Google Analytic application open displaying Real Time traffic, I have 5 tabs in 1 browser are set to auto revolve every 10 seconds.

(interestingly, I think one of these has a memory leak…)

Next Steps / Actions

Find out why we’re not using New Relic RUM
Find out if New Relic RUM might actually provide value for us
Learn how to include several metrics in one graph to save space (e.g. logins, deposits, bets placed)
Demo my idea / current dashboard to the relevant people
Take feedback / ideas on how to improve the dashboard
See if we’re ever going to get Apache data into Splunk (I know the Sys Admins want some other data logging as well)
If successful, get some monitors up on the walls
look into setting up the dashboard / monitor on a web server & embed the charts in an HTML page to see if that saves space
Having the dashboard / monitor on a web server will help with different people / locations being able to see the dashboard via a URL
Investigate Reload & Revolver to see if they are having memory problems or find an alternative

The demo is early next week, hopefully several of these actions will be boxed off after that.

I’ll keep you posted…