DevOps or SRE???
I love the way Google describes this: “If you think of DevOps as a philosophy, then SRE (Site Reliability Engineering) is a prescriptive way of accomplishing that philosophy. If DevOps was an interface in a programming language, you might almost say SRE is concrete CLASS that implements DevOps.”
So what does DevOps entail then as a philosophy?
Let’s look at how SRE helps accomplishes the DevOps philosophy.
Having now established that the two disciplines are complimentary, let’s first look at reliability and availability where software development teams and operations team typically have different definitions for it. Google had this problem since the early 2000’s when they were building the searching product and came up with SRE to solve it.
Everyone from the product developers, SRE’s and management agree to this and that way there is a shared responsibility for the product/service. SLO’s (service level objectives) are agreed upon in advance with the product owners. In addition to SLO’s, SLI’s (service level indicators) are agreed upon in advance as well – they are metrics over time such as:
We aggregate SLI’s over a period of year and determine SLO’s. So a 99.9% availability over a year will be measured against a binding target SLO that is mutually agreed upon at the beginning. Finally we have the SLA’s (service level agreement) which is a commercial agreement that spells out the penalties for not meeting the SLO targets. You will always see SLI’s having lower thresholds than SLO’s which in turn have lower thresholds than the SLA’s.
Is 100% availability needed? 100% availability is expensive, technically complex and even the users don’t typically notice it due to some other unreliability. There will always be some degree of risk for products to be delivered. Cost is a key consideration for reliability as we need to analyze the cost of adding extra fault tolerance, extra testing time, or increased time to determine if the release is good compared to benefits to the user of increased reliability. An error budget is a quantitative measure that determines how much downtime leeway you have – a 99.9% SLO allows about 43 minutes of downtime (error budget). So it’s a direct tradeoff, if the product team want to release a lot of features quickly with no cost increases for reliability, then they have to be OK with a lower SLO and more downtime. Your error budget needs to account for non development aspects as well. These include network issues, storage issues, issues with CDN (content delivery network) and the application issues. It’s all about balancing innovation and stability.
Measuring the SLO’s accurately and determining when a breach of the SLO is about to occur is a critical need. We need a great monitoring tool – this is where Stackdriver comes in.
Over the last year, Google has invested a lot of energy into building the Stackdriver product into becoming the leading monitoring tool. The tool has been heavily influenced by the SRE principles.
Native integration with Google Cloud data tools BigQuery, Cloud Pub/Sub, Cloud Storage, Cloud Datalab,
Stackdriver provides full-stack insights.
Google Stackdriver is the monitoring, APM and logging product for the amazing Google Cloud Platform (GCP) .
Let’s a take a look at some examples of the cool things you can monitor using Stackdriver. Latency is key for good user experience and with Stackdriver, you can get alerted if the latency is over a certain threshold that you deem right or if there is a sudden spike in the latency. You can also set a threshold that a certain percentage of requests (97%) should be under 1 second, if more than 3% of requests are over this limit, send me an alert. Another cool feature is that Stackdriver is able to determine what database you are running on the compute engine automatically and start monitoring. If you are running a container cluster, then you can use Stackdriver to monitor both the container engine resources and the underlying VM’s. It also provides multi-cloud monitoring with out-of-the-box integration with AWS in addition to GCP. In addition to out-of-the-box, you can build any custom metric you want like the latency example or just like the latency example you could do the same for memory usage.
One of the features of Stackdriver, the Incident Management and Response (IRM) provides the capability to investigate, understand, mitigate and recover from incidents quickly and efficiently. It offers end-to-end incident management lifecycle management metrics
Let’s talk a little bit about what an SRE (Site Reliability Engineer) does. While an SRE fixes reliability issues, his/her main objective is put in place engineering mechanisms that improves the long-term reliability by automation. So one of the main aspects of an SRE is to identify manual, repetitive tasks (truly repetitive) and automate them. The goal here is for an SRE to spend 50-70% of his/her time on long-term engineering automation work on reliability and performance.
The other aspect of the role of an SRE or in our case Virtue Group when we help implement solutions on GCP for customers, is to educate our customer on SRE so that they have a fair understanding on what to expect from the GCP platform in terms of reliability and performance. That includes educating our customers on SLI/SLO/SLA’s, error budgets, automation process etc. Stackdriver Service Monitoring is a tool for monitoring how your customers perceive your applications, and then lets you drill down to the underlying infrastructure when there’s a problem. Most IT operations teams look at Compute, Storage and Networking metrics to infer customer experiences and application performance tools like tracing, debugging look at it at the code level and not the infrastructure level. Stackdriver Service Monitoring gives you a cost effective, easy-to-use tool to monitor the customer-facing behavior of their applications. This works even with microservices architecture when the application is split up into many small pieces.
Finally, Google wrote a whole book on SRE, you can read it here – https://landing.google.com/sre/sre-book/toc/index.html