The difference between DevOps and SRE

Source: Infracloud

A friend and I were talking about career paths recently and there was a lot of talk about DevOps and SRE. When I was first introduced to the two concepts a few years ago I often confused them, but unfortunately, no one was there to answer my confusion. Now it’s extremely popular, but I found that my friend still has some misconceptions about the two roles. So I’ve given some insights and put together an article to share with the public.

The most common misconceptions.

  • DevOps is a new concept, it’s so advanced!
  • SRE is an advanced version of DevOps
  • Ops can easily turn into a DevOps engineer

DevOps is literally a combination of Dev development / Ops operations, in the strict sense of DevOps (via DevOps - Wikipedia).

DevOps (a combination of Development and Operations) is a culture, movement, or practice that values communication and collaboration between “software developers (Dev)” and “IT operations technicians (Ops)”.

SRE, or Site Reliability Engineering, was first introduced by Google and has been developed in its engineering practice. They also published a book by the same name, Site Reliability Engineering, which has helped to spread the concept widely among Internet engineers.

Google explains SRE as follows (via Site Reliability Engineering - Wikipedia).

Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.

I have redefined SRE as:

A site reliability engineer is a software engineer who is committed to building “highly scalable, highly available systems” and implementing them as a matter of principle.

By definition, DevOps is a culture, movement and practice, whereas SRE is a position with strict tenure requirements. Culture is a soft definition, and culture has more concepts to make up, whereas SREs are less imaginative (or perhaps SREs have a high threshold) because of their precise definition. According to Google, SRE engineers practice a DevOps culture. This is true, but DevOp is gradually becoming independent of DevOps engineers, so in this article, I will focus on the comparison between DevOps engineers and SRE engineers.

The needs of the Internet gave birth to DevOps. In the most traditional software companies, it was all Dev and no Ops, and Ops was probably still just technical support staff. Development followed a waterfall - requirements analysis, system design, development, testing, delivery, operation, and the traditional software release was a heavyweight operation. At this time, Ops did not have direct high-frequency contact with Devs, and even for some purely offline businesses, there was no such position as Ops.

After the Internet wave, software evolved from desktop software in the traditional sense to web and mobile applications. At this time, the core logic of the business, such as transactions and social behaviour, was not done on the user’s desktop, but on the back end of the server. This gave internet companies a lot of room to manoeuvre - they could change the business logic at any time, which facilitated rapid iterative change. But even so, Dev and Ops are two extremely divisive parts - Ops doesn’t care how the code works, Dev doesn’t know how the code runs on the server.

While the industry was still relishing the idea of weekly releases, in 2009 Flicker blew the industry away by introducing the concept of 10+ releases per day. Flicker introduced several core ideas:

  • Businesses grow fast and need to embrace change and take small steps
  • The goal of Ops is not to make the site stable and fast, but to drive the business fast
  • Improve Dev / Ops connectivity based on automation tools: code versioning, monitoring
  • Efficient communication: IRC / IM Robot (those ChatBot routines now were played by Flicker 10 years ago)
  • A culture of trust, transparency, efficiency and supportive communication

It’s hard to believe that these DevOps ideas, which are being called out by various training companies and some of the biggest V’s in the world today, could have been demonstrated in a slide show back in 2009. The classics never go out of fashion and shine with wisdom under the dustbin. Some people equate DevOps with operations automation, but this is only scratching the surface. The goal of DevOps is to improve the speed of delivery of business systems and to provide the tools, systems and services to do so. Some individuals or training providers add to and derive meaning from this DevOps essence.

The history of SRE comes a little later. In 2003 Google’s Ben Treynor recruited a team of software engineers to help make Google’s production environment services more stable, robust and reliable. Unlike small and medium sized companies, Google serves over a billion users and a short period of service unavailability can be fatal. So Google was ahead of its time and the SRE was created. This role was for a large cluster, a role that smaller teams didn’t need (and probably couldn’t afford) for a true SRE. After some years of exploration by Google, the SRE team started to write up their experiences online and published the book in 2016.

Many companies now separate out the DevOps function and call it DevOps Engineer. So let’s look at what a DevOps engineer cares about - with a DevOps culture aimed at delivering velocity, a DevOps engineer is naturally concerned with the entire lifecycle of the software/service. A simple formula: velocity = total volume / time, to add engineering jargon, is Speed of delivery = ((functional features * quality of work) / time to delivery) * risk of delivery.

Functional features are left to product managers and project managers, DevOps engineers need to be concerned with the remaining factors: Quality of Engineering / Time to Delivery / Risk of Delivery. The DevOps engineer function is as follows:

  • Manage the full application lifecycle (requirements, design, development, QA, release, operation)
  • Focus on full process efficiency improvement, uncovering bottlenecks and resolving them
  • Automated O&M platform design and development work (standardisation, automation, platformisation)
  • Support for O&M systems, including virtualisation technologies, resource management technologies, monitoring technologies, networking technologies

The key words for SRE are “high scalability” and “high availability”. High scalability means that when the number of service users increases dramatically, the application system and the services supporting it (server resources, network systems, database resources) can be scaled up without adjusting the system architecture or enhancing the performance of the machines themselves, but only by increasing the number of instances. High availability means that if any part of the application architecture becomes unavailable, such as application services, gateways, databases, etc., the whole system can be recovered and made available again within a predictable time. The SRE function can be summarised as follows:

  • Provide selection, design, development, capacity planning, tuning and fault handling for applications, middleware, infrastructure, etc.
  • Provide decision making for business systems based on availability, scalability considerations and participate in business system design and implementation
  • Locating, handling and managing faults and optimising the components that cause them to occur
  • Improving resource utilisation of components

The different responsibilities lead to different job descriptions for the two positions, and I have listed the DevOps Engineer and SRE Engineer functions as follows:

DevOps

  • Set up application life cycle management system, turn around process
  • Development, management Development engineers / QA engineers use Development platform systems
  • Development, management Release system
  • Development, selection, management Monitoring, alerting system
  • Development, management Permissions system
  • Development, selection, management CMBD
  • Management of changes
  • Fault management

SRE

  • Managing change
  • Managing faults
  • Develop SLA service standards
  • Develop, select and manage various types of middleware
  • Develop, manage distributed monitoring systems
  • Development, management Distributed tracking system
  • Development, management Performance monitoring, probing systems (dtrace, flame chart)
  • Development, selection, training Performance tuning tools

It is an interesting contrast that both DevOps and SRE are concerned with the application lifecycle, particularly changes and failures within the lifecycle. A DevOps team will typically provide a chain of tools, including development tools, versioning tools, CI continuous delivery tools, CD continuous release tools, alerting tools, and troubleshooting. The SRE team is more concerned with change, fault, performance and capacity related issues and will be business specific. The output of the tool chain will include: Capacity measurement tools, Logging logging tools, Tracing call chain tracking tools, Metrics performance measurement tools, Monitoring and alerting tools, etc.

DevOps was first a culture and later became a separate role; SRE was clearly a role from the start; many students confuse DevOps with SRE because they are confused by the appearance of both, which seems to have similar tool attributes and automation requirements. Some development students even understand these types of operations jobs to be uniformly server + tools + automation. This is a blind man’s view of the elephant in the room.

In terms of skills, both require strong operations skills. In terms of career ceiling, DevOps may lack the skills of SRE in some specialist areas: computer architecture skills; high-throughput, high-concurrency optimisation skills; scalable system design skills; complex system design skills; business system troubleshooting skills. Both require soft skills, but SREs face greater complexity, challenges and demands.

DevOps is universal, and modern Internet companies need DevOps, but not all teams have a need for high availability, high scalability, and they don’t need SRE. There are opportunities for DevOps engineers to develop into SRE engineers once they have acquired the relevant skills. And a qualified SRE engineer, given the choice, I don’t believe will transition to a DevOps engineer.

In terms of professional background, both DevOps and SRE engineers need an R&D background, with the former requiring a development tool chain and the latter requiring strong architectural design experience. If you have an operations engineer who wants to transition into DevOps or SRE, then you need to catch up on your technical knowledge. After all, you can’t call yourself a DevOps / SRE engineer just because you can build a set of Jenkins + Kubernetes.

How about these common misconceptions? I hope you can see the clarity here, and finally, I’ve included two skill points for engineers, so I hope that students who want to become these two types of engineers can work hard.

DevOps:

  • Operator Skills - Linux Basis
    Basic command operations
    Linux FHS (Filesystem Hierarchy Standard)
    Linux systems (differences, history, standards, development)
  • Scripting
    Bash / Python
  • Basic Services
    DHCP / NTP / DNS / SSH / iptables / LDAP / CMDB
  • Automation Tools
    Fabric / Saltstack / Chef / Ansible / Terraform / Pulumi / Puppet
  • Basic Monitoring Tools
    Zabbix / Nagios / Cacti
  • Virtualisation
    KVM Management / XEN Management / vSphere Management / Docker
    Container Orchestration / Mesos / Kubernetes
  • Services
    Nginx / F5 / HAProxy / LVS Load Balancing
    Common Middleware Operate (start up, shut down, restart, expand)

Dev

  • Languages
    Python
    Go (optional)
    Java (understanding deployment)
  • Process and Theory
    Application Life Cycle
    12 Factor
    Microservices Concepts, Deployment, Lifecycle
    CI Continuous Integration / Jenkins / Pipeline / Git Repo WebHook
    CD Continuous Release System
  • Infrastructure
    Git Repo / Gitlab / Github
    Logstash / Flume Log collection
    Configuration file management (applications, middleware, etc.)
    Nexus / JFrog / Pypi Package Dependency Management
    DevOriented / QA Development Environment Management System
    Online permission assignment system
    Monitoring and alerting system
    Fabric / Saltstack / Chef / Ansible based automation tools development

SRE:

  • Language and Engineering Implementation
  • In-depth understanding of the development language (assuming Java)
  • Business units use development frameworks
    Concurrency, multithreading, and locking
    Resource model understanding: network, memory, CPU
    Troubleshooting skills (analyse bottlenecks, familiarity with relevant tools, restore sites, provide solutions)
    Common business design scenarios and pitfalls (e.g. Business Modeling, N+1, remote calls, unreasonable DB structures)
    MySQL / Mongo OLTP type query optimisation
    Multiple concurrency models, and related Scalable designs
  • Problem location tools
    Capacity management
    Tracing link tracing
    Metrics Metrics tools
    Logging logging system
  • Operations architecture skills
    Linux proficiency, understanding of Linux load model, resource model
    Familiarity with general middleware (MySQL, Nginx, Redis, Mongo, ZooKeeper, etc.) and ability to tune
    Linux network tuning, network IO model, and implementation in the language
    Resource orchestration systems (Mesos / Kubernetes)
  • Theory
    Capacity planning solutions
    Familiarity with distributed theory (Paxos / Raft / BigTable / MapReduce / Spanner etc.) and the ability to decide on the right solution for a scenario
    Performance models (e.g. Pxx understanding, Metrics, Dapper)
    Resource models (e.g. Queuing Theory, load scenarios, avalanche problems)
    Resource orchestration systems (Mesos / Kubernetes)

Can read and write!