DevOps Engineer

DevOps Engineer Hiring Guide

DevOps in 2026 means one specific thing: someone who owns deployment, infrastructure, observability, and on-call, and writes code to do it. Not a system administrator with a Kubernetes certificate. Not a Jenkins button-pusher. Not someone whose entire job is converting YAML files to different YAML files. This is one of the highest-demand, lowest-supply roles in Indian IT staffing. Bangalore product companies will pay 40 to 80 lakhs per annum for a genuinely senior DevOps engineer, and perhaps two thousand people in the country actually meet the bar. The fastest filter is the on-call question — real DevOps engineers light up at the worst-2am-page question; fake ones stall. The second filter is asking them to write Terraform live for a simple resource. Ninety percent of candidates who list Terraform cannot do this without searching.

Key skills

Must-have

Cloud platform depth

Three or more years hands-on with AWS, GCP, or Azure including five or more services beyond basics. IAM design (roles, policies, trust relationships — not "I gave the team admin"), VPC networking, secrets, monitoring. Cloud certifications alone do not count — ask for specific incidents they resolved.

Infrastructure-as-Code fluency

Terraform, Pulumi, or CloudFormation. Has written modules from scratch, not just edited someone elses. Knows what Terraform state is, how to handle drift, when to import existing resources, why you do not run terraform apply from your laptop in production.

CI/CD pipeline ownership

Has built and debugged pipelines in Jenkins, GitLab CI, GitHub Actions, CircleCI, or ArgoCD. Knows the commit-to-prod time for their current team and what the bottleneck is. Understands why fast pipelines matter — slow pipelines change engineering behavior.

Real on-call experience

Has been on a production rotation, paged at 2am, diagnosed under pressure, resolved, and written the postmortem the next day. Without this, "DevOps" is theoretical. Ask for a specific incident — symptom, hypothesis path, fix, follow-up actions.

Linux fundamentals

Comfortable with bash, systemd, systemctl, journalctl, strace, lsof, tcpdump basics. Can diagnose a hung process on a Linux box without GUI tools. Most cloud problems reduce to Linux problems.

Nice-to-have

Kubernetes operational depth

Has run Kubernetes in production — debugged pod crashes, networking, resource requests vs limits, HPA tuning, ingress problems. Not "I deployed a sample app to EKS in a workshop." Strong signal for Bangalore product company roles.

Observability tooling

Hands-on with Datadog, New Relic, Prometheus plus Grafana, Honeycomb. Has configured alerts that went off, tuned them down after fatigue, built dashboards engineers actually use during incidents.

Security and compliance awareness

IAM least-privilege design, secret rotation discipline, SOC2 or ISO 27001 exposure from the implementation side. Reduces ramp on regulated clients — BFSI, healthcare, enterprise SaaS.

Scripting beyond bash

Python or Go for tooling, operators, and automation. Pure GUI-driven DevOps is a yellow flag at 5+ years. Ask what they built last quarter that was not Terraform or YAML.

Database operational experience

Has handled a real PostgreSQL, MySQL, or MongoDB incident — replication lag, connection pool exhaustion, runaway query, failover. Databases are where most cloud pages originate.

Interview questions (8)

Walk me through the worst on-call incident of the last 12 months. Timeline — when you got paged, what the alert said, what you checked, what it turned out to be, and what you changed afterward.

What to listen for

Specific timeline with approximate timestamps. Real diagnostic method — logs, metrics, dashboard queries, hypothesis testing. Postmortem with concrete action items that shipped. "We never have incidents" is disqualifying. "I cannot discuss specifics due to NDA" without structure is a flag.

How do you decide between Kubernetes, ECS, and plain EC2 with a deploy script for a new service? Walk me through your decision tree.

What to listen for

Pragmatic, based on team size, operational capacity, deployment frequency, need for horizontal scaling. "Always Kubernetes" is dogmatic. Strong: "For a 3-person team deploying once a week, Kubernetes is overkill — I would run on ECS or a managed platform." Senior thinking is context, not tooling.

Describe the CI/CD pipeline for your current team. How fast is commit-to-production? What is the bottleneck?

What to listen for

Specific numbers — 12 minutes, 35 minutes, 2 hours. Awareness that the bottleneck is usually test suite duration, approval gates, or slow container builds. "Our pipeline takes about an hour" with no plan to improve means they have stopped thinking about it.

How do you handle secrets in pipelines and at runtime? Walk me through the specific tool and workflow you use today.

What to listen for

AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, Kubernetes external-secrets operator. Not ".env files committed to the repo" or "plaintext env vars in Jenkins UI." Candidates without a real answer here will create security incidents on a regulated client.

Your approach to alerting? How do you decide what pages someone and what goes to a dashboard? How do you avoid alert fatigue?

What to listen for

SLO-based or symptom-based, not cause-based. Tiered alerts (P1 pages, P2 tickets, P3 dashboards). Regular alert review where noisy ones get killed or tuned. "Alert on every error" is anti-pattern. Bonus for mentioning Google SRE book concepts.

A developer says "my service is slow in production but fine in staging." Walk me through your investigation.

What to listen for

Metrics first (CPU, memory, request rate, latency percentiles), then logs for slow queries or errors, then traces for end-to-end breakdown, then comparison of configuration between environments. Modern answer leans on observability tooling. Weak: "I would ssh into the server and check top."

How do you balance move-fast engineering culture with production reliability?

What to listen for

Error budgets, canary deploys, feature flags for progressive rollout, automated rollback triggers, blameless postmortems. Strong candidates treat reliability as enabler of speed. Weak ones frame it as "DevOps is the gatekeeper who says no."

One common DevOps anti-pattern you see repeated at most companies.

What to listen for

Specific, opinionated. Common strong answers: snowflake servers modified by hand, Kubernetes for workloads that should be Lambda or ECS, monitoring that alerts on causes not symptoms, secrets rotated once a year. Reveals depth and taste.

Evaluation rubric

Score each candidate against these weighted criteria. Total: 100%.

Criterion	Weight	Signal
Cloud and IaC depth	30%	Multi-year hands-on cloud plus production Terraform, Pulumi, or CloudFormation. Writes IaC live without searching. Knows state drift, import, destroy order.
On-call and incident response	25%	Has owned real production incidents with names, dates, resolutions. Authored real postmortems with action items that shipped.
Pipeline ownership	20%	Has built and optimized CI/CD. Knows commit-to-prod time. Has specific opinions on what to parallelize and cache.
Observability discipline	15%	Set up alerts that paged them — and tuned down when they fired too much. Built dashboards other engineers use during incidents. Knows metrics vs logs vs traces.
Security mindset	10%	IAM least-privilege instinct, secret hygiene, awareness of common cloud misconfigurations (public S3, overly permissive security groups, long-lived access keys).

Red flags

CV is mostly certifications (AWS SAA, CKA, Terraform Associate) with no project depth or incident stories

Has never been on a real on-call rotation — their "DevOps" was 9am to 6pm only with a dedicated NOC team handling nights

Claims Kubernetes expertise but cannot debug a CrashLoopBackOff in a shared terminal exercise within five minutes

On-call described as "always-on 24 by 7" with no rotation structure — burned out candidate or culture red flag

Cannot name a specific production incident or debugging session even with 30 seconds of silence and a follow-up prompt

Apply this rubric automatically with CVPRO

Upload DevOps Engineer CVs and let AI score every candidate against the same 42-point evidence rubric.

Try CVPRO Free

More role guides

Software Engineer

Full-Stack Developer Assessment Guide

DevOps Engineer Hiring Guide

Key skills

Must-have

Cloud platform depth

Infrastructure-as-Code fluency

CI/CD pipeline ownership

Real on-call experience

Linux fundamentals

Comfortable with bash, systemd, systemctl, journalctl, strace, lsof, tcpdump basics. Can diagnose a hung process on a Linux box without GUI tools. Most cloud problems reduce to Linux problems.

Nice-to-have

Kubernetes operational depth

Observability tooling

Hands-on with Datadog, New Relic, Prometheus plus Grafana, Honeycomb. Has configured alerts that went off, tuned them down after fatigue, built dashboards engineers actually use during incidents.

Security and compliance awareness

IAM least-privilege design, secret rotation discipline, SOC2 or ISO 27001 exposure from the implementation side. Reduces ramp on regulated clients — BFSI, healthcare, enterprise SaaS.

Scripting beyond bash

Python or Go for tooling, operators, and automation. Pure GUI-driven DevOps is a yellow flag at 5+ years. Ask what they built last quarter that was not Terraform or YAML.

Database operational experience

Has handled a real PostgreSQL, MySQL, or MongoDB incident — replication lag, connection pool exhaustion, runaway query, failover. Databases are where most cloud pages originate.

Interview questions (8)

Walk me through the worst on-call incident of the last 12 months. Timeline — when you got paged, what the alert said, what you checked, what it turned out to be, and what you changed afterward.

What to listen for

How do you decide between Kubernetes, ECS, and plain EC2 with a deploy script for a new service? Walk me through your decision tree.

What to listen for

Describe the CI/CD pipeline for your current team. How fast is commit-to-production? What is the bottleneck?

What to listen for

How do you handle secrets in pipelines and at runtime? Walk me through the specific tool and workflow you use today.

What to listen for

Your approach to alerting? How do you decide what pages someone and what goes to a dashboard? How do you avoid alert fatigue?

What to listen for

A developer says "my service is slow in production but fine in staging." Walk me through your investigation.

What to listen for

How do you balance move-fast engineering culture with production reliability?

What to listen for

One common DevOps anti-pattern you see repeated at most companies.

What to listen for

Evaluation rubric

Score each candidate against these weighted criteria. Total: 100%.

Criterion	Weight	Signal
Cloud and IaC depth	30%	Multi-year hands-on cloud plus production Terraform, Pulumi, or CloudFormation. Writes IaC live without searching. Knows state drift, import, destroy order.
On-call and incident response	25%	Has owned real production incidents with names, dates, resolutions. Authored real postmortems with action items that shipped.
Pipeline ownership	20%	Has built and optimized CI/CD. Knows commit-to-prod time. Has specific opinions on what to parallelize and cache.
Observability discipline	15%	Set up alerts that paged them — and tuned down when they fired too much. Built dashboards other engineers use during incidents. Knows metrics vs logs vs traces.
Security mindset	10%	IAM least-privilege instinct, secret hygiene, awareness of common cloud misconfigurations (public S3, overly permissive security groups, long-lived access keys).

Red flags

CV is mostly certifications (AWS SAA, CKA, Terraform Associate) with no project depth or incident stories

Has never been on a real on-call rotation — their "DevOps" was 9am to 6pm only with a dedicated NOC team handling nights

Claims Kubernetes expertise but cannot debug a CrashLoopBackOff in a shared terminal exercise within five minutes

On-call described as "always-on 24 by 7" with no rotation structure — burned out candidate or culture red flag

Cannot name a specific production incident or debugging session even with 30 seconds of silence and a follow-up prompt

Key skills

Must-have

Cloud platform depth

Infrastructure-as-Code fluency

CI/CD pipeline ownership

Real on-call experience

Linux fundamentals

Nice-to-have

Kubernetes operational depth

Observability tooling

Security and compliance awareness

Scripting beyond bash

Database operational experience

Interview questions (8)

Evaluation rubric

Red flags

Apply this rubric automatically with CVPRO

More role guides

Hiring Software Engineers: AI Assessment Guide

Evaluating Data Analysts: Complete Framework

IT Project Manager Evaluation Framework

Full-Stack Developer Assessment Guide

Key skills

Must-have

Cloud platform depth

Infrastructure-as-Code fluency

CI/CD pipeline ownership

Real on-call experience

Linux fundamentals

Nice-to-have

Kubernetes operational depth

Observability tooling

Security and compliance awareness

Scripting beyond bash

Database operational experience

Interview questions (8)

Evaluation rubric

Red flags

Apply this rubric automatically with CVPRO

More role guides

Hiring Software Engineers: AI Assessment Guide

Evaluating Data Analysts: Complete Framework

IT Project Manager Evaluation Framework

Full-Stack Developer Assessment Guide