DevOps in 2026 means one specific thing: someone who owns deployment, infrastructure, observability, and on-call, and writes code to do it. Not a system administrator with a Kubernetes certificate. Not a Jenkins button-pusher. Not someone whose entire job is converting YAML files to different YAML files. This is one of the highest-demand, lowest-supply roles in Indian IT staffing. Bangalore product companies will pay 40 to 80 lakhs per annum for a genuinely senior DevOps engineer, and perhaps two thousand people in the country actually meet the bar. The fastest filter is the on-call question — real DevOps engineers light up at the worst-2am-page question; fake ones stall. The second filter is asking them to write Terraform live for a simple resource. Ninety percent of candidates who list Terraform cannot do this without searching.
Three or more years hands-on with AWS, GCP, or Azure including five or more services beyond basics. IAM design (roles, policies, trust relationships — not "I gave the team admin"), VPC networking, secrets, monitoring. Cloud certifications alone do not count — ask for specific incidents they resolved.
Terraform, Pulumi, or CloudFormation. Has written modules from scratch, not just edited someone elses. Knows what Terraform state is, how to handle drift, when to import existing resources, why you do not run terraform apply from your laptop in production.
Has built and debugged pipelines in Jenkins, GitLab CI, GitHub Actions, CircleCI, or ArgoCD. Knows the commit-to-prod time for their current team and what the bottleneck is. Understands why fast pipelines matter — slow pipelines change engineering behavior.
Has been on a production rotation, paged at 2am, diagnosed under pressure, resolved, and written the postmortem the next day. Without this, "DevOps" is theoretical. Ask for a specific incident — symptom, hypothesis path, fix, follow-up actions.
Comfortable with bash, systemd, systemctl, journalctl, strace, lsof, tcpdump basics. Can diagnose a hung process on a Linux box without GUI tools. Most cloud problems reduce to Linux problems.
Has run Kubernetes in production — debugged pod crashes, networking, resource requests vs limits, HPA tuning, ingress problems. Not "I deployed a sample app to EKS in a workshop." Strong signal for Bangalore product company roles.
Hands-on with Datadog, New Relic, Prometheus plus Grafana, Honeycomb. Has configured alerts that went off, tuned them down after fatigue, built dashboards engineers actually use during incidents.
IAM least-privilege design, secret rotation discipline, SOC2 or ISO 27001 exposure from the implementation side. Reduces ramp on regulated clients — BFSI, healthcare, enterprise SaaS.
Python or Go for tooling, operators, and automation. Pure GUI-driven DevOps is a yellow flag at 5+ years. Ask what they built last quarter that was not Terraform or YAML.
Has handled a real PostgreSQL, MySQL, or MongoDB incident — replication lag, connection pool exhaustion, runaway query, failover. Databases are where most cloud pages originate.
Walk me through the worst on-call incident of the last 12 months. Timeline — when you got paged, what the alert said, what you checked, what it turned out to be, and what you changed afterward.
What to listen for
Specific timeline with approximate timestamps. Real diagnostic method — logs, metrics, dashboard queries, hypothesis testing. Postmortem with concrete action items that shipped. "We never have incidents" is disqualifying. "I cannot discuss specifics due to NDA" without structure is a flag.
How do you decide between Kubernetes, ECS, and plain EC2 with a deploy script for a new service? Walk me through your decision tree.
What to listen for
Pragmatic, based on team size, operational capacity, deployment frequency, need for horizontal scaling. "Always Kubernetes" is dogmatic. Strong: "For a 3-person team deploying once a week, Kubernetes is overkill — I would run on ECS or a managed platform." Senior thinking is context, not tooling.
Describe the CI/CD pipeline for your current team. How fast is commit-to-production? What is the bottleneck?
What to listen for
Specific numbers — 12 minutes, 35 minutes, 2 hours. Awareness that the bottleneck is usually test suite duration, approval gates, or slow container builds. "Our pipeline takes about an hour" with no plan to improve means they have stopped thinking about it.
How do you handle secrets in pipelines and at runtime? Walk me through the specific tool and workflow you use today.
What to listen for
AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, Kubernetes external-secrets operator. Not ".env files committed to the repo" or "plaintext env vars in Jenkins UI." Candidates without a real answer here will create security incidents on a regulated client.
Your approach to alerting? How do you decide what pages someone and what goes to a dashboard? How do you avoid alert fatigue?
What to listen for
SLO-based or symptom-based, not cause-based. Tiered alerts (P1 pages, P2 tickets, P3 dashboards). Regular alert review where noisy ones get killed or tuned. "Alert on every error" is anti-pattern. Bonus for mentioning Google SRE book concepts.
A developer says "my service is slow in production but fine in staging." Walk me through your investigation.
What to listen for
Metrics first (CPU, memory, request rate, latency percentiles), then logs for slow queries or errors, then traces for end-to-end breakdown, then comparison of configuration between environments. Modern answer leans on observability tooling. Weak: "I would ssh into the server and check top."
How do you balance move-fast engineering culture with production reliability?
What to listen for
Error budgets, canary deploys, feature flags for progressive rollout, automated rollback triggers, blameless postmortems. Strong candidates treat reliability as enabler of speed. Weak ones frame it as "DevOps is the gatekeeper who says no."
One common DevOps anti-pattern you see repeated at most companies.
What to listen for
Specific, opinionated. Common strong answers: snowflake servers modified by hand, Kubernetes for workloads that should be Lambda or ECS, monitoring that alerts on causes not symptoms, secrets rotated once a year. Reveals depth and taste.
Score each candidate against these weighted criteria. Total: 100%.
| Criterion | Weight | Signal |
|---|---|---|
| Cloud and IaC depth | 30% | Multi-year hands-on cloud plus production Terraform, Pulumi, or CloudFormation. Writes IaC live without searching. Knows state drift, import, destroy order. |
| On-call and incident response | 25% | Has owned real production incidents with names, dates, resolutions. Authored real postmortems with action items that shipped. |
| Pipeline ownership | 20% | Has built and optimized CI/CD. Knows commit-to-prod time. Has specific opinions on what to parallelize and cache. |
| Observability discipline | 15% | Set up alerts that paged them — and tuned down when they fired too much. Built dashboards other engineers use during incidents. Knows metrics vs logs vs traces. |
| Security mindset | 10% | IAM least-privilege instinct, secret hygiene, awareness of common cloud misconfigurations (public S3, overly permissive security groups, long-lived access keys). |
CV is mostly certifications (AWS SAA, CKA, Terraform Associate) with no project depth or incident stories
Has never been on a real on-call rotation — their "DevOps" was 9am to 6pm only with a dedicated NOC team handling nights
Claims Kubernetes expertise but cannot debug a CrashLoopBackOff in a shared terminal exercise within five minutes
On-call described as "always-on 24 by 7" with no rotation structure — burned out candidate or culture red flag
Cannot name a specific production incident or debugging session even with 30 seconds of silence and a follow-up prompt
Upload DevOps Engineer CVs and let AI score every candidate against the same 42-point evidence rubric.
Try CVPRO Free