Principal / Staff Software Engineer
Backend • MLOps • Cloud Infrastructure | Rajkot, Gujarat | 10+ Years | Full-time
Role Overview
We are looking for a Principal / Staff Software Engineer with deep expertise in Python, backend systems, MLOps, and cloud infrastructure. You will design, build, and scale production-grade platforms, lead technical architecture, establish engineering best practices, and deliver reliable, high-performance cloud-native solutions.
Key Areas of Expertise
- Python software development
- Backend, microservices, and distributed systems architecture
- MLOps and machine learning infrastructure
- Cloud infrastructure and networking — AWS, VPCs, DNS, ingress, and service mesh
- Web application development
- Database design and optimization
- CI/CD, DevOps, and automation
- Docker and Kubernetes — containerization, orchestration, and deployment
- Observability — Datadog, Prometheus, Grafana, ELK/OpenSearch, and distributed tracing
- System scalability, reliability, and performance
- Technical leadership and mentoring
Responsibilities
- Architect and build scalable Python backend services, APIs, and distributed systems.
- Design and operate end-to-end ML infrastructure — pipelines, model registry, serving, and monitoring.
- Own AWS cloud infrastructure: VPCs, EKS, IAM, networking, and service mesh.
- Manage Kubernetes clusters and containerized workloads in production.
- Build and maintain CI/CD pipelines and infrastructure-as-code (Terraform/Pulumi).
- Set up and maintain observability stacks and drive SLO/reliability engineering.
- Lead code reviews, define engineering standards, and mentor engineers.
- Design and optimize relational and non-relational database schemas and query performance
Requirements
- 10+ years of software engineering experience with significant time in senior or staff-level roles.
- Expert-level Python — FastAPI, async, testing, and performance optimization.
- Hands-on MLOps experience: MLflow, Kubeflow, Airflow, or equivalent; model serving with TorchServe, Triton, or BentoML.
- Advanced AWS — EKS, EC2, RDS, S3, Lambda, VPC design, IAM, and CloudFront.
- Production Kubernetes — Helm, autoscaling, network policies, and RBAC.
- CI/CD and IaC — GitHub Actions, Terraform or Pulumi, GitOps with ArgoCD or FluxCD.
- Strong observability toolkit — Datadog, Prometheus, Grafana, OpenTelemetry, ELK/OpenSearch.
- Database proficiency — PostgreSQL, Redis, MongoDB; query tuning and schema design.
- Strong communication skills and ability to lead technical discussions with engineering and non-engineering stakeholders.