Site Reliability Engineer- I / II

  • Full Time
  • India
  • Posted 3 years ago

Location: Gurgaon, Noida, Bangalore, Mumbai, Pune, Chennai

We are looking for passionate and highly collaborative SREs to join our international Infrastructure team, where one will get the opportunity to work on interesting and challenging projects and help us build and maintain reliable end-to-end infrastructure solutions.

The SRE team at OLX Group is responsible for reliability, scalability, availability, latency, performance, efficiency, monitoring, alerting, emergency response, and capacity planning of the OLX platform that caters to multiple countries. We are developing tools and optimizing strategies that make the end-user experience smooth and stable.

Responsibilities:

  • Build and maintain end-to-end infrastructure solutions.
  • Write and maintain Infrastructure as Code (IaC).
  • Develop new tools/platforms for deployment flows, monitoring/dashboarding, alerting, incident management, observability, automation, security, horizontal requirements, and much more.
  • Manage and improve the whole lifecycle of microservices – deployments, architecture, operations, security, performance tuning, etc.
  • Diagnose, resolve, and prevent production issues. Perform RCAs/Postmortems and also automate the process, wherever required.
  • Introduce new technologies and tools that could help in faster and more efficient development, keeping reliability, scalability, resiliency, and availability in mind.
  • Architect solutions and design robust pipelines (including CI/CD and Data pipelines).
  • Provide on-call support on rotation basis.
  • Write proper documentation and publish system design / architectural blueprints. Promote best practices and standards. Do regular code reviews.
  • Work on infrastructure cost optimization and sustainability.
  • Work towards effective SLIs/SLOs.
  • Be open to exploring and adapting to new technologies/tools/languages/trends.

Qualifications/Requirements:

  • B.Tech/B.E in Computer Science or a related technical discipline with 4+ years of relevant experience.
  • Expertise in building reliable, scalable, highly-available, and resilient API driven platforms, preferably in Java, Python, Node.js or Golang.
  • Strong grasp of Data Structures, Algorithms, and Design Patterns.
  • Understanding of Unix/Linux operating systems, system administration, and networking stack (TCP/IP, NAT, DNS, SSL, iptables, routing, network topologies and protocols).
  • Experience in designing, analyzing, and troubleshooting large-scale distributed systems and cloud-based architectures.
  • Extensive knowledge of cloud infrastructure, operations, networking, resources, and services is a must; experience with Amazon Web Services (AWS) or Google Cloud Platform (GCP) is required.
  • Working knowledge of containers (Docker or rkt) and orchestration systems like Kubernetes is mandatory.
  • Proven track record of working with microservice-based architecture in production environments. Should have solid understanding of microservice design patterns like Circuit Breaker Pattern, Aggregator Pattern, etc.
  • Good knowledge of monitoring, alerting, and dashboarding tools like Prometheus, Sensu, Grafana, New Relic, Datadog, AppDynamics, Instana, PagerDuty, VictorOps, OpsGenie, etc.
  • Experience of writing Infrastructure as Code (IaC) using Terraform.
  • Very strong hands-on experience of building CI/CD Automation Pipelines using platforms like GitLab, GitHub, Jenkins, Spinnaker, etc.
  • Demonstrated experience of working towards effective SLIs/SLOs.
  • Ability to perform application performance tuning and reason about security and process interaction.
  • Solid understanding of VCS (GIT, SVN, etc.).
  • [Bonus] Good to have experience with ELK Stack (Elasticsearch, Logstash, Beats, and Kibana).
  • [Bonus] Hands-on experience with distributed systems like Kafka, RabbitMQ, Redis, Aerospike, Airflow, ZooKeeper, Solr, Elasticsearch, etc
  • [Bonus] Decent experience with at least one of the Relational Databases like MySQL or PostgreSQL and at least one of the NoSQL Databases like MongoDB, Cassandra, DynamoDB, etc. Should know about database clustering, management, upgrade process, disaster recovery mechanisms, performance, scalability, high availability, and reliability.
  • [Bonus] Working knowledge of Helm and Helmfile/Helmsman.
  • [Bonus] Experience in developing Custom Controllers and Dynamic Admission Controllers in Kubernetes.
  • [Bonus] Experience with OpenTelemetry (OpenTracing, OpenMetrics, and OpenCensus).
  • Fluent in both written and spoken English.

Benefits:

Competitive compensation and additional benefits/perks.
Collaborative learning and an abundance of learning resources would help you become better every day.
Company mobile phone.Laptop of your choice: MacBook Pro, Windows, or Linux.

To apply for this job email your details to hv@jobtrix.in

Job Overview
Job Location