Site Reliability Engineer

Here are the activities which I performed as a Site Reliability Engineer

  • Maintaining 99.95% uptime for Docker & Kubernetes clusters (hosted on AWS / Ubuntu + Windows)
  • Creating & Managing resources on AWS Infrastructure, VPC, S3, IAM, SSL, CloudWatch
  • Implemented monitoring using Nagios, Prometheus, Grafana and PagerDuty
  • Prepared and verified RCAs
  • Drafted Post Mortem reports for incidents and outages
  • Identified Problems and suggested workarounds / solutions for the same
  • Managed VPC routing on AWS
  • Identifying automation opportunities and implementing the same
  • Prepared playbooks for L1/L2 support engineers
  • Define requirements for technical engineers (to be recruited)
  • Manage hiring process for engineers for AWS / DevOps / CICD / Linux / Windows
  • Providing support on Docker infrastructure (Linux/Win) deployed on AWS
  • Managing Aws resources - S3, IAM, VPC, EC2, SQS, SES, RDS
  • Maintain EC2 instances, OS troubleshooting - Ubuntu 16.4 LTS and Windows 2016 Server
  • Install SSL on Load Balancers
  • Troubleshoot disk errors - LVM / EBS
  • Create IAM Policies for AWS Resources
  • Troubleshoot faulty containers
  • Grant users access to AWS infrastructure creating aws policies
  • Automation using Ansible, Terraform, Bash
  • Troubleshoot AWS outages with AWS Technical Support
  • Maintain clusters of Docker with over 3000 containers
  • Maintain kubernetes cluster with around 700 pods and applications
  • Assisting in dockerization of products
  • Troubleshoot Windows Technical errors with Microsoft Team
  • Manage Nginx proxies, consul troubleshooting
  • Data migration from AWS, Data Center to Docker / Kubernetes
  • Day to day maintenance of docker hosts, proxy servers, load balancers
  • Post mortem for outages and RCA prepraration
  • Maintain Git repositories
  • Problem solving, documentation
  • Maintain Jira, Confluence, Pagerduty, Zabbix