Software Reliability Engineer

No of Positions  No of Positions:   2

location Location: Noida

date Tentative Start Date:   September 30, 2021

Work From Work From : Offsite

rate Rate : $ 15  -  25 (Hourly)

experience Experience : 5 to 8 Year

Posted: 23 Days ago
Job Applicants : 0
Job Views : 89
Job Category : Information Technology & Services
Duration : Long-Term
Key Skills Required Skills
Kafka Python Puppetry Kubernetes Openstack ITIL
Description

Hi

Here are the job Details


 RESPONSIBILITIES

Important aspects of the job include:

It's MySQL, thousands of instances in hundreds of replication hierarchies, some of them seeing substantial load, the foundation of our Application Data Infrastructure.

It's automated. But as our systems are evolving, this automation needs improvement, extension and refactoring to meet the changing requirements of a different environment.

It's Python, and Go. And being at the center of most, if not all applications, it is literally talking to everything else.

It's moving to all the platforms, including Openstack, Kubernetes and the public cloud.

It's dynamic. With automated capacity testing, restore testing, failover testing and disaster recovery testing, it needs to be able to adapt to planned and unplanned changes in the production conditions and environments.

Sometimes it has problems. Sometimes our customers make problems. Good monitoring and alerting are required to be aware of problems as they develop, or ideally before they develop.

It's in multiple data centers, ours and in the public cloud. Replication and communication over long distances pose their own scaling and performance problems.

As SRE in the data infrastructure team, you will be responsible for planning, building, improving and refactoring solutions that solve these problems. You will also share the on-call rotation and be an escalation contact for incidents. You will be working in close collaboration with multi-functional teams in Core Infrastructure and in the Application Teams.

 

B. SKILLS

 

What will you bring to the role?

 

Must have

  1. Experience in managing a production Kafka cluster on a very large scale environment. That is, do you know how to scale up, or react when there is an issue like "disk space quickly filling up"?

This cluster in the current infrastructure is moving 1TB of data per second to give you some background information

  1. The operational experience in maintaining the above Kafka cluster and familiar with ITIL best practices (incident, problem and change management)
  2. Experience in writing production java code to make extensions on control plane written in java
  3. Familiar with the best practices that comes with role as SRE

 

Nice to have

  1. Experience in automation, capacity management achieved java, puppet, terraform, etc.
  2. Participation in the stand-by rotation schedule, 24/7 support
  3. Exposure in coding on scale
  4. Experience with Private cloud solutions
  5. Knowledge Phyton programming and scripting