|Job Type:||Full Time|
We are looking for a team-player to help us optimize and protect the software and systems behind our internal and customer offerings keeping an ever-watchful eye on their reliability, latency, performance, and capacity.
- Deploy and maintain our production infrastructure hosted on Azure
- Analyze complex system behavior, performance, and application issues
- Analyze and plan capacity for our cloud services
- Deploy, Manage and improve databases
- Apply modern engineering practices to drive down operational overhead through automation and system design
- Promote security excellence across a broad set of internal and external customers
- Define and create standard operating procedures for support teams
- Ensure all infrastructure and application alerts are “actionable” alerts and/or self-healing automation
- Work closely with the service development team - offering education and guidance on integration, support, and monitoring across the toolset
- Serve as the Tier 3 escalation point for support responsible for troubleshooting, as well as mentoring and coaching others
- Demonstrate complex troubleshooting skills, deep knowledge of the services running on the infrastructure, and work with engineers and vendors to resolve issues
- Live Site Management – as an SRE you will play a crucial role in a global team driving huge-scale live sites 24/7 and gaining deep understanding of availability, performance, and security
- Automate processes
- Conduct periodic on-call duties
- Work cross-team with Azure
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
- Experience in cloud environments (Azure/AWS/GCP).
- Proven technical troubleshooting and performance tuning experience.
- Experience with distributed systems, networking, hardware, logistics and operations, or capacity planning.
- Strong written and oral communication skills required.
- Ability to contribute to multiple projects/demands simultaneously.
- 2+ years of experience with Linux system administration.
- 3+ years of experience handling critical production incidents.
- 3+ years of DBA.
- 2+ years of software development experience or BS Degree in Computer Science, Electrical & Computer Engineering or Mathematics or equivalent experience.
- 3+ years of experience with a monitoring system (Pingdom, Datadog, Splunk, Grafana, Azure Monitoring).
- Experience defining and measuring internal/customer facing OLA/SLAs.
- 3+ years of service automation using scripting tools: Python/PowerShell/Bash (PowerShell preferred).
- Experience in MongoDB.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.
Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.