Site Reliability Engineer

Last updated 3 days ago
Location:Redmond, Washington
Job Type:Full Time

Site Reliability Engineer, Secure Admin Services, CSEO


Core Services Engineering builds and manages the critical products and services that Microsoft runs on. We boldly pursue big ideas that power transformational advances at Microsoft and for our customers, while helping Microsoft teams work smarter, faster and more securely every day. Core Services Engineering employees have deep technical and business expertise, customer insights, and a clear point of view that comes from first-hand, large-scale experience with Microsoft and industry solutions. We are engineers, technology leaders and experts, digital transformation change agents, and customer advocates.
We have exciting opportunities for you to innovate, influence, transform, inspire and grow within our organization and we encourage you to apply to learn more!

Microsoft has been a leading company in computing for decades. We are a global company, relied on by companies, governments, utilities, stores, schools, universities and co-operatives to deliver the things they need to work, every day. In order to make this work, we need to make it reliable. In order to make it reliable, we need you -- someone who already is, or is interested in becoming, a Site Reliability Engineer (also known as SRE), within our SAS Site Reliability Engineering team.

The Site Reliability Engineering (SRE) team provides leadership, direction and accountability for application architecture, system design, and end-to-end implementation. As a Site Reliability Engineer, you will identify and deliver service improvements using your expertise in services engineering, systems, networks and software know-how, reliability and dependency analysis and scalable system design principles. Strong collaboration skills will be required to work closely with other engineering teams, service owners and support teams to ensure services/systems are highly stable and performant, meeting the expectations of our user base across the company.

Site Reliability Engineering is a hybrid role, comparatively rare in industry but crucially important to how things work behind the scenes today. SREs are people who take engineering-based approaches to solving operations problems; we like infrastructure, we like seeing how the big complicated thing works, and most importantly, we gain great satisfaction from making it better.

Our Site Reliability engineers are persistent problem solvers, always focused on mitigating issues and owning a problem until resolution is in place. To accomplish this, they work in close collaboration with various engineering teams. They are also involved in automation, developing tools to support DevOps model, and analyzing vast amounts of data to find trends and suggest improvements. Creativity and data-driven decision making is heavily valued in this emerging role.

Site Reliability Engineers build, monitor, and maintain the systems and infrastructure that ensure our customers can quickly access their data and run workloads whenever they need to. We identify service problems and areas for improvement, and we help implement solutions. Our work is key to the security and credibility of many of the Microsoft services and Microsoft’s credibility. Secure Admin Services provide access to Microsoft’s entire infrastructure and ecosystem in a secure manner.

Responsibilities

Key responsibilities:

  • Provide technical engineering for a cross-functional, highly visible, operations team supporting the secure access services platform for Microsoft’s corporate network.
  • Identify opportunities and drive the implementation of automation to improve service health, manageability, reliability and telemetry.
  • Own, triage, investigate and resolve service issues with an emphasis on broad communications, learning & teaching throughout the process
  • Ability to read, write, configure, design, and script end-to-end service telemetry, alerting and self-healing capabilities for platforms.
  • Authoring functional and technical documentation.
  • Communicate on a deeply technical level with product engineering, project management and operations teams to improve and optimize products, improve infrastructure, and evolve services.
  • Remain current on new technologies, methods and procedures including, but not limited to, coding practices such as Test Driven Development, Continuous Integration, and Continuous Deployment.

Qualifications

Required Qualifications:

  • Familiarity with one or more general purpose programming languages including but not limited to: C/C++, C#, Python, PowerShell, JavaScript
  • Experience with the Microsoft cloud and/or stack including O365, Azure, Windows or other Microsoft software/services
  • Experience leveraging cloud architecture, applying site reliability principles, and/or demonstrating sensitivity to operational concerns
  • Demonstrated ability to debug, fix, and optimize code
  • Full-stack troubleshooting skills across network, application, hardware, management fabric, and distributed services layers

Preferred, not required:

  • BA/BS in Computer Science, Computer Engineering or related technical discipline, or in place of 4-year degree, an equivalent industry internship or industry software engineering experience
  • 2+ years of scripting and programming experience (preferably .NET, PowerShell, Python, C#)
  • Excellent troubleshooting skills are a must to be successful in this role.
  • Out of the box, quick and agile thinking to adapt to fast pace and changing environment
  • Deep knowledge of system design & architecture, and running of complex, large scale online services
  • Demonstrated technical experience with site reliability engineering or software development and operations
  • Experience building distributed cloud-based software services
  • Fast learner, introspective
  • Ability to contribute to multiple projects/demands simultaneously

The ideal candidate will have experience in a team environment, experience running and deploying cloud scale services and platforms, technical depth in security of cloud platforms, safe deployment paradigms at cloud scale, agile development practices, and experience in designing & tuning monitoring/telemetry.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.

#CSEO

#DSRE