|Location:||New York, New York|
|Job Type:||Full Time|
Site Reliability Engineer II, PromoteIQ
PromoteIQ provides intelligent vendor marketing solutions for the next generation of e-commerce. Our platform helps retailers implement, automate, and scale their brand-funded digital vendor marketing programs. We sit at the intersection of marketing and e-commerce and have a singular mission of empowering retailers and brands to maximize their e-commerce performance.
PromoteIQ embodies a strong startup culture that values diversity, collaboration and craftsmanship - and above all else, results. Our bias towards execution balances critical thinking, root analysis and pragmatic problem solving. We expect a lot from one another and value our thoughtful and intellectually curious company culture.
PromoteIQ is headquartered in New York City and supports a global footprint of e-commerce retailers and brands. The company was acquired by Microsoft in Aug, 2019 and continues to operate as an independent division within Microsoft Advertising. Learn more at https://www.promoteiq.com. This role is based in our SoHo/NYC office.
Microsoft Advertising is a worldwide Sales, Marketing and Services organization on the cutting edge of the digital advertising industry. Microsoft Advertising offers a compelling portfolio of advertising products, innovative solutions and the opportunity to engage with some of the brightest minds in the digital industry. Microsoft Advertising is the destination for experienced, collaborative, and passionate digital advertising professionals seeking a rewarding career and lifestyle.
Who We’re Looking For
At PromoteIQ, DevOps Engineers specialize in developing scalable methods for building, deploying, and supporting our cloud-agnostic enterprise services and systems. This is a highly collaborative role in which you will work closely with our Software Engineers to deploy and operate our solutions; automate and streamline our processes; build and maintain tools for deployment, monitor IT operations, and troubleshoot and resolve issues in our dev, test, and production environments.
- Design and build infrastructure & systems that provide high levels of scalability, reliability, and performance for the PromoteIQ’s stack, while balancing security, maintainability, and operational excellence
- Interface across teams to codify and reliably test infrastructure changes using PromoteIQ’s software development lifecycle
- Partner with Product and Dev teams to provide guidance and best practices around scalability, reliability, and performance of our productions systems, infrastructure, and software
- Work as a team on escalations, resolving critical issues that impact our highly available dev, test, and production systems
- Work with a creative engineering team to continuously implement and improve reliable and speedy build environments for DEV & QA; provide timely build status updates; automate as much as possible to improve efficiency and quality
- Promote innovation, implementation of cutting-edge technologies, outside-of-the-box thinking, teamwork, and self-organization
- Work with Github actions or other build tools in a CI/CD process to build and deploy to our cloud-agnostic environment
- Ensure traceability, observability, and retrievability of sources and deliverables
- Build logging, monitoring, and alerting systems to identify bottlenecks and assist with debugging, analysis, and optimization in a cloud-agnostic environment
- Improve operational efficiency through automation and deployment or development of new tools
- Experiment with and recommend new technologies that simplify or improve PromoteIQ's stack
- Craft solid and clearly explained designs, playbooks, and documentation, for consumption by teammates and the larger engineering organization
- Participate in an off-hours on-call rotation, and perform periodic off-hours work during maintenance windows
- 3+ years of experience in the cloud SRE/Infrastructure, or any related fields
- Experience with cloud-agnostic configuration management frameworks (Ansible, Terraform, etc)
- Experience configuring and managing cloud infrastructure (AWS, GCP, Azure)
- Knowledge of the ELK stack including Logstash config
- Knowledge of messaging systems such as Kafka, RabbitMQ, SQS, etc
- Understanding of SSH, VPN, TCP/IP, DNS, HTTP(S), network routing and subnetting
- Experience with managing and tuning datastore clusters (Elasticsearch, RDS, MySQL, and Aerospike)
- Experience with an always-on and high-volume web server stack (Nginx HAProxy, squid, etc)
- Experience with CI/CD pipelines such as Jenkins, Travis, Azure DevOps, TeamCity, etc.
- System Observability experience (Zabbix, CloudWatch, PagerDuty, Datadog, Azure Monitor, SignalFx, Graphana, etc)
- Knowledge of Linux (Debian/Ubuntu) architecture, security, administration, performance monitoring/tuning, troubleshooting, and production operations
- Fluent in Python and Shell Scripting, with experience implementing automation and monitoring using shell scripting and other related tools
- Experience with containerization technologies (Docker, Kubernetes, etc)
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.
Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.