Staff Site Reliability Engineer

Location: Remote

Job Posted
Jun 07, 2024
Company: dutchie

Company & Role Overview

Summary

About Dutchie

About This Job

What You'll Do...

  • Lead SRE Strategy: Define the overall technical direction and strategy for SRE at Dutchie, aligning with business goals and ensuring the highest levels of system reliability and stability.
  • Technical Leadership: Mentor and guide other engineers on best practices, emerging technologies, and industry trends, fostering a culture of continuous learning and improvement.
  • Project Execution: Drive the execution of key SRE projects, ensuring timely delivery, quality, and alignment with business objectives.
  • Operational Excellence: Collaborate with development and product teams to optimize system performance, reliability, and scalability.
  • Incident Management: Troubleshoot and resolve complex issues in production environments. Lead the resolution of critical incidents, conduct post-incident reviews, identify trends and implement preventative measures to minimize future disruptions.
  • Automation: Champion automation initiatives to streamline processes, reduce manual toil, and improve operational efficiency.
  • Performance Optimization: Continuously monitor system capacity and performance, identify bottlenecks, and implement optimization strategies to maximize efficiency and resource utilization.
  • Collaboration: Partner with stakeholders across the organization to understand their needs, communicate SRE initiatives, and foster a collaborative environment.
  • Mentorship: Provide technical guidance and mentorship to junior SREs, helping them develop their skills and grow professionally.
  • Maximize Observability: Drive successful adoption and use of observability tools (Datadog) and logging (Splunk) across the organization. Implement and manage monitoring, alerting and logging systems to ensure early detection of issues.
  • Business Continuity: Lead the design and implementation of disaster recovery and business continuity plans.
  • Support: Participate in on-call rotation to ensure 24/7 availability of our systems and services.

What You Bring...

  • Bachelor's degree in Computer Science, Information Technology, or a related field.
  • 10+ years of experience as a Site Reliability Engineer or a related role with a proven track record.
  • Strong expertise in cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes).
  • Strong technical expertise and leadership skills
  • Proficient in scripting and automation using languages such as Python, Shell, or Go.
  • Solid understanding of networking, security, and infrastructure-as-code principles.
  • Experience with observability tools such as Datadog and logging solutions such as Splunk.
  • Proven track record of successfully leading incident response efforts and conducting post-mortems.
  • Experience in enabling application teams to enhance observability and reliability of their services.
  • Excellent communication and collaboration skills, with the ability to work effectively in a team environment.
  • Excellent problem-solving and troubleshooting skills.

It's a bonus if you...

  • Master's degree in Computer Science, Computer Engineering, or a related field
  • Experience with containerization technologies (e.g., Docker, Kubernetes)
  • Experience with Infrastructure as Code (IaC) tools (e.g., Pulumi, Terraform, CloudFormation)
  • Experience with agile development methodologies (e.g., Scrum, Kanban)
  • Relevant industry certifications (e.g., CKAD)

You'll Get...

  • Full medical benefits including dental and vision plans to ensure you always have the best care.
  • Equity packages in the form of stock options to all employees.
  • Technology (hardware, software, reading materials, etc..) allowance
  • Flexible vacation and sick days

Company BenefitsBenefits for this job may vary.

Career Advancement, Dental, Medical, & Vision Benefits, Flexible Time Off / Unlimited Vacation, Inclusive Environment, Paid Time Off, Life Insurance, Lunch & Learns, Paid Parental Leave, Relaxed / Casual Dress Code, Short-Term & Long-Term Disability, Work From Home / Remote

Working Environment

(No Information)

About dutchie

Dutchie is an all-in-one technology platform offering a full suite of solutions to simplify operations: Point of Sale, Ecommerce, Payments, and more. It is our mission to provide consumers with safe and easy access and supports the wave of positive societal change that cannabis is bringing to the world.