PaloAltoRecruiter Since 2001
the smart solution for Palo Alto jobs

Senior Site Reliability Engineer

Company: Rivian Automotive
Location: Palo Alto
Posted on: May 3, 2021

Job Description:

Rivian is on a mission to keep the world adventurous forever. This goes for the emissions-free Electric Adventure Vehicles we build, and the curious, courageous souls we seek to attract. As a company, we constantly challenge whats possible, never simply accepting what has always been done. We reframe old problems, seek new solutions and operate comfortably in areas that are unknown. Our backgrounds are diverse, but our team shares a love of the outdoors and a desire to protect it for future generations. We operate development centers in Plymouth, Michigan; Southern California (Irvine, Carson & LA); Silicon Valley (San Jose and Palo Alto); Vancouver, British Columbia; and Surrey, England; as well as a manufacturing facility in Normal, Illinois.Rivians Digital Technology Team is responsible for the end-to-end implementation of the digital experience outside the vehicle (e.g. vehicle configurator, payment gateway, vehicle delivery management, service scheduling) across web, mobile app and in-store. To that end, we are developing a world-class commerce platform that will make learning about and purchasing electric adventure vehicles intuitive, seamless and fun. We are seeking a Senior Site Reliability Engineer who will join an SRE team in creating best practices and solutions to keep the Rivian Digital Technology sites and applications highly available and reliable. This is an exciting role working with software engineering teams from the ground up to build cloud-based solutions using the latest technologies, tools, and practices. The right candidate will be passionate about site reliability and how to serve millions of customers with full automation and limited downtime.ResponsibilitiesWork with engineering teams to deliver high quality products and solutions that delight Rivian customers.Work with engineering teams to design robust cloud-based architectures and redundant, fault tolerant solutions utilizing practices around CICD, blue-green deployments, canary testing, and traffic management.Define non-functional requirements (NFRs) for engineering teams around security, logging, monitoring, alerting, configuration, and testing and work with those teams in their implementations of apps and services.Develop runbooks and standard operating procedures (SOPs) for each service and application to ensure DevOps and SRE teams can detect incidents or issues before customers are impacted and act quickly to restore impacted services.Define practices and procedures around postmortems and root cause analysis to ensure service quality and maintainability KPIs are improving and downtime and service interruption are negligible.Work collaboratively with various stake holders to provide team-based solutions, creating a culture of inclusion and diversity of skillsets.Participate in a 24x7 on-call rotation and define and implement on-call practices and procedures.Qualifications5years in a technical role in Site Reliability, Operations, Systems Administration, or Cloud Infrastructure.5years of experience being responsible for the uptime and reliability of customer facing web or mobile applications and critical services.5years of experience maintaining and administrating large scale Linux based environments with best practices for security and automation.5years of experience providing and maintaining cloud-based infrastructure such as AWS, GCP, Azure, or internal data center solutions based on VSphere, Openstack etc.3years implementing and maintaining monitoring and alerting systems, creating service level indicators (SLIs), service level objectives (SLOs), and focusing on systems that self-heal or alert teams to take action before system downtime.3years designing and operating fault tolerant systems, with zero to no downtime.Expert knowledge of monitoring systems such as:AppDynamics, New Relic, Prometheus, Grafana, Graphite, Nagios, AWS CloudWatch etc.Knowledge of network architectures, security, and troubleshooting of connectivity or latency fortable managing several thousand node deployments and the automation it takes to ensure system uptime and redundancy.Experience with Docker, K8S, AWS Lambda is a plus.Proficiency in writing automation scripts and tools using bash / python / awk etc.Bachelors degree in computer science, electrical engineering, information systems or equivalent work experience.Department:Digital TechnologyLocation: Palo Alto, CAEqual OpportunityRivian is an Equal Opportunity Employer and Prohibits Discrimination and Harassment of Any Kind: Rivian is committed to the principle of equal employment opportunity for all employees and to providing employees with a work environment free of discrimination and harassment. All employment decisions at Rivian are based on business needs, job requirements and individual qualifications, without regard to race, color, religion or belief, family or parental status, or any other status protected by the laws or regulations in the locations where we operate. Rivian will not tolerate discrimination or harassment based on any of these characteristics. Rivian encourages applicants of all ages.PrivacyWe take your privacy seriously. For details please see our Candidate Privacy Notice.SDL2017

Keywords: Rivian Automotive, Palo Alto , Senior Site Reliability Engineer, Other , Palo Alto, California

Click here to apply!

Didn't find what you're looking for? Search again!

I'm looking for
in category
within


Log In or Create An Account

Get the latest California jobs by following @recnetCA on Twitter!

Palo Alto RSS job feeds