Site Reliability Engineer
Palo Alto, CA (United States)
cassandra java amazon-web-services
Apples Applied Machine Learning team has built systems for a number of large-scale data science applications. We work on high-impact projects that serve various Apple lines of business. We use the latest in open source technology and as committers on some of these projects, we are pushing the envelope! Working with multiple lines of business, we handle many streams of Apple-scale data. We bring it all together and extract the value. We do all this with an outstanding group of software engineers, data scientists, SRE/DevOps engineers and managers.
Monitor production, staging, test and development environments for a myriad of applications in an agile and multifaceted organization.
You are an independent problem-solver who is self-directed and capable of exhibiting deftness to handle multiple simultaneous contending priorities and deliver solutions in a timely manner.
Provide incident resolution for all technical production issues.
Create and maintain accurate, up-to-date documentation reflecting configuration, and responsible for writing justifications, training users in sophisticated topics, writing status reports, documenting procedures, and interacting with Apple staff and management.
Provide guidance to improve the stability, security, efficiency and scalability of systems.
Determine future needs for capacity and investigate new products and/or features.
Strong troubleshooting ability will be used daily; will take steps on their own to isolate issues and resolve root cause through investigative analysis in environments where the candidate has little knowledge/experience/documentation.
Administer and ensure the accurate execution of the backup systems.
Provide 24x7 on-call support to handle urgent critical issues.