Engineering - Software
. San Jose, California
Senior DevOps Engineer
Req ID: DSA31472
Internet of Things, IoT, is a true digital revolution. Digitization of the entire world of Things will be the core of the Next Internet, however, it is happening now, not in 2020. Samsung, through multiple initiatives, and at many different levels (from hardware to software to services), is positioned to play a fundamental role in that revolution.
ARTIK Cloud, https://artik.cloud, is a small agile team of highly motivated and smart people who are working on building this platform to enable production products as well as demos. Since IoT is all about data: collecting, transporting, processing, analyzing, taking actions upon them, and all that at unprecedented volume and scale, this platform is a complex system requiring cutting-edge big data and real-time technologies.
In this role you will manage server and application infrastructure for our production and development environments, including deployment, release management, instrumentation and monitoring, alerts, backup, high-availability setup, security, intrusion detection systems, etc. You will also be an integral part in scaling out our platform as we grow. You will also be responsible for maintaining high-quality documentation on operation processes.
You are a strong coder and not afraid to tackle tools and languages you don't know and give a hand on projects outside of your area when needed. There is no "box" in our team and versatility is a quality that we truly enjoy for ourselves and expect from others.
Our group is an extremely fast-paced environment with high demands on our engineers. A successful candidate will have had experience in such an environment and has high standards set for their own delivery of quality in everything that they do.
You can choose to work from SF or San Jose office.
- Understand and take responsibilities for all operational workflows and operating procedures, down to a granular, detailed level.
- Measurement, optimization, and tuning of system performance and ensuring that systems will run reliably and are highly available in a 24/7 production environment, as well as in a development stack.
- Learn new third-party software, hardware, and other solutions quickly and integrate them within our platform and other deliverables.
- Create and maintain operational documentation pertaining to infrastructure, procedures, tools, etc.
- Collaborate with other engineers to optimize application and infrastructure for performance, reliability, failover, and scale.
- Participate in 24/7 on-call rotation policy by responding to system and emergency problems.
- Proactively identify, manage and mitigate risks.
- Perform miscellaneous job-related duties as assigned, including work off-hours on occasion to maximize production uptime.
- Savvy about which open source tools to leverage and what to build in-house.
- Strong Linux administration skills - installing packages, troubleshooting etc.
- Minimum 5years of experience managing a linux environment in Production.
- Strong scripting skills - BASH a MUST. One high-level language like Ruby/Python a MUST.
- At least two years of AWS experience.
- Should have managed a production platform on AWS. Should have good understanding of AWS API. Terraform experience a huge plus.
- Chef, Puppet, or Ansible (at least one)
- Has managed at least one of the following datastores in a production environment:
- Cassandra, Mongo, Elasticsearch, or HDFS
- Experience with ANY logging/graphing/trending/monitoring infrastructure like:
- Nagios, Icinga, Sensu, Graphite, Statsd, Graylog, Logstash, Splunk, ELK, InfluxDB, Ganglia
- Solid team player and quick learner. Ability to switch gears quickly due to changing requirements.
- Experience managing a production environment.
- Must have experience troubleshooting/managing/deploying java applications on linux.
- MySQL cluster administration experience is highly desired.