Role Overview
The AI DevOps Engineer plays a critical role at the intersection of machine learning, software engineering and platform operations. This role ensures the reliable, scalable, and secure deployment of AI/ML models into production by building automated pipelines, optimizing model serving infrastructure, and integrating observability into the entire ML lifecycle. The AI DevOps Engineer partners closely with data scientists, ML engineers, and platform teams to accelerate the delivery of AI solutions.
________________________________________
Key Responsibilities
1. ML Infrastructure & Platform
• Design, build, and maintain AI platform components (model training, model registry, feature store, inference services).
• Implement container-based and serverless architectures for scalable AI workloads.
• Manage GPU/TPU compute clusters and optimize resource utilization for training and inference.
2. CI/CD for ML (MLOps)
• Build and maintain CI/CD pipelines for ML workflows including data validation, model testing, packaging, and automated deployment.
• Integrate model governance, approval workflows, and rollback mechanisms into pipelines.
• Enable reproducible pipelines using tools like MLflow, Kubeflow, Vertex AI, Databricks, Azure ML, or Amazon SageMaker.
3. Production Model Deployment & Inference
• Deploy real time, batch, and streaming inference pipelines.
• Optimize performance of model serving systems (e.g., Triton Inference Server, TorchServe, BentoML, Ray Serve).
• Implement A/B testing, shadow deployments, and model versioning strategies.
4. Monitoring & Observability
• Build end to end observability including:
o Model performance monitoring (drift, bias, accuracy decay).
o System health monitoring (latency, throughput, resource usage).
o Data quality checks using automated detectors.
• Integrate monitoring dashboards and alerts via Prometheus, Grafana, ELK, Datadog, etc.
5. Security, Compliance & Governance
• Ensure secure handling of model artifacts, datasets, and inference endpoints.
• Implement identity, access, and compliance controls (PII, GDPR, SOC2, ISO, Responsible AI frameworks).
• Conduct threat modeling for AI systems (model stealing, prompt injection, data poisoning).
6. Collaboration & Engineering Practices
• Work closely with data scientists to productionize research prototypes.
• Partner with cloud, SRE, and platform teams to align on best practices.
• Write high-quality documentation, runbooks, and architectural diagrams.
________________________________________
Required Skills & Qualifications
Technical Skills
• Strong programming experience in Python (preferred), plus experience with Bash, Go, or Java.
• Strong knowledge in cloud services (Azure / AWS / GCP) including managed ML services.
• Hands-on with containerization and orchestration: Docker, Kubernetes, Helm.
• Experience with CI/CD tools: GitHub Actions, Azure DevOps, GitLab CI, Jenkins.
• Familiarity with ML frameworks: PyTorch, TensorFlow, Scikit learn.
• Experience deploying and scaling AI inference systems.
DevOps & Infra Skills
• Strong Linux fundamentals and system troubleshooting skills.
• Knowledge of networking, load balancing, and distributed systems.
AI/ML Skills
• Understanding of ML lifecycle, model artifacts, hyperparameter tuning, and model evaluation.
• Experience with ML metadata management, experiment tracking, and data validation tools.
________________________________________
Preferred Qualifications
• Experience with LLMOps: deploying and optimizing large language models, vector DBs, and retrieval pipelines.
• Knowledge of frameworks such as LangChain, LlamaIndex, Milvus, Weaviate, Pinecone.
• Prior work in high scale, low latency inference environments.
• Certifications in cloud (Azure/AWS/GCP) or ML engineering.
________________________________________
Behavioral Competencies
• Strong problem-solving and debugging skills.
• Ability to collaborate across cross-functional teams.
• Ownership mindset with a focus on reliability, performance, and automation.
• Effective communication with both technical and non-technical stakeholders
Software Powered by iCIMS
www.icims.com