Miguel A. Lobato
Staff AI Platform Engineer
Production AI reliability, agent platforms, and inference operations at scale
Summary
Staff-level platform engineer focused on running AI systems in production with reliability, observability, governance, and operational discipline.
- I design and operate agent and inference platforms on Kubernetes, integrating real systems, human approval flows, and dynamic context loading.
- My work spans agent runtimes, LLMOps, multi-model gateways, evals, guardrails, cost controls, and AI observability.
- I bring a strong platform foundation in Snowflake, orchestration, data quality, governance, and production operations.
- I work as a hands-on Staff IC at the intersection of AI platform, SRE, and developer platform.
Core Capabilities
AI Platform
Agent runtimes
LLMOps / AgentOps
Multi-model routing
MCP integration
Human-in-the-loop workflows
Reliability & Operations
Inference on Kubernetes
SLOs, autoscaling, canary, rollback
Incident automation
Observability and alerting
Cost and capacity controls
Data & Governance
Snowflake, Airflow, dbt-ready patterns
ETLs and data platform modernization
Data quality and policy-as-code
OpenLineage and catalog integration
Platform governance
Representative Outcomes
- Reduced operational toil and incident response time through agent-assisted workflows, better observability, and production automation.
- Improved reliability and performance of inference services through SLO-driven operations, autoscaling, canary releases, and rollback patterns.
- Added delivery discipline to LLM and agent systems with eval harnesses, golden datasets, regression checks, scorecards, dashboards, and approval gates.
- Controlled cost and capacity with routing, budget-aware operations, and platform-level operational guardrails.
Work Experience
New Work SE
Staff AI Platform Engineer
Sep 2022 - Present
- Owned production AI platform reliability across agent, inference, and data platform workloads.
- Built agent workflows for PR validation and incident response with tools, memory, retries, fallbacks, and human approval.
- Integrated agents with real systems including Kubernetes, ArgoCD, Dash0, AWS, CloudWatch, and MCP-style context loading.
- Established AI observability across traces, task success, latency, cost, dashboards, alerts, and error taxonomies.
- Added eval-driven delivery with offline and online evals, golden datasets, regression checks, and scorecards.
- Introduced guardrails and operational controls for PII, secrets, grounding, approval gates, and sensitive actions.
- Ran inference services on Kubernetes with SLOs, autoscaling, canary releases, rollback, and production troubleshooting.
- Led platform modernization from legacy Hadoop foundations to Snowflake, Astronomer, AWS, Kubernetes, and Crossplane.
- Strengthened governance with data contracts, OpenLineage, RBAC, catalog integration, DQ testing, and policy-as-code.
Smart Protection
Staff Data Platform Engineer
Apr 2021 - Aug 2022
- Built platform foundations for data and ML workloads on AWS, Spark, Kafka, Delta Lake, and Kubernetes.
- Orchestrated pipelines and operations with Airflow, Terraform, and cloud-native infrastructure patterns.
- Established observability discipline with InfluxDB, CloudWatch, and Grafana.
- Created governance standards that improved consistency, ownership, and delivery velocity.
Evacode Studio
Co-founder & CTO
Sep 2007 - Sep 2010
- Co-founded and led an IT services company across software delivery, business development, and technical leadership.