Miguel A. Lobato

Staff AI Platform Engineer

Production AI reliability, agent platforms, and inference operations at scale

miguellobato.com


Summary

Staff-level platform engineer focused on running AI systems in production with reliability, observability, governance, and operational discipline.

  • I design and operate agent and inference platforms on Kubernetes, integrating real systems, human approval flows, and dynamic context loading.
  • My work spans agent runtimes, LLMOps, multi-model gateways, evals, guardrails, cost controls, and AI observability.
  • I bring a strong platform foundation in Snowflake, orchestration, data quality, governance, and production operations.
  • I work as a hands-on Staff IC at the intersection of AI platform, SRE, and developer platform.

Core Capabilities

AI Platform

Agent runtimes

LLMOps / AgentOps

Multi-model routing

MCP integration

Human-in-the-loop workflows

Reliability & Operations

Inference on Kubernetes

SLOs, autoscaling, canary, rollback

Incident automation

Observability and alerting

Cost and capacity controls

Data & Governance

Snowflake, Airflow, dbt-ready patterns

ETLs and data platform modernization

Data quality and policy-as-code

OpenLineage and catalog integration

Platform governance

Representative Outcomes

  • Reduced operational toil and incident response time through agent-assisted workflows, better observability, and production automation.
  • Improved reliability and performance of inference services through SLO-driven operations, autoscaling, canary releases, and rollback patterns.
  • Added delivery discipline to LLM and agent systems with eval harnesses, golden datasets, regression checks, scorecards, dashboards, and approval gates.
  • Controlled cost and capacity with routing, budget-aware operations, and platform-level operational guardrails.

Work Experience

New Work SE

Staff AI Platform Engineer

Sep 2022 - Present

  • Owned production AI platform reliability across agent, inference, and data platform workloads.
  • Built agent workflows for PR validation and incident response with tools, memory, retries, fallbacks, and human approval.
  • Integrated agents with real systems including Kubernetes, ArgoCD, Dash0, AWS, CloudWatch, and MCP-style context loading.
  • Established AI observability across traces, task success, latency, cost, dashboards, alerts, and error taxonomies.
  • Added eval-driven delivery with offline and online evals, golden datasets, regression checks, and scorecards.
  • Introduced guardrails and operational controls for PII, secrets, grounding, approval gates, and sensitive actions.
  • Ran inference services on Kubernetes with SLOs, autoscaling, canary releases, rollback, and production troubleshooting.
  • Led platform modernization from legacy Hadoop foundations to Snowflake, Astronomer, AWS, Kubernetes, and Crossplane.
  • Strengthened governance with data contracts, OpenLineage, RBAC, catalog integration, DQ testing, and policy-as-code.

Smart Protection

Staff Data Platform Engineer

Apr 2021 - Aug 2022

  • Built platform foundations for data and ML workloads on AWS, Spark, Kafka, Delta Lake, and Kubernetes.
  • Orchestrated pipelines and operations with Airflow, Terraform, and cloud-native infrastructure patterns.
  • Established observability discipline with InfluxDB, CloudWatch, and Grafana.
  • Created governance standards that improved consistency, ownership, and delivery velocity.

Evacode Studio

Co-founder & CTO

Sep 2007 - Sep 2010

  • Co-founded and led an IT services company across software delivery, business development, and technical leadership.

More experience at LinkedIn

Education

Executive Master in Business Administration

Universidad de Sevilla

2009 - 2011

International Master in Business Administration

Universidad Pablo de Olavide

2011 - 2012

Computer Software Engineer

Universidad de Sevilla

2002 - 2008

More education at LinkedIn