<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Pipelines on RockB</title><link>https://baeseokjae.github.io/tags/pipelines/</link><description>Recent content in Pipelines on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 11 May 2026 06:05:06 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/pipelines/index.xml" rel="self" type="application/rss+xml"/><item><title>ZenML Guide 2026: Production MLOps Pipelines Without the Lock-In</title><link>https://baeseokjae.github.io/posts/zenml-mlops-pipeline-guide-2026/</link><pubDate>Mon, 11 May 2026 06:05:06 +0000</pubDate><guid>https://baeseokjae.github.io/posts/zenml-mlops-pipeline-guide-2026/</guid><description>Complete ZenML guide 2026: build production MLOps pipelines with @step/@pipeline decorators, swap infra without rewrites, and avoid vendor lock-in.</description><content:encoded><![CDATA[<p>ZenML is an open-source MLOps framework that lets you define ML pipelines once in Python and run them on any infrastructure — local, AWS, GCP, or Azure — by swapping a stack configuration rather than rewriting code. In 2026, it&rsquo;s the most direct answer to the 85% of ML models that never reach production.</p>
<h2 id="why-85-of-ml-models-never-reach-production-and-how-zenml-fixes-that">Why 85% of ML Models Never Reach Production (And How ZenML Fixes That)</h2>
<p>The production gap in machine learning is one of the most persistent problems in the industry, and the numbers remain damning in 2026. Research consistently shows that 85% of ML models never make it to production, and approximately 45% of ML projects fail specifically due to poor monitoring and retraining pipelines. The root cause is almost never the model itself — it&rsquo;s the infrastructure around it. Teams build a model in a Jupyter notebook, spend months trying to productionize it using SageMaker, Vertex AI, or a custom Kubeflow cluster, and then discover that any infrastructure change requires rewriting their entire training logic. The research-to-production handoff becomes a six-month project every single time.</p>
<p>ZenML addresses this with a fundamentally different architecture: your pipeline code stays entirely cloud-agnostic, and the &ldquo;stack&rdquo; — the collection of infrastructure components like orchestrator, artifact store, and experiment tracker — is a swappable configuration. When your team decides to migrate from AWS SageMaker to Google Vertex AI, you change the stack, not the pipeline. When you need to reproduce a training run from eight months ago, ZenML&rsquo;s artifact tracking gives you deterministic access to the exact data, code, and configuration used. The result: teams using ZenML report cutting their research-to-production timeline from months to days, because the deployment machinery is already built into the framework.</p>
<h2 id="what-is-zenml-the-stack-based-mlops-framework-explained">What Is ZenML? The Stack-Based MLOps Framework Explained</h2>
<p>ZenML is an open-source, Apache 2.0-licensed MLOps framework built around a single core abstraction: the <strong>stack</strong>. A stack is a named collection of infrastructure components — an orchestrator (where pipelines run), an artifact store (where data and models are saved), and optionally an experiment tracker, model deployer, and alerter. The stack abstraction is what makes ZenML&rsquo;s anti-lock-in promise concrete: your <code>@pipeline</code>-decorated Python functions remain identical whether you&rsquo;re running locally or on Kubernetes. Only the active stack changes.</p>
<p>ZenML has accumulated 5.2k+ GitHub stars as of 2026, reflecting strong developer adoption in a market where the global MLOps space is valued at $4.39 billion and growing at nearly 46% CAGR toward an estimated $90 billion by 2035. Unlike heavier platforms like Kubeflow (which requires a dedicated Kubernetes cluster and platform team), ZenML is designed for the majority case: a small-to-medium ML team that wants production-grade pipelines without the operational overhead. It supports 50+ integrations including MLflow, Weights &amp; Biases, Great Expectations, Seldon, BentoML, and every major cloud provider. The Apache 2.0 license means zero vendor lock-in concerns — you own your pipelines, your data, and your infrastructure choices.</p>
<p>The framework positions itself as a <strong>portability layer</strong> over orchestrators, not an orchestrator itself. ZenML can use Airflow, Kubeflow Pipelines, Prefect, or plain local execution as the orchestrator — you decide based on your team&rsquo;s existing tooling.</p>
<h2 id="zenml-core-concepts-steps-pipelines-stacks-and-artifacts">ZenML Core Concepts: Steps, Pipelines, Stacks, and Artifacts</h2>
<p>ZenML&rsquo;s programming model is built on four interlocking concepts that every practitioner needs to understand before writing a single line of code. Steps are Python functions decorated with <code>@step</code> — they represent a single unit of work in your ML workflow, such as data ingestion, preprocessing, model training, or evaluation. Pipelines are Python functions decorated with <code>@pipeline</code> that wire steps together by calling them in sequence. Stacks define the infrastructure where pipelines run. And artifacts are the typed data outputs that flow between steps, automatically versioned and tracked by ZenML&rsquo;s artifact store.</p>
<p><strong>Steps</strong> are the core unit of reuse. Each <code>@step</code> function has typed inputs and outputs, which ZenML uses to automatically serialize and deserialize data via materializers. This means if you return a <code>pd.DataFrame</code> from one step, the next step receives a <code>pd.DataFrame</code> — ZenML handles storage transparently.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> zenml <span style="color:#f92672">import</span> step, pipeline
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.datasets <span style="color:#f92672">import</span> load_iris
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> pandas <span style="color:#66d9ef">as</span> pd
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@step</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">load_data</span>() <span style="color:#f92672">-&gt;</span> pd<span style="color:#f92672">.</span>DataFrame:
</span></span><span style="display:flex;"><span>    iris <span style="color:#f92672">=</span> load_iris(as_frame<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> iris<span style="color:#f92672">.</span>frame
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@step</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">train_model</span>(data: pd<span style="color:#f92672">.</span>DataFrame) <span style="color:#f92672">-&gt;</span> dict:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">from</span> sklearn.linear_model <span style="color:#f92672">import</span> LogisticRegression
</span></span><span style="display:flex;"><span>    X <span style="color:#f92672">=</span> data<span style="color:#f92672">.</span>drop(<span style="color:#e6db74">&#34;target&#34;</span>, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>    y <span style="color:#f92672">=</span> data[<span style="color:#e6db74">&#34;target&#34;</span>]
</span></span><span style="display:flex;"><span>    model <span style="color:#f92672">=</span> LogisticRegression(max_iter<span style="color:#f92672">=</span><span style="color:#ae81ff">200</span>)
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">.</span>fit(X, y)
</span></span><span style="display:flex;"><span>    accuracy <span style="color:#f92672">=</span> model<span style="color:#f92672">.</span>score(X, y)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> {<span style="color:#e6db74">&#34;accuracy&#34;</span>: accuracy}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@pipeline</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">iris_pipeline</span>():
</span></span><span style="display:flex;"><span>    data <span style="color:#f92672">=</span> load_data()
</span></span><span style="display:flex;"><span>    train_model(data)
</span></span></code></pre></div><p><strong>Stacks</strong> are where the cloud-agnostic magic lives. The active stack is set via <code>zenml stack set my-stack</code>, and from that point forward, running <code>iris_pipeline()</code> uses whatever orchestrator and artifact store that stack defines — no code changes required.</p>
<p><strong>Artifacts</strong> are versioned automatically. Every output of every step is stored with a version number, metadata, and lineage information. You can query any artifact from any past pipeline run, compare versions across experiments, and reproduce results exactly — solving the reproducibility crisis that plagues notebook-driven workflows.</p>
<h2 id="getting-started-with-zenml-installation-and-your-first-pipeline">Getting Started with ZenML: Installation and Your First Pipeline</h2>
<p>Getting ZenML running locally takes under five minutes, making it one of the more developer-friendly MLOps frameworks available in 2026. The installation is a standard pip install, and the default stack uses local orchestration and local file storage — no cloud credentials or Docker setup required for your first pipeline.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install zenml
</span></span><span style="display:flex;"><span>zenml init          <span style="color:#75715e"># Initialize ZenML in your project directory</span>
</span></span><span style="display:flex;"><span>zenml up            <span style="color:#75715e"># Start the ZenML server (optional, for dashboard access)</span>
</span></span></code></pre></div><p>After init, your project directory contains a <code>.zen</code> folder that marks it as a ZenML repository. The default stack is automatically created — it uses the local orchestrator and stores artifacts in <code>~/.config/zenml/local_stores/</code>. You can inspect your stack at any time:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>zenml stack list
</span></span><span style="display:flex;"><span>zenml stack describe
</span></span></code></pre></div><p>To run the pipeline you defined in the previous section:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;__main__&#34;</span>:
</span></span><span style="display:flex;"><span>    iris_pipeline()
</span></span></code></pre></div><p>Running this produces a structured output showing each step executing, with artifact URIs and run IDs printed to the console. The ZenML dashboard (accessible at <code>http://localhost:8080</code> after <code>zenml up</code>) shows the full DAG, artifact lineage, and step-level metadata.</p>
<p><strong>Step caching</strong> is enabled by default. If you run the same pipeline twice with identical inputs and code, ZenML skips steps whose outputs haven&rsquo;t changed, retrieving cached artifacts instead. For expensive training steps or LLM API calls, this translates to real cost savings — teams report cutting iteration costs by 40-60% on data-heavy pipelines simply by letting ZenML cache intermediate results. Disable caching on a per-step basis with <code>@step(enable_cache=False)</code> when you need fresh outputs.</p>
<h3 id="handling-custom-data-types-with-materializers">Handling Custom Data Types with Materializers</h3>
<p>ZenML&rsquo;s built-in materializers handle <code>pd.DataFrame</code>, NumPy arrays, PyTorch tensors, sklearn models, and more. For custom types, you define a materializer by subclassing <code>BaseMaterializer</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> zenml.materializers.base_materializer <span style="color:#f92672">import</span> BaseMaterializer
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> joblib
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">SklearnModelMaterializer</span>(BaseMaterializer):
</span></span><span style="display:flex;"><span>    ASSOCIATED_TYPES <span style="color:#f92672">=</span> (LogisticRegression,)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">load</span>(self, data_type):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> joblib<span style="color:#f92672">.</span>load(self<span style="color:#f92672">.</span>uri <span style="color:#f92672">+</span> <span style="color:#e6db74">&#34;/model.joblib&#34;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">save</span>(self, model):
</span></span><span style="display:flex;"><span>        joblib<span style="color:#f92672">.</span>dump(model, self<span style="color:#f92672">.</span>uri <span style="color:#f92672">+</span> <span style="color:#e6db74">&#34;/model.joblib&#34;</span>)
</span></span></code></pre></div><p>Custom materializers are registered globally or per-step, giving you precise control over serialization for any data type your pipeline produces.</p>
<h2 id="building-production-stacks-orchestrators-artifact-stores-and-trackers">Building Production Stacks: Orchestrators, Artifact Stores, and Trackers</h2>
<p>A production ZenML stack typically combines three components: an orchestrator that schedules and runs pipeline DAGs at scale, an artifact store backed by cloud object storage for durability and sharing across team members, and an experiment tracker that captures hyperparameters, metrics, and model versions. ZenML ships with integrations for all of these out of the box, and swapping components requires only CLI commands — not code changes. This is what makes ZenML&rsquo;s anti-lock-in guarantee practical rather than theoretical: in 2026, teams use a local stack for development (zero config, runs in seconds) and a cloud stack for production (SageMaker + S3 + MLflow, or Vertex AI + GCS + W&amp;B), with no pipeline code changes between environments. The 63% of organizations that report high integration complexity across ML systems typically suffer from infrastructure coupling — ZenML&rsquo;s stack model breaks that coupling by design. Each stack component is independently versioned, registered, and replaceable. You can upgrade your artifact store, switch orchestrators, or add a model deployer without touching any <code>@step</code> or <code>@pipeline</code> code.</p>
<h3 id="orchestrators-local--airflow--kubernetes">Orchestrators: Local → Airflow → Kubernetes</h3>
<table>
  <thead>
      <tr>
          <th>Orchestrator</th>
          <th>Best For</th>
          <th>ZenML Integration</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Local</td>
          <td>Development, testing</td>
          <td>Built-in, zero config</td>
      </tr>
      <tr>
          <td>Airflow</td>
          <td>Teams already using Airflow</td>
          <td><code>zenml-airflow</code></td>
      </tr>
      <tr>
          <td>Kubeflow Pipelines</td>
          <td>K8s-native teams</td>
          <td><code>zenml-kubeflow</code></td>
      </tr>
      <tr>
          <td>Vertex AI Pipelines</td>
          <td>GCP-first teams</td>
          <td><code>zenml-gcp</code></td>
      </tr>
      <tr>
          <td>SageMaker Pipelines</td>
          <td>AWS-first teams</td>
          <td><code>zenml-aws</code></td>
      </tr>
      <tr>
          <td>Tekton</td>
          <td>GitOps-native teams</td>
          <td><code>zenml-tekton</code></td>
      </tr>
  </tbody>
</table>
<p>The key point: the same <code>@pipeline</code> function runs on all of these. Switching orchestrators is a stack operation, not a code change. This is ZenML&rsquo;s foundational design decision.</p>
<h3 id="building-an-aws-stack">Building an AWS Stack</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Install AWS integration</span>
</span></span><span style="display:flex;"><span>pip install <span style="color:#e6db74">&#34;zenml[aws]&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Register components</span>
</span></span><span style="display:flex;"><span>zenml artifact-store register s3_store <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --flavor<span style="color:#f92672">=</span>s3 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --path<span style="color:#f92672">=</span>s3://my-bucket/zenml-artifacts
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>zenml orchestrator register sagemaker_orchestrator <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --flavor<span style="color:#f92672">=</span>sagemaker <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --execution_role_arn<span style="color:#f92672">=</span>arn:aws:iam::123456789:role/SageMakerRole
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>zenml experiment-tracker register mlflow_tracker <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --flavor<span style="color:#f92672">=</span>mlflow <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --tracking_uri<span style="color:#f92672">=</span>https://my-mlflow-server.com
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Assemble and activate the stack</span>
</span></span><span style="display:flex;"><span>zenml stack register production_aws <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    -o sagemaker_orchestrator <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    -a s3_store <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    -e mlflow_tracker
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>zenml stack set production_aws
</span></span></code></pre></div><p>Now <code>iris_pipeline()</code> runs on SageMaker with artifacts in S3 and metrics in MLflow — with zero changes to the pipeline code. To switch to GCP, register GCP components and set a GCP stack. Your pipeline code is untouched.</p>
<h3 id="artifact-store-best-practices">Artifact Store Best Practices</h3>
<p>Configure artifact stores with versioning enabled on your cloud buckets. ZenML&rsquo;s artifact lineage graph becomes invaluable when debugging production issues: you can trace any model prediction back through the exact training data version, preprocessing parameters, and code commit that produced it. Enable S3 versioning or GCS object versioning to ensure artifact immutability.</p>
<h2 id="zenml-vs-kubeflow-vs-mlflow-which-mlops-tool-do-you-actually-need">ZenML vs. Kubeflow vs. MLflow: Which MLOps Tool Do You Actually Need?</h2>
<p>Choosing between ZenML, Kubeflow Pipelines, and MLflow is one of the most common decisions ML teams face in 2026, and the right answer depends heavily on your team&rsquo;s existing infrastructure, operational maturity, and growth trajectory. ZenML positions itself as a portability layer that can use Kubeflow or MLflow internally, which changes the comparison significantly. Understanding where each tool excels prevents costly architecture mistakes that teams typically discover only after six months of implementation.</p>
<table>
  <thead>
      <tr>
          <th>Dimension</th>
          <th>ZenML</th>
          <th>Kubeflow Pipelines</th>
          <th>MLflow</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Primary Role</td>
          <td>Portable pipeline framework</td>
          <td>K8s-native orchestrator</td>
          <td>Experiment tracking + model registry</td>
      </tr>
      <tr>
          <td>Infrastructure Required</td>
          <td>None (local default)</td>
          <td>Kubernetes cluster + platform team</td>
          <td>MLflow server</td>
      </tr>
      <tr>
          <td>Vendor Lock-In Risk</td>
          <td>Very Low (stack swap)</td>
          <td>High (K8s + KFP-specific DSL)</td>
          <td>Medium (server dependency)</td>
      </tr>
      <tr>
          <td>Learning Curve</td>
          <td>Low-Medium</td>
          <td>High</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Best For</td>
          <td>Multi-cloud, evolving infra</td>
          <td>Dedicated K8s teams</td>
          <td>Tracking-only needs</td>
      </tr>
      <tr>
          <td>Pipeline Portability</td>
          <td>Excellent</td>
          <td>Poor (KFP-specific)</td>
          <td>N/A (not an orchestrator)</td>
      </tr>
      <tr>
          <td>LLMOps Support</td>
          <td>Yes (2025+)</td>
          <td>Limited</td>
          <td>Via plugins</td>
      </tr>
      <tr>
          <td>Open Source License</td>
          <td>Apache 2.0</td>
          <td>Apache 2.0</td>
          <td>Apache 2.0</td>
      </tr>
  </tbody>
</table>
<p><strong>Choose ZenML when:</strong> Your team doesn&rsquo;t have a dedicated platform team. You want to start local and graduate to cloud. You need to switch cloud providers or orchestrators without rewriting pipelines. You&rsquo;re building both classical ML and LLM workflows in the same organization.</p>
<p><strong>Choose Kubeflow when:</strong> You have a dedicated K8s platform team. Your organization is fully committed to Kubernetes and isn&rsquo;t switching. You need the specific features of Kubeflow&rsquo;s UI (pipeline visualizations, hyperparameter tuning via Katib).</p>
<p><strong>Choose MLflow when:</strong> You only need experiment tracking and a model registry, not full pipeline orchestration. You&rsquo;re adding observability to an existing pipeline system and don&rsquo;t need to change the orchestration layer.</p>
<p><strong>The pragmatic answer for most teams:</strong> Use ZenML with MLflow as the experiment tracker. You get ZenML&rsquo;s pipeline portability and ZenML&rsquo;s artifact tracking combined with MLflow&rsquo;s mature model registry and UI. This combination covers 90% of production MLOps requirements.</p>
<h2 id="advanced-zenml-features-caching-model-control-plane-and-llmops">Advanced ZenML Features: Caching, Model Control Plane, and LLMOps</h2>
<p>ZenML&rsquo;s advanced feature set in 2026 goes well beyond basic pipeline orchestration, addressing the full lifecycle of ML models in production and extending into LLMOps for teams building AI applications with large language models. These capabilities distinguish ZenML from simpler orchestration tools and represent the framework&rsquo;s evolution from a pipeline runner into a complete MLOps platform. Three features in particular drive the most business value: step caching (which directly reduces cloud compute and API costs), the Model Control Plane (which gives teams a single registry with full lineage across the entire model lifecycle), and LLMOps support (which extends the same reproducibility guarantees from classical ML to fine-tuning and evaluation of large language models). Teams that adopt ZenML&rsquo;s caching and MCP together typically cut their retraining iteration cycle from days to hours — not by making training faster, but by skipping work that doesn&rsquo;t need to be repeated and surfacing the exact lineage of every model artifact that reaches production.</p>
<h3 id="step-caching-for-cost-and-speed">Step Caching for Cost and Speed</h3>
<p>ZenML&rsquo;s caching system computes a cache key for each step based on the step code, input artifacts, and step configuration. If the cache key matches a previous run, the step is skipped and its outputs are retrieved from the artifact store. For a typical training pipeline with expensive data preprocessing, this means:</p>
<ul>
<li>First run: 45 minutes (preprocessing + training + evaluation)</li>
<li>Second run (same data, different model hyperparameters): 12 minutes (preprocessing cached, training re-runs)</li>
<li>Third run (same hyperparameters): ~30 seconds (all steps cached)</li>
</ul>
<p>For LLM pipelines where individual steps may cost $5-50 in API calls, caching transforms the economics of iteration.</p>
<h3 id="model-control-plane">Model Control Plane</h3>
<p>ZenML&rsquo;s Model Control Plane (MCP) — introduced in 2024 and significantly expanded in 2026 — is a model registry that understands the full lineage of every model version: which pipeline produced it, which artifact versions it consumed, which metrics it achieved, and which deployments are currently serving it. Access it via the <code>@step(model=Model(name=&quot;iris_classifier&quot;, version=&quot;production&quot;))</code> decorator or the Python SDK:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> zenml.model.model <span style="color:#f92672">import</span> Model
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@step</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">register_model</span>(accuracy: float) <span style="color:#f92672">-&gt;</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span>    model <span style="color:#f92672">=</span> Model(name<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;iris_classifier&#34;</span>)
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">.</span>log_metadata({<span style="color:#e6db74">&#34;accuracy&#34;</span>: accuracy})
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> accuracy <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">0.95</span>:
</span></span><span style="display:flex;"><span>        model<span style="color:#f92672">.</span>set_stage(<span style="color:#e6db74">&#34;production&#34;</span>, force<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span></code></pre></div><h3 id="llmops-and-agentic-ai-pipelines">LLMOps and Agentic AI Pipelines</h3>
<p>ZenML&rsquo;s 2025-2026 releases added first-class LLMOps support. You can now build pipelines that fine-tune models, evaluate them against benchmarks, and deploy them — using the same <code>@step/@pipeline</code> decorator pattern as classical ML. The artifact tracking system handles prompt templates, evaluation datasets, and model weights with the same lineage guarantees.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#a6e22e">@step</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">evaluate_llm</span>(model_id: str, eval_dataset: pd<span style="color:#f92672">.</span>DataFrame) <span style="color:#f92672">-&gt;</span> dict:
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Run evals against your fine-tuned model</span>
</span></span><span style="display:flex;"><span>    results <span style="color:#f92672">=</span> run_evals(model_id, eval_dataset)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> {<span style="color:#e6db74">&#34;pass_rate&#34;</span>: results<span style="color:#f92672">.</span>pass_rate, <span style="color:#e6db74">&#34;avg_latency&#34;</span>: results<span style="color:#f92672">.</span>avg_latency}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@pipeline</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">llm_fine_tuning_pipeline</span>():
</span></span><span style="display:flex;"><span>    dataset <span style="color:#f92672">=</span> prepare_eval_dataset()
</span></span><span style="display:flex;"><span>    model_id <span style="color:#f92672">=</span> fine_tune_model(dataset)
</span></span><span style="display:flex;"><span>    metrics <span style="color:#f92672">=</span> evaluate_llm(model_id, dataset)
</span></span><span style="display:flex;"><span>    gate_deployment(metrics)
</span></span></code></pre></div><p>ZenML also integrates with agentic frameworks — you can run Crew AI or LangGraph agent workflows as ZenML steps, giving you reproducibility and monitoring for agent pipelines that would otherwise be opaque.</p>
<h2 id="deploying-zenml-to-cloud-aws-gcp-azure-without-lock-in">Deploying ZenML to Cloud (AWS, GCP, Azure) Without Lock-In</h2>
<p>Cloud deployment with ZenML follows a consistent pattern regardless of provider: register cloud-specific stack components, assemble them into a named stack, and activate it — and teams report completing full cloud migrations in days rather than the months typical of tightly coupled MLOps systems. This is ZenML&rsquo;s most important practical guarantee, and it works in production. The 63% of organizations that report high integration complexity across ML systems have used ZenML&rsquo;s stack swap to migrate between cloud providers without touching pipeline code. The mechanics are always the same: <code>pip install &quot;zenml[&lt;provider&gt;]&quot;</code> for provider-specific integrations, register the components via the ZenML CLI, assemble them into a named stack, and <code>zenml stack set &lt;name&gt;</code>. From that point forward, every pipeline run targets that infrastructure with zero code changes. The three major cloud providers — AWS (SageMaker + S3), GCP (Vertex AI + GCS), and Azure (Azure ML + Azure Blob) — each have mature ZenML integrations with full step containerization support. Because ZenML wraps each step in a Docker container for cloud execution, your exact Python environment is reproducible across runs and across providers, eliminating the environment drift that causes most cloud deployment failures.</p>
<h3 id="aws-deployment-pattern">AWS Deployment Pattern</h3>
<p>For AWS, the recommended production stack combines SageMaker Pipelines (orchestrator), S3 (artifact store), and MLflow on EC2 or AWS MLflow (experiment tracker). Add an ECR container registry for step containerization:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install <span style="color:#e6db74">&#34;zenml[aws,mlflow,sklearn]&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>zenml container-registry register ecr_registry <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --flavor<span style="color:#f92672">=</span>aws <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --uri<span style="color:#f92672">=</span>123456789.dkr.ecr.us-east-1.amazonaws.com
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>zenml stack register aws_production <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    -o sagemaker_orchestrator <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    -a s3_artifact_store <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    -e mlflow_tracker <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    -c ecr_registry
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>zenml stack set aws_production
</span></span><span style="display:flex;"><span>zenml stack up  <span style="color:#75715e"># Provisions infrastructure using the stack&#39;s deployment spec</span>
</span></span></code></pre></div><h3 id="gcp-deployment-pattern">GCP Deployment Pattern</h3>
<p>For GCP, the production stack uses Vertex AI Pipelines (orchestrator) and Google Cloud Storage (artifact store):</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install <span style="color:#e6db74">&#34;zenml[gcp,mlflow]&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>zenml orchestrator register vertex_orchestrator <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --flavor<span style="color:#f92672">=</span>vertex <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --project<span style="color:#f92672">=</span>my-gcp-project <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --location<span style="color:#f92672">=</span>us-central1
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>zenml artifact-store register gcs_store <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --flavor<span style="color:#f92672">=</span>gcp <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --path<span style="color:#f92672">=</span>gs://my-bucket/zenml
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>zenml stack register gcp_production <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    -o vertex_orchestrator <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    -a gcs_store <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    -e mlflow_tracker
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>zenml stack set gcp_production
</span></span></code></pre></div><p>To migrate the same pipeline from the AWS stack to the GCP stack: <code>zenml stack set gcp_production</code>, then re-run. No code changes. This is the lock-in escape hatch that ZenML exists to provide.</p>
<h3 id="cicd-integration">CI/CD Integration</h3>
<p>ZenML pipelines integrate naturally with GitHub Actions or GitLab CI. The recommended pattern triggers a pipeline run on every merge to main:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#75715e"># .github/workflows/train.yml</span>
</span></span><span style="display:flex;"><span>- <span style="color:#f92672">name</span>: <span style="color:#ae81ff">Run training pipeline</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">run</span>: |<span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    zenml stack set ${{ env.ZENML_STACK }}
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    python pipelines/train.py</span>
</span></span></code></pre></div><p>The ZENML_STACK environment variable switches between staging and production stacks without touching the pipeline code.</p>
<h2 id="zenml-pro-vs-open-source-when-to-upgrade">ZenML Pro vs. Open Source: When to Upgrade</h2>
<p>ZenML Pro is the managed cloud offering that adds team collaboration, RBAC, a hosted dashboard, and managed infrastructure provisioning on top of the open-source core. The open-source version is genuinely production-ready for single teams and small organizations — ZenML does not use open-core tactics that cripple the OSS offering. The upgrade decision comes down to team size and operational overhead tolerance.</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>Open Source</th>
          <th>ZenML Pro</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Core pipeline framework</td>
          <td>✓</td>
          <td>✓</td>
      </tr>
      <tr>
          <td>All stack integrations</td>
          <td>✓</td>
          <td>✓</td>
      </tr>
      <tr>
          <td>Artifact tracking</td>
          <td>✓</td>
          <td>✓</td>
      </tr>
      <tr>
          <td>Model Control Plane</td>
          <td>✓</td>
          <td>✓</td>
      </tr>
      <tr>
          <td>Self-hosted server</td>
          <td>✓</td>
          <td>✓</td>
      </tr>
      <tr>
          <td>Hosted dashboard</td>
          <td>—</td>
          <td>✓</td>
      </tr>
      <tr>
          <td>RBAC and team management</td>
          <td>—</td>
          <td>✓</td>
      </tr>
      <tr>
          <td>Managed infra provisioning</td>
          <td>—</td>
          <td>✓</td>
      </tr>
      <tr>
          <td>SLA and enterprise support</td>
          <td>—</td>
          <td>✓</td>
      </tr>
      <tr>
          <td>SSO/SAML integration</td>
          <td>—</td>
          <td>✓</td>
      </tr>
  </tbody>
</table>
<p><strong>Stay on open source if:</strong> You have one ML team. You&rsquo;re comfortable hosting and maintaining the ZenML server. You don&rsquo;t need cross-team artifact sharing with fine-grained access control.</p>
<p><strong>Upgrade to Pro if:</strong> Multiple ML teams need shared artifact stores and model registries with RBAC. You want Anthropic or Slack-based SLA-backed support. You&rsquo;d rather pay $X/month than maintain the ZenML server infrastructure.</p>
<p>The open-source ZenML server can be self-hosted on any Kubernetes cluster or as a Docker container. Most teams start with the OSS server and graduate to Pro once the team grows beyond 5-10 ML engineers.</p>
<h2 id="production-best-practices-and-common-pitfalls-to-avoid">Production Best Practices and Common Pitfalls to Avoid</h2>
<p>Reaching production with ZenML requires getting the framework right from the start: teams that adopt ZenML early in a project reach production 3-5x faster than those who retrofit MLOps tooling onto an existing notebook-driven workflow. This speed advantage is not accidental — it is the direct result of enforcing the right patterns from day one rather than trying to reconstruct reproducibility and lineage after the fact. Given that 85% of ML models never reach production and 45% fail specifically due to poor monitoring and retraining infrastructure, the common thread in successful ZenML deployments is disciplined adherence to a few non-negotiable patterns: deterministic steps, typed outputs, versioned pipeline runs correlated with git commits, and data validation before every retraining run. These are not ZenML-specific best practices — they are MLOps fundamentals that ZenML&rsquo;s architecture makes easier to enforce consistently across a growing team. The pitfalls below are the specific ways teams break these fundamentals when using ZenML, drawn from patterns observed in production deployments across fintech, healthcare, and e-commerce. Addressing them early prevents the silent failures and debugging marathons that consume engineering time in less structured pipelines.</p>
<h3 id="design-for-reproducibility-from-day-one">Design for Reproducibility from Day One</h3>
<p>Every step output should be fully deterministic given its inputs. Avoid side effects in steps — don&rsquo;t write to databases, send API calls with side effects, or depend on global mutable state. ZenML&rsquo;s caching assumes determinism; if a step has side effects, disable caching explicitly with <code>@step(enable_cache=False)</code>.</p>
<p><strong>Common pitfall:</strong> Using <code>datetime.now()</code> inside a step without disabling cache. ZenML caches the output on the first run, and subsequent runs return the stale timestamp instead of the current time.</p>
<h3 id="version-your-pipelines-alongside-your-models">Version Your Pipelines Alongside Your Models</h3>
<p>ZenML pipeline runs are tagged with the git commit SHA if you configure it. Use this: <code>pipeline.run(run_name=f&quot;train-{git_sha}-{timestamp}&quot;)</code>. This gives you a direct link between model artifacts and the exact code that produced them — essential for debugging production issues six months after training.</p>
<h3 id="use-typed-step-outputs">Use Typed Step Outputs</h3>
<p>Always annotate step outputs with Python type hints. ZenML uses these to select the correct materializer and to enable type checking at the framework level. Untyped steps fall back to pickle serialization, which breaks cross-Python-version compatibility and defeats artifact portability.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Bad: ZenML falls back to pickle</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@step</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">train</span>():
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> model
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Good: ZenML selects the sklearn materializer</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@step</span>  
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">train</span>() <span style="color:#f92672">-&gt;</span> sklearn<span style="color:#f92672">.</span>base<span style="color:#f92672">.</span>BaseEstimator:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> model
</span></span></code></pre></div><h3 id="monitor-data-drift-in-production-pipelines">Monitor Data Drift in Production Pipelines</h3>
<p>Configure a data validator component (Great Expectations or Evidently) in your production stack. Run validation as the first step of every retraining pipeline — if data drift exceeds thresholds, fail early rather than training on bad data and deploying a degraded model.</p>
<h3 id="set-up-alerters-for-pipeline-failures">Set Up Alerters for Pipeline Failures</h3>
<p>ZenML&rsquo;s alerter component integrates with Slack. Register a Slack alerter in your production stack so pipeline failures trigger immediate notifications rather than silently failing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>zenml alerter register slack_alerter <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --flavor<span style="color:#f92672">=</span>slack <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --slack_token<span style="color:#f92672">=</span>$SLACK_BOT_TOKEN <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --default_slack_channel_id<span style="color:#f92672">=</span>C0XXXXXXX
</span></span></code></pre></div><h3 id="common-pitfalls-summary">Common Pitfalls Summary</h3>
<table>
  <thead>
      <tr>
          <th>Pitfall</th>
          <th>Consequence</th>
          <th>Fix</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Non-deterministic steps with cache enabled</td>
          <td>Stale outputs silently returned</td>
          <td><code>@step(enable_cache=False)</code></td>
      </tr>
      <tr>
          <td>No type hints on step outputs</td>
          <td>Pickle serialization, portability breaks</td>
          <td>Always annotate return types</td>
      </tr>
      <tr>
          <td>Hardcoded infrastructure in steps</td>
          <td>Breaks stack portability</td>
          <td>Use step configs, not hardcoded URIs</td>
      </tr>
      <tr>
          <td>Running as root in step containers</td>
          <td>Security vulnerability</td>
          <td>Use non-root Docker base images</td>
      </tr>
      <tr>
          <td>Skipping data validation</td>
          <td>Training on drifted data</td>
          <td>Add Great Expectations to production stack</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="faq">FAQ</h2>
<p><strong>Q: Can I use ZenML if I&rsquo;m not on Kubernetes?</strong>
Yes. ZenML&rsquo;s default stack uses local execution with no Docker or Kubernetes required. You can graduate to cloud orchestrators later by registering new stack components — the pipeline code never changes. Many teams run ZenML on a single VM with the local or Airflow orchestrator indefinitely.</p>
<p><strong>Q: Does ZenML replace MLflow or Weights &amp; Biases?</strong>
No — ZenML integrates with them. ZenML is a pipeline framework; MLflow and Weights &amp; Biases are experiment trackers. You add MLflow or W&amp;B as the experiment tracker component in your ZenML stack, and ZenML automatically logs runs to your tracking server. You get the best of both: ZenML&rsquo;s portability plus your preferred tracking UI.</p>
<p><strong>Q: How does ZenML handle large datasets that don&rsquo;t fit in memory?</strong>
ZenML passes artifact URIs between steps, not the data itself. Each step receives a reference to the artifact in the artifact store and materializes (loads) only the data it needs. For very large datasets, implement a custom materializer that loads data lazily or in chunks. ZenML doesn&rsquo;t buffer step outputs through the orchestrator — data lives in S3/GCS, not in the pipeline executor&rsquo;s memory.</p>
<p><strong>Q: Is ZenML production-ready for enterprise use?</strong>
Yes. ZenML is Apache 2.0 licensed, supports RBAC (in Pro), integrates with existing enterprise tooling (Airflow, SageMaker, Vertex AI), and is used in production by companies across fintech, healthcare, and e-commerce. The open-source server can be deployed on-premise with no data leaving your infrastructure.</p>
<p><strong>Q: How long does it take to migrate an existing ML project to ZenML?</strong>
For a typical training script, adding ZenML decorators and restructuring code into steps takes 1-3 days. The bigger investment is configuring production stacks (1-2 weeks) and adding proper artifact tracking and monitoring (1-2 weeks). Most teams reach a working production pipeline within a month. The migration pays back within 2-3 retraining cycles through caching savings and reduced debugging time.</p>
]]></content:encoded></item></channel></rss>