<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Data Privacy on RockB</title><link>https://baeseokjae.github.io/tags/data-privacy/</link><description>Recent content in Data Privacy on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 13 Apr 2026 12:00:00 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/data-privacy/index.xml" rel="self" type="application/rss+xml"/><item><title>AI Coding Tool Data Privacy Comparison 2026: Trae Telemetry vs Open-Source vs Enterprise</title><link>https://baeseokjae.github.io/posts/ai-coding-tool-data-privacy-comparison-2026/</link><pubDate>Mon, 13 Apr 2026 12:00:00 +0000</pubDate><guid>https://baeseokjae.github.io/posts/ai-coding-tool-data-privacy-comparison-2026/</guid><description>Compare AI coding tool privacy in 2026, from Trae telemetry concerns to local agents, Cursor, Copilot, and Tabnine.</description><content:encoded><![CDATA[<p>AI coding tool privacy in 2026 comes down to three questions: what code context leaves your machine, who can use it for training, and whether telemetry can be audited or disabled. I&rsquo;ve found that brand claims matter less than the actual architecture, contract terms, and default data flows.</p>
<h2 id="what-is-the-short-answer-for-ai-coding-tool-privacy-in-2026">What Is the Short Answer for AI Coding Tool Privacy in 2026?</h2>
<p>If you are working on throwaway code, almost any AI coding assistant is acceptable as long as you do not paste secrets, tokens, customer data, or unreleased product logic into the prompt. If you are working on proprietary source code, the default should be stricter: use an enterprise plan with no-training commitments and admin controls, or use an open-source agent wired to a local or self-hosted model endpoint.</p>
<p>The uncomfortable middle ground is the free or lightly documented AI IDE. Trae is the current example I would treat carefully. Its US privacy policy, last updated January 22, 2026, says the product may collect prompts, text and code, file uploads, embeddings, metadata, technical data, and usage data. It also says AI chatbot inputs may be shared with LLM providers and that information may be used to improve, develop, train, and improve technology. That is not a vague privacy footnote. That is the operating boundary.</p>
<p>In practice, I split AI coding tools into three privacy models:</p>
<table>
  <thead>
      <tr>
          <th>Privacy model</th>
          <th>Examples</th>
          <th>Main risk</th>
          <th>Best fit</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Vendor-hosted AI IDE</td>
          <td>Trae</td>
          <td>Broad collection, telemetry ambiguity, hosted model routing</td>
          <td>Low-sensitivity experiments</td>
      </tr>
      <tr>
          <td>Open-source or local agent</td>
          <td>Cline, Continue, Aider, OpenCode, Hermes</td>
          <td>Privacy depends on endpoint and configuration</td>
          <td>Developers who can operate local or trusted inference</td>
      </tr>
      <tr>
          <td>Enterprise managed assistant</td>
          <td>GitHub Copilot Business/Enterprise, Cursor Teams/Enterprise, Tabnine Enterprise</td>
          <td>Contract and plan details vary</td>
          <td>Company source code with procurement review</td>
      </tr>
  </tbody>
</table>
<p>That split is more useful than arguing whether a tool is &ldquo;private&rdquo; in the abstract. Open source is not automatically private. Enterprise is not automatically offline. A local model is not automatically compliant. You have to trace the data.</p>
<p>For related architecture decisions, I use the same mental model I described in <a href="/posts/multi-model-fallback-architecture-guide-surviving-ai-model-outages-in-production/">Multi-Model Fallback Architecture Guide</a> and <a href="/posts/ai-agent-deployment-infrastructure-guide-2026-ampere-sh-e2b-northflank-and-modal-compared/">AI Agent Deployment Infrastructure Guide 2026</a>: the important boundary is where execution, storage, and vendor control actually happen.</p>
<h2 id="what-does-traes-2026-privacy-policy-actually-say">What Does Trae&rsquo;s 2026 Privacy Policy Actually Say?</h2>
<p>Trae&rsquo;s policy is unusually important because it gives you enough detail to make a real security call. According to the research brief, the policy explicitly includes user-provided content such as prompts, code or text, file uploads, embeddings, and metadata. It also includes technical data and usage data.</p>
<p>When building internal review checklists for AI tools, I do not start with &ldquo;does the vendor say it respects privacy?&rdquo; I start with a field list:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>Data category                 Ask before approval
</span></span><span style="display:flex;"><span>Prompts and chat history       Are they stored, logged, or used for training?
</span></span><span style="display:flex;"><span>Selected code context          Is it sent to a hosted service?
</span></span><span style="display:flex;"><span>Whole-file or repo context     Is it uploaded for indexing or embedding?
</span></span><span style="display:flex;"><span>Embeddings                     Who computes and stores them?
</span></span><span style="display:flex;"><span>Telemetry                      Can it be disabled globally and verified?
</span></span><span style="display:flex;"><span>Model provider requests        Which subprocessors receive code?
</span></span><span style="display:flex;"><span>Retention                      How long is raw content cached or stored?
</span></span><span style="display:flex;"><span>Training use                   Is opt-out default, plan-specific, or contractual?
</span></span></code></pre></div><p>Trae raises concern across several of those rows. The policy says codebase files may be temporarily uploaded for embedding computation and that plaintext code is deleted after embeddings are computed. That is better than indefinite plaintext storage, but it still means a privacy-sensitive team must approve the upload path, embedding provider, retention behavior, and logs around that process.</p>
<p>The policy language around improving and training technology is also a procurement issue. I would not interpret that as automatically proving every prompt trains a foundation model. Policies are broader than individual product implementations. But I would treat it as insufficient for proprietary code unless the company has a separate enterprise agreement that narrows use, retention, training, subprocessors, and audit rights.</p>
<h2 id="what-happened-in-the-trae-telemetry-controversy">What Happened in the Trae Telemetry Controversy?</h2>
<p>The Trae telemetry story has two separate evidence buckets, and mixing them together is where teams get sloppy.</p>
<p>The first bucket is official policy. That confirms categories of collected data, model-provider sharing for chatbot inputs, and temporary code upload for embeddings. Those are high-confidence facts because they come from Trae&rsquo;s own published policy.</p>
<p>The second bucket is third-party technical reporting. Unit 221B reported persistent ByteDance network connections, device identifiers, local WebSocket channels that handled full file content, and recurring telemetry transmissions. The Register later reported claims that telemetry continued after opt-out, along with a ByteDance clarification that the IDE telemetry toggle controlled VS Code framework telemetry rather than all Trae tooling.</p>
<p>I would not write a security exception that says &ldquo;Trae exfiltrates entire repositories&rdquo; unless I had verified packet captures and payload contents in my own environment. Some telemetry payloads can be compressed, encrypted, or represent local routing rather than external upload. But I also would not approve a tool for regulated source code just because the strongest external claim has nuance.</p>
<p>For developers, the practical issue is simpler: if a local service reads full file content, if background telemetry is recurring, and if opt-out semantics are ambiguous, then the tool requires a formal review before it touches company code. That review should include outbound DNS logs, proxy captures, process inspection, file access monitoring, and vendor answers in writing.</p>
<h2 id="are-open-source-ai-coding-agents-actually-private">Are Open-Source AI Coding Agents Actually Private?</h2>
<p>Open-source AI coding agents can be much more private, but only when configured that way. I have seen teams install an open-source extension, point it at a hosted API key, leave telemetry enabled, and then claim they have a local AI coding stack. They do not. They have a more inspectable client with cloud inference.</p>
<p>Cline is a good example of the right shape. It is open source and provider-agnostic, with support for Anthropic, OpenAI, Google, AWS Bedrock, Azure, Vertex, Ollama, OpenAI-compatible endpoints, and custom weights. Continue also remains relevant as a local-first pattern, even after being acquired by Cursor, because teams can still study and run the open-source codebase. Aider, OpenCode, and Hermes fit similar patterns depending on your preferred workflow.</p>
<p>The privacy boundary is not the extension name. The boundary is the endpoint:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">private-ish_local_setup</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">editor_agent</span>: <span style="color:#e6db74">&#34;Cline or Continue&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">model_endpoint</span>: <span style="color:#e6db74">&#34;Ollama on localhost or private OpenAI-compatible gateway&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">repo_indexing</span>: <span style="color:#e6db74">&#34;local only, no hosted sync&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">telemetry</span>: <span style="color:#e6db74">&#34;disabled and verified at network layer&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">secrets_policy</span>: <span style="color:#e6db74">&#34;blocked from prompt context&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">logs</span>: <span style="color:#e6db74">&#34;local, rotated, not shipped to SaaS&#34;</span>
</span></span></code></pre></div><p>That setup has trade-offs. Local models can be slower, weaker on large refactors, and expensive if you need GPU capacity. Running Qwen, DeepSeek Coder, Codestral, or Llama-family coding models locally may be fine for small edits, but not equivalent to a top hosted frontier model on large-context architecture work. Self-hosted inference also creates operational work: patching, model governance, access control, audit logs, and capacity planning.</p>
<p>For teams evaluating agent frameworks, the same lesson shows up in <a href="/posts/open-source-agent-eval-harness-comparison-2026/">Open Source Agent Eval Harness Comparison 2026</a>. You need repeatable checks, not just a tool preference. For privacy, that means measuring network egress, inspecting config, and testing whether disabled telemetry stays disabled after upgrades.</p>
<h2 id="how-do-enterprise-ai-coding-tools-handle-privacy-differently">How Do Enterprise AI Coding Tools Handle Privacy Differently?</h2>
<p>Enterprise tools compete on a different axis: not just features, but contracts, admin controls, security portals, and deployment architecture.</p>
<p>GitHub Copilot is a good example of plan-dependent privacy. GitHub states that Copilot Business and Enterprise customer data is not used to train AI models. For individual Free, Pro, Pro+, and Max users, GitHub&rsquo;s 2026 policy distinguishes personal subscription behavior, including training use unless users opt out after April 24, 2026. That difference matters. A developer using Copilot Pro on personal billing is not the same risk profile as a company-managed Copilot Enterprise seat with org policies.</p>
<p>Cursor&rsquo;s Privacy Mode is another useful case because it is specific but not magic. Cursor says customer data is not used for Cursor training when Privacy Mode is enabled, and model providers are covered by zero data retention agreements. It also documents two caveats security reviewers should care about: requests can still pass through Cursor backend for final prompt construction, even with a user-provided API key, and codebase indexing can upload chunks to compute embeddings with temporary encrypted caches.</p>
<p>Tabnine positions itself more aggressively around privacy. Its code privacy materials claim no code storage, no code training, no code or usage-data sharing, proprietary models without third-party API sharing, and deployment options that include SaaS, VPC, on-premises, and fully air-gapped environments. For regulated teams, that deployment menu matters more than a marketing sentence about privacy. Air-gapped and on-prem options give security teams an architecture they can reason about.</p>
<p>Here is the comparison I would use in a security review:</p>
<table>
  <thead>
      <tr>
          <th>Tool category</th>
          <th>Training control</th>
          <th>Retention control</th>
          <th>Deployment control</th>
          <th>Main caveat</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Trae</td>
          <td>Policy language allows improvement/training uses</td>
          <td>Policy describes temporary code upload for embeddings</td>
          <td>Vendor-hosted IDE</td>
          <td>Telemetry concerns and broad collection language</td>
      </tr>
      <tr>
          <td>Cline/Continue/local</td>
          <td>Depends on selected endpoint</td>
          <td>Depends on local logs and provider</td>
          <td>Local, self-hosted, or cloud API</td>
          <td>Misconfiguration can destroy privacy</td>
      </tr>
      <tr>
          <td>GitHub Copilot Business/Enterprise</td>
          <td>Official no-training commitment for business customer data</td>
          <td>Enterprise policy-dependent</td>
          <td>Cloud SaaS</td>
          <td>Individual plans differ materially</td>
      </tr>
      <tr>
          <td>Cursor Privacy Mode</td>
          <td>No Cursor training; ZDR provider agreements</td>
          <td>Temporary caches and indexing behavior documented</td>
          <td>Cloud SaaS with enterprise controls</td>
          <td>Backend prompt construction still occurs</td>
      </tr>
      <tr>
          <td>Tabnine Enterprise</td>
          <td>Markets no code training</td>
          <td>Markets no code storage/sharing</td>
          <td>SaaS, VPC, on-prem, air-gapped</td>
          <td>Verify claims contractually for your plan</td>
      </tr>
  </tbody>
</table>
<h2 id="what-privacy-checklist-should-developers-use-before-adopting-an-ai-coding-tool">What Privacy Checklist Should Developers Use Before Adopting an AI Coding Tool?</h2>
<p>When building an approval checklist, I keep it concrete. The goal is not to create a 40-page governance document that developers route around. The goal is to answer the questions that actually change risk.</p>
<h3 id="what-data-can-leave-the-machine">What data can leave the machine?</h3>
<p>Require a documented answer for prompts, selected code, open files, workspace files, embeddings, terminal output, diagnostics, filenames, repo metadata, dependency manifests, and telemetry. If the tool has agent mode, include shell output and browser/session artifacts.</p>
<h3 id="can-training-be-disabled-by-default">Can training be disabled by default?</h3>
<p>Opt-out is weaker than opt-in. Individual user settings are weaker than admin-enforced policy. Marketing statements are weaker than a DPA, enterprise agreement, or trust-center document that names training, retention, subprocessors, and deletion.</p>
<h3 id="who-are-the-model-providers">Who are the model providers?</h3>
<p>If the product routes requests to OpenAI, Anthropic, Google, AWS, Azure, or another provider, that provider becomes part of your data flow. Zero data retention matters, but so does whether the agreement applies to your plan and region.</p>
<h3 id="are-embeddings-treated-as-sensitive">Are embeddings treated as sensitive?</h3>
<p>Some teams treat embeddings as harmless because they are not plaintext. I do not. Embeddings can encode meaningful semantic information about proprietary code, product names, APIs, and architecture. Store and transmit them as sensitive derived data unless your security team has explicitly classified them otherwise.</p>
<h3 id="can-telemetry-be-disabled-and-verified">Can telemetry be disabled and verified?</h3>
<p>The word &ldquo;disabled&rdquo; is not enough. In practice, I want to see config, admin policy, release notes, and a network-level verification run. A simple test is to open a representative repo in a clean environment, perform common actions, and capture outbound domains through a proxy or DNS logger.</p>
<h2 id="which-tool-should-each-team-type-choose">Which Tool Should Each Team Type Choose?</h2>
<p>For personal experiments, use whatever makes you productive, but assume prompts and code context may leave your machine unless you have verified otherwise. Do not paste secrets, private keys, <code>.env</code> files, customer exports, or unreleased partner code.</p>
<p>For startups with proprietary code, I would choose either a managed enterprise assistant or a local/open-source stack. The managed path is easier if the team already uses GitHub Enterprise, Cursor Teams, or Tabnine Enterprise. The local path is attractive when the team has strong infrastructure skills and wants to control model routing, but it will cost engineering time.</p>
<p>For regulated enterprises, do not approve tools based on developer popularity. Require DPA terms, admin controls, retention documentation, audit logs, SOC 2 Type II or equivalent reports, subprocessors, region controls, and a no-training commitment. If the vendor cannot explain embeddings and telemetry clearly, pause the rollout.</p>
<p>For defense, classified, sovereign, or strict offline environments, the answer is much narrower: require self-hosted, VPC, on-premises, or air-gapped deployment with no outbound telemetry. A cloud IDE with privacy language is not equivalent to an offline deployment.</p>
<p>My default recommendation is straightforward: use Trae only for low-sensitivity projects unless your organization has reviewed and accepted its policy and telemetry behavior. Use open-source/local tools when you can operate them correctly. Use enterprise products when you need enforceable controls and support. Use air-gapped or fully self-hosted systems when the code cannot leave the environment.</p>
<h2 id="faq">FAQ</h2>
<h3 id="is-trae-safe-for-proprietary-source-code">Is Trae safe for proprietary source code?</h3>
<p>I would not use Trae for proprietary source code without a formal security review. Its 2026 policy describes collection of prompts, code/text, uploads, embeddings, metadata, technical data, and usage data, and third-party telemetry reports raise additional concerns that should be verified.</p>
<h3 id="does-open-source-mean-an-ai-coding-assistant-is-private">Does open source mean an AI coding assistant is private?</h3>
<p>No. Open source makes the client more inspectable, but privacy depends on configuration. If Cline, Continue, or Aider sends prompts to a hosted model API, that provider&rsquo;s terms and logs become part of your privacy boundary.</p>
<h3 id="is-github-copilot-data-used-for-training">Is GitHub Copilot data used for training?</h3>
<p>It depends on the plan. GitHub says Copilot Business and Enterprise customer data is not used to train AI models. Individual plan behavior differs, and 2026 policy changes make opt-out settings important for Free, Pro, Pro+, and Max users.</p>
<h3 id="does-cursor-privacy-mode-keep-all-code-local">Does Cursor Privacy Mode keep all code local?</h3>
<p>No. Cursor Privacy Mode provides no-training and zero-data-retention style controls, but Cursor documents backend prompt construction and codebase indexing behavior, including uploaded chunks for embeddings and temporary encrypted caches.</p>
<h3 id="what-is-the-most-private-ai-coding-setup">What is the most private AI coding setup?</h3>
<p>The most private practical setup is an audited open-source coding agent pointed at a local or self-hosted model endpoint, with telemetry disabled, no hosted sync, local-only indexing, and network egress blocked except for approved dependencies. For large enterprises, an air-gapped commercial deployment can be stronger operationally.</p>
]]></content:encoded></item></channel></rss>