Scoping AI Access to a Data Warehouse Without Exposing Raw Sensitive Data
A practical setup, demonstrated on Snowflake but applicable to most data warehouses, where an AI agent runs real analytics but never reads the raw sensitive data
When you connect an AI coding agent to a database, a quiet question sits underneath the setup. The agent can write SQL and read results, which is the point of connecting it. But what is it actually allowed to see, and where does that data go once it has been read?
A core goal of this piece is to protect the personally identifiable information (PII) and other sensitive company data that an AI agent could otherwise reach by default, and to reduce the risk that comes with that exposure. The question this piece investigates is whether an AI agent can run real analytics on company data without ever surfacing the raw customer PII behind it. While I utilize Snowflake in investigating this piece, this approach can be adapted for most other modern data warehouses.
What you'll find here that wasn't in the YouTube video
The YouTube video walks through the full build. This companion adds the research that did not fit in the runtime:
Why a plain hash of an email or phone number is trivially reversed by a lookup or rainbow table, and how salting forces an attacker to brute-force each value individually.
When tokenization or dynamic data masking is stronger than hashing, and when it isn’t, given the maintenance overhead of mapping tables, key rotation, and governance that come with it.
Which risks a database-layer boundary actually mitigates and which it doesn’t, framed by a recent standards-body white paper on agent security.
What the default setup actually exposes
Picture this: you connect Claude Code to your data warehouse using your personal user. It works. The agent can answer questions, write SQL, and pull back results. That is exactly the situation worth defining before building against it, because what it actually exposes can be broader than most may realize at the beginning. An AI agent connected through an MCP server (Model Context Protocol, the standard that lets a tool like Claude Code talk to a database) does not receive its own permissions. The same is true if you connect through a command-line tool like Claude Code itself, which is what most of my videos use. It authenticates as whatever role the connection uses. If that role is your personal analyst role, the agent inherits everything that role can see, and anything it reads enters the model’s context window, which is sent to the AI provider.
I will be honest that this is a mistake I made at first. When I connected Claude Code to Snowflake, the role I used had access to sensitive data, and in some ways that is exactly what makes the setup powerful. It can read everything and take action. The setup worked, which is exactly why it can take a while to notice the new risk exposure.
There is a useful frame for the underlying risk. Simon Willison, who helped coin the term prompt injection, describes what he calls the lethal trifecta, a combination of an agent with access to private data, the ability to communicate externally, and exposure to untrusted content. A default database connection supplies the first leg of that on its own. The IBM Cost of a Data Breach Report 2025 found that, among the organizations it surveyed, 13% reported a breach of an AI model or application, and 97% of those lacked proper AI access controls. I walked through both of these in my last video, What Happens to Your Data When You Use AI, if you want the longer threat-model context. Closing that access-control gap is what this setup is for.
Three principles behind a scoped setup
The build rests on three principles, and the rest of the work follows from them.

The first is schema segregation. Sensitive data lives in its own schema, and the AI’s role is never granted USAGE on it, so it cannot reference those tables at all. The agent only sees a curated schema of views. This follows the same least-privilege pattern that Snowflake’s PII handling guidance is built around, with the curated schema acting as the security boundary.
The second is what the video calls hashing as irreversibility (meaning you cannot run the hash backwards to recover the email), though in practice it is closer to pseudonymization. Sometimes the agent still needs a PII column in order to join, group, or deduplicate. Running it through a hashing function like SHA-256 replaces the value with a fixed string, which is strong protection but, as the next section shows, not absolute. The useful property, shown in the diagram above, is that the agent can still join customers on a hashed email without ever resolving the real one.
The third principle is identity selection, and it is the one easiest to underweight. Since a 2024 change, a Snowflake user whose secondary roles were left unset has every granted role active in every session, so setting a restrictive default role does not actually restrict anything for that user. The real boundary is the user’s full grant list. The cleaner approach, shown above as the middle pattern, is a dedicated AI user that has only the scoped agent role granted to it. If there is nothing else on the user, there is nothing else to inherit.
Why hashing alone is not enough
That second principle, making PII unreadable to the agent, needs more than a plain hash function, which is what this section is about. A plain hash of a predictable value is reversible in practice. Email addresses follow common patterns, so someone can hash a dictionary of likely addresses and match the results against a hashed column. The fix shown in the video is per-row salting, where each value is combined with a random string from a separate keys table before hashing, which makes that dictionary attack fail.
Here is the part the video does not get into. Salting improves the engineering, but it does not change the legal classification. Hashing is pseudonymization rather than anonymization. Under frameworks like GDPR, pseudonymized data is still personal data, because a link back to a person still exists somewhere.
Salting also interacts with joins, and the tradeoff is worth being precise about. An unsalted hash is deterministic, so the same email hashes to the same value everywhere and joins work globally. A salt defeats the dictionary attack, but a join then only works where the same salt reaches the same value. The video’s per-customer salt keys table keeps each customer’s hash stable, which covers customer-keyed analytics. The salt-plus-SHA-256 approach used in the build is enough for that case. A stronger production option for joining a value across tables is HMAC (Hash-based Message Authentication Code), a keyed hash that combines a secret key with the value being hashed. Without the key, the hash cannot be reproduced, which defeats the dictionary attack while still preserving the join. For a column the agent never joins on, dropping it is simpler, and where the value still has to exist somewhere, tokenization through a separate data privacy vault is the stronger pattern, since the sensitive value is isolated entirely and downstream systems hold only a token.
Where this build stops
The queries the video runs after the protective layer is in place are deliberately stopped from working. Either the underlying tables are not reachable from the AI user at all, or the request returns data that is properly hidden.

It is worth being precise about what this does and does not do. This is a data-access boundary. It keeps raw values away from the agent and the provider. It does not address prompt injection, where an attacker tricks the agent into running a query it is technically allowed to run but should not.
A recent white paper from the Coalition for Secure AI makes this point at the architectural level. Tool implementations, it argues, should not rely on the model to perform security-critical operations or enforce constraints. Enforcement has to live in the database, which is what schema segregation and a scoped role provide. Anthropic's own deputy chief information security officer, Jason Clinton, has framed agent security in similar terms, thinking in specific scopes and blast radius rather than a single decision to deploy or not.
There is also the question of what happens to data the agent legitimately reads. That is the focus of my previous video, What Happens to Your Data When You Use AI, so please check it out there. The point that matters for this build is simpler. The scoped setup keeps sensitive data off that path regardless of a provider’s training or retention terms, because the agent never reads the raw values in the first place.
Conclusion
The real question this piece is asking is whether you can scope an AI agent’s access tightly enough that the agent does useful analytical work without becoming a new risk vector for raw customer PII. The argument here is that you can, provided the AI’s user is well segregated from any human user. So, can an AI agent run real analytics on company data without exposing raw customer PII? For a database-access boundary, the answer is yes, and the Snowflake build I show in the video gives it a concrete shape. The agent can still answer a question like churn rate by city against the masked views, while the raw values stay in a schema it cannot reach.
A few considerations if you are weighing your own setup:
Identity is the single highest-impact change. Authenticate the agent as a user that holds only the scoped role; doing that first closes the secondary-role gap that a default role alone leaves open.
Treat hashing as a join-preserving tool, not as a privacy guarantee. Join-preserving means the same email always hashes to the same value, so the agent can still join
purchasestocustomerson a masked email column without ever resolving the underlying value.Prefer dropping or tokenizing columns the agent never needs to join on.
With a per-row salt added, the hash for any individual value is not literally impossible to reverse, but it is so much harder that, for any realistic attacker, the practical privacy is effectively guaranteed.
This is one layer, and it is a good foundation. There is more to securing your data beyond this (key management, audit logging, row-level access), and those are topics for another article.
Thanks for reading. If you have wired an AI tool into your own data warehouse, I would love to hear how you are handling this. Leave a comment below.1


