Deepseek, Open AI, and SQD
When DeepSeek released their open-source model, they didn’t just freak out all the US-based big tech companies and tank NVIDIA stocks, much to the dismay of your favorite tech stonk influencer.
They also probably created the conditions for a pivot away from spending millions on training custom models to leveraging open-source models and running them on quality data instead. Initially, any model had to be trained, and DeepSeek benefitted from the funds used to develop ChatGPT. Nevertheless, the competitive advantage of the models themselves has now been diluted.
While the US can now figure out how to deploy its Stargate funds instead of building data centers to train AIs, AI projects developing advanced agents must grapple with the fact that UI and models are now commoditized and compete accordingly.
Our CEO predicted this would happen a while ago, leading to a shift toward using models that provide value for differing use cases. Inference then becomes the largest question to solve.
Inference refers to the process in which a model draws conclusions from new data. For anyone looking for an edge in their agent, the importance of access to quality data improves, while the need to spend resources on training decreases.
Most Valuable Data is Onchain
But where is one to find this quality data?
Our theory is that financial and social data rank high on the scale of valuable data. In traditional systems, there is no way to access such data transparently, if at all. You can use the X API to retrieve information, such as who follows who, but there are clearly defined limits to what information can and can’t be retrieved.
For the financial system, there is even less information publicly available on the movement of individuals' funds. In terms of privacy, that’s great, but not so much if you want to build an AI agent with predictive capabilities of funding and spending habits.
Fortunately, there exists a system that stores all the financial data openly: blockchain. The fact that all financial transactions onchain are publicly available isn’t the only thing making it so valuable for inference purposes.
Even non-financial data written onchain is likely more significant than data stored elsewhere - simply because onchain storage is expensive. Additionally, by making information available onchain, market forces can determine its price, establishing a hierarchy.
This idea is further explored by Jacob from Zora in this post from 2024. In it, he recaps that:
- Crypto wants information to be onchain so that it can be valued and add value to the system.
- AI wants information to be onchain so that it can be freely accessed and utilized by the system.
This leads us to the next challenge for AI agents: permissions and access. The beauty of blockchain data is that it’s openly accessible.
Getting onchain data in practice isn’t as easy as it sounds
Unfortunately, it’s also messy - as anyone who has ever tried to retrieve information directly from a chain without explorer or APIs by service providers - will confirm. The problems with blockchain data are manifold, but they boil down to the following:
- Scale ( 1+ Terrabyte and evergrowing)
- Fragmentation: data is spread across too many chains, L2s, and complementary data systems such as IPFS & Arweave
- Lack of standards: every native chain comes with its own approaches to data.
The above are problems we’ve already solved with SQD for hundreds of dApps. The heaviest users of SQD’s data lake are projects with advanced data needs that want to flexibly design their entire data pipeline.
In practice, they write their own indexer (software to retrieve and structure data) and then combine it with the rest of their infrastructure for a smooth user experience. You can read more about examples of projects using SQD this way in our use case section.
At the moment, SQD’s data lake contains raw, unstructured data from more than 200 blockchains. However, that format isn’t the best format for an AI agent to interact with data. Builders with sophisticated data needs benefit from raw data, as it allows them to create their own data structures.
AI Agent teams don’t necessarily share the same ambition. Instead, what matters most for agents is fast access to reliable data at scale. Ideally, it should be in a format better suited to their complex needs.
A Data Lake Evolves
As part of SQD’s strategic pivot, we’re also evolving how SQD offers data. Instead of focusing only on raw data, we’re becoming a hybrid of a data lake (place to dump raw data) and a data warehouse (structured data suited for analytics and inference).
Nothing will change for builders who already run their existing indexers on SQD. We’re expanding the offering to include more structured data ready for immediate consumption to support AI agent developers.
High-quality Web2 providers, such as Snowflake, offer data warehouses that theoretically could serve agents' needs. However, they suffer the same risks as any other centralized provider, mainly exploits and data lakes. An added barrier for agents is the lack of accessibility without human intervention and potential bandwidth constraints.
None of these are a problem when AI agents are combined with SQD. We’re moving quickly to enabling the vision of direct inference of agents with SQD.
Watch this space.
To ask questions, share feedback or just hang out with others interested in data x AI, join the SQD telegram chat.