The problem SQD solves today

Subsquid Labs GmbH

24 Sep 2024 • 5 min read

If you’re new to SQD and wonder what the point of having such a platform is, this post is for you. Imagine you’re a skeptic, meeting our team at a conference. You’d probably ask:

So what’s SQD?

In short, a hyper-scalable, permissionless, cost-efficient data layer for onchain data.

That’s quite a mouthful, and unless you’re thinking about how to access web3 data a lot, it might not immediately mean anything to you. So your next question could be:

And what’s that even good for?

Thanks for asking. The real problem SQD solves is accessing web3 data. As you know, blockchain started with the promise of empowering individuals to be more sovereign, to trust but verify, and to not rely on centralized parties. That assumes that you can get hold of the information you need.

More than ten years into the industry, though, trust but verify is increasingly difficult in a world of 100s of chains and L2s and a lack of holistic solutions to support data retrieval. It’s paradoxical that, despite its open and public nature, onchain data is hard to access.

Why is onchain data hard to access?

When Satoshi Nakamoto created Bitcoin, he (they) could not foresee the explosion in chains and projects built on top of the technology. It’s fair to assume that Vitalik didn’t see it coming either; hence, we have node software that’s optimized to deliver all the other important functions in a blockchain, such as consensus, execution, and storage, but not to allow others to just read the data stored on it.

Scale

If you wanted to download Ethereum’s blockchain history, you’d need 1140.89 GB in storage. For a more high-speed chain like Solana, good luck trying to sync 300 Terabytes on your home machine (don’t do this). Even if you managed to download all that, where are you gonna store it?

As an individual not looking to check all of the chains’ history, you’re probably fine using a light client to verify states. However, projects still need access to onchain data, and even if it’s not the entire ledger, the size is growing exponentially with every new chain and transaction.

Currently, projects get access to any onchain data via RPC nodes. These have mainly been architected to facilitate communication and write operations. As you can imagine, the more users a blockchain or dApp accumulates, the larger the scale at which you even just want to retrieve the data.

After all, even showing you a correct portfolio balance requires onchain data. Doing this in the thousands and RPC nodes can quickly be overwhelmed and slow down.

Obviously, this isn’t a great user experience. Running one’s own RPC node isn't necessarily a skillset that blockchain teams have; after all, running nodes is an entirely different ballgame from writing a smart contract.

Of course, data access can be outsourced to centralized providers, but that’s not really in the spirit of trustlessness either, and often expensive when done at scale.

Fragmentation

The sheer scale already makes retrieving all the onchain data hard. Still, matters are made worse by data in crypto being spread across hundreds of Layer-1s, L-2s, and other decentralized and non-decentralized data sources such as Arweave, IPFS, Farcaster Hubs, and Google Drive (if your favorite NFT project is hosting their metadata on a GDrive, run).

Big data is all fun and games until it’s spread across thousands of data sources that do not even have the same structure.

While EVM chains have interoperable data schemes, the same can’t be said for newer virtual machines such as SVM, Move, Fuel, and others coming to market. As long as new VMs still receive millions of dollars in funding, we’ll see more of them rise.

And while that is welcome in the spirit of innovation, it's a headache for anyone operating with data at a multichain scale.

Lastly, fragmentation is further exacerbated by different account primitives such as smart contract wallets, hosted accounts, and account abstraction. Suddenly, a bundler is in the mix, and the user doesn’t even know that they have just operated on a blockchain. How do you interpret the data mainly submitted via relayers and from a different address than the user?

Ok, stop there. So fragmentation and scale of data make it hard to access… But wasn’t it all supposed to be open and accessible?

Yes, and for the first few years of Bitcoin, it was. It was designed to be relatively lightweight and easy to access. But then we saw an explosion, and now we’re in a state with open data, but it’s not interoperable.

If operating a bridge, you’d need to gather data from different sources, such as the start chain, Solana, and the destination chain, Arbitrum. It so happens that account schemes vary, meaning the data is there, but to make it comparable and easy to work with, you need to put both in the same format. Doing that once manually is not a problem, but at scale, it’d be very resource-intensive and inefficient.

And next, you tell me that SQD solves that…

Indeed. It works by aggregating all of the blockchain data in parquet files - a very efficient way to store raw data - and distributing it across nodes in our decentralized data lake. Now, instead of needing to run their own RPC node or rely on centralized providers, devs can build their own indexer (that is just a way to grab all the data they need) and run it on the SQD network.

Whenever they send a query, it goes directly to a worker node that has the desired data range. Next you probably wonder: How do they know which worker to send the query to?

Without getting too deep, all the blockchain data stored is distributed and assigned by a network participant called the scheduler to the worker nodes. All workers consistently report which data they have stored providing a detailed map of where to direct queries. Whenever multiple worker nodes have stored the same data range, selection happens through a random algorithm to ensure fair distribution of query volume.

Aggregating all the data in one location makes it easy to source otherwise fragmented multichain data. By scaling with every node in the network, more than 900 atm, SQD is also able to scale bandwidth and data throughput. Both fragmentation and scalability are addressed.

Our measurements suggest that using SQD allows projects to sync blocks up to 100x faster than other similar solutions.

Understood, SQD addresses scaling and fragmentation issues. So who’s using it?

To mention just a few: PancakeSwap, Interface, Shiba Inu, Coin List, Aleph Zero, Catalog, Chainsafe, Hydration, and Deutsche Telekom.

Note that this is an introductory post explaining the data issues that SQD addresses. For more insights on the product, check out our recent game plan announcement and the SQD Whitepaper.

In the near future, we’ll also talk more about architecture, so stay tuned.

For a more direct comparison with other solutions, we recommend David Attermanns’ discussion on blockchain data solutions, including SQD.