Making Sense of Web3 Data

SQD Editor

28 May 2024 • 6 min read

While everyone is talking about AI these days, we’re doing something different. Since data is what we do (which, by the way, is very elementary for AI), we organized an X space to make sense of Web3 data tools, infra, and privacy instead.

We went live on May 7th to discuss all things data with amazing speakers:

Jasper, CTO & Co-Founder of Seda Protocol
Ahmad Mardeni, CEO of WeaveDB
Sanket Jain, Founder Gateway

And our CEO, Dima Zhelezov. You can listen to a recording of the conversation here. Read on for the most critical points we covered in that space.

Who is who

It’s helpful to know who’s focusing on what when reading what they opine on data, so here we go:

Seda Protocol: offers access to any off-chain data for any onchain consumer (rollups, apps, etc)
Gateway: turns every user into their own database. That means users carry their own data, deciding when and who to reveal it to.
WeaveDB: allows a new way to store data, turning every database into a permanently stored smart contract. Decentralized, immutable storage.
Subsquid: scaling access to onchain data

On Data as a Valuable Commodity

The phrase “Data is the New Oil” has been circulating ever since Web2 began farming users' data to provide them with more relevant content (aka spam them with highly-targeted ads).

Another truism that comes to mind is that when the product is free, you are the product, which is the web2 equivalent of “if you don’t know where the yield is coming from, you are the yield.”

Everyone in the space agreed that data is a valuable commodity. After all, Google is a billion-dollar company, and so is Facebook. Both are typical examples of Web2 behemoths that lure their users in with free-to-use products. They finance this by selling what they learn about you to advertisers (this is the short version), often without you even being aware of it or getting a cut.

This attention economy in its current form is problematic for all kinds of reasons, take the perverse incentives it creates for making platforms that keep users as long on-site as possible, even if that’s with fake news. The ensuing enshittification might spell the eventual death of these platforms. Nevertheless, we’re still far from that.

“Attention is actually the most valuable; data is just the best proxy we have.” - Jasper, Seda Protocol

In the meantime, all speakers believe that Web3 and crypto can create a fairer, more transparent business model in which we are finally rewarded for our data.

“ Data is one of the most valuable assets with a clear market. You should be the one getting the rewards and having sovereignty over it.” as Sanket highlights.

Picking up on the oil metaphor, Dima explains that it’s quite accurate for data. Putting these two into analogy means:

Crude oil = raw data
Refineries & pipelines = infra services offering access
Petrol stations = how people ultimately consume the data.

Why care about onchain data?

At first glance, it might not be clear to the uninitiated Web3 newbie why data is even an important topic. Still, in essence, blockchains are nothing but fancy databases. Decentralized, sure, but still “just” an aggregation of data.

“It’s only a blockchain when it comes from the blockchain region of France. Else, it’s just a sparkling database.” 🥂

Onchain data then contains everything that happens on a blockchain, from transactions to smart contract events. Pretty much everything that you see when connecting your wallet with a dApp is read from a blockchain and oracles.

There are a few good reasons to care about data. Jasper points out that bad data can mean bad financial outcomes for users. When price feeds are corrupted, it can lead to liquidations, and manipulating prices in order to exploit vulnerabilities is indeed a solid hacker strategy. End-users might only care when things go wrong. That’s why, for devs it’s paramount to lower that risk as much as possible.

Public vs. Private Data

Most blockchains are currently fully public. This means all transaction and interaction data is stored and visible to anyone. While there is the notion of pseudonymity, meaning users aren’t directly tying their natural person to their wallet, advances in analytics and people’s tendencies to share what they buy have made it easier to identify who’s behind a wallet.

Blockchain data is truly public data and, as such, more accessible than any other public data will ever be. Take Twitter posts, for example. As Dima points out, it’s, in theory, public, but if you wanted to use it to build a quest platform, you’d still have to pay X for access to their (limited) API.

With decentralized social networks, we’re witnessing how users have relied on that data to train their bots, making them increasingly harder to distinguish from real users.

Still, despite the benefits of ease of access to data, not all of it should be revealed to anyone at all times. If you think about it, it’s “insane” (Jasper) that everything is public. To achieve this desirable state of mass adoption, it’s unlikely this can remain the state of affairs.

The proposed solution from the speakers is “public verifiability” (Sanket). This would mean that while actions in general are stored on-chain, the details aren’t available. There should be a clear line between something that has happened and what exactly this entailed. Take KYC data or passports. If implemented correctly, users might have to just KYC once, and other entities would then simply accept the previous proofs and verify that this person is indeed an actual person - and who they claim to be.

After all, privacy is a human right, and who are we, as Web3, to dismantle that when our whole purpose is to empower people?

Data x AI

The topic shortly came up, with speakers foreseeing that much of the onchain activity will eventually be agents working on our behalf. Considering that nearly 46% of web traffic is already made up of bots (statistically, as the reader, you might be the bot), that’s not a far-fetched idea.

Unlike the closed-source models, Web3 might contribute to breaking up data silos and allowing these models access to data at scale. What’s more, training AI this way would allow full transparency into what’s going into a model and why it might behave a certain way.

An interesting idea brought up was a fully AI-agent-controlled smart contract—a smart contract that could be fed via an Oracle and then execute certain transactions accordingly. This might get even more fascinating if you combine it with funding models such as Contract Secured Revenue. The biggest risk when doing so would be hallucinations, leading to an adverse outcome, depending on how embedded this contract would be in other ecosystems.

Modularity

Subquid and WeaveDB have both built products that are fully modularized to optimize data access and storage. It’s clear that we’re in the unbundling phase of technological innovation.

But for the end-user, that might not matter much. In the end, there’ll be unified user experiences on top of the complex aggregation below. This might be powered by cross-chain communication protocols, data sharing across networks, and shared sequencing.

If you want to start indexing onchain data, check out our docs.