Last Updated on October 3, 2024 by mike
TL;DR: This week, indexers share what they'd like to cover in upcoming educational sessions on Kubernetes, plus Semiotic Labs shares an update on Timeline Aggregation Protocol, which is ready for testing on testnet and will be deployed to mainnet soon.
Opening remarks
Hello everyone, and welcome to episode 174 of Indexer Office Hours!
GRTiQ 185
Catch the GRTiQ Podcast with José Betancourt, Founder and CEO of Virtual Labs, a cutting-edge platform focused on decentralizing data and improving user experiences in web3.
Repo watch
The latest updates to important repositories
Execution Layer Clients
- Erigon New release v20.60.7 :
- This release introduces several changes to the Erigon release process, including the removal of the v prefix in Docker tags and released artifacts, the inclusion of 10 binaries in the archives and 7 binaries in Docker images, the shift of Docker image hosting to erigontech/erigon, and the availability of multi-platform Docker images for both linux/amd64 and linux/arm64. Additionally, Docker images now contain the org.opencontainers.image.revision label, reflecting the commit ID used for building.
- Breaking Change: Windows binaries are no longer published, pending improvements.
- sfeth/fireeth: New release v2.7.2 :
- This release introduces a new, shorter form for the –advertise-block-id-encoding flag (e.g., hex, base64), while still supporting the older, longer form.
- Substreams v1.10.2 includes several important bug fixes, such as correcting issues with handling modules that have multiple inputs and fixing stalling with stores and mappers using different start block numbers.
- Breaking Change: Due to a bug fix related to skipping blocks, any previously produced Substreams cache is considered potentially corrupted and should be replaced.
Matthew Darwin | Pinax posted: We’ll be rolling out the Firehose update later today. We’ll be re-processing the Substreams cache later.
- Avalanche: New release v1.11.11 :
- This version is backward compatible with v1.11.0 but introduces a plugin update (v37) requiring all plugins to update for compatibility. Key API changes include new methods (info.upgrades, platform.getFeeConfig, platform.getFeeState), deprecations of subnet uptime and validator metrics, and various fixes such as improved memory management and dynamic fee configurations.
- Breaking Change: Plugins must update to v37, and deprecated APIs related to subnet uptimes should be replaced accordingly.
Matthew Darwin | Pinax posted: Any plans from E&N when new graph-node might be coming? Yaro | Pinax is making some fixes into graph-node.
Protocol watch
Forum Research
Core dev updates:
- Edge & Node’s August/September 2024 Update
- Pinax August/September 2024 Update
- Semiotic August 2024 Update
- GraphOps Update August 2024
- Geo August 2024 Update
- StreamingFast August/September 2024 Update
- Messari August 2024 Update
Open discussion
Some of the comments have been lightly edited and condensed.
Kubernetes [11:28]
Abel: Recently, we’ve had a series of PostgreSQL-themed sessions, and they were well received. Matthew suggested we cover other aspects of the stack, like Kubernetes (a.k.a. Kubes, K8s). So similar to what we did for PostgreSQL, we’d like to do a set of Kubernetes-themed sessions.
We’d like to hear from indexers what questions you have about Kubernetes so we can answer them over the next few weeks.
Matthew: At Pinax, we’re in the midst of converting stuff from LXC containers to Kubernetes. This is a long journey in terms of learning the Kubernetes stack, what the various options are, which pieces we need, what storage components we need, and what the heck is an ingress controller.
We started from zero knowledge and we’ve gotten to the point where we can now run some stuff in Kubernetes.
Thanks a lot to the GraphOps team and all their great work on Launchpad—that has helped.
So the question I have is, which parts of this do people want to learn?
- Do we want to start from the basics or focus on a specific area?
- What kinds of things do people want to know?
- How can we help indexers learn? We’ll all learn together as we embark on this journey.
Then we can put together some material to address these needs.
Jim: I don’t have a great deal to add because I’m not there yet, but Vince tries every day to convince me to use Kubes, and I think ultimately I will, but I’m not in the trenches with it yet. The thing that Vince says is when you decide to go down the route of using infrastructure as code and GitOps, every decision you make is a philosophical one, and everybody has a different opinion. There are multiple solutions for every single service that you might want to run. Obviously, there’s the Kubes paradigm that you have to learn, which is probably the best place to start. But once you start making decisions, there are a lot of decisions to make. What we might typically do in the bare metal world is Ansible and Redfish and PXE booting—that’s all up for grabs in the Kube space as well.
If I’m going to do this, I’m going to start with a machine that has no opinions, and I’m going to build my opinions on top of it programmatically using Talos. You can take a server from nothing to part of your cluster very quickly this way. The stack I’ve built recently for The Graph is a horizontally scalable set of Docker Compose layers, and going through that process has made it pretty obvious that the next step is Kubes. Further down the line, I’ll have loads of questions.
I’m really interested to hear about your journey, Matthew because you’re coming from a similar place. It’s great that we’re not on this journey alone.
Vince | Nodeify posted: I made an auto cluster launcher for Talos with actions.
Matthew: Absolutely, I want it to be a journey where we don’t have to go through it alone, and we leave room for those discussions: indexers who want to do it this way, that way, or the other way. As you mentioned, there are 200 ways to do a single thing.
I heard a quote: “The main job of Kubernetes is to employ people to run Kubernetes.”
If you get into the depths of the details from managing hardware all the way up the stack, it can be a lot to learn.
If you’re starting from a cloud provider, then a lot of those decisions are already done for you. If you’re starting from bare metal, there’s a lot of stuff to take into account.
We started on LXC containers with a thing called “Matthew’s automation,” so we need to get away from that.
- Vince | Nodeify posted: Matthew’s automation is magical.
Alexis | Semiotic Labs posted: I feel like when you have all the K8s set up, it’s less work.
- Matthew: Sure, but it takes a while to get there.
Matthew: The key thing here is to have some good open discussions on people’s opinions about how they manage their unique infrastructure and what level of automation they want to achieve. If you’re a smaller indexer who wants to focus on a smaller set of chains and services, then maybe going to full automation on deploying things with PXE boot is not where you need to be. Other indexers may want to deploy every single blockchain on the planet and need to automate the stack from hardware acquisition all the way up. There’s a large scope for differentiation between all the different indexers.
Jim: For some of these core functions within The Graph stack, it doesn’t matter if you’re a big or small indexer; they’re still relevant.
Matthew: Once you deploy 2 of them [Kubernetes], you can deploy 100 of them, as long as you have the hardware.
Jim: Yeah, it doesn’t matter what scale you’re at for some of these core functions.
calinah | GraphOps posted: Interested in how you use Ansible because v1 of Launchpad was using Ansible, and it didn’t work out great for us.
Vince | Nodeify posted: I can spin up a cluster with Talos in 2 seconds with PXE boots 🙂 Or maybe people want to deploy on top of a Linux distribution.
Matthew: Vince really likes his Talos, but maybe some people are more comfortable with their normal Linux distribution and just want to deploy Kubernetes on top of that, which is certainly what we’ve done up until now. I don’t see that as the long-term solution, but it got us started, and it works for running The Graph stack.
Alexis | Semiotic Labs: I’m halfway, using Fedora CoreOS in prod.
- calinah | GraphOps: Yes, we are [using Fedora CoreOS] and it’s working quite well.
Matthew: Who are the little indexers here? How can we help you?
- If you’re on Payne’s Docker Compose stack, which is awesome to get started with, and you want to take it to the next level, what questions do you have?
- That will help us define what kind of resources we can bring. A good starting point is exploring what’s available and what is needed.
Jim: What are the benefits of using Docker, Docker Swarm, or bare metal?
Something that I’ve learned in building my “rubbish” version of Kubes is you can spend time in Docker writing or using Docker Swarm to deploy things, but if you don’t have common ingress and egress, you have to build proxies into every layer that you’ve built and you have to configure them. If you don’t do that in a dynamic way, you’re going to have a nightmare managing that as you try to expand. Even doing it dynamically is difficult to manage as a bunch of Docker Compose files, VS code with tons of parameters, and an environment file as long as your arm, whereas when you move into the Kubes world, those things are sort of taken care of for you.
The other big change for me is taking the leap and acting like a developer. All of my code is in repos, and I run a private GitLab. If you don’t have those in place before you start, then it’s a non-starter. Would you agree, Matthew?
Matthew: I’m not suggesting this, but you can run Kubernetes by doing kubectl and do everything as a manual deployment, which is a good way to get started and help you learn the environment. But if you’re trying to scale to a big level, now you can’t keep track of what you did, so you do need a Git environment to manage things.
Our Matthew’s magical environment that doesn’t use Kubernetes has been a Git environment from day one, so everything is managed in Git; everything is deployed as infrastructure as code even though we’re not using Docker and Kubernetes.
If you’re going to go towards the infrastructure as code, then you need a source control system and you need to act as a developer, not an SSH-into-machine deploy stuff manually kind of thing.
That is a fundamental shift to think about from an automation perspective.
If you’re already using Ansible, maybe you’re already in that headspace. But if you’re used to directly interacting with machines, then it’s different. You need to switch to: I can’t just change this config file; I need to update source control.
Colin | StreamingFast posted: SF’s indexer got off the ground with Kubernetes using these manifests: https://github.com/graphprotocol/indexer/tree/main/k8s
Got us 90% of the way to our production setting rather easily.
- Marc-André | Ellipfra posted: These manifests are the reason I learned K8s.
Matthew: I would also like to hear from anybody who’s thinking: “This Kubernetes thing is not for me.” I’d like to hear why.
calinah | GraphOps posted: Kubernetes is not for me 😅 I feel like Kubernetes is not for anyone buuut… there’s nothing better when you need such a large scale IMHO.
At this point, Alexis from Semiotic Labs shared the Timeline Aggregation Protocol, after which the conversation continued.
[48:37]
Abel: I’m curious if folks aren’t using Kubernetes, what their reasons are and any challenges. It would be great to have some direction before we wrap up today’s conversation.
Is there anyone who is proficient on Kubernetes and is willing to do a workshop or presentation on their setup?
Matthew: I would love to know more about other people’s setups. I’m happy to share ours. Maybe Vince could share what he’s doing for his indexer as well.
Vince: For my own indexer, I’m starting at scratch, which is what slate am I working on, a.k.a. the religious question of what distribution are you going to use? I decided on Talos for various reasons, especially in the bare metal world. If anybody’s ever loaded up an ISO, you do that and it goes into maintenance mode, and it just waits for you to tell it what to do, which made GitHub actions to it as a REST API. It’s secure by default, it enforces security, and you can really set machine configs at the base level before you do anything, run a GitHub action off that, and spin up a cluster in two minutes with as few or as many machines as you want. Adding machines is incredibly easy.
Talos also comes with Talosctl, which if you’re familiar with Kubernetes and kubectl, you can do a lot of things with that. At a higher level, you can do stuff to disks, apply new machine configs, and all sorts of cool stuff. That’s kind of where I’m at on my journey. I just got all that working, so now I can spin up a cluster in Talos within 45 seconds by sending a single command. I just point to IP addresses and hit go and I have a cluster, and it has cilium networking and everything like that.
I can do slides or a demo in a future session.
Alexis | Semiotic Labs posted: My stack: Bare metal > Fedora CoreOS customized with ZFS > kubeadm (os managds with PyInfra instead of Ansible )> cilium + ZFS localpv + Helm controller + traefik + cert-manager + prometheus + thanos + loki + grafana
Matthew: I’d love to see Alexi’s setup too—ZFS you say?
Jim: I’d be really interested to know how people find the experience of learning disk services at scale with Kubes.
Matthew: I think that is an excellent topic because I agree with you, there are probably 100 ways to do it.
In our Pinax journey, we started with deploying services that don’t need local disks or don’t need much disk to avoid that problem. So we started with Firehose and other services that don’t need a local disk. Blockchain nodes are kind of challenging when we’re trying to run the BSC or something. I think the GraphOps team has a lot of good experience here that hopefully they can share as well.
Ana: We can share what we’ve done.
Alexis | Semiotic Labs: Longhorn for distributed and ZFS localpv for high perf local stuff. Storage on K8s is a very deeeep rabbit hole subject.
Matthew: To me, the biggest thing with Kubernetes is to understand why people pick different options. We tried this, we didn’t try that, we need to have these requirements, whatever. Everybody’s got a different stack; why did you pick the stack that you picked?
Derek | Data Nexus: Also, “We tried ___ and it went really bad.”
Matthew: Absolutely. What went well; what didn’t go well.
Timeline Aggregation Protocol [32:42]
Alexis from Semiotic Labs is here to update us on Timeline Aggregation Protocol (TAP), a fast, efficient, and trustless unidirectional micro-payment system that facilitates high-throughput payments for The Graph. This enables indexers to efficiently aggregate millions of receipts.
Timeline Aggregation Protocol Repo
Alexis: TAP has been in the works for a while, so to refresh everyone’s memory, I will share some resources:
- Core Devs Call #22 [4:47]
- What is TAP, and how will it work?
- Indexer Office Hours #147 recording
- Indexer Office Hours #147 blog recap
- Indexer experience of TAP
Update
We’re now in the final stages of testing those new parts of the indexer stack, so it’s mostly a completely fresh indexer service that’s now written in Rust and doesn’t leak memory. It’s faster and easier to maintain than the TypeScript version.
The TAP agent is the component that will keep track of payments and make sure that everything is correct. It blocks a customer if they’re not paying correctly.
The new indexer service is scalable horizontally. TAP agent is more like indexer agent in that you want to have one running at all times. It’s more important to have TAP agent running at all times than indexer agent.
Update indexer agent to the latest version to run this stuff. I’m running the latest agent and I’m not seeing any issues on my end.
The new indexer service is multi-threaded because in Rust it’s very easy, so we almost accidentally made it multi-threaded.
Call to Action
Start playing with it on testnet, so you’re ready when it’s on mainnet and the old system is deprecated.
Repos, with pre-built containers:
calinah | GraphOps posted: We are also running on testnet and have released a new version on graph-network-indexer chart for it: graph-network-indexer-0.5.0-canary.2
Next week, we should be ready to announce the official mainnet deployment of the new payment system.
Send feedback or add issues to the indexer-rs repo.
Questions
Marc-André | Ellipfra: Is the new TAP agent and service backward compatible?
Alexis: TAP agent is not backward compatible because you don’t need it for Scalar. The new indexer service is not backward compatible either.
The gateway is backward compatible and has been updated to support the new payment system. There’s going to be a transition period where you can run the old stuff and then upgrade to the new. Both are going to work.
Indexer agent is backward compatible. The latest version is capable of redeeming TAP receipt aggregates on-chain (v0.21.4).
Marc-André: We’ll be expected to run legacy and TAP service in parallel? Or the gateway will gracefully handle it?
Alexis: You’re not going to be expected to run legacy agent and indexer service in parallel. You’ll have to jump to the new stack—it’s one or the other. The gateway is asking the service which version it is. If it’s version zero dot something, that means it’s Scalar, the old TypeScript version. If it’s version one or more, that means it’s the new Rust version and it’s going to send new payment receipts for that one.
Jim: Let’s say I have allocations running. I leave them running, and I swap from the old service stack to the new service stack, the gateway will still be able to allow me to settle the existing vouchers that I have, like pre-Rust service?
Alexis: Yes, because indexer agent supports both. As long as you have the latest version of indexer agent, you’re fine.
calinah | GraphOps posted: Worth mentioning that indexer agent is still on the old repo though, not sure if that was clear: https://github.com/graphprotocol/indexer/releases (v0.21.4)
Jim: Is this setting the groundwork for other services? Is there anything else coming service-wise in the short term, or is that all way down the line?
Alexis: I think it’s way down the line, as some things have been deprioritized in terms of building an indexer service framework for new services. I don’t know what the status on that stuff is, but having that Rust code is going to make it cleaner and easier for new services to be built, but not sure what the priorities are in terms of what’s coming.
What is significant is that TAP is also enabling trustless gateways. It will enable new gateway providers—ones that are not Edge & Node, maybe not under The Graph umbrella at all. Trust on the Edge & Node gateway will be reduced with more gateways, and there will be minimal risk for indexers to do business with unknown entities.
Derek | Data Nexus posted: Has there been any exploration into non-subgraph query TAP-based payments?
Alexis: There has been exploration, but I don’t know how that’s going.
Jim: If I’m setting up on testnet, are there queries streaming on specific subgraphs or anything I should do to make sure I’m participating in the testing?
Alexis: I haven’t been running on testnet for a while, but Ana has been running on testnet recently.
Ana: I have been running on testnet, and all subgraphs should work. In terms of configuration, I can share the config map we’re running for TAP and the additional environmental variables.
config.toml: |
[blockchain]
chain_id = 421614
receipts_verifier_address = "0xfC24cE7a4428A6B89B52645243662A02BA734ECF"
[graph_node]
query_url = "http://graph-node-query.graph-arbitrum-sepolia.svc.cluster.local.:8000"
status_url = "http://graph-node-block-ingestor:8030/graphql"
[indexer]
indexer_address = "indexer_Address"
[service]
host_and_port = "0.0.0.0:7600"
serve_escrow_subgraph = true
serve_network_subgraph = true
[service.tap]
max_receipt_value_grt = 0.001
[subgraphs.escrow]
query_url = "https://gateway-arbitrum.network.thegraph.com/api/$YOUR_API_KEY/subgraphs/id/7ubx365MiqBH5iUz6XWXWT8PTof5BVAyEzdb8m17RvbD"
syncing_interval_secs = 60
[subgraphs.network]
query_url = "https://gateway-arbitrum.network.thegraph.com/api/$YOUR_API_KEY/subgraphs/id/3xQHhMudr1oh69ut36G2mbzpYmYxwqCeU6wwqyCDCnqV"
recently_closed_allocation_buffer_secs = 100
syncing_interval_secs = 60
[tap]
max_amount_willing_to_lose_grt = 0.03
[tap.rav_request]
max_receipts_per_request = 1000
request_timeout_secs = 5
timestamp_buffer_secs = 60
trigger_value_divisor = 100
[tap.sender_aggregator_endpoints]
0xC3dDf37906724732FfD748057FEBe23379b0710D = "https://tap-aggregator.testnet.thegraph.com/"
[metrics]
port = 7300
-config for tap-agent and indexer-service
No Comments