TL;DR: In this week’s Indexer Office Hours, special guest John Graber, Principal DBA at Edge & Node, returns to answer more indexer questions about PostgreSQL. The session covers index design for optimal performance, optimizing PostgreSQL on NVMe, sharding frameworks, key monitoring metrics, and more.
Opening remarks
Hello everyone, and welcome to episode 173 of Indexer Office Hours!
GRTiQ 184
Catch the GRTiQ Podcast with Max Scruffur, Head of Research at WWVentures, a venture capital firm with a unique focus on social capital and web3 investments.
Repo watch
The latest updates to important repositories
Execution Layer Clients
- sfeth/fireeth: New release v2.7.1 :
- This release bumps Substreams and firehose-core versions.
- Arbitrum-nitro New release v3.1.2 :
- This release fixes a crash issue from v3.1.1 for chains imported from Arbitrum Classic, introduces default entrypoint flags that must be replicated if overridden, and includes a validator-specific image with a different entrypoint; it’s crucial to run the Nitro node against trusted databases to avoid potential security risks.
Consensus Layer Clients
Information on the different clients
- Nimbus: New release v24.8.0 :
- Nimbus v24.8.0 is a low-urgency release focused on performance improvements, including a 25% speed increase in processing blocks with deposits, and various fixes such as adding timeouts to failed execution layer requests and correcting fork digests in message broadcasts; users are advised to update within the regular two-week cycle.
Launchpad Stack
- New chart versions released with enhanced features and bug fixes:
- Proxyd
- Arbitrum-nitro
- New stable versions of Ethereum, Gnosis, Polygon, Arbitrum, Celo, Monitoring, Postgres-Operator, and Ingress
Issues:
- Launchpad Charts Issues: View or report issues
- Launchpad Namespaces Issues: View or report issues
Protocol watch
The latest updates on important changes to the protocol
Forum Governance
- Check the Chain Integrations Tracking Doc for updates
Contracts Repository
- Subgraph Service – Unit tests #1022 (open)
- test: wip refactor unit tests for staking fns #1021 (open)
- fix: make fisherman deposit a fixed value (OZ L-01) #1002 (open)
Open discussion [8:35]
John Graber, Principal DBA at Edge & Node, returns to answer more questions about PostgreSQL.
Some of the content below is directly from John’s Markdown presentation. Some of the comments have been lightly edited and condensed.
Questions collected from indexers (continued)
How should we approach index design for optimal performance? Is there a basic practice to step through? [9:54]
Indexers will dial this in as they gain more experience, but at the basic intro level, these are the steps to go through:
- Identify slow queries.
- Use explain analyze to understand what the planner did.
- Paste explain analyze output in: https://explain.depesz.com/.
- What step(s) of the plan are slowest / the highest cost?
- Try adding an index for steps where it’s not enabled.
- Generally speaking, just start with a default b-tree on the column in question.
- As you become more advanced, you will find scenarios where different types of indexes, partial or compound indexes, etc. will prove useful.
- For example, if two columns are involved, then you may want to add a composite index on the two columns combined.
- Perhaps you’re only querying the last X months of data on a particular column, like a timestamp column, you don’t need a massive index on the entire table, so you can just add a partial index on only the items in that column that fit that criteria.
- One thing that sometimes confuses me is when I add an index on something and it refuses to use the index even though I’m pretty sure that’s the one it should use. In a recent example, I found the statistics for the column were set to the default for the entire database. I think by default the statistics target is 100, and so I started edging it up and I got the statistics set to 1,000 for this particular column. So that means when the background workers come along to do the statistics on this column, they’re going to look at it a lot more closely and in detail, and that’s expensive. But in this case, it turned out to be what needed to happen to take the query time down from 3 minutes to under a second (without a big overload on the database).
- Internally, Edge & Node adds prefix manual_ to index names we’ve added to differentiate them.
- explain analyze again.
- Important to drop the index if it did not work to improve the query.
What are some advanced techniques for optimizing PostgreSQL performance on NVMe? [15:32]
Here’s what the community is doing:
effective_io_concurrency
- Set this to a higher value (e.g. 200–300) to allow Postgres to utilize the parallel I/O capabilities of NVMe.
random_page_cost
- Drop this from the default (4.0) to between 1.1 and 1.5 to reflect the faster random read performance of NVMe.
seq_page_cost
- Adjust this to a value close to random_page_cost to reflect the fast sequential read performance.
What sharding frameworks have you played with and have any specific scaling strategies offered better consistency of sync and query performance over others? [16:28]
- At present, Edge & Node does not have a specific set of criteria for what subgraphs go on which shards.
- As things have grown quickly, the team fills a Postgres database with subgraphs, and when that database is full, creates a new one.
- Edge & Node intends to evaluate Citus (a Postgres extension for distributed architecture).
- Citus has built-in support for sharding by schema, which would potentially fit our use case very well.
- (For what it’s worth, among many useful features, it should make moving subgraphs around for load balancing purposes easier.)
What techniques do you suggest for scaling PostgreSQL horizontally?
- See the above discussion on sharding.
- Monitor subgraphs to see which ones are experiencing heavy usage. Consider moving other subgraphs from that database, or setting up a dedicated shard for that subgraph.
How can we optimize PostgreSQL for our specific workload type? [21:09]
- To be determined.
- Only just beginning to understand the workload patterns at Edge & Node.
- Right now, I’m trying to dial in the basic settings that we talked about last week, getting that foundation in shape with PostgreSQL and my reporting with pgBadger. This can create a better understanding of read/write bandwidth to help tune databases for relevant workloads.
- Perhaps we will come up with tools for tables that are updated constantly throughout the day. For example, we could create some sort of scheduled job (e.g., Cron job) that specifically vacuums that table every 24 hours or every week. Maybe the job even does the heavy stuff like VACUUM FULL.
- Beware as VACUUM FULL can lock up a database if used improperly: VACUUM FULL doesn’t mean “VACUUM, but better” by Robert M. Haas
Can we implement effective caching strategies in PostgreSQL? [23:11]
- PostgreSQL is doing caching of its own in terms of most recently used, dirty blocks that need to be written to disk, buffer hits, and other items it’s managing. So there aren’t effective caching strategies from my perspective. I may learn something about how we do graph node later that changes that, but there’s not really a caching strategy with PostgreSQL that I would try to use. My strategy is simply the parameters we talked about last week: work_mem , shared_buffers , effective_cache_size .
- If I recall correctly, David Lutterkort previously discussed caching within graph node.
- Here’s the IOH #157 recap where he talked about query caching and performance.
What are the key metrics to monitor for PostgreSQL performance, both for syncing and serving queries? Do you have dashboards for them? [24:28]
If this question is referring to syncing a blockchain, I can’t answer that.
In terms of serving queries, this is the continuous improvement approach I was talking about last week where I’m using tools like pgBadger to chew through the logs and find my problem-child queries and see if I can improve them by adding indexes or changing the work_mem. I’ll make general observations and investigate if something is using up a lot of CPU or memory or the storage is growing too quickly.
For dashboards, AWS RDS or Google Cloud SQL all have their own dashboards. I’m starting to go through what our SRE team have developed in-house in Grafana so they can monitor CPU, query time, etc. I’m also interested in trying tools like pgMonitor from the suite of tools I mentioned last week.
I’ll use the cloud service dashboards to see if there are any long running queries or vacuums, but what we need to start building now is more alerting around that because I don’t want to find out accidentally that I had an auto vacuum running for 20 hours. I want to know if it took 30 minutes or even less. I want to know about that kind of stuff proactively.
Vince | Nodeify posted: This is a cool tool but expensive: pganalyze
John | Edge & Node: I used pganalyze once. I wouldn’t dissuade anyone from taking a look at it. It is a paid tool, and I didn’t love it because I came from the pgBadger crew and pgBadger is free. But this was quite a few years ago and I was working as a contractor at the time, so I didn’t have the ability to influence how we used it and that might have added to my frustration.
Questions from the chat [29:44]
Vince | Nodeify posted: I’ve sharded up a Postgres. What should pool sizes look like for metadata, block cache shards? Are you running 100’s on those as well as the subgraph shards?
John confirmed Vince is talking about the connection pooling that graph node software does.
John: Graph node has connection pooling built in and there have been some attempts in the past to use tools like pgBouncer in between, but it didn’t work out, so all we’ve got right now is the setting in graph node. At any given time, to all of our shards we have anywhere from 400 to 800 open connections to a given instance of Postgres, but all the instances have a max connection setting of 1,000. This concerns me. More connections is not better. I’m trying to get us to a point to where we’re at a ratio of 2-3 times the number of CPUs for max connections. Currently, out of 400 connections, only 30 will not be in the idle state. These connections remain open, doing nothing but consuming resources, around 14-16MB of RAM each.
David Lutterkort shared with me that graph node should be dropping unused connections automatically, but looks like it’s not doing that.
Derek | Data Nexus posted: We end up getting a lot of idle sessions.
Mickey | E&N posted: On the flip side, some queries were constantly getting killed because our high traffic nodes thought the shard was struggling (when the connection pool was too small). It’s an art to find the right balance for sure. I think we picked that 1k connection number arbitrarily.
Vince: My main question with splitting up metadata and block cache, and then scaling horizontally with subgraph shards is related to being in a limited-resource world, not in a cloud environment. Where do I want to spend most of my time giving the horsepower to? The subgraph shards? The block cache shards? The metadata? What’s going to benefit me most?
John: On the block cache side, there may be other people on the call that can answer better than me. Mostly I’ve been focused on the subgraph shards because they seem to be where the heavy workload is, so that’s where I’ve been looking.
Derek: I think ultimately those probably do different things. Block cache is going to help with syncing speeds. If it’s slow to reference or read, that could be bad for your chainhead proximity. If your chainhead proximity isn’t a problem, I wouldn’t try beefing up the block cache much.
Vince posted: So metadata is the low hanging fruit?
Derek: When you say low hanging fruit, it really depends on what you’re trying to optimize because we can optimize for syncing subgraphs faster, for having really fresh data, or for faster reads. Each person is going to be different. I recommend looking at the quality of service subgraph and getting a sense of how your latency compares to others vs. how your chain head proximity compares to others. You’d need to choose what you want to focus on optimizing and then get back to your other improvements later.
Hau | Pinax posted: So block cache should be sharded + scaled horizontally too? (more chains in the future)
Jim | Wavefive 🌊 posted: They get big fast, so yes.
Vince posted: More chains + more cache. Plus they just get bigger.
Hau posted: Which one gets bigger faster? Subgraph shard vs. block cache shard?
Jim posted: Block cache is my biggest database.
John: Are there situations where you’re trying to write data really fast, but you don’t care if it gets lost because you can recreate it? For example, creating a table with no logging or turning off sync. These are riskier things, but you can make it write really quickly if you don’t care.
Derek: When we’re trying to sync a brand new subgraph that just got deployed, like for an application that’s been around for four or five years, you have to go through a whole bunch of the chain’s history in order to compile all that data. Until you get to the chain head, you’re not really expecting any reads. I’d be really curious to start looking at how we can expedite that initial sync.
I know there have been some community efforts looking into turning JIT off or JIT on.
The other one that I’ve played with is turning off all indexes except for your primary key and the ID field, syncing it to the chain head, and then rebuilding all those indexes. I think that could be a way to speed things up. It would be really cool to have a trigger where, as soon as you change the sync column to true, it executes a script to go and rebuild all of your indexes.
Those are things that I like to play with, but one thing that I’ve had a lot of difficulty with is having good benchmarking.
Testing results: account-like [44:21]
Derek | Data Nexus: Jim and I tested the account-like query that you can run against your database for potential performance gains. It looks for tables with over 3 million records with at least a thousand versions per entity.
Overall, it had a pretty good improvement from what we’ve seen. I’m curious if anyone is going to be implementing this; if so, send me the list of deployments, preferably as a distinct list, and I will monitor some stats on your specific indexer across the quality of service subgraph. We can see what day did you apply it, what did your average latency start at and what did it go to? Did it have a positive effect?
Feel free to give it a try and let me know any feedback.
account-like query
select s.schemaname as schema_name, c.relname as table_name
from pg_namespace n, pg_class c, pg_stats s
join subgraphs.subgraph_deployment sd on sd.id::text = substring(s.schemaname,4,length(s.schemaname))
left join subgraphs.table_stats ts on ts.deployment = sd.id and s.tablename = ts.table_name
where c.relnamespace = n.oid
and s.schemaname = n.nspname
and c.relname not in ('poi2$','data_sources$')
and s.attname = 'id'
and c.relname = s.tablename
and c.reltuples::int8 > 3000000
and coalesce(ts.is_account_like,false) = false
and case when c.reltuples = 0 then 0::float8 when s.n_distinct < 0 then (-s.n_distinct)::float8 else greatest(s.n_distinct, 1)::float8 / c.reltuples::float8 end < 0.001
order by s.schemaname, c.relname;
No Comments