Substreams Sampler: How to Bulk-Load Blockchain Data into ClickHouse

Substreams By Jul 16, 2024 No Comments
TL;DR: Substreams and ClickHouse revolutionize blockchain data handling, as demonstrated by Pinax co-op students Maryam and Syed in their insightful bulk-loading project.

Handling and analyzing vast amounts of blockchain data is a challenge in a rapidly evolving digital landscape. Substreams and ClickHouse are two powerful technologies that promise to revolutionize the way we manage and analyze blockchain data.

In this post, we delve into an insightful video featuring Maryam and Syed, co-op students at Pinax, who walk us through their project of bulk-loading blockchain data into ClickHouse.

What is Substreams?

Substreams is a specialized indexing technology designed for blockchain data. It allows you to categorize and segment blockchain information, making it easier to handle and analyze.

With Substreams, developers can extract specific data types from blockchains, apply customized transformations, and send data to various destinations, known as sinks. This powerful tool is essential for effective blockchain data management and analysis.

A screenshot of a presentation slide that reads: What are Substreams? A way to get specific information from blockchain data (rather than processing an entire block). Developed by StreamingFast team. Role: Powerful tool for managing, analyzing, and using blockchain data effectively. Capabilities: Categorize and segment data, extract specific types, and apply custom transformations. Use: Data is sent to destinations like databases or files, known as sinks.

Understanding ClickHouse

ClickHouse is an open-source columnar database management system optimized for real-time analytics. It’s renowned for its high-speed query performance, scalability, and robust support for handling large amounts of data. Its architecture, designed for analytical queries, makes it an ideal choice for processing and analyzing blockchain data.

To learn more about ClickHouse and its capabilities, refer to our Quickstart Guide: ClickHouse 101.

A screenshot of a presentation slide that reads: What is ClickHouse? It is an open-source columnar database management system. It is known for high-speed query performance, scalability, and robust support for large data. It's optimized for analytical queries and has efficient compression and query execution.

Syncing Substreams with ClickHouse

The Substreams SQL sink lets you stream outputs from Substreams modules into ClickHouse. It reduces manual configuration and setup time by specifying the database connection details and setting up the necessary tables within ClickHouse, allowing for more focus on data analysis.

For an in-depth look at how ClickHouse can simplify real-time blockchain analytics, check out Simplify Real-Time Blockchain Analytics with ClickHouse.

A screenshot of a presentation slide with the title: Substreams-sink-sql. Syncs Substreams modules to a ClickHouse database. Sets up databases with necessary tables automatically. Optimizes data insertion with high throughput rates for efficient data handling.

The data pipeline

The data pipeline consists of several stages:

  1. Extract: Firehose extracts blockchain data into flat files.
  2. Transform: Substreams modules transform Firehose blocks and cache data on a server.
  3. Load: Substreams SQL sink loads data from Substreams backend into ClickHouse.
  4. Query: Execute SQL queries on the ClickHouse database for analysis.
A visual of the list above it with Extract: Firehose streams block data and creates flat files. Transform: Substreams streams transformed data and outputs cache files. Load: Substreams sink stores the data however it makes sense for the application, e.g., Graph Node. Query: query layer queries according to how it is stored, e.g., Graph Node or anything else.
This pipeline ensures smooth integration and effective utilization of blockchain data for analytics.

Project highlights

Maryam and Syed aimed to quickly insert large amounts of blockchain data into ClickHouse. After exploring various methods, they found the most efficient approach was to buffer data locally and send it as large inserts using the Substreams SQL sink. They achieved impressive throughput rates, demonstrating the scalability and efficiency of their approach.

Looking ahead

The team plans to optimize their methods further, focusing on chains like EOS, WAX, and Telos. They intend to expand to more chains and explore other blockchain tech stacks and APIs, such as ERC-20 tokens and inscriptions.

Getting started

For those interested in trying out these tools, the GitHub repository for the Substreams sink SQL tool provides an example setup. You’ll need an API key, but a free tier allows you to follow along with the example and run the Substreams sink SQL on your own data.

Bottom line

The combination of Substreams and ClickHouse offers a powerful solution for managing and analyzing blockchain data. Maryam and Syed’s project is proof of the potential of these technologies, and their work provides a valuable guide for others looking to leverage these tools. For more detailed insights, you can watch the full video above.

Stay tuned for more updates and advancements in the exciting field of blockchain data management!

💡 This article answers questions like:
- What is Substreams?
- What is ClickHouse used for?
- How can a developer use Substreams with ClickHouse?
- What is a Substreams sink?

Author

I am a dedicated member of the Graph Advocates DAO and proud to be a part of the Graphtronauts community. As a passionate crypto investor and enthusiast, I have delved into the world of decentralized technologies, with a strong focus on The Graph protocol. My journey includes writing insightful blogs for Graphtronauts and contributing to the development of subgraph documentation for various projects within The Graph ecosystem. Most recently, I have taken on the role of a Pinax technical writer, further expanding my commitment to advancing the adoption and understanding of blockchain and Graph-based technologies. /n https://twitter.com/PaulBarba12 https://github.com/PaulieB14 https://hey.xyz/u/paulieb https://medium.com/@paulieb.eth/about

No Comments

Leave a comment

Your email address will not be published. Required fields are marked *