Pinax Datasets Offer Superior Simplicity with Parquet and S3

Pinax By Nov 19, 2024 No Comments

Last Updated on December 19, 2024 by Pinax Team

Discover how we’re building our datasets for exceptional efficiency, ease of use, and performance.

TL;DR: Pinax's datasets simplify blockchain data access using Parquet files and S3. This combination provides efficient storage, accelerated query performance, and seamless integration with various analytics tools. Raw blockchain data is structured using SQL schema, making it readily available for analysis. Our datasets offer a user-friendly solution for accessing and analyzing complex blockchain information across multiple industries.

In a previous blog post, we introduced Pinax’s datasets and how they will simplify access to blockchain data. In this post, we’ll dive into the tools and structures that make our datasets so powerful, like Parquet files and SQL schema. We’ll discuss how Pinax uses Parquet files for efficient data storage and the comprehensive SQL schema that underpins our raw blockchain datasets. Find out how we plan to empower data analysts with user-friendly, cutting-edge technology.

Overview of Pinax datasets

Pinax datasets make blockchain data easily accessible to remove barriers for people who want to analyze blockchain information but may not have specialized tools or infrastructure. We leverage Parquet files hosted on S3 to enable professionals across various fields—from macro analytics to AI chatbots and finance—to focus on insights rather than data wrangling.

We’ve designed our datasets to be compatible with existing analytics tools, allowing analysts to easily plug and play with their current setups. This means you can seamlessly integrate our datasets into your existing workflow using Amazon S3 as external tables. External tables are widely supported across various databases so that you can query data directly from S3 without extensive setup or custom infrastructure, which makes data management less complex.

Try our Ethereum dataset: free to try as a 30-day trial on Snowflake.

Parquet files for efficiency

Accessing, processing, and analyzing blockchain data can be challenging, especially as the technology continues to grow across industries, and data volume and complexity increase. Pinax can help with easy-to-use datasets that simplify access, designed with both technical and non-technical users in mind. Central to our streamlined approach are Parquet files—a powerful columnar storage format that improves efficiency and scalability, especially for large datasets.

But what exactly are Parquet files, and why are they so critical for accessing blockchain data?

What are Parquet files?

Parquet is a columnar storage file format widely used in data processing for its efficient compression and performance. Unlike row-based storage formats, which store data sequentially, Parquet optimizes storage and query performance by organizing data in columns, making it particularly effective for analytics.

An illustration of a nested schema, then of logical table representation with sequential rows, then of columnar layout with each column stored in its entirety. So instead of row being a1, b1, c1, a2, b2, c2 it's a1, a2, a3, b1, b2, b3, etc.
This illustration contrasts the row layout, where each row is stored sequentially, with the column layout, where each column is stored in its entirety (source).

Why are Parquet files crucial for blockchain data?

The columnar format of Parquet files makes them ideal for managing the complex, high-volume datasets in blockchain. By enabling selective column queries, Parquet improves query speed and reduces storage costs, offering a practical solution for handling terabyte-scale blockchain data.

Key benefits include:

  1. Efficient storage: Parquet compresses data without losing quality, reducing storage costs—especially beneficial for large blockchain datasets.
  2. Faster queries: Column-based storage allows analysts to access only needed data, enhancing performance for tools like Messari.
  3. Broad compatibility: Parquet integrates seamlessly with various analytics tools, supporting workflows across SQL, Python, R, and more.

How Pinax uses Parquet files

Pinax delivers blockchain data in a format designed for easy use:

  • Seamless integration: We provide raw blockchain data in Parquet format, allowing users to integrate data smoothly into existing pipelines.
  • Up-to-date insights: Datasets are updated daily, ensuring users have access to the latest blockchain data.
  • Enhanced usability: Parquet’s familiar format simplifies the analysis process, enabling analysts to focus on insights rather than infrastructure.

Pinax Pilot Program Update: Messari Mainnet Testing

Our datasets pilot program continues to shape up well, with new advancements aimed at refining data access and performance. Here’s the latest on our setup:

  • Data segregation & access options: With the S3 external table setup, analysts can manage data access independently. Options include copying S3 data into custom tables, connecting directly to the S3 external table, or using our pre-configured Snowflake views. Additionally, data can be downloaded locally for on-disk processing, providing flexibility in data handling.
  • Minimal involvement in custom requirements: This setup keeps Pinax’s role light; we focus primarily on ensuring data completeness and accurate schema definitions while analysts retain control over specific data needs.
  • Data sync & formatting: Automated S3 sync uploads ensure data is formatted for readability, allowing for the creation of custom views on top of datasets to better match analysis needs.
  • Performance optimization: We continue to test configurations to improve query performance over large datasets, including reorganizing partitions to enhance speed.

Expanding use cases for Pinax datasets

Pinax datasets have the potential to serve a range of applications:

  • Macro analytics: Platforms like Messari and Snowflake benefit from our datasets by seamlessly incorporating blockchain insights into broader market analyses.
  • Accounting: Blockchain records provide a transparent and immutable ledger, making them invaluable for accounting and audit purposes.
  • Blockchain forensics: Datasets can aid forensic investigators in tracing transactions, uncovering fraud, and monitoring suspicious activities on the blockchain.
  • AI chatbots and LLMs: Access to structured blockchain data serves as high-quality training data for AI models, particularly large language models (LLMs) aimed at understanding blockchain-related queries or performing analytics.
  • Finance analytics: DeFi applications can integrate Pinax data to display historical trading stats, lending insights for better-informed financial decisions.

Understanding raw blockchain data

Blockchain data comes in raw form, capturing all essential transactions and interactions on the network. Pinax provides access to extended foundational Ethereum data tables, such as blocks, transactions, extended traces, and logs. Our datasets allow for granular analysis, from tracking individual transactions to understanding trends across blocks, making them ideal for applications across various sectors.

A screenshot of EVM raw blockchain data, showing blocks, logs, transactions, and traces.
Comprehensive EVM blockchain data covering blocks, transactions, logs, and extended traces for in-depth analysis.

The role of SQL schema in structuring blockchain data

For efficient access, Pinax uses a robust SQL schema to structure blockchain data. This schema defines data structure within Parquet files, enabling streamlined data access and analysis across various applications. The EVM SQL schema provides detailed tables for tracking EVM data, including blocks, transactions, logs, and state changes. Each table captures granular data for in-depth analysis, helping users explore blockchain interactions efficiently.

To explore the SQL schema used in structuring Ethereum data, check out Pinax’s EVM SQL Schema on GitHub.

Practical applications of the SQL schema

To illustrate how this SQL schema can be applied in real-world analysis, here are some sample queries that demonstrate how to retrieve specific insights from blockchain data.

Active Users

-- This query calculates the number of unique active users per minute on a specific date.
SELECT
    date_trunc('minute', block_time) AS minute,
    count(distinct "from") AS user
FROM ethereum.transactions
WHERE block_date = '2024-10-01'
GROUP BY minute
ORDER BY minute ASC;

Top Contracts

-- This query retrieves the top 10 contracts with the most transactions on a specific date, ordered by transaction count.
SELECT
    "to" AS contract,
    count(*) AS transactions
FROM ethereum.transactions
WHERE block_date = '2024-10-01'
GROUP BY contract
ORDER BY transactions DESC
LIMIT 10;

ERC-20 Transfers

-- This query counts the total number of successful ERC-20 token transfers (using Transfer and TransferFrom functions) per day within a specified date range.
SELECT
    block_date,
    count(*) as total
FROM ethereum.traces
WHERE
    tx_success = true AND
    SUBSTR(input, 1, 10) IN ('0xa9059cbb', '0x23b872dd')  -- Transfer and TransferFrom
    AND block_date >= '2024-09-01' AND block_date <= '2024-09-07'
GROUP BY block_date
ORDER BY block_date;

Hosting with Snowflake for enhanced accessibility

Pinax datasets are available on Snowflake, a cloud-based data platform known for its ease of use and powerful integration capabilities. Snowflake Marketplace allows you to integrate blockchain data into your workflow with minimal setup.

Snowflake Marketplace will make our datasets available to a wider audience, facilitating greater data accessibility across industries.

Interested in trying our datasets?

Blockchain data is often seen as challenging to access and analyze, but Pinax datasets will change this mindset. Our focus is on user-friendly formats and easy access to make blockchain data accessible for multiple uses, ranging from finance analytics to AI model training.

If you’re interested in exploring how we can simplify blockchain data access for your needs, visit our website or reach out to us to learn more. Request a demo to see how we can transform your blockchain data insights.

Remember, our Ethereum dataset is free to try as a 30-day trial on Snowflake!

💡 This article answers questions like:
- How is Pinax building its blockchain datasets for ease of use and optimal performance?
- What types of industries and use cases will benefit from Pinax's datasets?
- What are Parquet files?
- Why are Parquet files suited for blockchain data?
- What's captured in raw blockchain data?
- What is the role of SQL schema in structuring blockchain data?
- How can I try a Pinax dataset for free?

Author

I am a dedicated member of the Graph Advocates DAO and proud to be a part of the Graphtronauts community. As a passionate crypto investor and enthusiast, I have delved into the world of decentralized technologies, with a strong focus on The Graph protocol. My journey includes writing insightful blogs for Graphtronauts and contributing to the development of subgraph documentation for various projects within The Graph ecosystem. Most recently, I have taken on the role of a Pinax technical writer, further expanding my commitment to advancing the adoption and understanding of blockchain and Graph-based technologies. /n https://twitter.com/PaulBarba12 https://github.com/PaulieB14 https://hey.xyz/u/paulieb https://medium.com/@paulieb.eth/about

No Comments

Leave a comment

Your email address will not be published. Required fields are marked *