Tommaso Amici - How to load a local Parquet file in Starrocks

2 minutes

Published 2 years ago

A short guide on how to ingest data in Starrocks using Parquet files

TLDR at the bottom of the post

In the past few days, I've been investigating Starrocks, an open source database for high performance analytical workloads.

As a first step, I wanted to run some tests to validate that the performance claims actually hold true for our use case.

After trying and failing to insert roughly 75 million rows in the database, I reached the conclusion that the best approach would be to load a Parquet file.

If you don't know what Apache Parquet is, here's a summary from the project's website:

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

What this means in practice is that the SQLite database I had used to generate demo data takes up 7.9 GB, while the Parquet file that holds the exact same data is only 1.22 GB.

Unfortunately, the official documentation is not as detailed as I would like; and the search currently only returns pages for loading data from Parquet files stored in object storage services like AWS S3 or GCP GCS.

TLDR

Finally, I managed to find this very helpful discussion on GitHub.

First, add a broker for loading local files:

alter system add broker local_load "127.0.0.1:8000";

Second, if you're running Starrocks with the official Docker image, like suggested in the quick start guide. Copy the Parquet file inside the running container.

docker cp demo.parquet <container_name>:/demo.parquet

Last, import the data in an existing table:

load label past_cabin_metrics_label (
    data infile(
        "file:///demo.parquet"
    ) into table demo_table format as "parquet"(
        col_1,
        col_2
    )
) with broker local_load properties("timeout" = "3600");

Since the documentation is open source, I will try to open a pull request for improving this section. In the meantime, maybe this can be useful to someone out there.