How to Compare Rows in Two CSV Files Using ClickHouse

Table of Contents

Introduction

Dealing with data often requires comparing rows across different CSV files. In this post, we will see how to compare rows in two CSV files using ClickHouse.

Before diving into the comparison, ensure you have clickhouse-local installed. If not, you can download it from the official website.

In this example, we will compare two CSV files: stores.csv and stores_2.csv. This files contain store metadata, including city, state, type, and cluster. You can download the files from Kaggle. The files are the same, but stores_2.csv has one row less than stores.csv.

Compare Two CSV Files

As always, we will use clickhouse-local for this kind of operation.

Number of Rows

First, let’s check the number of rows in each file:

WITH
    (
        SELECT count()
        FROM file('stores.csv')
    ) AS s1,
    (
        SELECT count()
        FROM file('stores_2.csv')
    ) AS s2
SELECT
    s1,
    s2,
    s1 = s2 AS equal
;

Output:

┌─s1─┬─s2─┬─equal─┐
54530└────┴────┴───────┘

As we can see, the number of rows is different.

Find the Differences

Now, let’s find the row that is in stores.csv but not in stores_2.csv:

SELECT *
FROM file('stores.csv') AS s1
LEFT JOIN file('stores_2.csv') AS s2 ON s1.store_nbr = s2.store_nbr
WHERE s2.store_nbr IS NULL

Output:

┌─store_nbr─┬─city──┬─state─────┬─type─┬─cluster─┬─s2.store_nbr─┬─s2.city─┬─s2.state─┬─s2.type─┬─s2.cluster─┐
10QuitoPichinchaC15 │         ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ    │ ᴺᵁᴸᴸ     │ ᴺᵁᴸᴸ    │       ᴺᵁᴸᴸ │
└───────────┴───────┴───────────┴──────┴─────────┴──────────────┴─────────┴──────────┴─────────┴────────────┘

Store the Differences

If needed, we can store the differences in a table: Copy

CREATE TABLE differences
ENGINE = Memory AS
SELECT *
FROM file('stores.csv') AS s1
LEFT JOIN file('stores_2.csv') AS s2 ON s1.store_nbr = s2.store_nbr
WHERE s2.store_nbr IS NULL
;

Or into a file:

SELECT *
FROM file('stores.csv') AS s1
LEFT JOIN file('stores_2.csv') AS s2 ON s1.store_nbr = s2.store_nbr
WHERE s2.store_nbr IS NULL
INTO OUTFILE 'differences.csv'
FORMAT CSV

That’s it, now you know how to compare rows in two CSV files using ClickHouse.