Summary

In order to save costs on AWS infrastructure, we are migrating ClickHouse database from an EKS cluster to on-premises servers. This migration involves transferring the entire database, including all tables and data, to ensure continuity of service and data integrity. A certain AWS workload should also be migrated to on-premises servers, including:

Database migration steps

  1. Set up a new ClickHouse instance on-premises, ensuring it is properly configured and optimized for performance.

  2. Set up workloads on-premises servers, ensuring they are properly configured, have a proper connection to the new ClickHouse instance, and are ready to handle the input data.

  3. Prepare traffic mirroring to replicate data from the AWS ClickHouse instance to the on-premises instance.

  4. Enable traffic mirroring and monitor the replication process to ensure data consistency and integrity.

  5. Wait for a couple of full days of traffic mirroring to ensure that the on-premises setup is capable of handling the load and that all data is being accurately replicated.

  6. Validate and compare NEW data between the AWS and on-premises ClickHouse instances to ensure they are in sync:

    • Get the next day after the migration started and use it as a reference date (let’s call it {DATE}), since this is the first full day when both databases should have the same data.
    • Compare channels_stat_daily table on both databases:
    select toDate(datetime), sum(sent_count), sum(delivered_count), sum(unique_click_count), sum(subscribe_count), sum(unsubscribe_count) from channels_stat_daily where toDate(datetime) >= '{DATE}' group by toDate(datetime);
    • Compare page_events table on both databases:
    select toDate(timestamp), sum(visit_count) from page_events where toDate(timestamp) >= '{DATE}' group by toDate(timestamp);
    • Compare subscriber_sources_activities table on both databases:
    select toDate(datetime) as date, sum(delivered_count), sum(clicked_count) from subscriber_sources_activities where datetime >= '{DATE}' group by date;
    • Compare subscription_events table on both databases:
    select toDate(timestamp) as date, sum(subscribe_count), sum(unsubscribe_count) from subscription_events where toDate(timestamp) >= '{DATE}' group by date;
    • Compare sender_firebase_statistics table on both databases:
    select date, sum(delivered_count) as delivers, sum(clicked_count) as clicks from sender_firebase_statistics where date >= '{DATE}' group by date;
    • Compare firebase_errors table on both databases:
    select toDate(timestamp) as date, uuid, sum(count) from firebase_errors where date > '{DATE}' group by date, uuid order by date;
  7. Now we should build a list of tables we need to backfill data into.

    • Get a list of tables (target_table) that are populated by materialized views (clickhouse-materialized-structure). Use second query that excludes those which are adding data from Kafka streams
    • Add the following 'raw' tables to the list:
      • ** page_events **
      • ** subscription_events **
      • subscriber_sources_activities
      • firebase_errors
      • firebase_subscription_errors
    • Highlighted tables should be backfilled with disabled materialized views, so we don’t duplicate data
  8. Now as we have new data validated, a list of tables we need to backup, we need to backfill the old data from AWS ClickHouse to on-premises ClickHouse.

    • Data older than {DATE} should be deleted from these tables:
    ALTER TABLE {TABLE} DELETE WHERE toDate(datetime) < '{DATE}';
    ALTER TABLE ... FREEZE PARTITION

    ** This is suitable for all tables that have PARTITION BY toYYYYMM(datetime) (should be the majority of them) **

    After snapshot is created, it should be available in the /var/lib/clickhouse/shadow/ directory on the ClickHouse server. We can use WITH NAME clause to give a custom name to the snapshot.

    • Copy snapshot files from AWS ClickHouse server to on-premises ClickHouse using SCP or any other tool
    • Copy the data from the data/database/table/ directory inside the backup to the /var/lib/clickhouse/data/database/table/detached/ directory on the on-premises ClickHouse server (https://clickhouse.com/docs/sql-reference/statements/alter/partition#attach-partitionpart)
    • Run ALTER TABLE t ATTACH PARTITION for all tables that are safe to do so (except the highlighted ones: page_events and subscription_events)
    • For the highlighted tables, we need to backfill them with disabled materialized views
      • Find all materialized views that are populating from these tables:
        • page_events_analytics for page_events
        • subscription_events_analytics, subscription_events_channel_events, subscription_events_sender_firebase, subscription_events_window_events for subscription_events
      • Backup their definitions using SHOW CREATE TABLE {TABLE} command
      • Also backup definitions for Kafka stream views:
        • page_events_stream_to_table for page_events
        • subscription_stream_to_table for subscription_events
      • Drop the Kafka stream views (page_events_stream_to_table and subscription_stream_to_table) to stop new data from being added to the base tables
      • Drop materialized views we found earlier
      • Run ALTER TABLE t ATTACH PARTITION for these tables
      • Recreate the materialized views using the definitions we backed up earlier
  9. Final validation. Execute step 6 again, but this time we should compare the full or just broader data sets on both databases to ensure they are fully in sync.