添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Intermediate

Deduplicating data is one of the most common problems when dealing with analytical databases like ClickHouse. Here you’ll learn several strategies for it.

OLAP databases like ClickHouse are optimized for fast ingestion and, for that to work, some trade-offs have to be made. One of them is the lack of unique constraints, since enforcing them would add a big overhead and make ingestion speeds too slow for what’s expected from a database of this kind.

Nonetheless, there will be lots of times when you will have duplicated data and you will want to get rid of it, or just access the latest data point available. There are several ways to go about it, and the general workflow will be ingesting all of the data first, including duplicates, and dealing with them later.

The problem

Typically, there are two use-cases where you’ll end up with duplicated data:

  • Upserts . An upsert is an operation that inserts rows into a database table if they do not already exist, or updates them if they do. On databases like Postgres that’d be accomplished with ON CONFLICT DO INSERT clauses, but as ClickHouse doesn’t enforce uniqueness of primary keys, such clauses aren’t supported either. The way to do upserts on ClickHouse is with a ReplacingMergeTree engine. More on this later.

  • Constant ingestions and historical data. Imagine you are periodically dumping data from your transactional database to ClickHouse to run analytical queries on it in real-time, like described here . You’d end up with lots of rows inserted at different times with the same primary key. You may want to get only the latest data point - in this case you can get rid of the rest and treat them like upserts . But there are also use-cases where you want to keep all the data to have a historic record of the evolution of the attributes of an object over time.

  • A real-world example

    We’ve created a dataset that resembles what a social media analytics company would have to track page views of posts over time. It has data for about 10000 posts over every day of a year. It looks like this: