Deduplication Strategies in ClickHouse · Tinybird Guides

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

玩命的脸盆 · 文艺工作者速看！<br ...· 1 年前 ·

月球上的海豚 · Michael钱儿频道《超级隋唐英雄传》插画 ...· 1 年前 ·

健身的苦瓜 · 九尾紫瞳有那些系列，每个系列有什么特点？ - 知乎· 1 年前 ·

淡定的冰淇淋 · 荣威iMAX8和上汽大通MAXUS ...· 1 年前 ·

纯真的芒果 · 湖南邵东打火机年出口量超35亿只，行销上百个 ...· 1 年前 ·

Intermediate

Deduplicating data is one of the most common problems when dealing with analytical databases like ClickHouse. Here you’ll learn several strategies for it.

OLAP databases like ClickHouse are optimized for fast ingestion and, for that to work, some trade-offs have to be made. One of them is the lack of unique constraints, since enforcing them would add a big overhead and make ingestion speeds too slow for what’s expected from a database of this kind.

Nonetheless, there will be lots of times when you will have duplicated data and you will want to get rid of it, or just access the latest data point available. There are several ways to go about it, and the general workflow will be ingesting all of the data first, including duplicates, and dealing with them later.

The problem ¶

Typically, there are two use-cases where you’ll end up with duplicated data:

Upserts . An upsert is an operation that inserts rows into a database table if they do not already exist, or updates them if they do. On databases like Postgres that’d be accomplished with ON CONFLICT DO INSERT clauses, but as ClickHouse doesn’t enforce uniqueness of primary keys, such clauses aren’t supported either. The way to do upserts on ClickHouse is with a ReplacingMergeTree engine. More on this later.

Constant ingestions and historical data. Imagine you are periodically dumping data from your transactional database to ClickHouse to run analytical queries on it in real-time, like described here . You’d end up with lots of rows inserted at different times with the same primary key. You may want to get only the latest data point - in this case you can get rid of the rest and treat them like upserts . But there are also use-cases where you want to keep all the data to have a historic record of the evolution of the attributes of an object over time.

A real-world example ¶

We’ve created a dataset that resembles what a social media analytics company would have to track page views of posts over time. It has data for about 10000 posts over every day of a year. It looks like this:

推荐文章

玩命的脸盆 · 文艺工作者速看！<br />年度预算3亿元的四川省重大文艺项目扶持和精品奖励办法（试行）来了！- 四川省人民政府网站

1 年前

月球上的海豚 · Michael钱儿频道《超级隋唐英雄传》插画设计_腾讯新闻

1 年前

健身的苦瓜 · 九尾紫瞳有那些系列，每个系列有什么特点？ - 知乎

1 年前

淡定的冰淇淋 · 荣威iMAX8和上汽大通MAXUS 大家MIFA5 插电混动的图片对比-易车

1 年前

纯真的芒果 · 湖南邵东打火机年出口量超35亿只，行销上百个国家和地区——把1元钱的生意做成支柱产业 - 湖南省工业和信息化厅

1 年前