添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
含蓄的薯片  ·  如何在PS里使用stable ...·  6 天前    · 
留胡子的打火机  ·  Urbackup Server ...·  3 天前    · 
痴情的雪糕  ·  DELETE - Apache Doris·  3 天前    · 
有胆有识的勺子  ·  GitHub - ...·  2 月前    · 
腹黑的铅笔  ·  InfluxQL 参考 | ...·  3 月前    · 
暴走的骆驼  ·  电脑安装WPS ...·  10 月前    · 
斯文的佛珠  ·  Making and publishing ...·  1 年前    · 

I have a workflow that is working nicely in taking a stream of incremental data, transforming it, and then merging it with the DB Merge node to a SQLite DB Table that contains previously-pulled incremental data. Let’s call this Table One . It’s a great table. An excerpt of the workflow:

image 1363×565 133 KB (Credit to @mlauber71 for helping me through my initial DB questions ).

The dataset itself is a table of active real estate listings. However, a listing throughout its lifecycle will have price updates (seller decides to up/lower the list price), or other changes (new comments, updated dimensions, etc.). Each listing has a unique ID, called Listing Number, and I came to appreciate that I need a new table, which we should call Table Two .

Table Two , not yet created in any part of the workflow, should be a sort of “change tracker table” for every Listing Number. That way, I can pull up any Listing Number, and see all the changes that have happened. The native data set does include an “Updated Timestamp”. So Table Two could look like:

Listing# Updated Price SqFt Comments
001 Mon $300 500
001 Tue $350 500 Great!
002 Mon $600 1000
003 Mon $500
003 Wed $500 800

002 purposely left unchanged, in case that impacts the logic.

How would I go about efficiently creating a change tracker SQLite table?

@qdmt glad you liked the previous hints. You could just setup a SQLite database that would collect all the entries where you can then select the latest one per Listing with the use of a Windows/Rank function (the current driver bundled with KNIME would allow this function). The latest results (per Listing) will then be stored in a new table my_tracking_table_02.

image 1920×1256 178 KB

The key in this case is to set up the SQLite table so that it will have an automated PRIMARY KEY and an automated TIMESTAMP. The other columns are dynamically constructed from the sample file with the help of a KNIME Flow Variable (v_sql_string2).

CREATE TABLE IF NOT EXISTS `my_tracking_table_01` (
    	ID INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL
 ,    $${Sv_sql_string2}$$	
 ,    my_Timestamp DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL

This is also like big data systems handle their data operations: just append all data and then select the latest one. You might have to check can handle the volume. Maybe you have to use one database per month/quarter/year. And also think about backups.

Maybe you can explore this example and see if this can help. Please note: SQLite does have a special handling of date and time variables.

Thanks very much @mlauber71 - made this my Sunday project and thought I almost cracked it, but ended up spinning my wheels for a good portion of the day and made a good mess of things (my fault). So if I may, I’d like to share a version of the workflow prior to starting with a historical table.

For Sharing.knwf (230.7 KB)

To recap, this workflow checks a folder for the most recent CSV (for the purpose of a shareable workflow, a “Table Creator” node has a sample dataset), does some cleaning, then merges it or inserts new rows into the SQLite database. This database is essentially the “latest row” of any given ML Number (i.e., the “Listing” ID described above). The “Updated” column is the timestamp that represents when the ML Number has been updated. Note that the CSV-producing client could have overlap of data (can set to produce incremental or overlapping).

Tried to replicate or do something similar to what you had done above, but then thought to at least try some minor changes by simultaneously sending to the current DB Merge node as well as another DB Writer/Insert node that sends to a new, Historical table based on some ML Number/Updated keypair, and then from there, do the ranking. Thought this would also give the added benefit of keeping the main (latest update only table) running fast, while the historical table being accessed separately only when required. Spun wheels here.

Will let it be and step away for now - but any tips or even quick fixes to this would be super appreciated.

@qdmt I like your spirit to crack this thing. What you might want to keep in mind: best to make some sort of plan what your workflow should do. In the previous approach there were examples about H2 and UPDATE functions. Also there is an example where only new lines or the ones that have changed are being inserted

In this thread we used SQLite and just appended data like in a Big Data environment and automatically assign IDs and TIMESTAMPS. All this approaches have their benefits. You might want to think about.

H2 and SQLite are both great standalone platforms. SQLite might be a little bit more ‘stable’ but also has fewer functions. H2 is more advanced but I do not have that much information about how it behaves with large sizes.

In theory you could also use a (local) big data environment where you would have the data as single parquet files and manage them (1, 2, 3).

H2 do an UPDATE on the server using a VIEW of the existing and the new data