Wanted to use "--restore-jobs" parameter in the "pgcopydb clone". Tried using dimitri/pgcopydb:v0.14 docker image but it seems it does not support "--restore-jobs". So decided to pull the latest code(Branch: main) and build the docker image using the default Dockerfile available in the repository.
When using the docker image built from the latest code pull(main branch). I am getting the following error:
10:38:29.810 1 INFO Running pgcopydb version 0.14 from "/usr/local/bin/pgcopydb"
10:38:29.811 1 INFO [SOURCE] Copying database from "postgres://user1@host1:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
10:38:29.811 1 INFO [TARGET] Copying database into "postgres://user2@host2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
10:38:29.843 1 INFO Using work dir "/dbhome/tmp"
10:38:29.869 1 INFO Exported snapshot "00000005-002E1756-1" from the source database
10:38:29.879 1 NOTICE Wrote snapshot "00000005-002E1756-1" to file "/dbhome/tmp/snapshot"
10:38:29.880 15 NOTICE Starting the clone sub-process
10:38:29.881 15 INFO STEP 1: fetch source database tables, indexes, and sequences
10:38:30.089 15 NOTICE Registering catalog setup for source "postgres://user1@host1:5432/db1", target "postgres://user2@host2:5432/onecard_sbm", snapshot "00000005-002E1756-1"
10:38:38.417 15 INFO Fetched information for 253 tables (including 0 tables split in 0 partitions total), with an estimated total of 1186 million tuples and 411 GB on-disk
10:38:40.789 15 INFO Fetched information for 512 indexes (supporting 307 constraints)
10:38:41.184 15 INFO Fetching information for 145 sequences
10:38:41.882 15 INFO Fetched information for 6 extensions
10:38:42.043 15 INFO Found 0 indexes (supporting 0 constraints) in the target database
corrupted size vs. prev_size
Command run:
docker run -u root --mount type=bind,source="$(pwd)",target="$(pwd)" --rm -it 962570088078.dkr.ecr.ap-south-1.amazonaws.com/pgcopydb_custom:latest \
pgcopydb clone \
--source "host=${snapshot_db_host} port=5432 dbname=onecard_sbm user=${snapshot_user} password=${snapshot_password}" \
--target "host=${empty_db_host} port=5432 dbname=onecard_sbm user=${empty_user} password=${empty_password}" \
--no-role-passwords --no-owner --no-acl --no-comments --restore-jobs 1 --table-jobs 1 --index-jobs 1 --skip-vacuum --verbose --dir /dbhome/tmp
Please let me know. If any other information is required.
Hi @pratikshetty1994 ; thanks for the amount of details. That error message corrupted size vs. prev_size is new to me, and seems to be related with memory management (and maybe re-use after free?). Will investigate, but that might be hard to understand / track / solve, unless maybe Valgrind helps us there.
Cc @shubhamdhama who's been working on malloc/free/valgring and testing a Garbage Collector approach too I believe.
@shubhamdhama Unfortunately wont be able to share schema we are using. How can i help in getting more clearity for you.
Verbose.log:
08:18:34.810 8 SQL pgsql.c:1501 [SOURCE 8001] select e.oid, extname, extnamespace::regnamespace, extrelocatable, 0 as count, null as n, null as extconfig, null as nspname, null as relname, null as extcondition, null as relkind from pg_extension e where extconfig is null UNION ALL select e.oid, extname, extnamespace::regnamespace, extrelocatable, array_length(e.extconfig, 1) as count, extconfig.n, extconfig.extconfig, format('%I', n.nspname) as nspname, format('%I', c.relname) as relname, extcondition[extconfig.n], c.relkind as relkind from pg_extension e, unnest(extconfig) with ordinality as extconfig(extconfig, n) left join pg_class c on c.oid = extconfig.extconfig join pg_namespace n on c.relnamespace = n.oid where extconfig.extconfig is not null order by oid, n;
08:18:34.812 8 DEBUG schema.c:4210 getExtensionList: 1
08:18:34.814 8 DEBUG schema.c:4210 getExtensionList: 1
08:18:34.817 8 DEBUG schema.c:4210 getExtensionList: 1
08:18:34.819 8 DEBUG schema.c:4210 getExtensionList: 1
08:18:34.822 8 DEBUG schema.c:4210 getExtensionList: 1
08:18:34.824 8 DEBUG schema.c:4210 getExtensionList: 1
08:18:34.829 8 DEBUG schema.c:4210 getExtensionList: 1
08:18:34.832 8 DEBUG schema.c:4210 getExtensionList: 1
08:18:34.837 8 INFO copydb_schema.c:745 Fetched information for 6 extensions
08:18:34.843 8 SQL pgsql.c:1501 [SOURCE 8001] COMMIT;
08:18:34.845 8 SQL pgsql.c:404 Disconnecting from [source] "postgres://u1@h1:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
08:18:34.893 8 SQL pgsql.c:494 Connecting to [target] "postgres://u2@h2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
08:18:34.915 8 SQL pgsql.c:1501 [TARGET 14875] BEGIN;
08:18:34.916 8 SQL pgsql.c:1501 [TARGET 14875] select n.oid, n.nspname, format('- %s %s', regexp_replace(n.nspname, '[\n\r]', ' '), regexp_replace(auth.rolname, '[\n\r]', ' ')) from pg_namespace n join pg_roles auth ON auth.oid = n.nspowner where nspname <> 'information_schema' and nspname !~ '^pg_';
08:18:34.918 8 DEBUG schema.c:3815 getSchemaList: 1
08:18:34.921 8 SQL pgsql.c:1501 [TARGET 14875] select oid, format('%I', rolname) as rolname from pg_roles;
08:18:34.921 8 DEBUG schema.c:3903 getRoleList: 25
08:18:34.987 8 DEBUG schema.c:2976 listSourceIndexesSQL[SOURCE_FILTER_TYPE_NONE]
08:18:34.987 8 SQL pgsql.c:1501 [TARGET 14875] select i.oid, format('%I', n.nspname) as inspname, format('%I', i.relname) as irelname, r.oid, format('%I', rn.nspname) as rnspname, format('%I', r.relname) as rrelname, indisprimary, indisunique, (select string_agg(format('%I', attname), ',') from pg_attribute where attrelid = r.oid and array[attnum::integer] <@ indkey::integer[] ) as cols, pg_get_indexdef(indexrelid), c.oid, c.condeferrable, c.condeferred, case when conname is not null then format('%I', c.conname) end as conname, pg_get_constraintdef(c.oid), format('%s %s %s', regexp_replace(n.nspname, '[\n\r]', ' '), regexp_replace(i.relname, '[\n\r]', ' '), regexp_replace(auth.rolname, '[\n\r]', ' ')) from pg_index x join pg_class i ON i.oid = x.indexrelid join pg_class r ON r.oid = x.indrelid join pg_namespace n ON n.oid = i.relnamespace join pg_namespace rn ON rn.oid = r.relnamespace join pg_roles auth ON auth.oid = i.relowner left join pg_depend d on d.classid = 'pg_class'::regclass and d.objid = i.oid and d.refclassid = 'pg_constraint'::regclass and d.deptype = 'i' left join pg_constraint c ON c.oid = d.refobjid where r.relkind = 'r' and r.relpersistence in ('p', 'u') and n.nspname !~ '^pg_' and n.nspname <> 'information_schema' and n.nspname !~ 'pgcopydb' and not exists ( select 1 from pg_depend d where d.classid = 'pg_class'::regclass and d.objid = r.oid and d.deptype = 'e' ) order by n.nspname, r.relname, i.relname;
08:18:34.994 8 SQL pgsql.c:1501 [TARGET 14875] COMMIT;
08:18:34.997 8 SQL pgsql.c:404 Disconnecting from [target] "postgres://u2@h2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
08:18:34.997 8 INFO copydb_schema.c:1009 Found 0 indexes (supporting 0 constraints) in the target database
malloc(): invalid size (unsorted)
08:18:35.117 1 DEBUG cli_clone_follow.c:755 clone process 8 has terminated [0] (Aborted [6])
08:18:35.267 1 SQL pgsql.c:1501 [SOURCE 7999] COMMIT;
08:18:35.268 1 SQL pgsql.c:404 Disconnecting from [source] "postgres://u1@h1:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
08:18:35.269 1 DEBUG copydb.c:902 Waiting for sub-processes to finish
08:18:35.269 1 DEBUG copydb.c:918 copydb_wait_for_subprocesses: no more children
08:18:35.269 1 DEBUG queue_utils.c:60 iprm -q 32769 (create index)
08:18:35.269 1 DEBUG queue_utils.c:60 iprm -q 0 (vacuum)
08:18:35.269 1 DEBUG lock_utils.c:212 ipcrm -s 0
@shubhamdhama Also sharing Dockerfile used. Using debian 11 for build process and debian 12 for finql image.
Dockerfile
FROM --platform=linux/amd64 debian:11-slim as build
ARG TARGETPLATFORM
ARG TARGETOS
ARG TARGETARCH
RUN dpkg --add-architecture ${TARGETARCH:-amd64} && apt update \
&& apt install -qqy --no-install-recommends \
libncurses-dev \
libxml2-dev \
sudo \
valgrind \
build-essential \
libedit-dev \
libicu-dev \
libkrb5-dev \
liblz4-dev \
libncurses6 \
libpam-dev \
libpq-dev \
libpq5 \
libreadline-dev \
libselinux1-dev \
libsqlite3-dev \
libssl-dev \
libxslt1-dev \
lsof \
psmisc \
gdb \
strace \
tmux \
watch \
make \
openssl \
postgresql-common \
postgresql-client-common \
postgresql-server-dev-all \
psutils \
tmux \
watch \
zlib1g-dev
WORKDIR /usr/src/pgcopydb
COPY . .
RUN make -s clean && make -s -j$(nproc) install
FROM --platform=linux/amd64 debian:12-slim as run
ARG TARGETPLATFORM
ARG TARGETOS
ARG TARGETARCH
LABEL org.opencontainers.image.source https://github.com/dimitri/pgcopydb
RUN dpkg --add-architecture ${TARGETARCH:-amd64} && apt update \
&& apt install -qqy --no-install-suggests --no-install-recommends \
sudo \
passwd \
ca-certificates \
libpq5 \
libsqlite3-0 \
lsof \
tmux \
watch \
psmisc \
openssl \
postgresql-common \
postgresql-client \
postgresql-client-common \
&& apt clean \
&& rm -rf /var/lib/apt/lists/*
RUN useradd -rm -d /var/lib/postgres -s /bin/bash -g postgres -G sudo docker
RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
COPY --from=build --chmod=755 /usr/lib/postgresql/13/bin/pgcopydb /usr/local/bin
USER docker
ENTRYPOINT []
CMD []
HEALTHCHECK NONE
Hi @pratikshetty1994 ; it seems you've been using v0.14 and we released v0.15 today. The new version includes a big internal refactoring/rewrite that impacts a lots of the internal memory handling, in such a way that I believe it's worth your time trying again with the new version. Please report success/failure here with the needed level of details if necessary.
If you manage to reproduce the error and have the time to work on a self-contained test case that hits the problem everytime, that would gold to help us fix the problem.
Hey @dimitri @shubhamdhama Tried v0.15 but it seems the issue still persists.Hopefully these logs can help you debug. Also let me know if you need me to run any tests to understand the issue better.
test.log:
4:17:17.634 15 INFO indexes.c:157 STEP 6: starting 1 CREATE INDEX processes
04:17:17.634 15 INFO indexes.c:158 STEP 7: constraints are built by the CREATE INDEX processes
04:17:17.635 15 DEBUG copydb.c:714 Waiting for sub-processes to finish
04:17:17.635 17 SQL pgsql.c:1501 [SOURCE 16366] BEGIN;
04:17:17.636 17 SQL pgsql.c:1501 [SOURCE 16366] SET TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ WRITE, DEFERRABLE;
04:17:17.636 17 SQL pgsql.c:1501 [SOURCE 16366] SET TRANSACTION SNAPSHOT '00000005-003EA70A-1';
04:17:17.637 17 SQL pgsql.c:1501 [SOURCE 16366] select current_setting('server_version'), current_setting('server_version_num')::integer;
04:17:17.638 17 DEBUG pgsql.c:1174 [SOURCE 16366] Postgres version 14.3 (140003)
04:17:17.638 17 SQL pgsql.c:1501 [SOURCE 16366] SET client_encoding TO 'UTF-8';
04:17:17.639 17 SQL pgsql.c:1501 [SOURCE 16366] SET extra_float_digits TO 3;
04:17:17.640 17 SQL pgsql.c:1501 [SOURCE 16366] SET statement_timeout TO 0;
04:17:17.640 17 SQL pgsql.c:1501 [SOURCE 16366] SET default_transaction_read_only TO off;
04:17:17.641 17 SQL pgsql.c:1501 [SOURCE 16366] SET idle_in_transaction_session_timeout TO 0;
04:17:17.642 17 SQL pgsql.c:494 Connecting to [target] "postgres://user2@host2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
04:17:17.645 19 NOTICE indexes.c:215 Started CREATE INDEX worker 19 [15]
04:17:17.645 19 DEBUG catalog.c:705 Skipping opening SQLite database "/dbhome/tmp/schema/source.db": already opened
04:17:17.645 19 DEBUG catalog.c:710 Opening SQLite database "/dbhome/tmp/schema/filter.db" with lib version 3.40.1
04:17:17.645 19 DEBUG catalog.c:710 Opening SQLite database "/dbhome/tmp/schema/target.db" with lib version 3.40.1
04:17:17.645 19 DEBUG catalog.c:582 Catalog has been setup for source "postgres://user1@host1:5432/db1", target "postgres://user2@host2:5432/db1", snapshot "00000005-003EA70A-1"
04:17:17.645 19 SQL pgsql.c:494 Connecting to [target] "postgres://user2@host2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
04:17:17.646 8 SQL pgsql.c:1501 [SOURCE 16367] BEGIN;
04:17:17.647 8 SQL pgsql.c:1501 [SOURCE 16367] SET TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ WRITE, DEFERRABLE;
04:17:17.647 8 SQL pgsql.c:1501 [SOURCE 16367] SET TRANSACTION SNAPSHOT '00000005-003EA70A-1';
04:17:17.648 8 SQL pgsql.c:1501 [SOURCE 16367] select current_setting('server_version'), current_setting('server_version_num')::integer;
04:17:17.649 8 DEBUG pgsql.c:1174 [SOURCE 16367] Postgres version 14.3 (140003)
04:17:17.649 8 SQL pgsql.c:1501 [SOURCE 16367] SET client_encoding TO 'UTF-8';
04:17:17.650 8 SQL pgsql.c:1501 [SOURCE 16367] SET extra_float_digits TO 3;
04:17:17.650 8 SQL pgsql.c:1501 [SOURCE 16367] SET statement_timeout TO 0;
04:17:17.651 8 SQL pgsql.c:1501 [SOURCE 16367] SET default_transaction_read_only TO off;
04:17:17.651 8 SQL pgsql.c:1501 [SOURCE 16367] SET idle_in_transaction_session_timeout TO 0;
04:17:17.652 8 SQL pgsql.c:1501 [SOURCE 16367] select exists(select 1 from pg_largeobject_metadata);
04:17:17.653 8 INFO blobs.c:74 Skipping large objects: none found.
04:17:17.656 8 INFO sequences.c:155 STEP 9: reset sequences values
04:17:17.657 8 DEBUG copydb.c:714 Waiting for sub-processes to finish
04:17:17.657 20 NOTICE sequences.c:215 Now starting setval process 20 [8]
04:17:17.657 8 ERROR copydb.c:774 Sub-process 13 exited with code 0 and signal Segmentation fault
04:17:17.657 20 DEBUG catalog.c:710 Opening SQLite database "/dbhome/tmp/schema/source.db" with lib version 3.40.1
04:17:17.657 8 DEBUG copydb.c:760 Sub-process 16 exited with code 0
04:17:17.660 20 INFO sequences.c:245 Reset sequences values on the target database
04:17:17.660 20 SQL pgsql.c:494 Connecting to [target] "postgres://user2@host2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
04:17:17.661 17 SQL pgsql.c:1501 [TARGET 27993] SET client_encoding TO 'UTF-8';
04:17:17.663 17 SQL pgsql.c:1501 [TARGET 27993] SET extra_float_digits TO 3;
04:17:17.663 17 SQL pgsql.c:1501 [TARGET 27993] SET statement_timeout TO 0;
04:17:17.664 19 SQL pgsql.c:1501 [TARGET 27994] SET client_encoding TO 'UTF-8';
04:17:17.664 17 SQL pgsql.c:1501 [TARGET 27993] SET default_transaction_read_only TO off;
04:17:17.665 19 SQL pgsql.c:1501 [TARGET 27994] SET extra_float_digits TO 3;
04:17:17.665 17 SQL pgsql.c:1501 [TARGET 27993] SET maintenance_work_mem TO '1 GB';
04:17:17.665 19 SQL pgsql.c:1501 [TARGET 27994] SET statement_timeout TO 0;
04:17:17.665 17 SQL pgsql.c:1501 [TARGET 27993] SET synchronous_commit TO 'off';
04:17:17.666 19 SQL pgsql.c:1501 [TARGET 27994] SET default_transaction_read_only TO off;
04:17:17.666 17 SQL pgsql.c:1501 [TARGET 27993] SET statement_timeout TO 0;
04:17:17.666 19 SQL pgsql.c:1501 [TARGET 27994] SET maintenance_work_mem TO '1 GB';
04:17:17.667 17 SQL pgsql.c:1501 [TARGET 27993] SET lock_timeout TO 0;
04:17:17.667 19 SQL pgsql.c:1501 [TARGET 27994] SET synchronous_commit TO 'off';
04:17:17.667 17 DEBUG catalog.c:710 Opening SQLite database "/dbhome/tmp/schema/source.db" with lib version 3.40.1
04:17:17.667 17 DEBUG catalog.c:710 Opening SQLite database "/dbhome/tmp/schema/filter.db" with lib version 3.40.1
04:17:17.668 17 DEBUG catalog.c:710 Opening SQLite database "/dbhome/tmp/schema/target.db" with lib version 3.40.1
04:17:17.668 19 SQL pgsql.c:1501 [TARGET 27994] SET statement_timeout TO 0;
04:17:17.668 17 DEBUG catalog.c:582 Catalog has been setup for source "postgres://user1@host1:5432/db1", target "postgres://user2@host2:5432/db2", snapshot "00000005-003EA70A-1"
04:17:17.668 17 DEBUG table-data.c:876 copydb_copy_data_by_oid: 4117045 public.tranlog, part 0
04:17:17.668 19 SQL pgsql.c:1501 [TARGET 27994] SET lock_timeout TO 0;
malloc(): invalid size (unsorted)
04:17:17.680 20 SQL pgsql.c:1501 [TARGET 27995] BEGIN;
04:17:17.681 20 SQL pgsql.c:1501 [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.681 20 SQL pgsql.c:1514 [TARGET 27995] 'public.accrual_aggregate_id_seq', '207346524', 'true'
04:17:17.682 20 SQL pgsql.c:1501 [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.682 20 SQL pgsql.c:1514 [TARGET 27995] 'public.accrual_plan_wise_id_seq', '188023299', 'true'
04:17:17.683 20 SQL pgsql.c:1501 [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.683 20 SQL pgsql.c:1514 [TARGET 27995] 'public.acs_authentication_id_seq', '1', 'false'
04:17:17.684 20 SQL pgsql.c:1501 [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.684 20 SQL pgsql.c:1514 [TARGET 27995] 'public.acs_message_authlog_id_seq', '1', 'false'
04:17:17.684 20 SQL pgsql.c:1501 [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.684 20 SQL pgsql.c:1514 [TARGET 27995] 'public.acs_request_id_seq', '1053578', 'true'
04:17:17.685 20 SQL pgsql.c:1501 [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.685 20 SQL pgsql.c:1514 [TARGET 27995] 'public.aggregate_entity_type_id_seq', '1', 'false'
04:17:17.686 20 SQL pgsql.c:1501 [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.686 20 SQL pgsql.c:1514 [TARGET 27995] 'public.apportionment_plan_wise_id_seq', '13036992', 'true'
04:17:17.686 20 SQL pgsql.c:1501 [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.687 20 SQL pgsql.c:1514 [TARGET 27995] 'public.auth_control_id_seq', '1', 'true'
04:17:17.687 20 SQL pgsql.c:1501 [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.687 20 SQL pgsql.c:1514 [TARGET 27995] 'public.auth_control_seq', '-1482359088', 'true'
04:17:17.688 20 DEBUG pgsql.c:2132 Postgres result status is PGRES_FATAL_ERROR
04:17:17.688 20 ERROR pgsql.c:2016 [TARGET 27995] ERROR: setval: value -1482359088 is out of bounds for sequence "auth_control_seq" (1..9223372036854775807)
04:17:17.688 20 ERROR pgsql.c:2016 [TARGET 27995] ERROR: setval: value -1482359088 is out of bounds for sequence "auth_control_seq" (1..9223372036854775807)
04:17:17.688 20 ERROR pgsql.c:2027 [TARGET 27995] SQL query: select pg_catalog.setval($1::regclass, $2, $3)
04:17:17.688 20 ERROR pgsql.c:2035 [TARGET 27995] SQL params: 'public.auth_control_seq', '-1482359088', 'true'
04:17:17.688 20 ERROR schema.c:2509 Failed to set sequence public.auth_control_seq last value to -1482359088
04:17:17.688 20 ERROR sequences.c:311 Failed to set sequence values for public.auth_control_seq
04:17:17.688 20 ERROR catalog.c:4423 Failed to iterate over list of seqs, see above for details
04:17:17.688 20 ERROR sequences.c:265 Failed to copy sequences values from our internal catalogs, see above for details
04:17:17.688 20 SQL pgsql.c:404 Disconnecting from [target] "postgres://user2@ohost2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
04:17:17.726 14 ERROR copydb.c:774 Sub-process 17 exited with code 0 and signal Aborted
04:17:17.726 14 DEBUG copydb.c:760 Sub-process 18 exited with code 0
04:17:17.726 14 DEBUG copydb.c:730 copydb_wait_for_subprocesses: no more children
04:17:17.726 14 ERROR table-data.c:341 Some COPY worker process(es) have exited with error, see above for details
04:17:17.726 14 DEBUG indexes.c:611 Send STOP message to CREATE INDEX queue 0
04:17:17.726 14 ERROR table-data.c:252 Failed to copy table data, see above for details
04:17:17.727 14 DEBUG queue_utils.c:60 iprm -q 32769 (copy table-data)
04:17:17.729 19 DEBUG indexes.c:266 Stop message received by create index worker
04:17:17.730 19 SQL pgsql.c:404 Disconnecting from [target] "postgres://user2@ohost2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
04:17:17.735 15 DEBUG copydb.c:760 Sub-process 19 exited with code 0
04:17:17.735 15 DEBUG copydb.c:730 copydb_wait_for_subprocesses: no more children
04:17:17.757 8 ERROR copydb.c:769 Sub-process 14 exited with code 12
04:17:17.757 8 DEBUG copydb.c:760 Sub-process 15 exited with code 0
04:17:17.757 8 ERROR copydb.c:769 Sub-process 20 exited with code 12
04:17:17.757 8 DEBUG copydb.c:730 copydb_wait_for_subprocesses: no more children
04:17:17.757 8 ERROR table-data.c:205 Some sub-processes have exited with error status, see above for details
04:17:17.757 8 ERROR table-data.c:212 Errors detected, see above for details
04:17:17.757 8 FATAL table-data.c:71 Failed to COPY the data, see above for details
04:17:17.757 8 ERROR cli_clone_follow.c:479 Failed to clone source database, see above for details
04:17:17.758 8 DEBUG lock_utils.c:212 ipcrm -s 98307
04:17:17.758 8 DEBUG lock_utils.c:212 ipcrm -s 65538
04:17:17.758 8 DEBUG lock_utils.c:212 ipcrm -s 32769
04:17:17.811 1 ERROR cli_clone_follow.c:741 clone process 8 has terminated [6]
04:17:17.961 1 SQL pgsql.c:1501 [SOURCE 16345] COMMIT;
04:17:17.962 1 SQL pgsql.c:404 Disconnecting from [source] "postgres://user1@host1:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
04:17:17.962 1 DEBUG copydb.c:714 Waiting for sub-processes to finish
04:17:17.962 1 DEBUG copydb.c:730 copydb_wait_for_subprocesses: no more children
04:17:17.963 1 DEBUG queue_utils.c:60 iprm -q 0 (create index)
04:17:17.963 1 DEBUG lock_utils.c:212 ipcrm -s 0
Seeing the same issue with AWS RDS Aurora Postgres 14. I won't have time to investigate today but I'll see if I have some time in a week or two to look more
$ pgcopydb clone --filters filters.ini
17:08:19.645 4187125 INFO Running pgcopydb version 0.15-1.pgdg22.04+1 from "/usr/bin/pgcopydb"
17:08:19.649 4187125 INFO [SOURCE] Copying database from "postgres://[email protected]/db?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
17:08:19.649 4187125 INFO [TARGET] Copying database into "postgres://[email protected]/db?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
17:08:19.700 4187125 INFO Using work dir "/tmp/pgcopydb"
17:08:19.755 4187125 WARN WARNING: unrecognized node type: 378
17:08:19.755 4187125 WARN WARNING: unrecognized node type: 378
17:08:19.755 4187125 WARN WARNING: unrecognized node type: 378
17:08:19.756 4187125 INFO Exported snapshot "00000005-018133DA-1" from the source database
17:08:19.761 4187127 INFO STEP 1: fetch source database tables, indexes, and sequences
17:08:20.823 4187127 INFO Fetched information for 15 tables (including 0 tables split in 0 partitions total), with an estimated total of 25 million tuples and 13 GB on-disk
17:08:21.237 4187127 INFO Fetched information for 48 indexes (supporting 25 constraints)
17:08:21.248 4187127 INFO Fetching information for 0 sequences
17:08:21.266 4187127 INFO Fetched information for 1 extensions
17:08:22.831 4187127 INFO Found 0 indexes (supporting 0 constraints) in the target database
17:08:22.843 4187127 INFO STEP 2: dump the source database schema (pre/post data)
17:08:22.848 4187127 INFO /usr/bin/pg_dump -Fc --snapshot 00000005-018133DA-1 --section pre-data --exclude-schema pganalyze --exclude-schema pglogical --file /tmp/pgcopydb/schema/pre.dump 'postgres://[email protected]/db?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60'
17:08:22.962 4187127 ERROR pg_dump: WARNING: unrecognized node type: 378
17:08:22.962 4187127 ERROR pg_dump: WARNING: unrecognized node type: 378
17:08:23.159 4187127 INFO /usr/bin/pg_dump -Fc --snapshot 00000005-018133DA-1 --section post-data --exclude-schema pganalyze --exclude-schema pglogical --file /tmp/pgcopydb/schema/post.dump 'postgres://[email protected]/db?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60'
17:08:23.259 4187127 ERROR pg_dump: WARNING: unrecognized node type: 378
17:08:23.260 4187127 ERROR pg_dump: WARNING: unrecognized node type: 378
17:08:23.457 4187127 INFO STEP 3: restore the pre-data section to the target database
17:08:24.351 4187127 INFO /usr/bin/pg_restore --dbname 'postgres://[email protected]/db?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --jobs 4 --exclude-schema pganalyze --exclude-schema pglogical --use-list /tmp/pgcopydb/schema/pre-filtered.list /tmp/pgcopydb/schema/pre.dump
17:08:34.717 4187200 INFO STEP 4: starting 4 table-data COPY processes
17:08:34.751 4187205 INFO STEP 8: starting 4 VACUUM processes
17:08:34.767 4187201 INFO STEP 6: starting 4 CREATE INDEX processes
17:08:34.767 4187201 INFO STEP 7: constraints are built by the CREATE INDEX processes
17:08:34.794 4187203 WARN WARNING: unrecognized node type: 378
17:08:34.794 4187203 WARN WARNING: unrecognized node type: 378
17:08:34.794 4187203 WARN WARNING: unrecognized node type: 378
17:08:34.797 4187206 WARN WARNING: unrecognized node type: 378
17:08:34.798 4187206 WARN WARNING: unrecognized node type: 378
17:08:34.799 4187206 WARN WARNING: unrecognized node type: 378
17:08:34.800 4187204 WARN WARNING: unrecognized node type: 378
17:08:34.800 4187204 WARN WARNING: unrecognized node type: 378
17:08:34.800 4187204 WARN WARNING: unrecognized node type: 378
17:08:34.801 4187127 WARN WARNING: unrecognized node type: 378
17:08:34.802 4187127 WARN WARNING: unrecognized node type: 378
17:08:34.802 4187127 WARN WARNING: unrecognized node type: 378
17:08:34.803 4187202 WARN WARNING: unrecognized node type: 378
17:08:34.803 4187202 WARN WARNING: unrecognized node type: 378
17:08:34.803 4187202 WARN WARNING: unrecognized node type: 378
17:08:34.806 4187127 INFO Skipping large objects: none found.
17:08:34.810 4187127 INFO STEP 9: reset sequences values
17:08:34.818 4187216 INFO Reset sequences values on the target database
malloc(): invalid size (unsorted)
free(): corrupted unsorted chunks
17:08:37.645 4187200 ERROR Sub-process 4187206 exited with code 0 and signal Aborted
corrupted size vs. prev_size in fastbins
17:08:38.646 4187200 ERROR Sub-process 4187204 exited with code 0 and signal Aborted
17:08:39.748 4187200 ERROR Sub-process 4187202 exited with code 0 and signal Aborted
17:08:44.653 4187200 ERROR Some COPY worker process(es) have exited with error, see above for details
17:08:44.653 4187200 ERROR Failed to copy table data, see above for details
17:08:44.724 4187127 ERROR Sub-process 4187200 exited with code 12
17:08:45.426 4187127 ERROR Some sub-processes have exited with error status, see above for details
17:08:45.426 4187127 ERROR Errors detected, see above for details
17:08:45.426 4187127 FATAL Failed to COPY the data, see above for details
17:08:45.426 4187127 ERROR Failed to clone source database, see above for details
17:08:45.435 4187125 ERROR clone process 4187127 has terminated [6]
This DB was about 10GB and it didn't end up copying any data
It looks like I did get a coredump here, hopefully this helps
Core was generated by `pgcopydb: copy public.mods_comment '.
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139826508401024) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) t a a bt
Thread 1 (Thread 0x7f2be55d0980 (LWP 4187206)):
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139826508401024) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=139826508401024) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=139826508401024, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007f2be6242476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007f2be62287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007f2be6289676 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f2be63dbb77 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#6 0x00007f2be62a0cfc in malloc_printerr (str=str@entry=0x7f2be63de838 "free(): corrupted unsorted chunks") at ./malloc/malloc.c:5664
#7 0x00007f2be62a2f16 in _int_free (av=0x7f2be641ac80 <main_arena>, p=0x558230396350, have_lock=<optimized out>) at ./malloc/malloc.c:4630
#8 0x00007f2be62a5453 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#9 0x00007f2be65b3e22 in sqlite3_free () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#10 0x00007f2be65c0edb in sqlite3PagerClose () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#11 0x00007f2be6546c08 in sqlite3BtreeClose () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#12 0x00007f2be6547595 in sqlite3BtreeCloseCursor () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#13 0x00007f2be6606b08 in sqlite3VdbeHalt () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#14 0x00007f2be65fd3f7 in sqlite3VdbeExec () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#15 0x00007f2be66059bf in sqlite3_step () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#16 0x000055822eeeb630 in catalog_sql_step (query=query@entry=0x55823039e6a0) at ./src/bin/pgcopydb/catalog.c:7319
#17 0x000055822eeec224 in catalog_iter_s_table_attrs_next (iter=0x55823039e690) at ./src/bin/pgcopydb/catalog.c:3155
#18 catalog_s_table_fetch_attrs (catalog=<optimized out>, table=<optimized out>) at ./src/bin/pgcopydb/catalog.c:3068
#19 0x000055822ef2dc6b in copydb_table_create_lockfile (isDone=<synthetic pointer>, tableSpecs=0x5582303cd880, specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/summary.c:402
#20 copydb_copy_data_by_oid (part=<optimized out>, oid=<optimized out>, dst=0x7ffc491e3f70, src=0x7ffc493ff8d8, specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:902
#21 copydb_table_data_worker (specs=specs@entry=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:771
#22 0x000055822ef2f7be in copydb_start_table_data_workers (specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:675
#23 copydb_copy_supervisor (specs=specs@entry=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:299
#24 0x000055822ef2fec4 in copydb_start_copy_supervisor (specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:250
#25 copydb_start_copy_supervisor (specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:227
#26 copydb_process_table_data (specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:135
#27 copydb_process_table_data (specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:109
#28 copydb_copy_all_table_data (specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:69
#29 0x000055822eef0e4b in cloneDB (copySpecs=0x7ffc493f7f80) at ./src/bin/pgcopydb/cli_clone_follow.c:579
#30 0x000055822ef35726 in start_clone_process (pid=<optimized out>, copySpecs=0x7ffc493f7f80) at ./src/bin/pgcopydb/cli_clone_follow.c:477
#31 start_clone_process.constprop.0 (copySpecs=0x7ffc493f7f80, pid=0x7ffc492ec288) at ./src/bin/pgcopydb/cli_clone_follow.c:457
#32 0x000055822eeeacd3 in cli_clone (argc=<optimized out>, argv=<optimized out>) at ./src/bin/pgcopydb/cli_clone_follow.c:157
#33 0x000055822eee5c53 in commandline_run (argv=0x7ffc49623a18, argc=0, command=0x7ffc49623860) at ./src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71
#34 main (argc=<optimized out>, argv=<optimized out>) at ./src/bin/pgcopydb/main.c:133
04:17:17.688 20 ERROR pgsql.c:2016 [TARGET 27995] ERROR: setval: value -1482359088 is out of bounds for sequence "auth_control_seq" (1..9223372036854775807)
04:17:17.688 20 ERROR pgsql.c:2016 [TARGET 27995] ERROR: setval: value -1482359088 is out of bounds for sequence "auth_control_seq" (1..9223372036854775807)
This looks like this have been fixed in #633.
That said I think these are still relevant to this issue:
04:17:17.657 8 ERROR copydb.c:774 Sub-process 13 exited with code 0 and signal Segmentation fault
04:17:17.726 14 ERROR copydb.c:774 Sub-process 17 exited with code 0 and signal Aborted
17:08:34.794 4187203 WARN WARNING: unrecognized node type: 378
17:08:34.794 4187203 WARN WARNING: unrecognized node type: 378
17:08:34.794 4187203 WARN WARNING: unrecognized node type: 378
17:08:34.797 4187206 WARN WARNING: unrecognized node type: 378
17:08:34.798 4187206 WARN WARNING: unrecognized node type: 378
17:08:34.799 4187206 WARN WARNING: unrecognized node type: 378
17:08:34.800 4187204 WARN WARNING: unrecognized node type: 378
17:08:34.800 4187204 WARN WARNING: unrecognized node type: 378
17:08:34.800 4187204 WARN WARNING: unrecognized node type: 378
17:08:34.801 4187127 WARN WARNING: unrecognized node type: 378
17:08:34.802 4187127 WARN WARNING: unrecognized node type: 378
17:08:34.802 4187127 WARN WARNING: unrecognized node type: 378
17:08:34.803 4187202 WARN WARNING: unrecognized node type: 378
17:08:34.803 4187202 WARN WARNING: unrecognized node type: 378
17:08:34.803 4187202 WARN WARNING: unrecognized node type: 378
For future reviewers (or myself), that's a problem with Amazon RDS at the moment, see https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQLReleaseNotes/AuroraPostgreSQL.Updates.html#AuroraPostgreSQL.Updates.20180305.1530.
I found documentation/articles that lead the libc run-time error message "corrupted size vs. prev_size" to either problems with stack alignment or with zero-size memory allocation requests.
From what I have seen the stack alignment problems happen in build process that involves separate run-time (C/Python for instance), it seems to me we're not concerned with this aspect in pgcopydb.
I have now merged two PRs that clean-up our memory allocation and avoid zero-size requests entirely, so I believe it's worth trying again to see if the error happens again. Hopefully, that's all it takes to fix it. Otherwise more research is going to be needed: help welcome.
Hey @dimitri the "corrupted size vs. prev_size" issue seems to be fixed but now there seems to be an issue with copying partman extention. Sharing the logs below.
err.log
01:29:00.229 20 INFO Reset sequences values on the target database
01:29:00.247 13 INFO COPY extension "pg_partman" configuration table partman.part_config_sub
malloc(): invalid size (unsorted)
01:29:00.268 13 INFO COPY extension "pg_partman" configuration table partman.custom_time_partitions
01:29:00.297 14 ERROR Sub-process 17 exited with code 0 and signal Aborted
01:29:00.297 14 ERROR Some COPY worker process(es) have exited with error, see above for details
01:29:00.297 14 ERROR Failed to copy table data, see above for details
01:29:00.325 8 ERROR Sub-process 14 exited with code 12
01:29:00.425 8 ERROR Some sub-processes have exited with error status, see above for details
01:29:00.425 8 ERROR Errors detected, see above for details
01:29:00.425 8 FATAL Failed to COPY the data, see above for details
01:29:00.425 8 ERROR Failed to clone source database, see above for details
01:29:00.534 1 ERROR clone process 8 has terminated [6]