添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving data from Postgres 14 to Postgres 14 on AWS RDS.

Problem:

Wanted to use "--restore-jobs" parameter in the "pgcopydb clone". Tried using dimitri/pgcopydb:v0.14 docker image but it seems it does not support "--restore-jobs". So decided to pull the latest code(Branch: main) and build the docker image using the default Dockerfile available in the repository.

When using the docker image built from the latest code pull(main branch). I am getting the following error:

Error.log:

10:38:29.810 1 INFO   Running pgcopydb version 0.14 from "/usr/local/bin/pgcopydb"
10:38:29.811 1 INFO   [SOURCE] Copying database from "postgres://user1@host1:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
10:38:29.811 1 INFO   [TARGET] Copying database into "postgres://user2@host2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
10:38:29.843 1 INFO   Using work dir "/dbhome/tmp"
10:38:29.869 1 INFO   Exported snapshot "00000005-002E1756-1" from the source database
10:38:29.879 1 NOTICE Wrote snapshot "00000005-002E1756-1" to file "/dbhome/tmp/snapshot"
10:38:29.880 15 NOTICE Starting the clone sub-process
10:38:29.881 15 INFO   STEP 1: fetch source database tables, indexes, and sequences
10:38:30.089 15 NOTICE Registering catalog setup for source "postgres://user1@host1:5432/db1", target "postgres://user2@host2:5432/onecard_sbm", snapshot "00000005-002E1756-1"
10:38:38.417 15 INFO   Fetched information for 253 tables (including 0 tables split in 0 partitions total), with an estimated total of 1186 million tuples and 411 GB on-disk
10:38:40.789 15 INFO   Fetched information for 512 indexes (supporting 307 constraints)
10:38:41.184 15 INFO   Fetching information for 145 sequences
10:38:41.882 15 INFO   Fetched information for 6 extensions
10:38:42.043 15 INFO   Found 0 indexes (supporting 0 constraints) in the target database
corrupted size vs. prev_size

Command run:

docker run -u root --mount type=bind,source="$(pwd)",target="$(pwd)" --rm -it 962570088078.dkr.ecr.ap-south-1.amazonaws.com/pgcopydb_custom:latest \
 pgcopydb clone \
  --source "host=${snapshot_db_host} port=5432 dbname=onecard_sbm user=${snapshot_user} password=${snapshot_password}" \
  --target "host=${empty_db_host} port=5432 dbname=onecard_sbm    user=${empty_user}    password=${empty_password}" \
  --no-role-passwords --no-owner --no-acl --no-comments --restore-jobs 1 --table-jobs 1 --index-jobs 1 --skip-vacuum --verbose --dir /dbhome/tmp

Please let me know. If any other information is required.

Hi @pratikshetty1994 ; thanks for the amount of details. That error message corrupted size vs. prev_size is new to me, and seems to be related with memory management (and maybe re-use after free?). Will investigate, but that might be hard to understand / track / solve, unless maybe Valgrind helps us there.

Cc @shubhamdhama who's been working on malloc/free/valgring and testing a Garbage Collector approach too I believe.

@shubhamdhama Unfortunately wont be able to share schema we are using. How can i help in getting more clearity for you.

Verbose.log:

08:18:34.810 8 SQL    pgsql.c:1501              [SOURCE 8001] select e.oid, extname, extnamespace::regnamespace, extrelocatable,        0 as count, null as n,        null as extconfig, null as nspname, null as relname,        null as extcondition,                 null as relkind   from pg_extension e  where extconfig is null  UNION ALL   select e.oid, extname, extnamespace::regnamespace, extrelocatable,          array_length(e.extconfig, 1) as count,          extconfig.n,          extconfig.extconfig,          format('%I', n.nspname) as nspname,         format('%I', c.relname) as relname,          extcondition[extconfig.n],          c.relkind as relkind     from pg_extension e,          unnest(extconfig) with ordinality as extconfig(extconfig, n)           left join pg_class c on c.oid = extconfig.extconfig           join pg_namespace n on c.relnamespace = n.oid    where extconfig.extconfig is not null order by oid, n;
08:18:34.812 8 DEBUG  schema.c:4210             getExtensionList: 1
08:18:34.814 8 DEBUG  schema.c:4210             getExtensionList: 1
08:18:34.817 8 DEBUG  schema.c:4210             getExtensionList: 1
08:18:34.819 8 DEBUG  schema.c:4210             getExtensionList: 1
08:18:34.822 8 DEBUG  schema.c:4210             getExtensionList: 1
08:18:34.824 8 DEBUG  schema.c:4210             getExtensionList: 1
08:18:34.829 8 DEBUG  schema.c:4210             getExtensionList: 1
08:18:34.832 8 DEBUG  schema.c:4210             getExtensionList: 1
08:18:34.837 8 INFO   copydb_schema.c:745       Fetched information for 6 extensions
08:18:34.843 8 SQL    pgsql.c:1501              [SOURCE 8001] COMMIT;
08:18:34.845 8 SQL    pgsql.c:404               Disconnecting from [source] "postgres://u1@h1:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
08:18:34.893 8 SQL    pgsql.c:494               Connecting to [target] "postgres://u2@h2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
08:18:34.915 8 SQL    pgsql.c:1501              [TARGET 14875] BEGIN;
08:18:34.916 8 SQL    pgsql.c:1501              [TARGET 14875] select n.oid, n.nspname,        format('- %s %s',                 regexp_replace(n.nspname, '[\n\r]', ' '),   regexp_replace(auth.rolname, '[\n\r]', ' '))   from pg_namespace n        join pg_roles auth ON auth.oid = n.nspowner  where nspname <> 'information_schema' and nspname !~ '^pg_';
08:18:34.918 8 DEBUG  schema.c:3815             getSchemaList: 1
08:18:34.921 8 SQL    pgsql.c:1501              [TARGET 14875] select oid, format('%I', rolname) as rolname from pg_roles;
08:18:34.921 8 DEBUG  schema.c:3903             getRoleList: 25
08:18:34.987 8 DEBUG  schema.c:2976             listSourceIndexesSQL[SOURCE_FILTER_TYPE_NONE]
08:18:34.987 8 SQL    pgsql.c:1501              [TARGET 14875]    select i.oid,           format('%I', n.nspname) as inspname,           format('%I', i.relname) as irelname,          r.oid,           format('%I', rn.nspname) as rnspname,           format('%I', r.relname) as rrelname,           indisprimary,          indisunique,          (select string_agg(format('%I', attname), ',')             from pg_attribute            where attrelid = r.oid              and array[attnum::integer] <@ indkey::integer[]          ) as cols,          pg_get_indexdef(indexrelid),          c.oid,          c.condeferrable,          c.condeferred,          case when conname is not null                then format('%I', c.conname)            end as conname,          pg_get_constraintdef(c.oid),          format('%s %s %s',                  regexp_replace(n.nspname, '[\n\r]', ' '),                  regexp_replace(i.relname, '[\n\r]', ' '),                  regexp_replace(auth.rolname, '[\n\r]', ' '))     from pg_index x          join pg_class i ON i.oid = x.indexrelid          join pg_class r ON r.oid = x.indrelid          join pg_namespace n ON n.oid = i.relnamespace          join pg_namespace rn ON rn.oid = r.relnamespace          join pg_roles auth ON auth.oid = i.relowner          left join pg_depend d                  on d.classid = 'pg_class'::regclass                and d.objid = i.oid                and d.refclassid = 'pg_constraint'::regclass                and d.deptype = 'i'          left join pg_constraint c ON c.oid = d.refobjid    where r.relkind = 'r' and r.relpersistence in ('p', 'u')       and n.nspname !~ '^pg_' and n.nspname <> 'information_schema'      and n.nspname !~ 'pgcopydb'      and not exists        (          select 1            from pg_depend d           where d.classid = 'pg_class'::regclass and d.objid = r.oid             and d.deptype = 'e'        )  order by n.nspname, r.relname, i.relname;
08:18:34.994 8 SQL    pgsql.c:1501              [TARGET 14875] COMMIT;
08:18:34.997 8 SQL    pgsql.c:404               Disconnecting from [target] "postgres://u2@h2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
08:18:34.997 8 INFO   copydb_schema.c:1009      Found 0 indexes (supporting 0 constraints) in the target database
malloc(): invalid size (unsorted)
08:18:35.117 1 DEBUG  cli_clone_follow.c:755    clone process 8 has terminated [0] (Aborted [6])
08:18:35.267 1 SQL    pgsql.c:1501              [SOURCE 7999] COMMIT;
08:18:35.268 1 SQL    pgsql.c:404               Disconnecting from [source] "postgres://u1@h1:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
08:18:35.269 1 DEBUG  copydb.c:902              Waiting for sub-processes to finish
08:18:35.269 1 DEBUG  copydb.c:918              copydb_wait_for_subprocesses: no more children
08:18:35.269 1 DEBUG  queue_utils.c:60          iprm -q 32769 (create index)
08:18:35.269 1 DEBUG  queue_utils.c:60          iprm -q 0 (vacuum)
08:18:35.269 1 DEBUG  lock_utils.c:212          ipcrm -s 0
          

@shubhamdhama Also sharing Dockerfile used. Using debian 11 for build process and debian 12 for finql image.

Dockerfile

FROM --platform=linux/amd64 debian:11-slim as build
ARG TARGETPLATFORM
ARG TARGETOS
ARG TARGETARCH
RUN dpkg --add-architecture ${TARGETARCH:-amd64} && apt update \
  && apt install -qqy --no-install-recommends \
    libncurses-dev \
    libxml2-dev \
    sudo \
    valgrind \
    build-essential \
    libedit-dev \
    libicu-dev \
    libkrb5-dev \
    liblz4-dev \
    libncurses6 \
    libpam-dev \
    libpq-dev \
    libpq5 \
    libreadline-dev \
    libselinux1-dev \
    libsqlite3-dev \
    libssl-dev \
    libxslt1-dev \
    lsof \
    psmisc \
    gdb \
    strace \
    tmux \
    watch \
    make \
    openssl \
    postgresql-common \
    postgresql-client-common \
    postgresql-server-dev-all \
    psutils \
    tmux \
    watch \
    zlib1g-dev
WORKDIR /usr/src/pgcopydb
COPY . .
RUN make -s clean && make -s -j$(nproc) install
FROM --platform=linux/amd64 debian:12-slim as run
ARG TARGETPLATFORM
ARG TARGETOS
ARG TARGETARCH
LABEL org.opencontainers.image.source https://github.com/dimitri/pgcopydb
RUN dpkg --add-architecture ${TARGETARCH:-amd64} && apt update \
  && apt install -qqy --no-install-suggests --no-install-recommends \
    sudo \
        passwd \
    ca-certificates \
    libpq5 \
        libsqlite3-0 \
        lsof \
    tmux \
    watch \
    psmisc \
    openssl \
    postgresql-common \
    postgresql-client \
    postgresql-client-common \
    && apt clean \
    && rm -rf /var/lib/apt/lists/*
RUN useradd -rm -d /var/lib/postgres -s /bin/bash -g postgres -G sudo docker
RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
COPY --from=build --chmod=755 /usr/lib/postgresql/13/bin/pgcopydb /usr/local/bin
USER docker
ENTRYPOINT []
CMD []
HEALTHCHECK NONE
          

Hi @pratikshetty1994 ; it seems you've been using v0.14 and we released v0.15 today. The new version includes a big internal refactoring/rewrite that impacts a lots of the internal memory handling, in such a way that I believe it's worth your time trying again with the new version. Please report success/failure here with the needed level of details if necessary.

If you manage to reproduce the error and have the time to work on a self-contained test case that hits the problem everytime, that would gold to help us fix the problem.

Hey @dimitri @shubhamdhama Tried v0.15 but it seems the issue still persists.Hopefully these logs can help you debug. Also let me know if you need me to run any tests to understand the issue better.

test.log:

4:17:17.634 15 INFO   indexes.c:157             STEP 6: starting 1 CREATE INDEX processes
04:17:17.634 15 INFO   indexes.c:158             STEP 7: constraints are built by the CREATE INDEX processes
04:17:17.635 15 DEBUG  copydb.c:714              Waiting for sub-processes to finish
04:17:17.635 17 SQL    pgsql.c:1501              [SOURCE 16366] BEGIN;
04:17:17.636 17 SQL    pgsql.c:1501              [SOURCE 16366] SET TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ WRITE, DEFERRABLE;
04:17:17.636 17 SQL    pgsql.c:1501              [SOURCE 16366] SET TRANSACTION SNAPSHOT '00000005-003EA70A-1';
04:17:17.637 17 SQL    pgsql.c:1501              [SOURCE 16366] select current_setting('server_version'),        current_setting('server_version_num')::integer;
04:17:17.638 17 DEBUG  pgsql.c:1174              [SOURCE 16366] Postgres version 14.3 (140003)
04:17:17.638 17 SQL    pgsql.c:1501              [SOURCE 16366] SET client_encoding TO 'UTF-8';
04:17:17.639 17 SQL    pgsql.c:1501              [SOURCE 16366] SET extra_float_digits TO 3;
04:17:17.640 17 SQL    pgsql.c:1501              [SOURCE 16366] SET statement_timeout TO 0;
04:17:17.640 17 SQL    pgsql.c:1501              [SOURCE 16366] SET default_transaction_read_only TO off;
04:17:17.641 17 SQL    pgsql.c:1501              [SOURCE 16366] SET idle_in_transaction_session_timeout TO 0;
04:17:17.642 17 SQL    pgsql.c:494               Connecting to [target] "postgres://user2@host2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
04:17:17.645 19 NOTICE indexes.c:215             Started CREATE INDEX worker 19 [15]
04:17:17.645 19 DEBUG  catalog.c:705             Skipping opening SQLite database "/dbhome/tmp/schema/source.db": already opened
04:17:17.645 19 DEBUG  catalog.c:710             Opening SQLite database "/dbhome/tmp/schema/filter.db" with lib version 3.40.1
04:17:17.645 19 DEBUG  catalog.c:710             Opening SQLite database "/dbhome/tmp/schema/target.db" with lib version 3.40.1
04:17:17.645 19 DEBUG  catalog.c:582             Catalog has been setup for source "postgres://user1@host1:5432/db1", target "postgres://user2@host2:5432/db1", snapshot "00000005-003EA70A-1"
04:17:17.645 19 SQL    pgsql.c:494               Connecting to [target] "postgres://user2@host2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
04:17:17.646 8 SQL    pgsql.c:1501              [SOURCE 16367] BEGIN;
04:17:17.647 8 SQL    pgsql.c:1501              [SOURCE 16367] SET TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ WRITE, DEFERRABLE;
04:17:17.647 8 SQL    pgsql.c:1501              [SOURCE 16367] SET TRANSACTION SNAPSHOT '00000005-003EA70A-1';
04:17:17.648 8 SQL    pgsql.c:1501              [SOURCE 16367] select current_setting('server_version'),        current_setting('server_version_num')::integer;
04:17:17.649 8 DEBUG  pgsql.c:1174              [SOURCE 16367] Postgres version 14.3 (140003)
04:17:17.649 8 SQL    pgsql.c:1501              [SOURCE 16367] SET client_encoding TO 'UTF-8';
04:17:17.650 8 SQL    pgsql.c:1501              [SOURCE 16367] SET extra_float_digits TO 3;
04:17:17.650 8 SQL    pgsql.c:1501              [SOURCE 16367] SET statement_timeout TO 0;
04:17:17.651 8 SQL    pgsql.c:1501              [SOURCE 16367] SET default_transaction_read_only TO off;
04:17:17.651 8 SQL    pgsql.c:1501              [SOURCE 16367] SET idle_in_transaction_session_timeout TO 0;
04:17:17.652 8 SQL    pgsql.c:1501              [SOURCE 16367] select exists(select 1 from pg_largeobject_metadata);
04:17:17.653 8 INFO   blobs.c:74                Skipping large objects: none found.
04:17:17.656 8 INFO   sequences.c:155           STEP 9: reset sequences values
04:17:17.657 8 DEBUG  copydb.c:714              Waiting for sub-processes to finish
04:17:17.657 20 NOTICE sequences.c:215           Now starting setval process 20 [8]
04:17:17.657 8 ERROR  copydb.c:774              Sub-process 13 exited with code 0 and signal Segmentation fault
04:17:17.657 20 DEBUG  catalog.c:710             Opening SQLite database "/dbhome/tmp/schema/source.db" with lib version 3.40.1
04:17:17.657 8 DEBUG  copydb.c:760              Sub-process 16 exited with code 0
04:17:17.660 20 INFO   sequences.c:245           Reset sequences values on the target database
04:17:17.660 20 SQL    pgsql.c:494               Connecting to [target] "postgres://user2@host2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
04:17:17.661 17 SQL    pgsql.c:1501              [TARGET 27993] SET client_encoding TO 'UTF-8';
04:17:17.663 17 SQL    pgsql.c:1501              [TARGET 27993] SET extra_float_digits TO 3;
04:17:17.663 17 SQL    pgsql.c:1501              [TARGET 27993] SET statement_timeout TO 0;
04:17:17.664 19 SQL    pgsql.c:1501              [TARGET 27994] SET client_encoding TO 'UTF-8';
04:17:17.664 17 SQL    pgsql.c:1501              [TARGET 27993] SET default_transaction_read_only TO off;
04:17:17.665 19 SQL    pgsql.c:1501              [TARGET 27994] SET extra_float_digits TO 3;
04:17:17.665 17 SQL    pgsql.c:1501              [TARGET 27993] SET maintenance_work_mem TO '1 GB';
04:17:17.665 19 SQL    pgsql.c:1501              [TARGET 27994] SET statement_timeout TO 0;
04:17:17.665 17 SQL    pgsql.c:1501              [TARGET 27993] SET synchronous_commit TO 'off';
04:17:17.666 19 SQL    pgsql.c:1501              [TARGET 27994] SET default_transaction_read_only TO off;
04:17:17.666 17 SQL    pgsql.c:1501              [TARGET 27993] SET statement_timeout TO 0;
04:17:17.666 19 SQL    pgsql.c:1501              [TARGET 27994] SET maintenance_work_mem TO '1 GB';
04:17:17.667 17 SQL    pgsql.c:1501              [TARGET 27993] SET lock_timeout TO 0;
04:17:17.667 19 SQL    pgsql.c:1501              [TARGET 27994] SET synchronous_commit TO 'off';
04:17:17.667 17 DEBUG  catalog.c:710             Opening SQLite database "/dbhome/tmp/schema/source.db" with lib version 3.40.1
04:17:17.667 17 DEBUG  catalog.c:710             Opening SQLite database "/dbhome/tmp/schema/filter.db" with lib version 3.40.1
04:17:17.668 17 DEBUG  catalog.c:710             Opening SQLite database "/dbhome/tmp/schema/target.db" with lib version 3.40.1
04:17:17.668 19 SQL    pgsql.c:1501              [TARGET 27994] SET statement_timeout TO 0;
04:17:17.668 17 DEBUG  catalog.c:582             Catalog has been setup for source "postgres://user1@host1:5432/db1", target "postgres://user2@host2:5432/db2", snapshot "00000005-003EA70A-1"
04:17:17.668 17 DEBUG  table-data.c:876          copydb_copy_data_by_oid: 4117045 public.tranlog, part 0
04:17:17.668 19 SQL    pgsql.c:1501              [TARGET 27994] SET lock_timeout TO 0;
malloc(): invalid size (unsorted)
04:17:17.680 20 SQL    pgsql.c:1501              [TARGET 27995] BEGIN;
04:17:17.681 20 SQL    pgsql.c:1501              [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.681 20 SQL    pgsql.c:1514              [TARGET 27995] 'public.accrual_aggregate_id_seq', '207346524', 'true'
04:17:17.682 20 SQL    pgsql.c:1501              [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.682 20 SQL    pgsql.c:1514              [TARGET 27995] 'public.accrual_plan_wise_id_seq', '188023299', 'true'
04:17:17.683 20 SQL    pgsql.c:1501              [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.683 20 SQL    pgsql.c:1514              [TARGET 27995] 'public.acs_authentication_id_seq', '1', 'false'
04:17:17.684 20 SQL    pgsql.c:1501              [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.684 20 SQL    pgsql.c:1514              [TARGET 27995] 'public.acs_message_authlog_id_seq', '1', 'false'
04:17:17.684 20 SQL    pgsql.c:1501              [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.684 20 SQL    pgsql.c:1514              [TARGET 27995] 'public.acs_request_id_seq', '1053578', 'true'
04:17:17.685 20 SQL    pgsql.c:1501              [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.685 20 SQL    pgsql.c:1514              [TARGET 27995] 'public.aggregate_entity_type_id_seq', '1', 'false'
04:17:17.686 20 SQL    pgsql.c:1501              [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.686 20 SQL    pgsql.c:1514              [TARGET 27995] 'public.apportionment_plan_wise_id_seq', '13036992', 'true'
04:17:17.686 20 SQL    pgsql.c:1501              [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.687 20 SQL    pgsql.c:1514              [TARGET 27995] 'public.auth_control_id_seq', '1', 'true'
04:17:17.687 20 SQL    pgsql.c:1501              [TARGET 27995] select pg_catalog.setval($1::regclass, $2, $3);
04:17:17.687 20 SQL    pgsql.c:1514              [TARGET 27995] 'public.auth_control_seq', '-1482359088', 'true'
04:17:17.688 20 DEBUG  pgsql.c:2132              Postgres result status is PGRES_FATAL_ERROR
04:17:17.688 20 ERROR  pgsql.c:2016              [TARGET 27995] ERROR:  setval: value -1482359088 is out of bounds for sequence "auth_control_seq" (1..9223372036854775807)
04:17:17.688 20 ERROR  pgsql.c:2016              [TARGET 27995] ERROR:  setval: value -1482359088 is out of bounds for sequence "auth_control_seq" (1..9223372036854775807)
04:17:17.688 20 ERROR  pgsql.c:2027              [TARGET 27995] SQL query: select pg_catalog.setval($1::regclass, $2, $3)
04:17:17.688 20 ERROR  pgsql.c:2035              [TARGET 27995] SQL params: 'public.auth_control_seq', '-1482359088', 'true'
04:17:17.688 20 ERROR  schema.c:2509             Failed to set sequence public.auth_control_seq last value to -1482359088
04:17:17.688 20 ERROR  sequences.c:311           Failed to set sequence values for public.auth_control_seq
04:17:17.688 20 ERROR  catalog.c:4423            Failed to iterate over list of seqs, see above for details
04:17:17.688 20 ERROR  sequences.c:265           Failed to copy sequences values from our internal catalogs, see above for details
04:17:17.688 20 SQL    pgsql.c:404               Disconnecting from [target] "postgres://user2@ohost2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
04:17:17.726 14 ERROR  copydb.c:774              Sub-process 17 exited with code 0 and signal Aborted
04:17:17.726 14 DEBUG  copydb.c:760              Sub-process 18 exited with code 0
04:17:17.726 14 DEBUG  copydb.c:730              copydb_wait_for_subprocesses: no more children
04:17:17.726 14 ERROR  table-data.c:341          Some COPY worker process(es) have exited with error, see above for details
04:17:17.726 14 DEBUG  indexes.c:611             Send STOP message to CREATE INDEX queue 0
04:17:17.726 14 ERROR  table-data.c:252          Failed to copy table data, see above for details
04:17:17.727 14 DEBUG  queue_utils.c:60          iprm -q 32769 (copy table-data)
04:17:17.729 19 DEBUG  indexes.c:266             Stop message received by create index worker
04:17:17.730 19 SQL    pgsql.c:404               Disconnecting from [target] "postgres://user2@ohost2:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
04:17:17.735 15 DEBUG  copydb.c:760              Sub-process 19 exited with code 0
04:17:17.735 15 DEBUG  copydb.c:730              copydb_wait_for_subprocesses: no more children
04:17:17.757 8 ERROR  copydb.c:769              Sub-process 14 exited with code 12
04:17:17.757 8 DEBUG  copydb.c:760              Sub-process 15 exited with code 0
04:17:17.757 8 ERROR  copydb.c:769              Sub-process 20 exited with code 12
04:17:17.757 8 DEBUG  copydb.c:730              copydb_wait_for_subprocesses: no more children
04:17:17.757 8 ERROR  table-data.c:205          Some sub-processes have exited with error status, see above for details
04:17:17.757 8 ERROR  table-data.c:212          Errors detected, see above for details
04:17:17.757 8 FATAL  table-data.c:71           Failed to COPY the data, see above for details
04:17:17.757 8 ERROR  cli_clone_follow.c:479    Failed to clone source database, see above for details
04:17:17.758 8 DEBUG  lock_utils.c:212          ipcrm -s 98307
04:17:17.758 8 DEBUG  lock_utils.c:212          ipcrm -s 65538
04:17:17.758 8 DEBUG  lock_utils.c:212          ipcrm -s 32769
04:17:17.811 1 ERROR  cli_clone_follow.c:741    clone process 8 has terminated [6]
04:17:17.961 1 SQL    pgsql.c:1501              [SOURCE 16345] COMMIT;
04:17:17.962 1 SQL    pgsql.c:404               Disconnecting from [source] "postgres://user1@host1:5432/db1?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
04:17:17.962 1 DEBUG  copydb.c:714              Waiting for sub-processes to finish
04:17:17.962 1 DEBUG  copydb.c:730              copydb_wait_for_subprocesses: no more children
04:17:17.963 1 DEBUG  queue_utils.c:60          iprm -q 0 (create index)
04:17:17.963 1 DEBUG  lock_utils.c:212          ipcrm -s 0
          

Seeing the same issue with AWS RDS Aurora Postgres 14. I won't have time to investigate today but I'll see if I have some time in a week or two to look more

$ pgcopydb clone --filters filters.ini
17:08:19.645 4187125 INFO   Running pgcopydb version 0.15-1.pgdg22.04+1 from "/usr/bin/pgcopydb"
17:08:19.649 4187125 INFO   [SOURCE] Copying database from "postgres://[email protected]/db?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
17:08:19.649 4187125 INFO   [TARGET] Copying database into "postgres://[email protected]/db?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
17:08:19.700 4187125 INFO   Using work dir "/tmp/pgcopydb"
17:08:19.755 4187125 WARN   WARNING:  unrecognized node type: 378
17:08:19.755 4187125 WARN   WARNING:  unrecognized node type: 378
17:08:19.755 4187125 WARN   WARNING:  unrecognized node type: 378
17:08:19.756 4187125 INFO   Exported snapshot "00000005-018133DA-1" from the source database
17:08:19.761 4187127 INFO   STEP 1: fetch source database tables, indexes, and sequences
17:08:20.823 4187127 INFO   Fetched information for 15 tables (including 0 tables split in 0 partitions total), with an estimated total of 25 million tuples and 13 GB on-disk
17:08:21.237 4187127 INFO   Fetched information for 48 indexes (supporting 25 constraints)
17:08:21.248 4187127 INFO   Fetching information for 0 sequences
17:08:21.266 4187127 INFO   Fetched information for 1 extensions
17:08:22.831 4187127 INFO   Found 0 indexes (supporting 0 constraints) in the target database
17:08:22.843 4187127 INFO   STEP 2: dump the source database schema (pre/post data)
17:08:22.848 4187127 INFO    /usr/bin/pg_dump -Fc --snapshot 00000005-018133DA-1 --section pre-data --exclude-schema pganalyze --exclude-schema pglogical --file /tmp/pgcopydb/schema/pre.dump 'postgres://[email protected]/db?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60'
17:08:22.962 4187127 ERROR  pg_dump: WARNING:  unrecognized node type: 378
17:08:22.962 4187127 ERROR  pg_dump: WARNING:  unrecognized node type: 378
17:08:23.159 4187127 INFO    /usr/bin/pg_dump -Fc --snapshot 00000005-018133DA-1 --section post-data --exclude-schema pganalyze --exclude-schema pglogical --file /tmp/pgcopydb/schema/post.dump 'postgres://[email protected]/db?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60'
17:08:23.259 4187127 ERROR  pg_dump: WARNING:  unrecognized node type: 378
17:08:23.260 4187127 ERROR  pg_dump: WARNING:  unrecognized node type: 378
17:08:23.457 4187127 INFO   STEP 3: restore the pre-data section to the target database
17:08:24.351 4187127 INFO    /usr/bin/pg_restore --dbname 'postgres://[email protected]/db?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --jobs 4 --exclude-schema pganalyze --exclude-schema pglogical --use-list /tmp/pgcopydb/schema/pre-filtered.list /tmp/pgcopydb/schema/pre.dump
17:08:34.717 4187200 INFO   STEP 4: starting 4 table-data COPY processes
17:08:34.751 4187205 INFO   STEP 8: starting 4 VACUUM processes
17:08:34.767 4187201 INFO   STEP 6: starting 4 CREATE INDEX processes
17:08:34.767 4187201 INFO   STEP 7: constraints are built by the CREATE INDEX processes
17:08:34.794 4187203 WARN   WARNING:  unrecognized node type: 378
17:08:34.794 4187203 WARN   WARNING:  unrecognized node type: 378
17:08:34.794 4187203 WARN   WARNING:  unrecognized node type: 378
17:08:34.797 4187206 WARN   WARNING:  unrecognized node type: 378
17:08:34.798 4187206 WARN   WARNING:  unrecognized node type: 378
17:08:34.799 4187206 WARN   WARNING:  unrecognized node type: 378
17:08:34.800 4187204 WARN   WARNING:  unrecognized node type: 378
17:08:34.800 4187204 WARN   WARNING:  unrecognized node type: 378
17:08:34.800 4187204 WARN   WARNING:  unrecognized node type: 378
17:08:34.801 4187127 WARN   WARNING:  unrecognized node type: 378
17:08:34.802 4187127 WARN   WARNING:  unrecognized node type: 378
17:08:34.802 4187127 WARN   WARNING:  unrecognized node type: 378
17:08:34.803 4187202 WARN   WARNING:  unrecognized node type: 378
17:08:34.803 4187202 WARN   WARNING:  unrecognized node type: 378
17:08:34.803 4187202 WARN   WARNING:  unrecognized node type: 378
17:08:34.806 4187127 INFO   Skipping large objects: none found.
17:08:34.810 4187127 INFO   STEP 9: reset sequences values
17:08:34.818 4187216 INFO   Reset sequences values on the target database
malloc(): invalid size (unsorted)
free(): corrupted unsorted chunks
17:08:37.645 4187200 ERROR  Sub-process 4187206 exited with code 0 and signal Aborted
corrupted size vs. prev_size in fastbins
17:08:38.646 4187200 ERROR  Sub-process 4187204 exited with code 0 and signal Aborted
17:08:39.748 4187200 ERROR  Sub-process 4187202 exited with code 0 and signal Aborted
17:08:44.653 4187200 ERROR  Some COPY worker process(es) have exited with error, see above for details
17:08:44.653 4187200 ERROR  Failed to copy table data, see above for details
17:08:44.724 4187127 ERROR  Sub-process 4187200 exited with code 12
17:08:45.426 4187127 ERROR  Some sub-processes have exited with error status, see above for details
17:08:45.426 4187127 ERROR  Errors detected, see above for details
17:08:45.426 4187127 FATAL  Failed to COPY the data, see above for details
17:08:45.426 4187127 ERROR  Failed to clone source database, see above for details
17:08:45.435 4187125 ERROR  clone process 4187127 has terminated [6]

This DB was about 10GB and it didn't end up copying any data

It looks like I did get a coredump here, hopefully this helps

Core was generated by `pgcopydb: copy public.mods_comment  '.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139826508401024) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) t a a bt
Thread 1 (Thread 0x7f2be55d0980 (LWP 4187206)):
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139826508401024) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139826508401024) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139826508401024, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f2be6242476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f2be62287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007f2be6289676 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f2be63dbb77 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007f2be62a0cfc in malloc_printerr (str=str@entry=0x7f2be63de838 "free(): corrupted unsorted chunks") at ./malloc/malloc.c:5664
#7  0x00007f2be62a2f16 in _int_free (av=0x7f2be641ac80 <main_arena>, p=0x558230396350, have_lock=<optimized out>) at ./malloc/malloc.c:4630
#8  0x00007f2be62a5453 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#9  0x00007f2be65b3e22 in sqlite3_free () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#10 0x00007f2be65c0edb in sqlite3PagerClose () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#11 0x00007f2be6546c08 in sqlite3BtreeClose () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#12 0x00007f2be6547595 in sqlite3BtreeCloseCursor () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#13 0x00007f2be6606b08 in sqlite3VdbeHalt () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#14 0x00007f2be65fd3f7 in sqlite3VdbeExec () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#15 0x00007f2be66059bf in sqlite3_step () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#16 0x000055822eeeb630 in catalog_sql_step (query=query@entry=0x55823039e6a0) at ./src/bin/pgcopydb/catalog.c:7319
#17 0x000055822eeec224 in catalog_iter_s_table_attrs_next (iter=0x55823039e690) at ./src/bin/pgcopydb/catalog.c:3155
#18 catalog_s_table_fetch_attrs (catalog=<optimized out>, table=<optimized out>) at ./src/bin/pgcopydb/catalog.c:3068
#19 0x000055822ef2dc6b in copydb_table_create_lockfile (isDone=<synthetic pointer>, tableSpecs=0x5582303cd880, specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/summary.c:402
#20 copydb_copy_data_by_oid (part=<optimized out>, oid=<optimized out>, dst=0x7ffc491e3f70, src=0x7ffc493ff8d8, specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:902
#21 copydb_table_data_worker (specs=specs@entry=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:771
#22 0x000055822ef2f7be in copydb_start_table_data_workers (specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:675
#23 copydb_copy_supervisor (specs=specs@entry=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:299
#24 0x000055822ef2fec4 in copydb_start_copy_supervisor (specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:250
#25 copydb_start_copy_supervisor (specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:227
#26 copydb_process_table_data (specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:135
#27 copydb_process_table_data (specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:109
#28 copydb_copy_all_table_data (specs=0x7ffc493f7f80) at ./src/bin/pgcopydb/table-data.c:69
#29 0x000055822eef0e4b in cloneDB (copySpecs=0x7ffc493f7f80) at ./src/bin/pgcopydb/cli_clone_follow.c:579
#30 0x000055822ef35726 in start_clone_process (pid=<optimized out>, copySpecs=0x7ffc493f7f80) at ./src/bin/pgcopydb/cli_clone_follow.c:477
#31 start_clone_process.constprop.0 (copySpecs=0x7ffc493f7f80, pid=0x7ffc492ec288) at ./src/bin/pgcopydb/cli_clone_follow.c:457
#32 0x000055822eeeacd3 in cli_clone (argc=<optimized out>, argv=<optimized out>) at ./src/bin/pgcopydb/cli_clone_follow.c:157
#33 0x000055822eee5c53 in commandline_run (argv=0x7ffc49623a18, argc=0, command=0x7ffc49623860) at ./src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71
#34 main (argc=<optimized out>, argv=<optimized out>) at ./src/bin/pgcopydb/main.c:133

04:17:17.688 20 ERROR pgsql.c:2016 [TARGET 27995] ERROR: setval: value -1482359088 is out of bounds for sequence "auth_control_seq" (1..9223372036854775807)
04:17:17.688 20 ERROR pgsql.c:2016 [TARGET 27995] ERROR: setval: value -1482359088 is out of bounds for sequence "auth_control_seq" (1..9223372036854775807)

This looks like this have been fixed in #633.

That said I think these are still relevant to this issue:

04:17:17.657 8 ERROR copydb.c:774 Sub-process 13 exited with code 0 and signal Segmentation fault
04:17:17.726 14 ERROR copydb.c:774 Sub-process 17 exited with code 0 and signal Aborted

17:08:34.794 4187203 WARN WARNING: unrecognized node type: 378
17:08:34.794 4187203 WARN WARNING: unrecognized node type: 378
17:08:34.794 4187203 WARN WARNING: unrecognized node type: 378
17:08:34.797 4187206 WARN WARNING: unrecognized node type: 378
17:08:34.798 4187206 WARN WARNING: unrecognized node type: 378
17:08:34.799 4187206 WARN WARNING: unrecognized node type: 378
17:08:34.800 4187204 WARN WARNING: unrecognized node type: 378
17:08:34.800 4187204 WARN WARNING: unrecognized node type: 378
17:08:34.800 4187204 WARN WARNING: unrecognized node type: 378
17:08:34.801 4187127 WARN WARNING: unrecognized node type: 378
17:08:34.802 4187127 WARN WARNING: unrecognized node type: 378
17:08:34.802 4187127 WARN WARNING: unrecognized node type: 378
17:08:34.803 4187202 WARN WARNING: unrecognized node type: 378
17:08:34.803 4187202 WARN WARNING: unrecognized node type: 378
17:08:34.803 4187202 WARN WARNING: unrecognized node type: 378

For future reviewers (or myself), that's a problem with Amazon RDS at the moment, see https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQLReleaseNotes/AuroraPostgreSQL.Updates.html#AuroraPostgreSQL.Updates.20180305.1530.

I found documentation/articles that lead the libc run-time error message "corrupted size vs. prev_size" to either problems with stack alignment or with zero-size memory allocation requests.

From what I have seen the stack alignment problems happen in build process that involves separate run-time (C/Python for instance), it seems to me we're not concerned with this aspect in pgcopydb.

I have now merged two PRs that clean-up our memory allocation and avoid zero-size requests entirely, so I believe it's worth trying again to see if the error happens again. Hopefully, that's all it takes to fix it. Otherwise more research is going to be needed: help welcome.

Hey @dimitri the "corrupted size vs. prev_size" issue seems to be fixed but now there seems to be an issue with copying partman extention. Sharing the logs below.

err.log

01:29:00.229 20 INFO   Reset sequences values on the target database
01:29:00.247 13 INFO   COPY extension "pg_partman" configuration table partman.part_config_sub
malloc(): invalid size (unsorted)
01:29:00.268 13 INFO   COPY extension "pg_partman" configuration table partman.custom_time_partitions
01:29:00.297 14 ERROR  Sub-process 17 exited with code 0 and signal Aborted
01:29:00.297 14 ERROR  Some COPY worker process(es) have exited with error, see above for details
01:29:00.297 14 ERROR  Failed to copy table data, see above for details
01:29:00.325 8 ERROR  Sub-process 14 exited with code 12
01:29:00.425 8 ERROR  Some sub-processes have exited with error status, see above for details
01:29:00.425 8 ERROR  Errors detected, see above for details
01:29:00.425 8 FATAL  Failed to COPY the data, see above for details
01:29:00.425 8 ERROR  Failed to clone source database, see above for details
01:29:00.534 1 ERROR  clone process 8 has terminated [6]