添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
孤独的大海  ·  usb - npm·  5 月前    · 
迷茫的番茄  ·  山风入怀 ...·  6 月前    · 

1. 软件介绍

PacBio三代数据基因组组装,针对低测序深度的组装;

2. 软件安装

软件主页: The MARVEL Assembler
clone 后发现目录下没有 configure 文件,而是 configure.ac 文件,怎么安装???

2.1. 加载所需环境

1
2
3
4
5
source /public/home/software/.bashrc
module load Autoconf/2.69
## added python3.6
export PATH="/public/home/zpxu/miniconda3/bin:$PATH"
export LD_LIBRARY_PATH="$PATH:/public/home/zpxu/miniconda3/lib"

版本检测

1
2
autoreconf --version
python --version

然后运行 autoreconf 来生成 configure 文件。

2.2. 安装

1
2
3
4
## 注意此目录与git clone 的 MARVEL 目录不同
./configure --prefix=/public/home/cotton/software/MARVEL-bin
make
make install

make install 后提示如下信息

1
2
3
4
---------------------------------------------------------------
Installation into /public/home/cotton/software/MARVEL-bin finished.
Don't forget to include /public/home/cotton/software/MARVEL-bin/lib.python in your PYTHONPATH.
---------------------------------------------------------------

2.3. 添加PYTHONPATH

参见: Permanently add a directory to PYTHONPATH

1
export PYTHONPATH="$PYTHONPATH:/public/home/cotton/software/MARVEL-bin/lib.python"

3. 软件使用

整体过程如下:

  • overlap
  • patch reads
  • overlap (again)
  • scrubbing
  • assembly graph construction and touring
  • optional read correction
  • fasta file creation
    实际运行主要经过两个过程, READ PATCHING PHASE ASSEMBLY PHASE ,即reads的处理阶段( DBprepare.py )和组装阶段( do.py );
  • 3.1. 初始化数据库: DBprepare.py

    1
    /public/home/cotton/software/MARVEL-bin/scripts/DBprepare.py ECOL p6.25x.fasta

    DBprepare.py 实际上是一个集成的python脚本,里面包含大约4步MARVEL程序 ( FA2db , DBsplit , DBdust HPCdaligner )。

    运行完成将生成 ECOL.db ,两个隐藏文件 .ECOL.idx .ECOL.bps ,两个plan后缀的文件;

    3.2. 组装: do.py

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    ##!/usr/bin/env python

    import multiprocessing
    import marvel
    import marvel.config
    import marvel.queue

    ###### settings

    DB = "ECOL" ## database name,需修改,和DBprepare.py中命名一样
    COVERAGE = 25 ## coverage of the dataset,需修改
    DB_FIX = DB + "_FIX" ## name of the database containing the patched reads
    PARALLEL = multiprocessing.cpu_count() ## number of available processors

    ###### patch raw reads

    q = marvel.queue.queue(DB, COVERAGE, PARALLEL)

    ###### run daligner to create initial overlaps
    q.plan("{db}.dalign.plan")

    ###### run LAmerge to merge overlap blocks
    q.plan("{db}.merge.plan")

    ## create quality and trim annotation (tracks) for each overlap block
    q.block("{path}/LAq -b {block} {db} {db}.{block}.las")
    ## merge quality and trim tracks
    q.single("{path}/TKmerge -d {db} q")
    q.single("{path}/TKmerge -d {db} trim")

    ## run LAfix to patch reads based on overlaps
    q.block("{path}/LAfix -g -1 {db} {db}.{block}.las {db}.{block}.fixed.fasta")
    ## join all fixed fasta files
    q.single("!cat {db}.*.fixed.fasta > {db}.fixed.fasta")

    ## create a new Database of fixed reads (-j numOfThreads, -g genome size)
    q.single("{path_scripts}/DBprepare.py -s 50 -r 2 -j 4 -g 4600000 {db_fixed} {db}.fixed.fasta", db_fixed = DB_FIX)

    ## run the commands
    q.process()

    ########## assemble patched reads

    q = marvel.queue.queue(DB_FIX, COVERAGE, PARALLEL)

    ###### run daligner to create overlaps
    q.plan("{db}.dalign.plan")
    ###### run LAmerge to merge overlap blocks
    q.plan("{db}.merge.plan")

    ###### start scrubbing pipeline

    ########## for larger genomes (> 100MB) LAstitch can be run with the -L option (preload reads)
    ########## with the -L option two passes over the overlap files are performed:
    ########## first to buffer all reads and a second time to stitch them
    ########## otherwise the random file access can make LAstitch pretty slow.
    ########## Another option would be, to keep the whole db in cache (/dev/shm/)
    q.block("{path}/LAstitch -f 50 {db} {db}.{block}.las {db}.{block}.stitch.las")

    ########## create quality and trim annotation (tracks) for each overlap block and merge them
    q.block("{path}/LAq -s 5 -T trim0 -b {block} {db} {db}.{block}.stitch.las")
    q.single("{path}/TKmerge -d {db} q")
    q.single("{path}/TKmerge -d {db} trim0")

    ########## create a repeat annotation (tracks) for each overlap block and merge them
    q.block("{path}/LArepeat -c {coverage} -b {block} {db} {db}.{block}.stitch.las")
    q.single("{path}/TKmerge -d {db} repeats")

    ########## detects "borders" in overlaps due to bad regions within reads that were not detected
    ########## in LAfix. Those regions can be quite big (several Kb). If gaps within a read are
    ########## detected LAgap chooses the longer part oft the read as valid range. The other part(s) are
    ########## discarded
    ########## option -L (see LAstitch) is also available
    q.block("{path}/LAgap -t trim0 {db} {db}.{block}.stitch.las {db}.{block}.gap.las")

    ########## create a new trim1 track, (trim0 is kept)
    q.block("{path}/LAq -s 5 -u -t trim0 -T trim1 -b {block} {db} {db}.{block}.gap.las")
    q.single("{path}/TKmerge -d {db} trim1")

    ########## based on different filter critera filter out: local-alignments, repeat induced-alifnments
    ########## previously discarded alignments, ....
    ########## -r repeats, -t trim1 ... use repeats and trim1 track
    ########## -n 500 ... overlaps must be anchored by at least 500 bases (non-repeats)
    ########## -u 0 ... overlaps with unaligned bases according to the trim1 interval are discarded
    ########## -o 2000 ... overlaps shorter than 2k bases are discarded
    ########## -p ... purge overlaps, overlaps are not written into the output file
    ########## option -L (see LAstitch) is also available
    q.block("{path}/LAfilter -n 300 -r repeats -t trim1 -T -o 2000 -u 0 {db} {db}.{block}.gap.las {db}.{block}.filtered.las")

    ########## merge all filtered overlap files into one overlap file
    q.single("{path}/LAmerge -S filtered {db} {db}.filtered.las")

    ########## create overlap graph
    q.single("{path}/OGbuild -t trim1 {db} {db}.filtered.las {db}.graphml")

    ########## tour the overlap graph and create contigs paths
    q.single("{path_scripts}/OGtour.py -c {db} {db}.graphml")

    q.single("{path}/LAcorrect -j 4 -r {db}.tour.rids {db} {db}.filtered.las {db}.corrected")
    q.single("{path}/FA2db -c {db}_CORRECTED [expand:{db}.corrected.*.fasta]")

    ########## create contig fasta files
    q.single("{path_scripts}/tour2fasta.py -c {db}_CORRECTED -t trim1 {db} {db}.tour.graphml {db}.tour.paths")

    ###### optional: create a layout of the overlap graph which can viewed in a Browser (svg) or Gephi (dot)
    ## q.single("{path}/OGlayout -R {db}.tour.graphml {db}.tour.layout.svg")
    q.single("{path}/OGlayout -R {db}.tour.graphml {db}.tour.layout.dot")

    q.process()

    4. 软件运行实例

    4.1. 分步运行

    The axolotl genome and the evolution of key tissue formation regulators 基因组组装过程,来源于MARVEL的Github主页的 MARVEL/examples/axolotl/README (初始化数据库时经过Fix的修正过程)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    This file contains the steps we took to assemble the axolotl genome.


    ########## READ PATCHING PHASE

    ## create initial database
    FA2db -v -x 2000 -b AXOLOTL -f axoFiles.txt

    ## split database into blocks
    DBsplit AXOLOTL

    ## create dust track
    DBdust AXOLOTL

    ## create daligner and merge plans, replace SERVER and PORT
    HPCdaligner -v -t 100 -D SERVER:PORT -r1 -j16 --dal 32 --mrg 32 -o AXOLOTL AXOLOTL
    ################################################################## DBprepare.py 脚本可运行上诉4步 ######################################################################################################
    ## start dynamic repeat masking server on SERVER:PORT, replace PORT (可跳过)
    DMserver -t 16 -p PORT -C AXOLOTL 40 AXOLOTL_CP
    ################################################################################################################################################################################################################################
    ## AXOLOTL.daligner.plan 和 AXOLOTL.merge.plan 是 DBprepare.py 生成的结果,内容是一些运行LAmerge和daligner程序的命令
    ## run daligner to create initial overlaps
    AXOLOTL.daligner.plan

    ## after all daligner jobs are finshied the dynamic repeat masker has to be shut down (可跳过)
    DMctl -h HOST -p PORT shutdown

    ###### run LAmerge to merge overlap blocks
    AXOLOTL.merge.plan
    ##############################################################################################################################################################################
    ## create quality and trim annotation (tracks) for each overlap block for each database block
    LAq -b <block> AXOLOTL AXOLOTL.<block>.las

    TKmerge -d AXOLOTL q
    TKmerge -d AXOLOTL trim

    ## run LAfix to patch reads based on overlaps for each database block
    LAfix -c -x 2000 AXOLOTL AXOLOTL.<block>.las AXOLOTL.<block>.fixed.fasta


    ########## ASSEMBLY PHASE

    ## create a new database with the fixed reads
    FA2db -v -x 2000 -c AXOLOTL_FIX AXOLOTL.*.fixed.fasta

    ## split database into blocks
    DBsplit AXOLOTL_FIX

    ## combine repeat tracks maskr and maskc that were created during read patching phase
    TKcombine AXOLOTL_FIX mask maskr maskc

    ## create daligner and merge plans, replace SERVER and PORT
    HPCdaligner -v -t 100 -D SERVER:PORT -m mask -r2 -j16 --dal 32 --mrg 32 -o AXOLOTL_FIX AXOLOTL_FIX

    ## start dynamic repeat masking server on SERVER:PORT, replace PORT
    DMserver -t 16 -p PORT -C AXOLOTL_FIX 40 AXOLOTL_CP

    ###### run daligner to create overlaps
    AXOLOTL_FIX.dalign.plan

    ## after all daligner jobs are finshied the dynamic repeat masker has to be shut down
    DMctl -h HOST -p PORT shutdown

    ###### run LAmerge to merge overlap blocks
    AXOLOTL_FIX.merge.plan

    ###### SCRUBBING PHASE

    ## repair alignments that prematurely stopped due to left-over errors in the reads for each database block
    LAstitch -f 50 AXOLOTL_FIX AXOLOTL_FIX.<block>.las AXOLOTL_FIX.<block>.stitch.las

    ## create quality and trim annotation for each database block
    LAq -T trim0 -s 5 -b <block> AXOLOTL_FIX AXOLOTL_FIX.<block>.stitch.las

    TKmerge -d AXOLOTL_FIX q
    TKmerge -d AXOLOTL_FIX trim0

    ## create a repeat annotation for each database block
    LArepeat -c <coverage> -l 1.5 -h 2.0 -b <block> AXOLOTL_FIX AXOLOTL_FIX.<block>.stitch.las

    TKmerge -d AXOLOTL_FIX repeats

    ## merge duplicate & overlapping annotation repeat annotation and masking server output
    TKcombine {db} frepeats repeats maskc maskr

    ## remove gaps (ie. due to chimeric reads, ...) for each database block
    LAgap -s 100 -t trim0 AXOLOTL_FIX AXOLOTL_FIX.<block>.stitch.las AXOLOTL_FIX.<block>.gap.las

    ## recalculate the trim track based on the cleaned up gaps for each database block
    LAq -u -t trim0 -T trim1 -b <block> AXOLOTL_FIX AXOLOTL_FIX.<block>.gap.las

    TKmerge -d AXOLOTL_FIX trim1

    ## filter repeat induced alignments and try to resolve repeat modules for each database block
    LAfilter -p -s 100 -n 300 -r frepeats -t trim1 -o 1000 -u 0 AXOLOTL_FIX AXOLOTL_FIX.<block>.gap.las AXOLOTL_FIX.<block>.filtered.las

    ## not much is left now, so we can merge everything into a single las file
    LAmerge -S filtered AXOLOTL_FIX AXOLOTL_FIX.filtered.las

    ## create overlap graph
    mkdir -p components
    OGbuild -t trim1 -s -c 1 AXOLOTL_FIX AXOLOTL_FIX.filtered.las components/AXOLOTL_FIX

    ## tour the overlap graph and create contigs paths for each components/*.graphml
    OGtour.py -c AXOLOTL_FIX <component.graphml>

    ## create contig fasta files for each components/*.paths
    tour2fasta.py -t trim1 AXOLOTL_FIX <component.graphml> <component.paths>

    4.2. 集合运行

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    source /public/home/software/.bashrc
    export PATH="/public/home/cotton/software/MARVEL-bin/bin:$PATH"
    export PATH="/public/home/zpxu/miniconda3/bin:$PATH"
    export LD_LIBRARY_PATH="$PATH:/public/home/zpxu/miniconda3/lib"

    cd /public/home/zpxu/AS/results/MARVEL
    fa="/public/home/zpxu/AS/PacBio/bmk_nh_filtered_subreads.fasta"

    echo "1. initially the database"
    DBprepare.py AS ${fa}
    cd /public/home/zpxu/AS/results/MARVEL
    ##sed -i "s/^/\/public\/home\/cotton\/software\/MARVEL-bin\/bin\//g" /public/home/zpxu/AS/results/MARVEL/AS.dalign.plan
    ##sed -i "s/^/\/public\/home\/cotton\/software\/MARVEL-bin\/bin\//g" /public/home/zpxu/AS/results/MARVEL/AS.merge.plan
    python /public/home/zpxu/AS/scripts/MARVEL_do.py

    5. 疑问参数

    5.1. 线程数目参数

    -j 参数表示线程数,需为2的幂次。

    5.2. HPCdaligner 的 -D SERVER:PORT

    解释见: HPCdaligner parameter <-D host:port> ##7 ,即重复序列多的基因组组装需要设置此参数,如果不是则不需要设置此参数和运行随后的 DMserver 步骤。如下即可:

    1
    HPCdaligner -v -t 100 -r1 -j16 --dal 32 --mrg 32 -o AXOLOTL AXOLOTL

    5.3. DMserver 的 -p PORT 和 expected.coverage

    Dynamic repeat Masking server主要目的在于降低大基因组组装对于计算机CPU和内存的需求,

    While preliminary calculations for computational time and storage space estimated over multiple millions of CPU hours and >2 PB of storage for one daligner run, the usage of a dynamic repeat masking server (below) reduced this dramatically to 150,000 CPU hours and 120 Tb of storage space for the complete pipeline.