在Windows下完成QTL-seq&MutMap

在Windows下完成QTL-seq&MutMap,第1张

在Windows下完成QTL-seq&MutMap

纯粹就是记录下如何在Window下少写代码完成QTL-seq&MutMap,并没有详细讲解QTL-seq&MutMap的原理和每一步背后的含义。详细原理以及实验设计可以看文章
QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA from two bulked populations.;
Genome sequencing reveals agronomically important loci in rice using MutMap
High performance pipeline for MutMap and QTL-seq.
QTL-seq 分析原理及流程
使用MutMap快速定位突变基因:原理篇
QTL-seq和MutMap都是从BSA发展过来的,已经比较成熟,测序成本也不大,只需要两三个混合的样本(bulk)就可以了。简单地说就是把数量性状转换为质量性状,利用detla SNP这个指标来定位QTL。
本文主要是利用Iwate Biotechnology Research Center的pipelines。但在windows下运行会有一点点问题,需要修改下。

工具准备

conda(需要添加到环境变量, 建议Python版本在3.8之后,miniconda 就可以了)Windows 10下安装Miniconda3
perl5 (需要添加到环境变量, 我装的草莓版的)windows环境安装Perl
java (需要添加到环境变量)JAVA(windows)安装教程
SRA Toolkit (用于将sra文件转为reads)下载地址
bowtie2 (需要添加到环境变量,最新版的还没被编译到windows平台,这里用的是2.3.4版本, 名字含mingw的zip就可以。运行bowtie2需要perl在环境变量中)下载地址
snpEff (用来注释vcf文件,需要java) 下载地址
Trimmomatic (用来去reads的接头,需要java)下载地址
samtools (需要添加到环境变量,这个是1.3.1版本。原作者Li Heng 写过能用mingw/msy2编译的makefile文件,但在最新版的1.13版的编译中总是报错,所以找了个旧的支持windows的版本)下载地址
QTL-seq 下载地址 (建议先下载到本地再用conda安装。在code 下download zip。) 这个库本身的功能就是call snp 和计算delta snp。但是这里用了subprocess和mutiprocess,所以有些地方需要按照windows环境进行修改,就是该QTL-seq文件夹下qtlseq的py文件。
1、trim.py和qtlplot.py由于Trimmomatic 和snpEff 需要用java -jar 命令来启动所以相关命令要进行修改。
trim.py 修改44行 和65行命令 java -jar 后面跟trimmomatic-0.39.jar的具体存放位置就可以了。这个命令说明如果要去接头的话,只能用双端测序的结果

qtlplot.py 修改36行 。java -jar 后面跟snpEff.jar的具体存放位置。

2、由于本身的pipelines用的是bwa进行比对。试了一下发现,bwa在windwos下运行太慢了,一个30x的样本比对要27个小时,这里给它换成bowtie2来比对,同样的样本只要5个小时就可以完成了。这里需要修改refindex.py和alignment.py。
refindex 就是改了下构建引索的命令
替换的refindex.py

import sys
import subprocess as sbp
from qtlseq.utils import time_stamp, clean_cmd, call_log

class RefIndex(object):

    def __init__(self, args):
        self.args = args
        self.out = args.out

    def run(self):
        print(time_stamp(),
              'start to index reference fasta.',
              flush=True)

        cmd1 = 'bowtie2-build {0} {1}/10_ref/ref
                >> {1}/log/bowtie2.log 
                2>&1'.format(self.args.ref, self.out)

        cmd2 = 'samtools faidx {0} 
                >> {1}/log/samtools.log 
                2>&1'.format(self.args.ref, self.out)

        cmd1 = clean_cmd(cmd1)
        cmd2 = clean_cmd(cmd2)

        try:
            sbp.run(cmd1,
                    stdout=sbp.DEVNULL,
                    stderr=sbp.DEVNULL,
                    shell=True,
                    check=True)
        except sbp.CalledProcessError:
            call_log(self.out, 'bowtie2', cmd1)
            sys.exit(1)

        try:
            sbp.run(cmd2,
                    stdout=sbp.DEVNULL,
                    stderr=sbp.DEVNULL,
                    shell=True,
                    check=True)
        except sbp.CalledProcessError:
            call_log(self.out, 'samtools', cmd1)
            sys.exit(1)

        print(time_stamp(),
              'indexing of reference successfully finished.',
              flush=True)

alignment改了下比对命令,毕竟家用的windows不太可能有100g的运行内存。
替换的alignment.py

import sys
import os
import subprocess as sbp
from qtlseq.utils import time_stamp, clean_cmd, call_log


class Alignment(object):

    def __init__(self, args):
        self.args = args
        self.out = args.out

    def run(self, fastq1, fastq2, index):
        print(time_stamp(),
              'start to align reads of {} by bowtie2.'.format(index),
              flush=True)

        cmd1 = 'bowtie2 -p {0} -x {3}/10_ref/ref -1 {1} -2 {2} -S {3}/20_bam/{4}.sam >> {3}/log/alignment.log '.format(self.args.threads,fastq1,fastq2,self.out,index)
        cmd2 = 'samtools fixmate -O bam {0}/20_bam/{1}.sam {0}/20_bam/{1}_fixmate.bam '.format(self.out,index)
        cmd3 = 'rm {0}/20_bam/{1}.sam '.format(self.out,index)
        cmd4 = 'samtools sort -O bam -m 1G -@ {0} -o {1}/20_bam/{2}_sort.bam {1}/20_bam/{2}_fixmate.bam '.format(self.args.threads,self.out,index)
        cmd5 = 'rm {0}/20_bam/{1}_fixmate.bam '.format(self.out,index)
        cmd6 = 'samtools rmdup -sS {0}/20_bam/{1}_sort.bam {0}/20_bam/{1}_rmdup.bam '.format(self.out,index)
        cmd7 = 'rm {0}/20_bam/{1}_sort.bam '.format(self.out,index)
        cmd8 = 'samtools view -b -f 2 -F 2048 -o {0}/20_bam/{1}.bam {0}/20_bam/{1}_rmdup.bam '.format(self.out,index)
        cmd9 = 'rm {0}/20_bam/{1}_rmdup.bam '.format(self.out,index)

        cmd = [cmd1, cmd2, cmd3, cmd4, cmd5, cmd6, cmd7, cmd8, cmd9]
        for i in cmd:
            tem = clean_cmd(i)
            if tem.startswith("rm"):
                target = tem.split(" ")[1]
                os.remove(target)
                continue
            try:
                sbp.run(tem,
                    stdout=sbp.DEVNULL,
                    stderr=sbp.DEVNULL,
                    shell=True,
                    check=True)
            except sbp.CalledProcessError:
                call_log(self.out, 'alignment', tem)
                sys.exit(1)

        print(time_stamp(),
              'alignment {} successfully finished.'.format(index),
              flush=True)

3、在mpileup.py和vcf2index.py中用了pool.map 可能会有些问题,打不开子进程。还有管道符"|"可能在subprocess中用不了。这里改了下。
替换的mpileup.py

import sys
import re
import os
import glob
import subprocess as sbp
import multiprocessing
import multiprocessing.pool
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
from qtlseq.vcf2index import Vcf2Index
from qtlseq.utils import time_stamp, clean_cmd, call_log

class Mpileup(object):
    def __init__(self, args):
        self.out = args.out
        self.args = args

    def get_bams(self, label):
        bams = glob.glob('{}/20_bam/{}*.bam'.format(self.out, label))
        return bams

    def merge(self):
        for label in ['parent', 'bulk1', 'bulk2']:
            bams = self.get_bams(label)
            if len(bams) == 1:
                path_to_bam = os.path.abspath(bams[0])
                cmd1 = 'ln -s {0} {1}/20_bam/{2}.unsorted.bam'.format(path_to_bam,
                                                                   self.out,
                                                                   label)
            else:
                cmd1 = 'samtools merge -f {0}/20_bam/{1}.unsorted.bam 
                                          {0}/20_bam/{1}*.bam 
                                          >> {0}/log/samtools.log 
                                          2>&1'.format(self.out, label)

            cmd2 = 'samtools sort -m {0} 
                                  -@ {1} 
                                  -o {2}/20_bam/{3}.bam 
                                  {2}/20_bam/{3}.unsorted.bam 
                                  >> {2}/log/samtools.log 
                                  2>&1'.format(self.args.mem,
                                               self.args.threads,
                                               self.out,
                                               label)

            cmd3 = 'samtools index {0}/20_bam/{1}.bam 
                                   >> {0}/log/samtools.log 
                                   2>&1'.format(self.out, label)

            cmd4 = 'rm {}/20_bam/{}.*.bam'.format(self.out, label)

            cmd = [cmd1, cmd2, cmd3, cmd4]
            for i in cmd:
                tem = clean_cmd(i)
                if tem.startswith("rm"):
                    target = glob.glob(tem.split(" ")[1])
                    for t in target:
                        os.remove(t)
                    continue
                if tem.startswith("ln"):
                    target = tem.split(" ")
                    os.symlink(target[2], target[3])
                    continue
                try:
                    sbp.run(tem,
                        stdout=sbp.DEVNULL,
                        stderr=sbp.DEVNULL,
                        shell=True,
                        check=True)
                except sbp.CalledProcessError:
                    call_log(self.out, 'samtools', tem)
                    sys.exit(1)

    def get_header(self):
        ref = open(self.args.ref, 'r')
        pattern = re.compile('>')
        chr_names = []
        for line in ref:
            if pattern.match(line):
                line = line.rstrip('n')
                chr_name = re.split('[> ]',line)[1]
                chr_names.append(chr_name)
        return chr_names

    def mpileup(self, chr_name):
        print("call variants in {} ".format(chr_name))
        cmd1 = 'samtools mpileup -t AD,ADF,ADR -B -q {0} -Q {1} -C {2} -v -u -o {3}/30_vcf/tem_mpileup.{4}.vcf -r {4} -f {5} --ignore-RG {3}/20_bam/parent.bam {3}/20_bam/bulk1.bam {3}/20_bam/bulk2.bam >> {3}/log/bcftools.{4}.log 2>&1 '.format(self.args.min_MQ,self.args.min_BQ,self.args.adjust_MQ,self.out,chr_name,self.args.ref)
        cmd2 = 'bcftools call -vm -f GQ,GP -O z -o {1}/30_vcf/tem_call.{2}.vcf.gz {1}/30_vcf/tem_mpileup.{2}.vcf >> {1}/log/bcftools.{2}.log 2>&1 '.format(self.args.threads,self.out, chr_name) 
        cmd3 = 'bcftools filter -i "INFO/MQ>={3}" -O z -o {1}/30_vcf/qtlseq.{2}.vcf.gz {1}/30_vcf/tem_call.{2}.vcf.gz >> {1}/log/bcftools.{2}.log 2>&1 '.format(self.args.threads,self.out, chr_name, self.args.min_MQ) 
        cmd4 = 'tabix -f -p vcf {0}/30_vcf/qtlseq.{1}.vcf.gz >> {0}/log/tabix.{1}.log 2>&1 '.format(self.out, chr_name)
        cmd5 = 'rm {0}/30_vcf/tem_mpileup.{1}.vcf'.format(self.out, chr_name)
        cmd6 = 'rm {0}/30_vcf/tem_call.{1}.vcf.gz'.format(self.out, chr_name)

        cmd = [cmd1, cmd2, cmd3, cmd4, cmd5, cmd6]

        for i in cmd:
            tem = clean_cmd(i)
            if tem.startswith("rm"):
                target = glob.glob(tem.split(" ")[1])
                for t in target:
                    os.remove(t)
                continue
            try:
                sbp.run(tem,
                    stdout=sbp.DEVNULL,
                    stderr=sbp.DEVNULL,
                    shell=True,
                    check=True)
            except sbp.CalledProcessError:
                if tem.startswith("tabix"):
                    call_log(self.out, 'tabix.{0}'.format(chr_name), tem)
                if tem.startswith("bcftools") or tem.startswith("samtools"):
                    call_log(self.out, 'bcftools.{0}'.format(chr_name), tem)
                sys.exit(1)

    def concat(self):
        cmd1 = 'cat {0}/log/bcftools.*.log > {0}/log/bcftools.log'.format(self.out)
        cmd2 = 'cat {0}/log/tabix.*.log > {0}/log/tabix.log'.format(self.out)

        cmd3 = 'bcftools concat --threads {1} -O z 
                                -o {0}/30_vcf/qtlseq.vcf.gz 
                                {0}/30_vcf/qtlseq.*.vcf.gz 
                                >> {0}/log/bcftools.log 
                                2>&1'.format(self.out,self.args.threads)

        cmd4 = 'rm {}/30_vcf/qtlseq.*.vcf.gz'.format(self.out)
        cmd5 = 'rm {}/30_vcf/qtlseq.*.vcf.gz.tbi'.format(self.out)
        cmd6 = 'rm {}/log/bcftools.*.log'.format(self.out)
        cmd7 = 'rm {}/log/tabix.*.log'.format(self.out)

        cmd = [cmd1, cmd2, cmd3, cmd4, cmd5, cmd6, cmd7]

        for i in cmd:
            tem = clean_cmd(i)
            if tem.startswith("rm"):
                target = glob.glob(tem.split(" ")[1])
                for t in target:
                    os.remove(t)
                continue
            if tem.startswith("cat"):
                target = glob.glob(tem.split(" ")[1])
                outfile = open(tem.split(" ")[3], "a")
                for t in target:
                    for line in open(t):
                        outfile.writelines(line)
                        outfile.write('n')
                outfile.close()
                continue
            try:
                sbp.run(tem,
                    stdout=sbp.DEVNULL,
                    stderr=sbp.DEVNULL,
                    shell=True,
                    check=True)
            except sbp.CalledProcessError:
                if tem.startswith("bcftools"):
                    call_log(self.out, 'bcftools.{0}'.format(chr_name), tem)
                sys.exit(1)

    def mkindex(self):
        cmd = 'tabix -f 
                     -p vcf 
                     {0}/30_vcf/qtlseq.vcf.gz 
                     >> {0}/log/tabix.log 
                     2>&1'.format(self.out)

        cmd = clean_cmd(cmd)

        try:
            sbp.run(cmd,
                    stdout=sbp.DEVNULL,
                    stderr=sbp.DEVNULL,
                    shell=True,
                    check=True)
        except sbp.CalledProcessError:
            call_log(self.out, 'tabix', cmd)
            sys.exit(1)

    def run(self):
        print(time_stamp(), 'start to merge BAMs.', flush=True)
        self.merge()
        print(time_stamp(), 'merging process successfully finished.', flush=True)

        print(time_stamp(), 'start to call variants.', flush=True)
        chr_names = self.get_header()
#        for chr_name in chr_names:
#            self.mpileup(chr_name)
        threads = self.args.threads
        if self.args.threads>=len(chr_names):
            threads = len(chr_names)
        p = ThreadPool(threads)
        p.map(self.mpileup, chr_names)
        p.close()
        self.concat()
        print(time_stamp(), 'variant calling successfully finished.', flush=True)

        print(time_stamp(), 'start to index VCF.', flush=True)
        self.mkindex()
        print(time_stamp(), 'indexing VCF successfully finished.', flush=True)


替换的vcf2index .py

import os
import re
import sys
import gzip
import threading
import multiprocessing
import multiprocessing.pool
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
from multiprocessing import Manager
lock = threading.Lock()
cache = Manager().dict()
import numpy as np
import pandas as pd
from functools import lru_cache
from qtlseq.snpfilt import SnpFilt
from qtlseq.smooth import Smooth
from qtlseq.utils import time_stamp


class Vcf2Index(object):

    def __init__(self, args):
        self.out = args.out
        self.vcf = args.vcf
        self.snpEff = args.snpEff
        self.filial = args.filial
        self.species = args.species
        self.N_bulk1 = args.N_bulk1
        self.N_bulk2 = args.N_bulk2
        self.N_replicates = args.N_rep
        self.min_SNPindex = args.min_SNPindex
        self.snp_index = '{}/snp_index.tsv'.format(self.out)
        self.sf = SnpFilt(args)
        self.args = args

        if self.snpEff is not None:
            self.ANN_re = re.compile(';ANN=(.*);*')

        if self.species is None:
            self.p99_index = int(0.99*self.N_replicates) - 1
            self.p95_index = int(0.95*self.N_replicates) - 1
        else:
            k = self.correct_threshold()
            corr_p99 = 0.01/k
            corr_p95 = 0.05/k
            self.p99_index = int((1 - corr_p99)*self.N_replicates) - 1
            self.p95_index = int((1 - corr_p95)*self.N_replicates) - 1

            if int(corr_p99*self.N_replicates) - 1 < 0:
                print(('!!WARNING!! Number of replicates for simulation is not '
                       'enough to consider multiple testing correction. '
                       'Therefore, the highest SNP-index and the second highest '
                       'SNP-index were selected for p99 and p95, respectively.'), 
                       file=sys.stderr)

                self.p99_index = self.N_replicates - 1
                self.p95_index = self.N_replicates - 2

    def correct_threshold(self):
        if self.filial == 2:
            l = 8.4
        elif self.filial == 3:
            l = 5.8
        elif self.filial == 4:
            l = 5.0
        else:
            l = 4.5

        if self.species == 'Arabidopsis':
            k = 5 + 600/l
        elif self.species == 'Cucumber':
            k = 7 + 1390/l
        elif self.species == 'Maize':
            k = 10 + 2060/l
        elif self.species == 'Rapeseed':
            k = 18 + 2520/l
        elif self.species == 'Rice':
            k = 12 + 1530/l
        elif self.species == 'Tobacco':
            k = 12 + 3270/l
        elif self.species == 'Tomato':
            k = 12 + 1470/l
        elif self.species == 'Wheat':
            k = 21 + 3140/l
        elif self.species == 'Yeast':
            k = 16 + 4900/l
            if self.filial >= 2:
                print('!!WARNING!! Filial generation must be 2 in yeast.', file=sys.stderr)

        else:
            print('You specified not supported species.', file=sys.stderr)
            sys.exit(1)

        return k


    def get_field(self):
        root, ext = os.path.splitext(self.vcf)
        if ext == '.gz':
            vcf = gzip.open(self.vcf, 'rt')
        else:
            vcf = open(self.vcf, 'r')
        for line in vcf:
            if re.match(r'[^#]', line):
                fields = line.split()[8].split(':')
                try:
                    GT_pos = fields.index('GT')
                except ValueError:
                    sys.stderr.write(('{} No GT field'
                                      ' in your VCF!!n').format(time_stamp()))
                    sys.exit(1)
                try:
                    AD_pos = fields.index('AD')
                except ValueError:
                    sys.stderr.write(('{} No AD field'
                                      ' in your VCF!!n').format(time_stamp()))
                    sys.exit(1)

                if 'ADF' in fields and 'ADR' in fields:
                    ADF_pos = fields.index('ADF')
                    ADR_pos = fields.index('ADR')
                else:
                    ADF_pos = None
                    ADR_pos = None
                    sys.stderr.write(('{} no ADF or ADR field'
                                      ' in your VCF.n').format(time_stamp()))
                    sys.stderr.write(('{} strand bias filter '
                                      'will be skipped.n').format(time_stamp()))
                break
        vcf.close()
        return GT_pos, AD_pos, ADF_pos, ADR_pos

    def get_variant_impact(self, annotation):
        ANN = self.ANN_re.findall(annotation)[0]
        genes = ANN.split(',')
        impacts = [gene.split('|')[2] for gene in genes]
        if 'HIGH' in impacts:
            impact = 'HIGH'
        elif 'MODERATE' in impacts:
            impact = 'MODERATE'
        elif 'LOW' in impacts:
            impact = 'LOW'
        else:
            impact = 'MODIFIER'
        return impact

    def check_variant_type(self, REF, ALT):
        if (len(REF) == 1) and (len(ALT) == 1):
            variant = 'snp'
        else:
            variant = 'indel'
        return variant

    def check_depth(self, bulk1_depth, bulk2_depth):
        if bulk1_depth < bulk2_depth:
            depth1 = bulk1_depth
            depth2 = bulk2_depth
        else:
            depth1 = bulk2_depth
            depth2 = bulk1_depth
        return depth1, depth2

    def Fn_simulation(self, depth1, depth2):
        GT = [0, 1]
        replicates = []
        n = 0
        while n < self.N_replicates:
            bulk1_Fn_GT = np.ones(self.N_bulk1)
            bulk2_Fn_GT = np.ones(self.N_bulk2)
            for _ in range(self.args.filial - 1):
                bulk1_random_gt = np.random.choice([0, 1, 1, 2],
                                                   bulk1_Fn_GT.shape,
                                                   replace=True)

                bulk2_random_gt = np.random.choice([0, 1, 1, 2],
                                                   bulk2_Fn_GT.shape,
                                                   replace=True)

                bulk1_Fn_GT = np.where(bulk1_Fn_GT==1,
                                       bulk1_random_gt,
                                       bulk1_Fn_GT)

                bulk2_Fn_GT = np.where(bulk2_Fn_GT==1,
                                       bulk2_random_gt,
                                       bulk2_Fn_GT)

            bulk1_freq = sum(bulk1_Fn_GT)/(2*self.N_bulk1)
            bulk2_freq = sum(bulk2_Fn_GT)/(2*self.N_bulk2)

            bulk1_AD = np.random.choice([0, 1],
                                        depth1,
                                        p=[1 - bulk1_freq, bulk1_freq],
                                        replace=True)

            bulk2_AD = np.random.choice([0, 1],
                                        depth2,
                                        p=[1 - bulk2_freq, bulk2_freq],
                                        replace=True)

            bulk1_SNPindex = bulk1_AD.sum()/depth1
            bulk2_SNPindex = bulk2_AD.sum()/depth2

            if bulk1_SNPindex < self.min_SNPindex and 
               bulk2_SNPindex < self.min_SNPindex:
                continue
            else:
                delta_SNPindex = bulk2_SNPindex - bulk1_SNPindex
                replicates.append(abs(delta_SNPindex))
                n += 1

        replicates.sort()
        p99 = replicates[self.p99_index]
        p95 = replicates[self.p95_index]
        return p99, p95

    def calculate_SNPindex_sub(self, line):
        if re.match(r'[^#]', line):
            cols = line.split('t')
            CHR = cols[0]
            POS = cols[1]
            REF = cols[3]
            ALT = cols[4]
            annotation = cols[7]
            GT_pos = self.field_pos[0]
            AD_pos = self.field_pos[1]
            ADF_pos = self.field_pos[2]
            ADR_pos = self.field_pos[3]

            parent_GT = cols[9].split(':')[GT_pos]
            parent_AD = cols[9].split(':')[AD_pos]
            bulk1_AD = cols[10].split(':')[AD_pos]
            bulk2_AD = cols[11].split(':')[AD_pos]

            if ADF_pos != None and ADR_pos != None:
                parent_ADF = cols[9].split(':')[ADF_pos]
                parent_ADR = cols[9].split(':')[ADR_pos]
                ADFR = (parent_ADF, parent_ADR)
            else:
                ADFR = None

            record = self.sf.filt(parent_GT, parent_AD, bulk1_AD, bulk2_AD, ADFR)
            if record['type'] == 'keep':
                variant = self.check_variant_type(REF, ALT)
                depth1, depth2 = self.check_depth(record['bulk1_depth'],
                                                  record['bulk2_depth'])

                if (depth1, depth2) in cache:
                    p99, p95 = cache[(depth1, depth2)]
                else:
                    p99, p95 = self.Fn_simulation(depth1, depth2)

                    cache[(depth1, depth2)] = (p99, p95)

                lock.acquire()
                snp_index = open(self.snp_index + ".temp", 'a')
                if self.snpEff is None:
                    snp_index.write(('{}t{}t{}t{}t{}t{:.4f}t{:.4f}t'
                                     '{:.4f}t{:.4f}t{:.4f}n').format(CHR,
                                                                        POS,
                                                                        variant,
                                                                        record['bulk1_depth'],
                                                                        record['bulk2_depth'],
                                                                        p99,
                                                                        p95,
                                                                        record['bulk1_SNPindex'],
                                                                        record['bulk2_SNPindex'],
                                                                        record['delta_SNPindex']))
                else:
                    impact = self.get_variant_impact(annotation)
                    snp_index.write(('{}t{}t{}t{}t{}t{}t{:.4f}t{:.4f}t'
                                     '{:.4f}t{:.4f}t{:.4f}n').format(CHR,
                                                                        POS,
                                                                        variant,
                                                                        impact,
                                                                        record['bulk1_depth'],
                                                                        record['bulk2_depth'],
                                                                        p99,
                                                                        p95,
                                                                        record['bulk1_SNPindex'],
                                                                        record['bulk2_SNPindex'],
                                                                        record['delta_SNPindex']))
                snp_index.close()
                lock.release()

    def table_sort(self):
        if self.snpEff is None:
            snp_index = pd.read_csv('{}/snp_index.tsv.temp'.format(self.out),
                                    sep='t',
                                    names=['CHROM',
                                           'POSI',
                                           'variant',
                                           'bulk1_depth',
                                           'bulk2_depth',
                                           'p99',
                                           'p95',
                                           'bulk1_SNPindex',
                                           'bulk2_SNPindex',
                                           'delta_SNPindex'])
        else:
            snp_index = pd.read_csv('{}/snp_index.tsv.temp'.format(self.out),
                                    sep='t',
                                    names=['CHROM',
                                           'POSI',
                                           'variant',
                                           'impact',
                                           'bulk1_depth',
                                           'bulk2_depth',
                                           'p99',
                                           'p95',
                                           'bulk1_SNPindex',
                                           'bulk2_SNPindex',
                                           'delta_SNPindex'])

        snp_index = snp_index.sort_values(by=['CHROM', 'POSI'])

        snp_index.to_csv('{}/snp_index.tsv'.format(self.out),
                         sep='t',
                         index=False,
                         header=False)

        os.remove('{}/snp_index.tsv.temp'.format(self.out))

    def calculate_SNPindex(self):
        if os.path.exists('{}/snp_index.tsv.temp'.format(self.out)):
            os.remove('{}/snp_index.tsv.temp'.format(self.out))

        root, ext = os.path.splitext(self.vcf)
        if ext == '.gz':
            vcf = gzip.open(self.vcf, 'rt')
        else:
            vcf = open(self.vcf, 'r')
#        for line in vcf:
#            self.calculate_SNPindex_sub(line)
        p = ThreadPool(self.args.threads)
        p.map(self.calculate_SNPindex_sub, vcf)
        p.close()

        vcf.close()

        self.table_sort()

    def run(self):
        print(time_stamp(), 'start to calculate SNP-index.', flush=True)
        self.field_pos = self.get_field()
        self.calculate_SNPindex()
        print(time_stamp(), 'SNP-index successfully finished.', flush=True)

        print(time_stamp(), 'start to smooth SNP-index.', flush=True)
        sm = Smooth(self.args)
        sm.run()
        print(time_stamp(), 'smoothing process successfully finished.', flush=True)

可能进程池还会报错。那就把用到pool.map的地方注释掉(#)。那把for语块前面的#去掉
如果报错是permission denied 就把miniconda的python.exe给所有权限。
这个库还依赖几个包
可以用conda 来安装
conda install matplotlib,numpy,pandas,seaborn
改完以上的文件就可以安装了,打开QTL-seq的文件夹,用pip安装就可以了。如果是在conda的虚拟环境中用python -m pip install -e .
具体用法可以参考QTL-seq
powershell中输入命令

cd QTL-seq #打开QTL-seq所在文件夹
pip install -e .#安装此文件夹包含的库
qtlseq #检查是否安装成功

利用自带的test文件测试是否能正常运行

cd test #打开QTL-seq所在文件夹
qtlseq -o output -t 6 -n1 20 -n2 20 -w 100 -s 20 -r qtlseq_ref.fasta -p qtlseq_parent.1.fastq.gz,qtlseq_parent.2.fastq.gz -b1 qtlseq_bulk1.1.fastq.gz,qtlseq_bulk1.2.fastq.gz -b2 qtlseq_bulk2.1.fastq.gz,qtlseq_bulk2.2.fastq.gz


结果1

结果2

结果3

可以发现得到的结果和自带的结果存在细微的区别,这主要是bowtie2和bwa的比对差异造成的。但不会影响主要结论,特别是测序深度足够深了之后。
MutMap 下载地址 原理和QTL-seq类似,也需要做一些改动。安装和测试和上面一样。
替换的refindex.py

import sys
import subprocess as sbp
from mutmap.utils import time_stamp, clean_cmd, call_log


class RefIndex(object):

    def __init__(self, args):
        self.args = args
        self.out = args.out

    def run(self):
        print(time_stamp(),
              'start to index reference fasta.',
              flush=True)

        cmd1 = 'bowtie2-build {0} {1}/10_ref/ref
                >> {1}/log/bowtie2.log 
                2>&1'.format(self.args.ref, self.out)

        cmd2 = 'samtools faidx {0} 
                >> {1}/log/samtools.log 
                2>&1'.format(self.args.ref, self.out)

        cmd1 = clean_cmd(cmd1)
        cmd2 = clean_cmd(cmd2)

        try:
            sbp.run(cmd1,
                    stdout=sbp.DEVNULL,
                    stderr=sbp.DEVNULL,
                    shell=True,
                    check=True)
        except sbp.CalledProcessError:
            call_log(self.out, 'bowtie2', cmd1)
            sys.exit(1)

        try:
            sbp.run(cmd2,
                    stdout=sbp.DEVNULL,
                    stderr=sbp.DEVNULL,
                    shell=True,
                    check=True)
        except sbp.CalledProcessError:
            call_log(self.out, 'samtools', cmd1)
            sys.exit(1)

        print(time_stamp(),
              'indexing of reference successfully finished.',
              flush=True)

替换的alignment.py

import sys
import os
import subprocess as sbp
from mutmap.utils import time_stamp, clean_cmd, call_log


class Alignment(object):

    def __init__(self, args):
        self.args = args
        self.out = args.out

    def run(self, fastq1, fastq2, index):
        print(time_stamp(),
              'start to align reads of {} by bowtie2.'.format(index),
              flush=True)

        cmd1 = 'bowtie2 -p {0} -x {3}/10_ref/ref -1 {1} -2 {2} -S {3}/20_bam/{4}.sam >> {3}/log/alignment.log '.format(self.args.threads,fastq1,fastq2,self.out,index)
        cmd2 = 'samtools fixmate -O bam {0}/20_bam/{1}.sam {0}/20_bam/{1}_fixmate.bam '.format(self.out,index)
        cmd3 = 'rm {0}/20_bam/{1}.sam '.format(self.out,index)
        cmd4 = 'samtools sort -O bam -m 1G -@ {0} -o {1}/20_bam/{2}_sort.bam {1}/20_bam/{2}_fixmate.bam '.format(self.args.threads,self.out,index)
        cmd5 = 'rm {0}/20_bam/{1}_fixmate.bam '.format(self.out,index)
        cmd6 = 'samtools rmdup -sS {0}/20_bam/{1}_sort.bam {0}/20_bam/{1}_rmdup.bam '.format(self.out,index)
        cmd7 = 'rm {0}/20_bam/{1}_sort.bam '.format(self.out,index)
        cmd8 = 'samtools view -b -f 2 -F 2048 -o {0}/20_bam/{1}.bam {0}/20_bam/{1}_rmdup.bam '.format(self.out,index)
        cmd9 = 'rm {0}/20_bam/{1}_rmdup.bam '.format(self.out,index)

        cmd = [cmd1, cmd2, cmd3, cmd4, cmd5, cmd6, cmd7, cmd8, cmd9]
        for i in cmd:
            tem = clean_cmd(i)
            if tem.startswith("rm"):
                target = tem.split(" ")[1]
                os.remove(target)
                continue
            try:
                sbp.run(tem,
                    stdout=sbp.DEVNULL,
                    stderr=sbp.DEVNULL,
                    shell=True,
                    check=True)
            except sbp.CalledProcessError:
                call_log(self.out, 'alignment', tem)
                sys.exit(1)

        print(time_stamp(),
              'alignment {} successfully finished.'.format(index),
              flush=True)

替换的mpileup.py

import sys
import re
import os
import glob
import subprocess as sbp
import multiprocessing
import multiprocessing.pool
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
from mutmap.vcf2index import Vcf2Index
from mutmap.utils import time_stamp, clean_cmd, call_log


class Mpileup(object):
    def __init__(self, args):
        self.out = args.out
        self.args = args

    def get_bams(self, label):
        bams = glob.glob('{}/20_bam/{}*.bam'.format(self.out, label))
        return bams

    def merge(self):
        for label in ['cultivar', 'bulk']:
            bams = self.get_bams(label)
            if len(bams) == 1:
                path_to_bam = os.path.abspath(bams[0])
                cmd1 = 'ln -s {0} {1}/20_bam/{2}.unsorted.bam'.format(path_to_bam,
                                                                   self.out,
                                                                   label)
            else:
                cmd1 = 'samtools merge -f {0}/20_bam/{1}.unsorted.bam 
                                          {0}/20_bam/{1}*.bam 
                                          >> {0}/log/samtools.log 
                                          2>&1'.format(self.out, label)

            cmd2 = 'samtools sort -m {0} 
                                  -@ {1} 
                                  -o {2}/20_bam/{3}.bam 
                                  {2}/20_bam/{3}.unsorted.bam 
                                  >> {2}/log/samtools.log 
                                  2>&1'.format(self.args.mem,
                                               self.args.threads,
                                               self.out,
                                               label)

            cmd3 = 'samtools index {0}/20_bam/{1}.bam 
                                   >> {0}/log/samtools.log 
                                   2>&1'.format(self.out, label)

            cmd4 = 'rm {}/20_bam/{}.*.bam'.format(self.out, label)

            cmd = [cmd1, cmd2, cmd3, cmd4]
            for i in cmd:
                tem = clean_cmd(i)
                if tem.startswith("rm"):
                    target = glob.glob(tem.split(" ")[1])
                    for t in target:
                        os.remove(t)
                    continue
                if tem.startswith("ln"):
                    target = tem.split(" ")
                    os.symlink(target[2], target[3])
                    continue
                try:
                    sbp.run(tem,
                        stdout=sbp.DEVNULL,
                        stderr=sbp.DEVNULL,
                        shell=True,
                        check=True)
                except sbp.CalledProcessError:
                    call_log(self.out, 'samtools', tem)
                    sys.exit(1)

    def get_header(self):
        ref = open(self.args.ref, 'r')
        pattern = re.compile('>')
        chr_names = []
        for line in ref:
            if pattern.match(line):
                line = line.rstrip('n')
                chr_name = re.split('[> ]',line)[1]
                chr_names.append(chr_name)
        return chr_names

    def mpileup(self, chr_name):
        print("call variants in {} ".format(chr_name))
        cmd1 = 'samtools mpileup -t AD,ADF,ADR -B -q {0} -Q {1} -C {2} -v -u -o {3}/30_vcf/tem_mpileup.{4}.vcf -r {4} -f {5} --ignore-RG {3}/20_bam/cultivar.bam {3}/20_bam/bulk.bam >> {3}/log/bcftools.{4}.log 2>&1 '.format(self.args.min_MQ,self.args.min_BQ,self.args.adjust_MQ,self.out,chr_name,self.args.ref)
        cmd2 = 'bcftools call -vm -f GQ,GP -O z -o {1}/30_vcf/tem_call.{2}.vcf.gz {1}/30_vcf/tem_mpileup.{2}.vcf >> {1}/log/bcftools.{2}.log 2>&1 '.format(self.args.threads,self.out,chr_name) 
        cmd3 = 'bcftools filter -i "INFO/MQ>={3}" -O z -o {1}/30_vcf/mutmap.{2}.vcf.gz {1}/30_vcf/tem_call.{2}.vcf.gz >> {1}/log/bcftools.{2}.log 2>&1 '.format(self.args.threads,self.out, chr_name, self.args.min_MQ) 
        cmd4 = 'tabix -f -p vcf {0}/30_vcf/mutmap.{1}.vcf.gz >> {0}/log/tabix.{1}.log 2>&1 '.format(self.out, chr_name)
        cmd5 = 'rm {0}/30_vcf/tem_mpileup.{1}.vcf'.format(self.out, chr_name)
        cmd6 = 'rm {0}/30_vcf/tem_call.{1}.vcf.gz'.format(self.out, chr_name)

        cmd = [cmd1, cmd2, cmd3, cmd4, cmd5, cmd6]

        for i in cmd:
            tem = clean_cmd(i)
            if tem.startswith("rm"):
                target = glob.glob(tem.split(" ")[1])
                for t in target:
                    os.remove(t)
                continue
            try:
                sbp.run(tem,
                    stdout=sbp.DEVNULL,
                    stderr=sbp.DEVNULL,
                    shell=True,
                    check=True)
            except sbp.CalledProcessError:
                if tem.startswith("tabix"):
                    call_log(self.out, 'tabix.{0}'.format(chr_name), tem)
                if tem.startswith("bcftools") or tem.startswith("samtools"):
                    call_log(self.out, 'bcftools.{0}'.format(chr_name), tem)
                sys.exit(1)

    def concat(self):
        cmd1 = 'cat {0}/log/bcftools.*.log > {0}/log/bcftools.log'.format(self.out)
        cmd2 = 'cat {0}/log/tabix.*.log > {0}/log/tabix.log'.format(self.out)

        cmd3 = 'bcftools concat --threads {1} -O z 
                                -o {0}/30_vcf/mutmap.vcf.gz 
                                {0}/30_vcf/mutmap.*.vcf.gz 
                                >> {0}/log/bcftools.log 
                                2>&1'.format(self.out,self.args.threads)

        cmd4 = 'rm {}/30_vcf/mutmap.*.vcf.gz'.format(self.out)
        cmd5 = 'rm {}/30_vcf/mutmap.*.vcf.gz.tbi'.format(self.out)
        cmd6 = 'rm {}/log/bcftools.*.log'.format(self.out)
        cmd7 = 'rm {}/log/tabix.*.log'.format(self.out)

        cmd = [cmd1, cmd2, cmd3, cmd4, cmd5, cmd6, cmd7]

        for i in cmd:
            tem = clean_cmd(i)
            if tem.startswith("rm"):
                target = glob.glob(tem.split(" ")[1])
                for t in target:
                    os.remove(t)
                continue
            if tem.startswith("cat"):
                target = glob.glob(tem.split(" ")[1])
                outfile = open(tem.split(" ")[3], "a")
                for t in target:
                    for line in open(t):
                        outfile.writelines(line)
                        outfile.write('n')
                outfile.close()
                continue
            try:
                sbp.run(tem,
                    stdout=sbp.DEVNULL,
                    stderr=sbp.DEVNULL,
                    shell=True,
                    check=True)
            except sbp.CalledProcessError:
                if tem.startswith("bcftools"):
                    call_log(self.out, 'bcftools.{0}'.format(chr_name), tem)
                sys.exit(1)

    def mkindex(self):
        cmd = 'tabix -f 
                     -p vcf 
                     {0}/30_vcf/mutmap.vcf.gz 
                     >> {0}/log/tabix.log 
                     2>&1'.format(self.out)

        cmd = clean_cmd(cmd)

        try:
            sbp.run(cmd,
                    stdout=sbp.DEVNULL,
                    stderr=sbp.DEVNULL,
                    shell=True,
                    check=True)
        except sbp.CalledProcessError:
            call_log(self.out, 'tabix', cmd)
            sys.exit(1)

    def run(self):
        print(time_stamp(), 'start to merge BAMs.', flush=True)
        self.merge()
        print(time_stamp(), 'merging process successfully finished.', flush=True)

        print(time_stamp(), 'start to call variants.', flush=True)
        chr_names = self.get_header()
#        for chr_name in chr_names:
#            self.mpileup(chr_name)
        threads = self.args.threads
        if self.args.threads>=len(chr_names):
            threads = len(chr_names)
        p = ThreadPool(threads)
        p.map(self.mpileup, chr_names)
        p.close()
        self.concat()
        print(time_stamp(), 'variant calling successfully finished.', flush=True)

        print(time_stamp(), 'start to index VCF.', flush=True)
        self.mkindex()
        print(time_stamp(), 'indexing VCF successfully finished.', flush=True)

替换的vcf2index.py

import os
import re
import sys
import gzip
import numpy as np
import pandas as pd
from functools import lru_cache
from mutmap.snpfilt import SnpFilt
from mutmap.smooth import Smooth
from mutmap.utils import time_stamp


class Vcf2Index(object):

    def __init__(self, args):
        self.out = args.out
        self.vcf = args.vcf
        self.snpEff = args.snpEff
        self.species = args.species
        self.N_bulk = args.N_bulk
        self.N_replicates = args.N_rep
        self.min_SNPindex = args.min_SNPindex
        self.snp_index = '{}/snp_index.tsv'.format(self.out)
        self.args = args
        if self.snpEff is not None:
            self.ANN_re = re.compile(';ANN=(.*);*')

        if self.species is None:
            self.p99_index = int(0.99*self.N_replicates) - 1
            self.p95_index = int(0.95*self.N_replicates) - 1
        else:
            k = self.correct_threshold()
            corr_p99 = 0.01/k
            corr_p95 = 0.05/k
            self.p99_index = int((1 - corr_p99)*self.N_replicates) - 1
            self.p95_index = int((1 - corr_p95)*self.N_replicates) - 1

            if int(corr_p99*self.N_replicates) - 1 < 0:
                print(('!!WARNING!! Number of replicates for simulation is not '
                       'enough to consider multiple testing correction. '
                       'Therefore, the highest SNP-index and the second highest '
                       'SNP-index were selected for p99 and p95, respectively.'), 
                       file=sys.stderr)

                self.p99_index = self.N_replicates - 1
                self.p95_index = self.N_replicates - 2

    def correct_threshold(self):
        l = 8.4

        if self.species == 'Arabidopsis':
            k = 5 + 600/l
        elif self.species == 'Cucumber':
            k = 7 + 1390/l
        elif self.species == 'Maize':
            k = 10 + 2060/l
        elif self.species == 'Rapeseed':
            k = 18 + 2520/l
        elif self.species == 'Rice':
            k = 12 + 1530/l
        elif self.species == 'Tobacco':
            k = 12 + 3270/l
        elif self.species == 'Tomato':
            k = 12 + 1470/l
        elif self.species == 'Wheat':
            k = 21 + 3140/l
        elif self.species == 'Yeast':
            k = 16 + 4900/l

        else:
            print('You specified not supported species.', file=sys.stderr)
            sys.exit(1)

        return k

    def get_field(self):
        root, ext = os.path.splitext(self.vcf)
        if ext == '.gz':
            vcf = gzip.open(self.vcf, 'rt')
        else:
            vcf = open(self.vcf, 'r')
        for line in vcf:
            if re.match(r'[^#]', line):
                fields = line.split()[8].split(':')
                try:
                    GT_pos = fields.index('GT')
                except ValueError:
                    sys.stderr.write(('{} No GT field'
                                      ' in your VCF!!n').format(time_stamp()))
                    sys.exit(1)
                try:
                    AD_pos = fields.index('AD')
                except ValueError:
                    sys.stderr.write(('{} No AD field'
                                      ' in your VCF!!n').format(time_stamp()))
                    sys.exit(1)

                if 'ADF' in fields and 'ADR' in fields:
                    ADF_pos = fields.index('ADF')
                    ADR_pos = fields.index('ADR')
                else:
                    ADF_pos = None
                    ADR_pos = None
                    sys.stderr.write(('{} no ADF or ADR field'
                                      ' in your VCF.n').format(time_stamp()))
                    sys.stderr.write(('{} strand bias filter '
                                      'will be skipped.n').format(time_stamp()))
                break
        vcf.close()
        return GT_pos, AD_pos, ADF_pos, ADR_pos

    def get_variant_impact(self, annotation):
        ANN = self.ANN_re.findall(annotation)[0]
        genes = ANN.split(',')
        impacts = [gene.split('|')[2] for gene in genes]
        if 'HIGH' in impacts:
            impact = 'HIGH'
        elif 'MODERATE' in impacts:
            impact = 'MODERATE'
        elif 'LOW' in impacts:
            impact = 'LOW'
        else:
            impact = 'MODIFIER'
        return impact

    def check_variant_type(self, REF, ALT):
        if (len(REF) == 1) and (len(ALT) == 1):
            variant = 'snp'
        else:
            variant = 'indel'
        return variant

    @lru_cache(maxsize=256)
    def F2_simulation(self, depth):
        GT = [0, 1]
        replicates = []
        n = 0
        while n < self.N_replicates:
            F2_GT = np.random.choice(GT, 2*self.N_bulk, p=[0.5, 0.5])
            mut_AD = np.random.choice(F2_GT, depth, replace=True)
            SNPindex = mut_AD.sum()/depth
            if SNPindex < self.min_SNPindex:
                continue
            else:
                replicates.append(SNPindex)
                n += 1
        replicates.sort()
        p99 = replicates[self.p99_index]
        p95 = replicates[self.p95_index]
        return p99, p95

    def calc_SNPindex(self, field_pos):
        root, ext = os.path.splitext(self.vcf)
        if ext == '.gz':
            vcf = gzip.open(self.vcf, 'rt')
        else:
            vcf = open(self.vcf, 'r')
        snp_index = open(self.snp_index, 'w')

        sf = SnpFilt(self.args)
        for line in vcf:
            if re.match(r'[^#]', line):
                cols = line.split('t')
                CHR = cols[0]
                POS = cols[1]
                REF = cols[3]
                ALT = cols[4]
                annotation = cols[7]
                GT_pos = field_pos[0]
                AD_pos = field_pos[1]
                ADF_pos = field_pos[2]
                ADR_pos = field_pos[3]

                cultivar_GT = cols[9].split(':')[GT_pos]
                cultivar_AD = cols[9].split(':')[AD_pos]
                bulk_AD = cols[10].split(':')[AD_pos]

                if ADF_pos != None and ADR_pos != None:
                    cultivar_ADF = cols[9].split(':')[ADF_pos]
                    cultivar_ADR = cols[9].split(':')[ADR_pos]
                    ADFR = (cultivar_ADF, cultivar_ADR)
                else:
                    ADFR = None

                record = sf.filt(cultivar_GT, cultivar_AD, bulk_AD, ADFR)
                if record['type'] == 'keep':
                    variant = self.check_variant_type(REF, ALT)

                    p99, p95 = self.F2_simulation(record['bulk_depth'])

                    if self.snpEff is None:
                        snp_index.write(('{}t{}t{}t{}t'
                                         '{:.4f}t{:.4f}t{:.4f}n').format(CHR,
                                                                            POS,
                                                                            variant,
                                                                            record['bulk_depth'],
                                                                            p99,
                                                                            p95,
                                                                            record['SNPindex']))
                    else:
                        impact = self.get_variant_impact(annotation)
                        snp_index.write(('{}t{}t{}t{}t{}t'
                                         '{:.4f}t{:.4f}t{:.4f}n').format(CHR,
                                                                            POS,
                                                                            variant,
                                                                            impact,
                                                                            record['bulk_depth'],
                                                                            p99,
                                                                            p95,
                                                                            record['SNPindex']))

        snp_index.close()
        vcf.close()

    def run(self):
        print(time_stamp(), 'start to calculate SNP-index.', flush=True)
        field_pos = self.get_field()
        self.calc_SNPindex(field_pos)
        print(time_stamp(), 'SNP-index successfully finished.', flush=True)

        print(time_stamp(), 'start to smooth SNP-index.', flush=True)
        sm = Smooth(self.args)
        sm.run()
        print(time_stamp(), 'smoothing process successfully finished.', flush=True)

安装和测试
powershell 命令

cd MutMap
pip install -e .
cd test
mutmap -o output -t 6 -n 20 -w 100 -s 20 -r mutmap_ref.fasta -c mutmap_cultivar.1.fastq.gz,mutmap_cultivar.2.fastq.gz -b mutmap_bulk.1.fastq.gz,mutmap_bulk.2.fastq.gz


mutmap结果
也是有点差别的。

数据下载

下载QTL-seq数据(这里是直接在sra上找的),并用sratoolkit解压缩

亲本是突变体和日本晴,并在F3代分离了两个极端表型的群体 ,一个是早开花含37个样本,一个是晚开花含35个样本。看样本介绍可以知道每个样本都是30x以上的。
sra文件下载下来后用sratoolkit解压。
最好是放到个空目录里,在powershell里输入命令

cd F:Mut_QTL
#--split-3 用于双端测序 --gzip 输出结果为fastq.gz格式比较省空间,缺点是比对过程会慢些
for($x=6; $x -lt 10; $x=$x+1) #x从6-9
{
	fastq-dump --split-3 --gzip SRR1241296$x
}


下载水稻参考基因组序列,需要fasta格式,我这里用的是常染色体序列。
all.chrs.con

QTL-seq

由于我的qtlseq是装在虚拟环境里的

conda activate ngs #激活虚拟环境
cd F:Mut_QTL
qtlseq -n1 35 -n2 37 -o output -t 6 -r E:IRGSP-1.0_representativereferent_fastaall.chrs.fasta -p SRR12412966_1.fastq.gz,SRR12412966_2.fastq.gz -b1 SRR12412968_1.fastq.gz,SRR12412968_2.fastq.gz -b2 SRR12412969_1.fastq.gz,SRR12412969_2.fastq.gz 

-n1 bulk1的混样数量 这里为SRR12412968样本 35个
-n2 bulk2的混样数量 这里为SRR12412969样本 37个
-t 调用的线程数
–mem 每个线程可调用的内存
-o 输出文件夹名字,必须原先不存在
-r 参考基因组序列位置
-p 其中一个亲本 一般用目标表型的亲本 这里用了突变体亲本
-T 启用trimmomatic质控
–trim-params trimmomatic的参数,如:[33,:2:30:10,20,20,4:15,75].33为接头类型,后边是去除接头的参数
-q 是call SNP过程中要过滤的参数,最小的MQ值
-Q 是call SNP过程中要过滤的参数,最小的BQ值
-b1 bulk1的fastq或bam 文件
-b2 bulk2的fastq或bam 文件

结果

很明显6号染色体前臂上有两个效应相反的QTL,但范围太大了,在tsv文件里看,两个位点都有2M以上,这可能和极端表型的选择有关,可能选择的时候选了20%的表型啥的,当然这也不是样品越少越好,这是个复杂的问题。(吐槽下这个软件出的图 竟然染色体的顺序是乱的)
如果想只看indel的SNP-index,或想用igv来查看结果或输出其他格式的结果可以用qtlplot命令,进一步修改划窗,步长啥的。
这个软件的作者也给出来具体例子和结果含义。QTL-seq User Guide

MutMap

mutmap的使用也和上边是一样的,只需要两个样本,一个是野生型,一个是突变体和野生型杂交得到的F2中选择的极端表型的混池样本(bulk)。因为两个bulk都是F3,所以这里用SRR12412968假装下。

conda activate ngs #激活虚拟环境
cd F:Mut_QTL
mutmap -o output -t 6 -n 35 -r E:IRGSP-1.0_representativereferent_fastaall.chrs.fasta -c SRR12412967_1.fastq.gz,SRR12412967_2.fastq.gz -b SRR12412968_1.fastq.gz,SRR12412968_2.fastq.gz

参考MutMap User Guide
-c 是野生型
-b 是突变体混合样本
-n 是突变体混合样本里的样本量

结果

6号染色体0~3M的区域应该是控制晚开花。

最后

QTL-seq跑了26个小时,MutMap跑了12个小时。不同电脑的线程啥的不太一样可能会有区别。由于中间生成sam文件太大了,建议30x的样本大概需要预留100g的空间。

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/4019151.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-10-22
下一篇 2022-10-22

发表评论

登录后才能评论

评论列表(0条)

保存