Molecule name | GenBank sequence | RefSeq sequence | Unlocalized sequences count | |
---|---|---|---|---|
Chromosome 1 | CM000812.5 | = | NC_010443.5 | 0 |
Chromosome 2 | CM000813.5 | = | NC_010444.4 | 0 |
Chromosome 3 | CM000814.5 | = | NC_010445.4 | 0 |
$ cat ID_list
CM000812.5 NC_010443.5
CM000813.5 NC_010444.4
CM000814.5 NC_010445.4
CM000815.5 NC_010446.5
CM000816.5 NC_010447.5
CM000817.5 NC_010448.4
CM000818.5 NC_010449.5
...
2.2 pig_gene.bed:需要转化的bed文件,标准bed文件前6列格式,以 \t 作为分隔符,bed文件可以自己用awk手动从GFF中提取(推荐)或者利用python模块jcvi等其它方法进行提取。
# 此处利用awk从猪的参考基因组GFF文件中提取对应gene的bed文件
less GCF_000003025.6_Sscrofa11.1_genomic.gff.gz | grep -v "#" | \
awk '$3=="gene" {print$0}' | \
awk 'BEGIN{OFS=FS="\t"}{
split($9,x,";");
for(i in x){
n=split(x[i],y,"=");
if(n==2)INFO[y[1]]= y[2];
}
printf("%s\t%d\t%d\t%s\t%s\t%s\n",$1,$4-1,$5,INFO["Name"],$6,$7);
delete INFO;
}'|less > pig_gene.bed
$ cat pig_gene.bed
...
NC_010443.5 262243224 262258186 NDUFA8 . -
NC_010443.5 262258132 262295132 MORN5 . +
NC_010443.5 262297556 262322777 LHX6 . -
NC_010443.5 262333528 262356280 RBM18 . -
NC_010443.5 262355993 262416116 MRRF . +
...
NC_010460.4 7242377 7279345 KEL . +
NC_010460.4 7279559 7282874 C18H7orf34 . -
NC_010460.4 7285185 7341965 TRPV5 . +
NC_010460.4 7332252 7348383 TRPV6 . +
NC_010460.4 7346792 7419859 EPHB6 . -
...
2.3 Chr_Convert.py:自己写的转换的python脚本,可处理多个不同文件,用python的argpase包简单的包装了下,便于重复使用外部传参,写的比较简洁,还有待改进^-^。
# -*- coding: UTF-8 -*-
# import os
import pandas as pd
import argparse # 1、导入argpase包
def parse_args():
parse = argparse.ArgumentParser(
description='Convert GenBank sequence and RefSeq sequence of chr ID') # 2、创建参数对象
parse.add_argument('-b', '--bed', type=str,
help='bed file need to convert ') # 3、往参数对象添加参数1
parse.add_argument('-l', '--clist', type=str,
help='Chromosome conversion corresponding list') # 3、往参数对象添加参数2
parse.add_argument('-o', '--output', type=str,
help='output file path') # 3、往参数对象添加参数3
args = parse.parse_args() # 4、解析参数对象
return args
# 加载文件
def load_file(file):
data = pd.read_csv(str(file), sep='\t', header=None)
# print(data)
return data
# 保存文件
def save_file(dataframe, file_path):
dataframe.to_csv(file_path, sep='\t' ,index=False,header=None)
print("数据已保存:", file_path)
# 主函数
def main():
if args.clist:
data1 = load_file(args.clist) # 5、使用解析对象.参数获取使用命令行参数
if args.bed:
data2 = load_file(args.bed)
data3 = data2.copy()
ID_dict = dict((data1.iloc[n, 0], data1.iloc[n, 1]) #制作字典
for n in range(len(data1)))
ID_res_dict = {v: k for k, v in ID_dict.items()} #转换键和值,可根据情况调整
for n in range(len(data3)):
if data3.iloc[n, 0] in list(ID_res_dict.keys()):
data3.iloc[n, 0] = ID_res_dict[data3.iloc[n, 0]] #进行转化
if args.output:
save_file(data3, args.output)
if __name__ == '__main__':
args = parse_args()
main()
效果展示:python Chr_Convert.py -h
$ python Chr_Convert.py -h
usage: Chr_Convert.py [-h] [-b BED] [-l CLIST] [-o OUTPUT]
Convert GenBank sequence and RefSeq sequence of chr ID
optional arguments:
-h, --help show this help message and exit
-b BED, --bed BED bed file need to convert
-l CLIST, --clist CLIST
Chromosome conversion corresponding list
-o OUTPUT, --output OUTPUT
output file path
3.测试使用:共设置了三个可选参数:
-b 对应需要转换的bed文件
-l 对应转换的ID list
-o 对应输出文件的名称/路径
$ python Chr_Convert.py -l ID_list -b pig_gene.bed -o pig_gene_fix.bed
...
CM000812.5 513363 540948 ERMARD . -
CM000812.5 541009 552018 TCTE3 . +
CM000812.5 555648 576781 PHF10 . +
CM000812.5 575717 578441 C1H6orf120 . -
...
CM000830.5 87315354 87319580 RIPPLY1 . -
CM000830.5 87352694 87355724 CLDN2 . +
CM000830.5 87369310 87427296 MORC4 . -
CM000830.5 87512886 87575118 RBM41 . -
...
成功转化!自己写的小脚本,作为解决问题和代码练习,略微有些粗糙,其实还可以在此基础上再改进改进,加油研究生!
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)