WGDI软件（三）：两个不同物种共线性点阵图-dotplot_教程

我的上一篇博客介绍了WGDI软件画单个物种的dotplot图，参加 WGDI软件（二）：单个物种共线性点阵图-dotplot ,接下来我带你画两个不同物种的dotplot图。我选的是雷公藤和葡萄基因组，雷公藤在和葡萄发生分化后自己发生过一次WGT事件，文献原文见 Genome of Tripterygium wilfordii and identification of cytochrome P450 involved in triptolide biosynthesis 。

首先准备好雷公藤和葡萄的基因组和gff3文件，并处理得到Vvin.len、Twi.len、Vvin.gff、Twi.gff、Twi.pep.fa、Vvin.pep.fa六个文件（处理方法见上一篇博客）

然后准备雷公藤和葡萄蛋白比对文件Twi.Vvin.blastp

（码瞎1）图1为WGDI画出来的dotplot图，图2为quota软件画出来的dotplot图，两个软件都是基于blastp的结果进行画图

（2）对比可以发现，quota软件展示的结果在WGDI里基本上都能找着，能给得出与文章一致的结论（近期发生过三倍化），并且WGDI能展示更多的信息

画dotplot图的软件很多，我使用过mummer、JCVI还有quota，每个软悉弯件有各自的特点，总的来说，WGDI是最人性化的一款软件，可以通过简单修改配置文件对输入文件进行过滤画图，并能够很方便调整迟陆空图片的性状大小

先将向量写成标量形式然后解方程求出z,然后画图,下面是画图部分代码

[th,r] = meshgrid((0:5:360)*pi/180,0:1:100)

[X,Y] = pol2cart(th,r)

Z = -2.*X - 2.*Y + sqrt(3.*X.^2 + 4.*X.*Y + 3.*Y.^2)

contour3(X,Y,Z,30)

surface(X,Y,Z,'EdgeColor',[.8 .8 .8],'FaceColor','none')

grid off

view(-15,25)

//lecture 6

cd /Victor/stata

use "nei_sample.dta",clear

edit zipcode

//split默认根据空格拆分 stub前缀 prase on these strings根据什么拆分（通过观察）

split facilityname_origin, generate(varnew) parse(,)

split zipcode,generate(zipnew) parse(-)//在2894行有的知手洞没有破折号需要提取前五位

//按照某种符号拆分字符串

edit zipcode

help substr

//截取

gen zip5=substr(zipcode,1,5)

//生成zip5，表示截取zipcode的前五位从第一位截取五位

edit zipcode zip5 if length(zip5) ~=5

//展示长度不等于5的zip5和zipcode

edit zip5

gen len_cn = ustrlen(zipcode)

//生成中文字符串长度

edit fips

gen fips2 = substr(fips, 1,2)

edit fips2

gen fips3 = substr(fips, 3,3)

edit fips2 fips3

destring fips2, replace force

destring fips3, replace force

//字符变数值

tostring fips2 fips3, replace force

//数值变字符

edit fips2 fips3

replace fips2="0"+fips2 if length(fips2)==1

replace fips3="0"+fips3 if length(fips3)==2

replace fips3="00"+fips3 if length(fips3)==1

//前面用零补齐，补成五位

help duplicates

//重复观测值

sort newid

duplicates report newid year

//报告重复观测值

//copies代表这个数据一共有多少个 =1就代表没有重复第4541只有一个观测值newid

duplicates tag newid year, gen(dup)

//标注重复观测值

tab dup

//展示

edit new year if dup>=177

duplicates drop newid year, force

//去掉重复样本//两个都一样才丢掉

duplicates report newid year

ssc install unique

//安装unique

unique newid year

//展示有几个是唯一的

unique fips

use nei_sample.dta, clear

help collapse

//压缩

collapse (sum) so2 co nox nh3 voc (first) facilityname_origin fips zipcode , by(newid year)

//根据newid year重复的字符串变量（first）后面的三个只取第一个数据数值变量so2等等。。加总（sum）没涉及的变量就丢掉了

duplicates report newid year

collapse (sum) so2 co nox nh3 voc (count) newid, by(fips year)

//关于fips year 加总。数出newid（在fips year全都相同的情况下有几个newid（企业））

//每个地区每一年污染物的多少，企业有薯段多少

gen id = newid

//replace

//改变面板数据的结构

use nei_sample.dta, clear

help reshape

keep newid year so2

duplicates drop newid year, force

reshape wide so2 , i(newid) j(year)

reshape long so2 co nox voc nh3, i(newid) j(time)

//将宽表和长表相互搭枯转换

keep newid year co

reshape wide co,i(newid) j(year)

duplicates drop newid year,force

reshape wide co,i(newid) j(year)

reshape long co,i(newid) j(year)//观测值变成了999*12，转换两次之后，数据变成

*balanced data（平衡面板数据）了也是为了便于做可视化分析，计量分析

//lecture 7

cd /Victor/stata

use "nei_sample.dta",clear

keep newid year so2

//保留这三个

help reshape

//数据重排

duplicates drop newid year, force

reshape wide so2 , i(newid) j(year)

//不同问题下i不同这里的i是企业 j是时间

reshape long so2 co nox voc nh3, i(newid) j(time)

//reshape//long wide lecture7

use "nei_sample.dta",clear

keep newid year so2 co nox voc nh3 sic

duplicates drop newid year, force

reshape wide so2 co nox voc nh3, i(newid sic) j(year)

keep newid year so2 co nox voc nh3 sic

reshape wide so2 co nox voc nh3, i(newid sic) j(year)

//数据变少了是因为有的newid对应多个sic

reshape long so2 co nox voc nh3, i(newid sic) j(year)

//通过这种方式将它强行变成平衡面板先wide 后long（意义重大）

use nei_sample,clear

keep so2 co nox voc nh3 newid year

duplicates drop newid year,force

reshape wide so2 co nox voc nh3,i(newid) j(year)

reshape long so2 co nox voc nh3,i(newid) j(time)//三千多个变成了一万多个

*reshape之后每一个企业都在每一年1990——2011有观测值，强行将数据变为balanced

ren (so2 co nh3 nox voc) (pol1 pol2 pol3 pol4 pol5)

//更改变量名为了保证前缀都一样才能转换

*sample

rename so2 pu1

rename co pu2

rename nox pu3

rename voc pu4

rename nh3 pu5

reshape long pu,i(newid time) j(type)

tostring type,replace

replace type="so2" if type=="1"

replace type="co" if type=="2"

replace type="nox" if type=="3"

replace type="voc" if type=="4"

replace type="nh3" if type=="5"

keep newid year pol1 pol2 pol3 pol4 pol5

reshape long pol, i(newid year) j(type)

//??? 没有drop

tostring type, replace force

//???

replace type = "so2" if type == 1

//替代污染物名称

use "nei_sample.dta",clear

duplicates drop newid year, force

//去掉重复值

edit newid year so2

sort newid year

by newid: gen l1so2 = so2[_n-1]

//so2[1] so2[_N] n-1代表上一行的观测值通过企业来分每个n对于企业来说是不一样的

by newid: gen l2so2 = so2[_n-2]

//上两行

by newid: gen l0so2 = so2[_n]

by newid: gen f1so2 = so2[_n+1]

//滞后一期

bys newid: gen Nso2 = so2[_N]

//展示这个企业最后一年的数据

bys newid: gen n1so2 = so2[1]

//有时需要保证它是一个平衡面板：可利用以下命令

xtset newid year

//set panel variable 让他成为面板数据如果不告诉它它永远按上一行处理

gen lso2 = l.so2

//l.代表上一期的滞后变量（上一年）这个和上一行的数据不一样喔有时可能上一行不是上一年就没有上一期了

use "nei_sample.dta",clear

duplicates drop newid year, force

edit fips year newid

sort fips year newid

by fips year: egen id_sum = count(newid)

//通过fips year来分如果两个都相同就算一次

edit fips year newid so2

by fips year: egen so2_fips = sum(so2)

//missing values

//得到地区层面的数据用于变量的构造通过微观数据做加总数据又保留微观数据本身

use "nei_sample.dta",clear

help collapse

collapse (sum) so2 co nox nh3 voc (first) facilityname_origin fips zipcode , by(newid year)

//构造更高层面的行业数据微观数据全部损失了. 加总相同年份的污染量，（first）后面的是只保留第一行

duplicates report newid year

collapse (sum) so2 co nox nh3 voc (count) newid, by(fips year)

gen id = newid

//replace

//collapse by 2_digit sic and fips_stata (2_dight fips), and year,

use "nei_sample.dta",clear

gen fips3 = substr(fips,1,2)

gen sic2 = substr(sic,1,2)

collapse (sum) so2 co nox nh3 voc ,by( fips3 sic2 year)

//lecture 8

//图形的组成

sysuse uslifeexp2

decribe

scatter le year

//第一个是y 第二个是x轴

//connect(l) 表示以直线的方式连接相邻的两个点

//msymbol(i) 表示散点的显示方式为“看不见”

scatter le year, connect(l)

scatter le year, connect(l) msymbol(i)

scatter le year, connect(l) msymbol(smdiamond)

//散点形状改为棱形

scatter le year, connect(l) msymbol(smdiamond) mcolor(lime)

//标记间连线的方式，标记本身的形状，标记的颜色

help marker_options

//标记标签的选择

graph query symbolstyle

help marker_label_options

sysuse lifeexp.dta, clear

describe

list country lexp gnppc if region == 2

scatter lexp gnppc if region == 2, mlabel(country)

scatter lexp gnppc if region == 2, mlabel(country) mlabpos(9)

//将标签调整到九点钟方向这样美国就可以显示出来了

//下面尝试利用 mlabvposition(varname) 选项为某些特殊选项的观测值设定标签的位置为了单独为美国和洪都拉斯设定标签显示方向，

//需要生成一个指标方向的变量，命名为破碎，然后利用这个变量对每个案例的不同附值来调整各个散点的标签位置

generate pos = 3

//所有国家都是3

replace pos = 12 if country == "Honduras"

replace pos = 9 if country == "United States"

scatter lexp gnppc if region == 2, mlabel(country) mlabv(pos)

//下面尝试利用改变坐标轴的覆盖范围来设定标签

//方法一：利用xscale(range())指定作图的区域

scatter lexp gnppc if region ==2,mlabel(country) mlabv(pos) xscale(range(-500 3500))

//方法二：利用plotregion（ margin())来解决作图区域的微小变动

scatter lexp gnppc if region == 2,mlabel(country) mlabv(pos) plotregion(margin(l+9))

sysuse autornd, clear

descrbe

scatter mpg weight

scatter mpg weight, jitter(7)

//由于数据点太密集了，产生重叠，需要将数据点轻微地挪动位置，jitter（#）震荡选项

//二维绘图选项，help twoway

//标题选项，坐标，图例，增加线，by

sysuse lifeexp.dta, clear

scatter lexp gnppc

gen log_gnppc = log(gnppc)

//对数化，更线性

scatter lexp log_gnppc

//另一种方法：

scatter lexp gnppc, xscale(log)

//做散点图，并对比y轴刻度使用正常尺度与逆向尺度的异同

sysuse auto.dta, clear

scatter mpg weight

scatter mpg weight, yscale(rev)

//车重与油耗正相关

//下面绘制完全没有任何坐标的散点图和有坐标刻度但没有坐标线的散点图

scatter mpg weight, yscale(off)

//不要y轴

scatter mpg weight, yscale(noline) xscale(noline)

//去掉了坐标线，保留刻度

help axis_label_options

sysuse auto, clear

describe

sum

//下面分别绘值mpg、weight的标有大约坐标轴上5个10个刻度标识的mpg和weight散点图

scatter mpg weight

scatter mpg weight, ylabel(#5) xlabel(#5)

scatter mpg weight, ylabel(#10) xlabel(#10)

scatter mpg weight, ylabel(10(5)45) xlabel(1500 1970 2500(1000)4500)

//自定义规则

scatter mpg weight, ytick(#10) xtick(#15)

//绘制x轴大约有15个刻度，y大约10个刻度

scatter mpg weight, ymlabel(##5) xmtick(##10)

//把小刻度的标识也标上去，x轴主刻度之间有10个小刻度

scatter mpg weight , ymlabel(##5) xmlabel(##10)

//时间序列散点图时的轴线刻度标识问题

sysuse uslifeexp, clear

scatter le year, c(l)

scatter le year, c(l) xlabel(#10,grid)

//网格,用线连

scatter le year, c(l) xlabel(1900(10)2000,grid)

scatter le year, c(l) xlabel(1900 1918 1936 1950(20)2000,nogrid)

scatter le year, lcolor(yellow) c(l) xlabel(1900 1918 1936 1950(20)2000,nogrid)

line le year, lcolor(navy) lpattern(dot)

sysuse uslifeexp, clear

scatter le year, c(l) xlabel(1990 1918 1940(20)2000,grid) legend(on)

label var le_male "男人，人均寿命"

scatter le_male le_female year, legend(label(1 "Male") label(2 "Female"))

//绘制散点图并添加图例，将图例分别改为”male”和”female“

scatter le_male le_female year, c(l) legend(on)

//legend是图例

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/tougao/12218995.html

WGDI软件（三）：两个不同物种共线性点阵图-dotplot

发表评论

评论列表（0条）