富集分析（GO、KEGG、GSEA）_Golang

library("clusterProfiler")
library("org.Hs.eg.db")

GO分析与KEGG分析

GO分析需要一个基因 symbol列表，列表中为差异表达基因。

一、读入数据

result<- read.csv(file = "Results/gleason high vs low_DESeq2差异分析/gleason high vs low_result.csv", header=T, row.names=1,check.names=FALSE)

t_index=result$Change %in% c('up','down')
DEG_symbol=rownames(result)[t_index]

这里以gleason high vs low_result.csv为例，得到的DEG_symbol即为所需要的 symbol列表

View(result)

二、symbol转换为entrezid

1 转换

DEG_entrezid = mapIds(x = org.Hs.eg.db,
                       keys = DEG_symbol,
                       keytype = "SYMBOL",
                       column = "ENTREZID") #存在NA

转换后的DEG_entrezid是character vector，其中有NA值

2 去除DEG_entrezid中的NA值

DEG_entrezid=na.omit(DEG_entrezid)

na.omit()函数能去除所有含有NA的行

三、GO分析与KEGG分析

GO_BP = enrichGO(gene = DEG_entrezid,
                 OrgDb = org.Hs.eg.db,
                 keyType = "ENTREZID",
                 ont = "BP",     #'BP','CC','MF'
                 pvalueCutoff = 0.5,
                 qvalueCutoff = 0.5)

KEGG <- enrichKEGG(DEG_entrezid, 
                   organism = 'hsa',  ## hsa为人的简写
                   keyType = 'kegg', 
                   pvalueCutoff = 0.05,
                   pAdjustMethod = 'BH', 
                   minGSSize = 3,
                   maxGSSize = 500,
                   qvalueCutoff = 0.5,
                   use_internal_data = FALSE)

四、可视化

dotplot(GO_BP,
        showCategory=5, #展示5个通路
        title="EnrichmentGO_BP")

barplot(GO_BP,
        showCategory=5,
        title="EnrichmentGO_BP")

GSEA分析

clusterProfiler|GSEA富集分析及可视化 - 云+社区 - 腾讯云

GSEA分析需要一个 gene_list vector，为foldchange（或logFC）的降序排列，同时标注ENTREZID

在Rstudio中显示如下：

library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)

一、读入数据

result<- read.csv(file = "Results/gleason high vs low_DESeq2差异分析/gleason high vs low_result.csv", header=T, row.names=1,check.names=FALSE)

DEG_fc=data.frame('Symbol'=rownames(result),'LogFC'=result[,2],stringsAsFactors=F) #合并rownames和log2Foldchange列

View(DEG_fc)

二、转换ENTREZID，获取gene_list vector

DEG_fc[,'ENTREZID'] = mapIds(x = org.Hs.eg.db,
                      keys = DEG_fc$Symbol,
                      keytype = "SYMBOL",
                      column = "ENTREZID") #存在NA
DEG_fc=na.omit(DEG_fc) #去除NA值

View(DEG_fc)

gene_list=DEG_fc$LogFC                       #提取logFC列
names(gene_list)=DEG_fc$ENTREZID             #加上ENTREZID
gene_list = sort(gene_list, decreasing = T)  #降序排列

三、GSEA分析

gsea_KEGG <- gseKEGG(gene_list,
                     organism = "hsa",
                     keyType = "kegg")

若将gsea_KEGG转化为dataframe格式，如下图：

四、可视化

1 path 为需要展示的pathway id，这里展示的是enrichment score最高的4条通路

t_index=order(gsea_KEGG_d$enrichmentScore,decreasing = T)
path=rownames(gsea_KEGG[t_index,])[1:4] #选择展示的pathway

2 作图

gseaplot2(gsea_KEGG,
          path, 
          subplots = 1:2, #展示前2个图（共3个）
          pvalue_table = T, #显示p值
          title = "Olfactory transduction",  #设置title
          base_size = 10, #字体大小
          #color="red") #线条颜色可选

GO分析

1 GO分析分3个层面

Cellular component，CC 细胞成分Biological process， BP 生物学过程Molecular function，MF 分子功能 Cellular component解释的是基因存在在哪里，在细胞质还是在细胞核？如果存在细胞质那在哪个细胞器上？如果是在线粒体中那是存在线粒体膜上还是在线粒体的基质当中？这些信息都叫Cellular component。Biological process是在说明该基因参与了哪些生物学过程，比如，它参与了rRNA的加工或参与了DNA的复制，这些信息都叫Biological processMolecular function在讲该基因在分子层面的功能是什么？它是催化什么反应的？
So, we will have a gene annotation infarmation.
立足于这三个方面，我们将得到基因的注释信息。

2 结果解读

气泡图（dotplot）

横坐标是GeneRatio，意思是说输入进去的基因，它每个term（纵坐标）占整体基因的百分之多少；圆圈的大小代表gene count的多少，图中给出了最大的圆圈代表11个基因；圆圈的颜色代表P-value；

总体来说，P-value越小gene count圈越大，富集就越可信

柱形图（barplot）

柱子长度代表gene count的多少，图中给出了最长柱子代表11个基因；柱子的颜色代表P-value；

总体来说，P-value越小gene count越大，富集就越可信

KEGG分析

1 KEGG富集分析，即Pathway富集分析

除了对基因本身功能的注释，我们也知道基因会参与人体的各个通路，基于人体通路而形成的数据库就是通路相关的数据库。而KEGG就是通路相关的数据库的一种。

2 结果解读

和GO分析的dotplot、barplot相同

GSEA分析

传统的基因功能富集分析的时候肯定遇到这样的情况，一条富集到的通路中，既有上调的差异表达基因，也有下调的差异表达基因，那么这条通路总体是被抑制还是被激活呢？那么这条通路中的基因表达水平在实验组相比于对照组究竟是上升了呢，还是下降了呢？

GSEA富集分析解决的就是 pathways 被上调或者下调的问题。

KEGG数据库

数据库|最全的KEGG使用教程在这里！

认识 KEGG PATHWAY 数据库_QFIUNE的博客-CSDN博客_pathway数据库

1 简介

KEGG是一个综合数据库，它们大致分为系统信息、基因组信息、化学信息和健康信息四大类。进一步可细分为15个主要的数据库。

2 查询 hsa pathways id 和 description

方法1

http://rest.kegg.jp/list/pathway/hsa

path:hsa00010	Glycolysis / Gluconeogenesis - Homo sapiens (human)
path:hsa00020	Citrate cycle (TCA cycle) - Homo sapiens (human)
path:hsa00030	Pentose phosphate pathway - Homo sapiens (human)
path:hsa00040	Pentose and glucuronate interconversions - Homo sapiens (human)
path:hsa00051	Fructose and mannose metabolism - Homo sapiens (human)
path:hsa00052	Galactose metabolism - Homo sapiens (human)
... ...

方法2

KEGG PATHWAY Database

物种选择hsa，keywords输入description

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/994367.html