Metware Biotechnology Co., Ltd.

How to understand the WGCNA analysis in publications? (1/2)

WGCNA analysis is not new to many of us, especially in journal articles using a combination of transcriptomics and metabolomics, and it is a useful tool for data mining. So what is WGCNA? How to interpret WGCNA results? What is the role of WGCNA analysis in association studies? How to use the results from WGCNA analysis to identify candidate genes? Let’s discuss these questions.

I. What is WGCNA?

WGCNA is short for weighted gene co-expression network analysis, which aims to find co-expressed gene modules and explore the association between the gene network and the phenotype of interest, as well as to find the core genes in the network. WGCNA analysis is suitable for analyzing complex data patterns from large number of samples. The official WGCNA website recommends at least 15 samples for analysis1. The general research directions that are applicable includes: 1. Developmental regulation of different organs or tissue types 2. Different developmental regulation in the same tissue 3. Different responses to abiotic stresses at different time points, and 4. Different responses after pathogenic bacterial infection at different time points.

II. Principle of WGCNA analysis

WGCNA is divided into two parts: expression clustering analysis and phenotype association, which are algorithms for mining gene modules from transcriptome data. If the expression pattern of a certain class of genes is similar at different developmental stages or different time points of stress treatment, then it can be grouped into a module, and these genes are considered to be functionally related.

WGCNA analysis is carried out in four steps: 1. calculation of correlation coefficients between genes, 2. identification of gene modules, 3. the association of modules with traits, and 4. extraction of key candidate genes.

1. Gene correlation coefficient calculation

The first step of WGCNA is to calculate the correlation coefficient (Person Coefficient) between any two genes. In order to measure whether two genes have similar expression patterns, it is generally necessary to set a standard threshold value for screening, and those above are considered similar. Traditionally, the description of the degree of association between two genes can be obtained by calculating Pearson, Spearman and other correlation coefficients of the threshold values. To construct an association network, a screening threshold, such as a correlation coefficient greater than 0.8 or more, is usually specified as the basis for having a strong degree of association between two genes. However, the disadvantage of a fixed threshold-based method is that the threshold value is artificially defined and will ignore many potential associations. This one-size-fits-all approach will lose information about the trend of gene expression changes, and it will be difficult to describe strong correlation relationships in the network. To address these problems, the idea of "weighting" was proposed, where the correlation coefficients between gene expression values are taken to the power of β, so that the connections between genes in the network follow scale-free networks, and as a direct result, the differences in the strength of correlations between genes are magnified. This has the advantage of making the strong and weak relationships more distinct, which facilitates subsequent clustering (module) identification.

Determining the suitability of β:

β parameter and log(p(k)) is negatively correlated. Generally, 0.8 of this negative correlation is taken as the appropriate β value. The parameter β takes values from 1 to 30 by default.