WGCNA Explained: Everything You Need to Know
Multi-omics data analysis blog series
- How to understand the WGCNA analysis in publications? (1/2)
- Understanding WGCNA Analysis in Publications
- Harnessing the Power of WGCNA Analysis in Multi-Omics Data
01 What is WGCNA?
WGCNA, short for Weighted Gene Co-expression Network Analysis, is a commonly used tool for analyzing gene co-expression networks. It is often translated as weighted correlation network analysis. Weighted Correlation network analysis, particularly through the WGCNA R package, is applied to examine correlation structures in high-dimensional datasets, such as gene expression and proteomics data.
02 When is WGCNA used?
WGCNA is applied to analyze gene expression data in complex transcriptome data with multiple samples, particularly in the study of developmental regulation across different organs/tissues and stages. Differential network analysis is crucial for identifying changes in connectivity patterns under varying conditions. Additionally, it is utilized to investigate response mechanisms to biotic and abiotic stresses at various time points.
03 How to interpret WGCNA results?
3.1 Identifying gene co-expression network sets
Based on pairwise correlation of gene expression data across all samples, genes with similar expression patterns are grouped into modules. This categorization condenses thousands of differentially expressed genes into several modules, typically a dozen or more. Using transcriptome data from tomato fruit development research as an example, WGCNA analysis identified 12 gene modules. The commonly used representation in literature is depicted in the figure below, where the upper part of the evolutionary tree displays each branch representing a gene, and the different colors below represent various modules.
3.2 Filtering Key Modules for Functional enrichment analysis
Method 1: Filtering based on module characteristic expression patterns.
After identifying modules, WGCNA calculates a module characteristic value (Epigene) for each module, representing the expression status of all genes in the module. Analyzing the abundance of module characteristic values in various samples helps filter modules closely related to the samples. For instance, in tomatoes, the "brown" module shows higher characteristic value expression (positive) in samples from the first period, making it a key module for subsequent analysis. This process requires the input of gene expression data to accurately determine the module characteristic expression patterns.
Method 2: Filtering through module-sample or phenotype correlation analysis. Calculating the correlation coefficient between modules and sample or phenotype data identifies modules highly correlated with specific samples or phenotypes. In tomato data, a specific correlation between JS3 and the "pink" module suggests special attention to this module. If there are statistical data on relevant phenotypes during tomato fruit development, such as tomato lycopene content, modules with the highest correlation to lycopene content can also be selected.
Method 3: Filtering through module gene function enrichment. Conducting functional enrichment analysis, like Gene Ontology (GO), for each module helps identify modules corresponding to biological processes related to the traits of interest. For example, in tomato fruit development, processes like carotenoid metabolism and ethylene signaling are relevant to fruit ripening, prompting focus on modules enriched with relevant GO terms. Additionally, network visualization can be enhanced by plotting the connectivity distribution of the entire network, where the y-axis depicts the logarithm of the corresponding frequency distribution.
Method 4: Filtering modules through target gene selection. Considering research objectives, previous findings, and published literature, modules containing target genes of interest can be directly selected for further analysis. In tomato fruit development, key genes like PG2A and PL1 involved in pectin degradation found in the "yellow" module make it a candidate for further investigation.
3.3 Identifying key genes
After filtering down to candidate modules through the aforementioned analyses, analyzing the internal composition of the modules is crucial. Identifying key genes within the modules, often referred to as Hub genes, is essential. This can be achieved through analyzing intra-modular gene connectivity (TOM values, KME, or KIM), selecting genes with higher connectivity in the network. RNA-Seq datasets from the Gene Expression Omnibus (GEO) are invaluable for such transcriptomics research, providing comprehensive data for various species and biological sample groups. Additionally, attention can be directed towards genes with regulatory functions, such as transcription factors, as they generally act as regulators in the upstream part of the module regulatory network.
MetwareBio has extensive experience in transcriptomics and metabolomics analysis, and our lab in Boston can provide services for transcriptomics, metabolomics, and multi-omics correlation analysis.