Beginner for KEGG Pathway Analysis: The Complete Guide

MetwareBio data analysis blog series


In transcriptome, proteome, metabolome and microbiome analysis, KEGG pathway annotation and enrichment analysis are often encountered. KEGG analysis has become an essential and most commonly presented analysis content in high-throughput sequencing and protein and metabolite analysis. This article aims to delve into the effective utilization of the KEGG database to enhance the quality and efficiency of scientific data interpretation, incorporating our targeted keywords for optimal SEO performance and readability.


1. What is KEGG database?

The KEGG database was developed by the Kanehisa laboratory in 1995, and is known as the Kyoto Encyclopedia of Genes and Genomes. It has now developed into a comprehensive database, which is roughly divided into four categories: system information, genome information, chemical information, and health information. It can be further subdivided into 15 major databases. The most core ones are the KEGG PATHWAY and KEGG ORTHOLOGY databases. The KEGG PATHWAY is the most important and common database in the KEGG database. It is a large number of manually drawn KEGG pathway diagrams by researchers based on existing research literature. The KEGG PATHWAY can be divided into six categories: Cellular Processes, Environmental Information Processing, Genetic Information Processing, Human Diseases, Metabolism, and Organismal Systems.


2. Exploring the KEGG Website: Tips and Tricks

On the homepage of KEGG, the entire interface is divided into four areas. The top is a search box, the left side contains descriptions of different modules, and the bottom contains an introduction to the database and all of its sub-links.


In KEGG, the two most commonly used links are "KEGG PATHWAY" and "KEGG COMPOUND". Next, let's take a closer look at these two sections.




3. Overview of KEGG Pathways: Decoding the Biological Roadmaps

Clicking on the "KEGG PATHWAY" on the homepage of the website will take you to this link. On this link page, in addition to the search box above, there is a detailed description of the pathway classification below, mainly including seven categories: ① Metabolism, ② Genetic Information Processing, ③ Environmental Information Processing, ④ Cellular Process, ⑤ Organismal Systems, ⑥ Human Diseases, and ⑦ Drug Development.


In metabolomics or multi-omics research, the most commonly used is the metabolic pathway Metabolism, which involves genes corresponding to enzymes involved in substance metabolism and metabolites.




To know how to use KEGG PATHWAY to retrieve metabolic pathways of interest, it is necessary to briefly understand the naming rules for metabolic pathways on KEGG. Each pathway in KEGG is encoded by 2-4 prefixes and 5 numbers. The specific encoding method is shown in the following table.



In metabolomics or multi-omics studies, we use the most of five of them: pathwaymap/hsa, koK, genevg/vp/ag, compoundC, enzyme. Pathway and ko are two forms of a pathway. Since these two are the most frequently used ones, we will focus on them.


The pages linked with the prefix 'map' mainly include seven modules: 'name', 'pathway description', 'pathway classification', 'pathway map link', 'module', 'other database link', and 'related article link'.


Among these modules, the one we use the most is the Pathway map. Clicking on the corresponding blue font can link to the pathway map. In the pathway map, the boxes represent enzymes, and clicking on the corresponding box can obtain the gene information that constitutes the enzyme. The circles represent metabolites, and clicking on them can obtain relevant information about metabolites and genes.




The gene information contained in the enzyme mainly includes the following: the gene number K in KEGG, the gene symbol Symbol, the gene name Name, the pathway Pathway of the gene, the module Module, the functional hierarchy Brite, other database links Other DBs, and the gene number including genes currently studied in various species, as well as homologous genes Genes, related article links Reference Authors Title Journal.




Metabolites contain the following main information: substance number C, substance name Name, molecular formula Formula, exact weight Exact weight, molecular weight Mol weight, molecular structure Structure, reactions involved in the substance Reaction, metabolic pathway pathway, enzyme enzyme, functional hierarchy Brite, and links or numbers to other databases Other DBs.




4. How to view the KEGG pathway map in the transcriptome

The KEGG pathway map is the most intuitive database display result in the analysis results of transcriptome. In transcriptome analysis, thousands or even tens of thousands of genes are often involved, so we hope to classify genes and try to analyze genes with the same function together. This gene classification can be achieved through gene annotation. For transcriptome analysis, annotation information is generally obtained by referencing information in the genome. For unreferenced transcriptome, it is obtained by comparing with specific databases. After annotating the KEGG database, we can annotate the differentially expressed genes in a certain differential group to the KEGG pathway and display it in a graphical form, which visually and conveniently classifies and views the differentially expressed genes.




In the KEGG pathway map, rectangular boxes represent gene enzymes, and circles represent metabolites. In the KEGG pathway map of differential genes, some genes are marked in red, green, or blue. What do these colors represent? If a gene is marked in red, it means that the gene expression of the enzyme annotated in the differential group is up-regulated. If it is marked in green, it means that the gene expression of the enzyme annotated in the differential group is down-regulated. If it is marked in blue, it means that the gene expression of the enzyme annotated in the differential group is both up- and down-regulated.


5. Importance of KEGG Enrichment Analysis in Biological Research

Even though we classified the differentially expressed genes using KEGG annotation analysis, we still found that there were dozens of pathways in each differentially expressed group. Therefore, we usually perform enrichment analysis on gene functions to discover the biological pathways that play a key role in biological processes, so as to reveal and understand the basic molecular mechanisms of biological processes. In addition, under different experimental conditions, activated pathways are obviously more convincing than simple gene and protein lists. Enrichment analysis is a statistical algorithm that combines functionally similar gene sets to facilitate the study of genes with certain functions. The principle of enrichment analysis is based on the hypergeometric distribution, and KEGG enrichment analysis uses qvalue less than 0.05 as the threshold for significant enrichment. The calculation formula of hypergeometric distribution is as follows:


Where, N is the number of all genes annotated to the KEGG database, n is the number of all differentially expressed genes annotated to the KEGG database, M represents the number of genes annotated to a certain pathway in the KEGG database, and m is the number of differentially expressed genes annotated to the same pathway in the KEGG database.


6. Conclusion

The intricate journey through the realms of transcriptome, proteome, metabolome, and microbiome analysis using KEGG pathway annotations and enrichment analysis underscores the pivotal role these methodologies play in unraveling the complex web of biological processes. The precision and depth offered by KEGG analysis facilitate a deeper understanding of high-throughput sequencing data, enabling researchers to make significant strides in scientific discovery and innovation.


Discover the frontier of biological data with MetwareBio. Our cutting-edge tools and databases unlock new insights in metabolomicsproteomics and multi-omics. Backed by leading-edge technologies and seasoned professionals, we're your partner for groundbreaking discoveries. Contact us to explore our innovative services as well as  Metware Cloud Platform for seamless analysis of your multi-omics data.



Please submit a detailed description of your project. We will provide you with a customized project plan metabolomics services to meet your research requests. You can also send emails directly to support-global@metwarebio.com for inquiries.
Name can't be empty
Email error!
Message can't be empty
Copyright © Metware Biotechnology Inc. All Rights Reserved.
support-global@metwarebio.com +1(781)975-1541
8A Henshaw Street, Woburn, MA 01801
Contact Us Now
Name can't be empty
Email error!
Message can't be empty