This article provides an insightful overview of batch effects in metabolomics research, exploring what they are, how to minimize them, and the methods available for batch correction. Understanding and addressing these batch effects is crucial for ensuring the accuracy and reproducibility of data analysis in metabolomics studies.
Batch effects refer to unnecessary fluctuations in detection data caused by factors such as sample collection differences, inconsistencies in pre-experimental processing, and instrument stability. These effects result in differences unrelated to biological variation and can affect the repeatability and accuracy of data analysis results.
a) Process all samples in the same batch as much as possible to reduce the interval time.
b) For a large number of samples, it's recommended to randomize samples from different groups to ensure distribution of each group at any stage from start to finish. If samples are processed in group order, the "inter-group differences" obtained in subsequent calculations may be primarily due to batch effects, rather than just biological differences.
c) If, for some objective reasons, samples can only be tested in 2 or more batches with long intervals, it's advised to retest some samples from the first batch in subsequent batches for easier correction.
If batch effects are present in the results due to certain objective reasons, batch correction is necessary. Correction methods are broadly classified into three types: based on internal standards, based on the samples themselves, and based on QC mixed samples. These methods are further introduced as follows:
Internal standards refer to isotopically labeled compounds added to samples before testing. The response value of the target substance is divided by the response value of the internal standard to obtain the real response value of the target. However, this method has limitations, as the internal standard and the target substance must be the same, which restricts its application range.
Assuming the total amount of metabolites is the same or similar across different samples, sample-based correction methods can be used. There are various methods, such as TIC (Total Ion Count), calculated as the metabolite content divided by the sum of all metabolite contents in each sample, calculated independently for each sample. More correction methods are shown in the figure below, where the formula represents the scaling factor (denominator) of the correction method.
QC mixed samples are made by mixing equal amounts from all samples or a certain proportion selected randomly. During testing, a QC mixed sample is tested after every certain number of samples (e.g., 10), as shown in the figure below. Thus, the trend of all metabolite changes is obtained through mixed sampling. By removing this trend, the real change trend of metabolites in the samples is left. There are many QC mixed sample-based correction methods. Here are a few common ones:
a) The Support Vector Regression (SVR) based correction method in the R package metaX.
b) The Robust Spline Correction (RSC) based method in the R package metaX.
c) The Random Forest-based QC-RFSC correction method in the R package statTarget.
There are many types of correction methods, but no definitive best method has been determined yet. However, we can try to narrow down the range by eliminating methods that clearly perform poorly.
The figure below shows a simple evaluation of various methods using technical replicate samples, calculating the correlation of technical replicate samples before and after correction. As seen from the graph, the original data itself has good correlation. The SVR and sample-based methods (median, mean, MAD) slightly improved sample correlation, while RSC and QC-RFSC correction significantly worsened the correlation of technical replicate samples. Therefore, the evaluation suggests that RSC and QC-RFSC are not recommended. As for which of SVR, median, mean, MAD, or other methods is better, an accurate judgment cannot be made yet. It is advised to use several types of methods for correction and validate the results experimentally after differential analysis.
This article comprehensively addresses the challenge of batch effects in metabolomics, detailing their impact, strategies for reduction, and various correction methods. It emphasizes the importance of carefully choosing appropriate correction techniques and validating results to ensure the reliability of metabolomics data. The article concludes that while no single best correction method exists, a combination of approaches and experimental validation may offer the most robust solutions.