MethCP: Differentially Methylated Region Detection with Change Point Models (bioRxiv)

1 Citation

Boying Gong, Elizabeth Purdom, MethCP: Differentially Methylated Region Detection with Change Point Models, 2018, bioRxiv.

https://doi.org/10.1101/265116

2 Summary

A new approach (MethCP) for the identification of differentially methylated regions (DMRS) of the DNA based on whole genome bisulfite sequencing data is supposed. The approach is developed for more complex design than two-group comparisons, e.g. for time course experiments. For the two-group setup, it is claimed that MethCP outperforms existing approaches.

3 Study outcomes

3.1 Outcome O1

For simulated data, the following outcomes were obtained (ROC curves, i.e. TPR vs. FPR for different local precision or local recall)

  • Overall, metilene, MethCP-DSS, MethCP-MethylKit are superior to bsmooth, HMM-Fisher, DSS and methylKit.
  • DSS has rather good performance when the local recall is controlled, but rather weak when the local precision is increased.
  • MethCP better controls the desired FPR than metilene at a significance level 0.05.

Outcome O1 is presented as Figure 2 in the original publication.

3.2 Outcome O2

For simulating small effect sizes (2.5%, 5%, 10%, 20%), the following result is obtained:

  • For <10% it is claimed that only MethCP can accurately predict DMRs (although only results for MethCP and metilene are plotted).
  • It is very surprising, the metilene has FPR up to 0.75, but the TPR is close to zero. The outcome would be much more plausible, if the curves for metilene are switched. In that case, metilene and MethCP would have similar performances.

Outcome O2 is presented as Figure 3 in the original publication.

3.3 Outcome O3

For randomly dividing six control samples in two groups with three replicates and by randomly permute over samples for each CgG (termed "1. permutation" below), the following performance was observed:

  • HMM-Fisher performes best. It yiels almost no false-positive predictions.
  • MethCP-DSS and MethCP-MethylKit have good performance (around 20-40 false positive DMRs and less than 0.0005 for the proportion of CpGs)
  • Bsmooth, DSS, methylKit and metilene perform worst (more than 150 false DMRs and a proportion of around 0.008-0.0022 as wrongly predicted CpGs)

Outcome O3 is presented as Figure 4 panels (c) and (e) in the original publication.

3.4 Outcome O4

For randomly dividing six control samples in two groups with three replicates and by randomly permute the CpG positions within each sample (termed "2." permutation below), the following performance was observed:

  • MethCP-DSS and MethCP-MethlyKit perform best and have no false positive predictions.
  • Bsmooth, HMM-Fisher, methylKit have very few (or even no) wrong predictions.
  • DSS and metilene performe worst (around 60-90 wrong DMRs, around 0.0005 wrongly predicted CpG proportions)

Outcome O4 is presented as Figure 4 panels (d) and (f) in the original publication.

3.5 Further outcomes

If intended, you can add further outcomes here.

4 Study design and evidence level

4.1 General aspects

  • The paper presents a new approach (MethCP) and at the same times provides several analyses for comparing the performance of the new approach with existing algorithms. Such a study setting is very frequently found in the literature although it has a high risk for biased outcomes. One reason for such a bias might be that typically application examples are selected to nicely demonstrate performance benefits. Moreover, new approaches are often established if existing methods have minor performance in a new application setup. For such a setup, a new approach then has good chances to outperform and it remains rather unclear how performance comparisons translates to new application settings.
  • The different methods usually apply a coverage filter, i.e. the observed methylation ratio is removed if it is based on few reads. This filter step was entirely removed to obtain comparable outcomes which does not depend on method-specific filter thresholds. The drawback, however, is that the outcomes less comparable to outcomes obtained in the detault setup (with coverage filter).
  • Only regions with at least 3 CpGs and at least 0.1 for the "mean methylation level" were considered as DMRs.
  • For bsmooth, the smoothing window was shortend from 1000 bps (default) to 500 bps because it yields better results for the simulated data set.
  • DSS was applied with the "moving average smoothing" option.
  • For methylKit, adjacent DMCs were merged manually as DMRs
  • "All other parameters other than the significance level (test statistics cutoffs) were left at the default values."

4.2 Design for Outcome O1 and O2

  • The outcome was generated for simulated data for a two-group comparision with 3 vs. 3 replicates.
  • Some details are provided about how the data has been simulated (in supplement section B, page 14). However, there are no plots available and no other procedures for comparing simulated and real-world data. Therefore, it is difficult to assess, how good the simulated data corresponds/agrees with real world measurements and therefore how good the outcomes generalize to application settings
  • It seems that only one random realisation of the data has been analyzed
  • To guarantee typical read coverage and methylation ratios, a publicly available human data set (GSE48580) was used.

4.3 Design for Outcome O3 and O4

  • Publicly available data for Arabidopsis Thaliana [Coleman-Derr et al., 2012] with GEO accession number GSE39045 was analyzed
  • Wildtype data was compared to H2Z.Z mutant
  • The data had six replicates in both groups
  • For assessing false-positives, the six control replicates were randomly assinged to two groups with three replicates AND by performing one of the two additional permutation approaches:
  1. Outcome O3: The two counts for methylated and unmethylated were permuted across samples for each CpG. This breaks local correlations within a sample but preserved correlations which occur over all/several samples. It also prohibits global differences between the samples in the average methylation level.
  2. Outcome O4: The CpG positions within a sample were permuted which breaks local correlations along the genome. This does not prevent potential global difference between the methylation levels of the individual samples.


5 Further comments and aspects

6 References

Coleman-Derr, D. and Zilberman, D. 2012. Deposition of histone variant h2a. z within gene bodies regulates responsive genes. PLoS genetics 8, e1002988