Analyses
Summary results from pangenome analyses, including assembly completeness (BUSCO), orthogroup statistics (OrthoFinder), and functional annotation (GO / KEGG).
BUSCO Completeness
Effect of CD-HIT clustering on redundancy while preserving completeness.
- Completeness remained high at 96.5% before and after.
- Duplicated BUSCOs dropped 61.1% → 21.7%.
- Single-copy BUSCOs increased 35.4% → 74.8%.
- Fragmented (1.7%) and Missing (1.8%) unchanged.
CD-HIT removed redundancy without losing completeness, confirming a cleaner representative protein set for downstream analyses.
CD-HIT: ORF Redundancy Reduction
Progressive reduction of protein sequences across CD-HIT clustering steps.
- Raw TransDecoder: 146,413
- After CD-HIT @98%: 62,862
- After CD-HIT @95%: 50,090
ORFs reduced by ~66%. The final set (~50k) is close to the expected mango gene number (30–40k), supporting transcriptome-based gene prediction quality.
Annotation Statistics Summary
Total predicted proteins (nr95 set): 50,090
Out of the 50,090 predicted proteins, 93.5% could be annotated using SwissProt and/or eggNOG, while only 6.5% remained unannotated. The majority (78.9%) were supported by both SwissProt and eggNOG, indicating strong functional evidence. SwissProt-only contributed a very small fraction (0.2%), while 14.6% were annotated only by eggNOG. This shows that the dataset has broad functional coverage with minimal unannotated sequences, confirming its reliability for downstream analyses.
Functional Annotation Statistics (SwissProt + eggNOG)
The combined annotation using SwissProt and eggNOG resulted in 46,824 proteins annotated, with nearly 20 unique GO terms and about 141 KEGG pathways. The Annotated set (SwissProt + eggNOG) dominates the annotation, contributing the majority of functional assignments (39,500 proteins, 20 GO terms, 141 KEGG pathways). The Unannotated set (eggNOG-only) still contributes 7,324 proteins, but with fewer unique GO terms (18) and pathways (73). This highlights the value of combining both databases to maximize functional coverage.
Orthogroups & Singletons
Interpretation:
Out of 13,111 orthogroups, 3,375 are core (shared by all four cultivars), while the majority (9,736) are dispensable (shared by two to three cultivars). Notably, no cultivar-specific unique orthogroups were detected, highlighting the strong genetic overlap among the reference cultivars.
When looking at singleton genes (unassigned / unique genes), Alphonso (2,588) and Amrapali (2,239) show the highest counts, whereas Dashehari (725) and Neelam (1,071) contribute fewer. This indicates that while the genetic backbone is largely shared, each cultivar still retains a distinct set of unique genes that may underlie cultivar-specific traits.