Cancer Data Research via Bipartite Graph Modeling for Subtype Discovery, Therapeutic Linking, and Outcome Prediction
PI: Faisal N. Abu-Khzam
Project description
We use a graph-based framework that models cancer cohorts as bipartite networks connecting patients to molecular/clinical entities
(e.g., genes, variants, pathways, drugs/targets). This representation preserves the inherent two-mode structure of oncology data and supports
subtype discovery, link prediction for therapy mapping, and risk/outcome modeling. Using proven technology, we will develop, validate, and
release a toolkit and a secure analytics workflow suitable for multi-omics and clinical integration.
Research partners
We welcome partners who are keen to apply novel (yet proven) methodology/technology to solve high-dimensional data problems in cancer research and
uncover new insights in diagnosis, risks, and therapy—particularly organizations that possess clinical data and are open to underwriting a project
and/or partnering on research grants.
Our methodology
1. Data-to-graph integration
- P×G: patients ↔ genes/variants (SNVs, CNAs, fusions; optional methylation/expression binarization).
- P×Pth: patients ↔ pathways (edge weights from pathway activity scores).
- D×T: drugs ↔ targets; align to patient-side via shared target genes to enable therapeutic linking.
Deliver a harmonized schema with robust QC, missingness handling, and edge weighting (frequency, effect size, confidence).
2. Bi-clustering & community detection for subtype discovery
Apply bipartite-aware methods (e.g., bi-Louvain, spectral bi-clustering, stochastic block models, NMF) to identify patient–feature bi-clusters
that correspond to molecular subtypes and co-alteration modules. Evaluate biological coherence (pathway enrichment), stability, and agreement with
known labels (e.g., PAM50 in breast cancer).
3. Link prediction for therapy mapping and novel associations
- Matrix completion (weighted low-rank, graph-regularized NMF)
- Bipartite GNNs (message passing on two-mode graphs; LightGCN/HeteroGNN variants)
- Calibrated probabilities for candidate patient–target or patient–trial links; prioritize testable hypotheses (e.g., DepMap/CCLE; OncoKB/CIViC)
4. Prognostic modeling using graph-derived features
Engineer graph features (bipartite degrees, authority/hub scores, community memberships, graph embeddings) and test their value in survival/time-to-event
models (Cox, RSF, DeepSurv). Report C-index, time-dependent AUC, and calibration; compare against baselines without graph features.
Data and cohorts
- Genomics/Clinical: TCGA (multi-omics + outcomes), ICGC, cBioPortal cohorts; METABRIC (breast).
- Knowledge bases: COSMIC (somatic variants), OncoKB/CIViC (actionability), DrugBank/DGIdb (drug–target), MSigDB/Reactome (pathways).
- Optional cell-line sensitivity: DepMap/PRISM/CCLE for orthogonal validation.
All protected health information—if any institutional data are added—will follow IRB approvals, de-identification, and secure compute controls.
Evaluation and success criteria
- Subtype discovery: NMI/ARI vs. established labels; pathway enrichment FDR < 0.05; bootstrap stability.
- Link prediction: AUC-PR, hits@k for actionable targets/drugs, literature back-up rate.
- Prognosis: C-index improvement Δ ≥ 0.03 over clinical-only baseline; well-calibrated risk.
- Reproducibility: results replicated across at least two cancers (e.g., BRCA, LUAD).
Deliverables
- Open methods: code (Python; NetworkX/igraph, PyTorch Geometric/pyG, cuGraph optional), Docker/Conda environments.
- Documentation & notebooks for data-to-graph pipelines and analyses.
- Manuscript/preprint (subtype discovery + prognostic utility) and a short policy/clinical note on therapy-linking limits.
- Demo dashboard: query patient/module to see linked features, targets, and literature.
Ethics, privacy, and compliance
All human data work will follow IRB approval, data-use agreements, and GDPR/HIPAA-compliant storage. Only de-identified data will be used;
results are research-only, not for clinical decision-making.