Project Details
Summary:

Natural Language Processing in Cancer: Extracting Diagnostic Insights from Pathology Reports
This project focuses on leveraging natural language processing (NLP) techniques to extract critical information from pathology reports. Participants will gain hands-on experience with a wide range of text classification methods, from traditional tf-idf analysis to cutting-edge LLMs. The Cancer Genome Atlas (TCGA) pathology report corpus will be used for the development of advanced NLP technologies that can ultimately enhance patient diagnosis, treatment selection, and cancer care.
Description:
This project focuses on leveraging natural language processing (NLP) techniques to extract critical information from pathology reports. Participants will gain hands-on experience with a wide range of text classification methods, from traditional tf-idf analysis to cutting-edge large language models. The Cancer Genome Atlas (TCGA) pathology report corpus used in this project offers a unique opportunity for the development of advanced NLP technologies that can ultimately enhance patient diagnosis, treatment selection, and many other aspects of cancer care.
Datasets
- GDC Data Portal: https://portal.gdc.cancer.gov/
- TCGA pathology reports corpus: https://github.com/tatonetti-lab/tcga-path-reports/blob/main/TCGA_Reports.csv.zip
- TCGA cancer type: https://github.com/tatonetti-lab/tcga-path-reports/blob/main/data/tcga_metadata/tcga_patient_to_cancer_type.csv
- TNM stage: https://github.com/tatonetti-lab/tnm-stage-classifier/tree/main/TCGA_Metadata
- TCGA clinical endpoints: https://xenabrowser.net/datapages/?dataset=Survival_SupplementalTable_S1_20171025_xena_sp&host=https%3A%2F%2Fpancanatlas.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443
- TCGA Pan-Cancer: https://xenabrowser.net/datapages/?cohort=TCGA Pan-Cancer (PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443
Code
- Code repository for this project: https://github.com/guilopgar/AI-Campus-Project-7-NLP
- TNM stage classifier: https://github.com/tatonetti-lab/tnm-stage-classifier/tree/main
- Cancer type classifier: https://github.com/tatonetti-lab/tcga-path-reports
Publications
- Kefeli, J., Berkowitz, J., Acitores Cortina, J.M. et al. Generalizable and automated classification of TNM stage from pathology reports with external validation. Nat Commun 15, 8916 (2024). https://doi.org/10.1038/s41467-024-53190-9
- Kefeli, J., Tatonetti, N. TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models. Patterns 5(3), 100933 (2024). https://doi.org/10.1016/j.patter.2024.100933
Project Prepared By:
Guillermo Lopez Garcia – Guillermo.LopezGarcia@cshs.org
Takeshi Onishi – Takeshi.Onishi@cshs.org