AI Campus Logo @ Cedars-Sinai

Project Details

Summary:


Project Thumbnail
Breast Cancer Prognosis: Leveraging Cancer Registry Data for Survival Prediction

This project focuses on a survival prediction task using cancer registry data. Participants will gain hands-on experience with key topics, including data cleaning, feature selection techniques, and machine learning modeling tailored to time-to-event data.

Description:


Description:

Breast cancer is one of the leading causes of cancer-related deaths, and early identification of patients at higher risk of poor outcomes is crucial for improving survival rates. This project focuses on predicting breast cancer survival using clinical data sourced from the Kaggle data science competition platform, derived and tailored from the Surveillance, Epidemiology, and End Results (SEER) cancer registry (2017 November update). The dataset includes a variety of clinical features, such as age, tumor size, lymph node status, and hormone receptor status. Unlike traditional classification tasks, which aim to predict a categorical outcome (e.g., whether a patient will survive or die), survival prediction focuses on estimating the time to an event, represented as (outcome, time) pairs. As such, survival prediction modeling is a distinctive and complex task within the field of machine learning.

The primary objectives of this project are to: (1) develop a robust prognostic model using machine learning techniques, such as survival XGBoost and survival random forest, to accurately predict survival probabilities at various time points based on clinical features; and (2) interpret the models and findings from the analysis, with feature importance explained using methods like survSHAP.

 

Dataset Sources:

The data file is available from Kaggle: https://www.kaggle.com/datasets/reihanenamdari/breast-cancer

The original data was from the SEER program of the National Cancer Institute. SEER provides comprehensive information on cancer statistics in the United States, including incidence and survival rates. For a more comprehensive dataset suitable for model training, testing, and validation, consider applying for access to the SEER program. The original data in its entirety can also be downloaded through https://seerdataaccess.cancer.gov/seer-data-access

 

Dataset Information:

This dataset includes 4,024 patients and 14 features. These features include demographic information (age, race, marital status) and prognostic factors related to tumor characteristics (Primary tumor (T) stage, Regional lymph nodes (N) stage, breast cancer stages, differentiation, tumor grade, regional metastasis (A) stage, tumor size, estrogen receptor (ER) status, progesterone receptor (PR) status, the number of regional node examined, and the number of positive regional node). The outcomes are survival months and status.

 

Getting Started Guide:

  1. Download the dataset from the Kaggle platform: Breast Cancer Dataset. Familiarize yourself with the objectives and data included in this project.
  2.  Skim or read the recommended papers for additional insights, such as:
    1. Efthimiou, Orestis, Michael Seo, Konstantina Chalkou, Thomas Debray, Matthias Egger, and Georgia Salanti. "Developing clinical prediction models: a step-by-step guide." BMJ 386 (2024).
    2. Phung, Minh Tung, Sandar Tin Tin, and J. Mark Elwood. "Prognostic models for breast cancer: a systematic review." BMC Cancer 19 (2019): 1-18.
  3. For beginners, start by learning the fundamentals of Python or R programming and exploring tools like Jupyter Notebook or Google Colab. (Refer to the 'Resources' page for educational links.)
  4. Download the demo script from the GitHub repository: 2025 Cedars AI Campus GitHub.
  5. Start working with the example notebook and experiment with the code by making modifications, such as testing different machine learning algorithms or different parameter settings.

References and Suggested Reading:

Original Publication

  • Grootes, Isabelle, Gordon C. Wishart, and Paul David Peter Pharoah. "An updated PREDICT breast cancer prognostic model including the benefits and harms of radiotherapy." NPJ Breast Cancer 10.1 (2024): 6.
  • Sonabend, Raphael, Franz J. Király, Andreas Bender, Bernd Bischl, and Michel Lang. "mlr3proba: an R package for machine learning in survival analysis." Bioinformatics 37, no. 17 (2021): 2789-2791.

 

Other Useful Publications/Materials

  • PREDICT Breast cancer website: https://breast.predict.cam/
  • The mlr3 R package: https://mlr3.mlr-org.com/
  • Phung, Minh Tung, Sandar Tin Tin, and J. Mark Elwood. "Prognostic models for breast cancer: a systematic review." BMC cancer 19 (2019): 1-18.
  • Efthimiou, Orestis, Michael Seo, Konstantina Chalkou, Thomas Debray, Matthias Egger, and Georgia Salanti. "Developing clinical prediction models: a step-by-step guide." bmj 386 (2024).
  • Spooner, Annette, et al. "A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction." Scientific reports 10.1 (2020): 20410.
  • Krzyziński, M., Spytek, M., Baniecki, H. and Biecek, P., 2023. SurvSHAP (t): time-dependent explanations of machine learning survival models. Knowledge-Based Systems, 262, p.110234.

 

Project Authors: 

Pei-Chen Peng, PhD  - Assistant Professor of Computational Biomedicine at the Cedars Sinai Medical Center

Yi-Wen Hsiao, PhD (he/him) - Postdoctoral Scientist of Computational Biomedicine at the Cedars Sinai Medical Center

File:


Tags: