## Project Details

### Summary:

##### Modeling Complex Binary-Class Associations in Simulated Genomic Data

This project deals with binary classification tasks (case/control) in a variety of simulated single nucleotide polymorphism (SNP) datasets. Each dataset has a different form of underlying complex association (e.g. multivariate additive, epistatic, or genetic heterogeneity). This project will introduce participants to topics such as basic data preparation, feature selection methods, machine learning (ML) modeling algorithms, automated machine learning, and model explanation and interpretation.

### Description:

This project utilizes 4 simulated genomic single nucleotide polymorphism (SNP) datasets, each with a binary class outcome. This outcome is the ‘dependent variable’ that we want to be able to accurately predict using a trained model. Each dataset has a different simulated underlying complex association driven by its respective predictive features: i.e. (A) 4 additive features, (B) 2-way purely epistatic interacting features, (C) 4 genetically heterogenous features, and (D) a mix of both 2-way epistasis and genetic heterogeneity with a total of 4 predictive features. Each dataset includes a 2 to 4 predictive features (where here, features are simulated SNPs) along with a much larger number of non-predictive features (randomly simulated SNPs).

The primary goal of this project is to apply appropriate elements of an ML analysis pipeline to train and evaluate predictive model(s) aiming to achieve the best prediction performance possible. In other words: What ML algorithm or algorithm(s) perform best when tackling different underlying patterns of association in data? Also, what is the most effective way to set up a data analysis/machine learning pipeline that adheres to best practices in data science?

A secondary goal of this project is to utilize strategies to correctly distinguish between predictive and non-predictive features within each respective dataset (e.g. model feature importance estimation). In other words: Can a given ML algorithm correctly use/prioritize predictive features, but successfully avoid ‘overfitting’ by not using/prioritizing non-predictive features?

Beyond ML modeling, this project can involve any or all of the following aspects of a data science pipeline (or beyond): (1) exploratory data analysis (e.g. univariate analysis, feature correlation analysis and clustering) (2) data cleaning and/or feature encoding, (3) data partitioning (e.g. k-fold cross validation), (4) feature transformation/scaling, (5) pre-modeling feature importance estimation, (6) feature selection, (7) machine learning modeling (with different algorithms, and hyperparameter sweep for each algorithm), (8) model evaluation using a variety of classification metrics, ROC, and/or PRC plots, (9) model feature importance estimation, and (10) other model ‘explanation’ strategies.

Through examining the different datasets, students are encouraged to consider the unique challenges presented by different unique patterns of association in data (i.e. additivity, epistasis, and genetic heterogeneity). What modeling algorithm can detect these associations, and what processing steps can be taken on these datasets (with a different number of total features) to improve the ability of modeling to detect and interpret these effects.

This project also includes a similar set of ‘Challenge’ datasets, that are just like the first 4 described, however they include a total of 10,000 features (i.e. many more non-predictive features to deal with) as well as including 1% missing values within each feature. These datasets add additional dimensions of challenge to target problem, i.e. feature selection can be a more important part of the analysis pipeline, some ML algorithms will be more likely to find certain underlying associations, and some strategy will likely need to be considered to deal with missing values in the data.

__Datasets:__

Datasets were simulated using the GAMETES software package. In each dataset, most features are non-predictive (i.e. randomly simulated based on a randomly chosen minor allele frequency between 0.01 and 0.5. The remaining (predictive) features have been simulated to have a unique association that is predictive of a binary outcome in the dataset column named ‘Class’. This outcome has been encoded as 0 or 1 representing control subjects or case subjects, respectively. Non-predictive features start with the letter ‘N’, and predictive features start with the letter ‘M’.

8 datasets have been simulated in total, each with 1000 samples/instances, and a degree of simulated ‘noise’ which should make it impossible to predict any hold-out testing data with 100% accuracy. 4 ‘basic’ datasets have been simulated with a total of 100 features, and 4 corresponding ‘challenge’ datasets have been simulated with a total of 10,000 features as well as 1% missing values for every feature. Both the ‘basic’ and ‘challenge’ folders each include 4 datasets with different underlying ‘patterns of association’, i.e. the association between the predictive features and outcome is of a different nature.

- 4-wayAdditive: 4 predictive features that have been additively combined to predict outcome, such that all predictive features have univariate associations with outcome.
- 2-wayEpi: 2 predictive features that have a ‘pure’ epistatic interaction predicting outcome. Neither feature will have a univariate association with outcome.
- 2Additive_2-wayEpi: 4 predictive features total, representing two separately simulated 2-feature epistatic interactions that have been additively combined to predict outcome. Features in this dataset can potentially have both epistatic and univariate associations with outcome.
- 4-wayHeterogeneous: 4 predictive features that are each simulated to be predictive only within a respective ¼ of all instances in the data (i.e. they are heterogeneously associated with outcome). Each feature has a weaker overall univariate association with outcome.

__Getting Started Guide:__

- Download the zipped project file (below), and unzip it's contents to view: a pdf of this project summary, an example Jupyter Notebook, an HTML link to view the example notebook without having Jupyter Notebook installed, some relevant papers, and the project datasets.
- Familiarize yourself with the goals and included data in this project.
- Read or skim some of the included ‘Papers’.
- In particular, check out the two papers on STREAMLINE that discuss the assembly of a machine learning analysis pipeline.
- Both 2018 papers, explain the challenge of epistatic interactions in data as well as focus on feature selection strategies that are sensitive to interactions as well as genetic heterogeneity.
- The Woodward_2022 paper is a review covering the topic of ‘heterogeneity’ in data.
- Both 2012 papers discuss the tool/strategy used to simulate the data for this project

- (For Beginners) Take some time to learn the basics of Python programming, and using Jupyter Notebook and/or Google Colab Notebooks (see the 'Resources' page for educational links).
- Start by installing Anaconda – which comes with both Python, Jupyter Notebook, and all standard Python packages used in ML/data science (e.g. pandas, numpy, scikit-learn, etc.)
- Learn to open up and start working with code in Jupyter Notebook and/or Google Colab Notebook.
- Take some time to learn some of the basics of data science and machine learning. One starting point would be this YouTube tutorial we created titled “Machine Learning Essentials for Biomedical Data Science”

- Open up (and run) the included ‘Example_Jupyter_Notebook.ipynb’ file using Jupyter Notebook. This will both demonstrate that you have Anaconda and Jupyter Notebook installed correctly, as well as provide an example of loading one of the datasets included in this project, conducting a brief exploratory analysis, and then training and evaluating a machine learning model.
- Start playing around with this working example notebook as a starting point. Try out making changes to the code: for example, changing the appearance of plots, adding new elements to the exploratory analysis, or changing the machine learning algorithm(s) used to train a model.
- (Optional) Try out the ‘STREAMLINE’ Automated Machine Learning software located on the STREAMLINE repository on GitHub.
- View the STREAMLINE video tutorials on what it is and how it works
- Download the repository and open up the included demonstration Jupyter Notebook, and update the notebook (as indicated in the notebook instructions) to load and analyze the datasets included in one of the project data folders (start with the ‘basic’ datasets). You can update the STREAMLINE run parameters directly in this notebook as desired.
- Review the STEAMLINE output PDF and folder for a comprehensive ML analysis of these datasets.

- Lay out some initial specific goals/objectives for yourself and/or your team, regarding what you want to accomplish in building your own analysis pipeline, or utilizing STREAMLINE as a starting point to build from.
- Try out and compare new strategies, methods, tools, AutoML’s to answer your own questions and goals, using the goals at the beginning of this project description as a guide.

__References and Suggested Reading:__

- Urbanowicz, R.J., Kiralis, J., Sinnott-Armstrong, N.A., Heberling, T., Fisher, J.M. and Moore, J.H., 2012. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData mining, 5, pp.1-14.
- Urbanowicz, R.J., Kiralis, J., Fisher, J.M. and Moore, J.H., 2012. Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection. BioData mining, 5(1), pp.1-13.
- Woodward, A.A., Urbanowicz, R.J., Naj, A.C. and Moore, J.H., 2022. Genetic heterogeneity: Challenges, impacts, and methods through an associative lens.
*Genetic Epidemiology*,*46*(8), pp.555-571.Urbanowicz, R.J., Meeker, M., La Cava, W., Olson, R.S. and Moore, J.H., 2018. Relief-based feature selection: Introduction and review.*Journal of biomedical informatics*,*85*, pp.189-203. - Urbanowicz, R.J., Olson, R.S., Schmitt, P., Meeker, M. and Moore, J.H., 2018. Benchmarking relief-based feature selection methods for bioinformatics data mining.
*Journal of biomedical informatics*,*85*, pp.168-188. - Urbanowicz, R., Zhang, R., Cui, Y. and Suri, P., 2023. STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison. In Genetic Programming Theory and Practice XIX (pp. 201-231). Singapore: Springer Nature Singapore.

__Project Author:__

Ryan Urbanowicz, PhD (he/him)

Assistant Professor of Computational Biomedicine at the Cedars Sinai Medical Center

Adjunct Assistant Professor of Biostatistics, Epidemiology, and Informatics at the University of Pennsylvania

Director of Cedars AI Campus Program

Director of the URBS-Lab