A Study of a Breast Cancer Dataset¶

Github Pages here

Authors: Kshitij (TJ) Chauhan, Neha Haq, Wenhao Pan, Jiaji Wu

Introduction¶

This repository includes a study of a breast bancer dataset that is downloaded from Kaggle

As a group, we wanted to focus on the health industry. We decided to look at cancer, as cancer is a widely studied disease today. This motivated us to analyze the breast cancer dataset publicly available on Kaggle. We believe that our analysis of this dataset would be helpful to doctors and patients, who want to find out whether the cancer is benign or malignant.

For our model, we decided to choose to use a classification model as our goal is to classify whether the cancer is benign or malignant. To do this, we ran different models including logistic regression, decision trees and random forests and then, based on the performance on the vailidation set, we picked the best one. Based on our final model, we hope that our model can serve as a strong basis for predicting whether the cancer is benign or malignant based on its characteristics. Furthermore, we performed hypothesis testing using the parametric Two-Sample T-Test and the non-parametric Wilcoxian Rank Sumt Test to see whether our results are statistically significant at a significance level of 5%.

Installation¶

Run make env to setup the conda environment and install the required dependencies. Use hw07 kernel to execute the Jupyter Notebook.

Repository Structure¶

data/ contains different datasets in csv formats
- raw_data.csv is the original data downloaded from Kaggle
- clean.csv is the cleaned version of raw_data.csv
- train.csv is the training dataset
- val.csv is the validation dataset
- test.csv is the testing dataset
figures/ contains generated figures from running the notebooks in codes/
tables/ contains generated tables from running the notebooks in codes/
codes/ contains the jupyter notebooks for data analysis
- data_prepare.ipynb prepares the data for later analysis
- data_visual.ipynb conducts data visualization
- logistic_reg.ipynb conducts logistic regression analysis
- decision_tree_and_random_forest.ipynb conducts decision tree and random forest modeling and comparison
- final_model_selection.ipynb chooses the final model between logistic regression, decision tree, and random forest
- two_populations_analysis.ipynb conducts two sample hypothesis testing
models/ contains different fitted models from running the notebooks in codes/
- dt_model.sav is the fitted decesion tree model
- rf_model.sav is the fitted random forest model
- lg_model.sav is the fitted logistic regression model
diagnosis/ contains required files for package creation purposes.
- README.md info of package
- setup.py required to create python package
- pyproj.tml required to create python package
- setup.cfg required to create python package
- LICENSE info of package
- diagnosis/ contains content of package
  - tests/ tests for created methods
  - __init__.py required to create python package
  - modelmake.py methods for classification modeling
  - twosample.py methods for hypothesis testing
  - main.py methods for plotting figures
  - prepare.py methods for preparing the datad
_config.yml required for JupyterBook
conf.py required for JupyterBook
_toc.yml is the table of contents for JupyterBook
book-requirements.txt packages for the book build in Github Actions
environment.yml hw07 conda environment installation
envsetup.sh utilized by make env
envupdate.sh utilized by make update
envremove.sh utilized by make remove
run_codes.sh utilized by make all
html_hub.sh build JupyterBook to view it on the hub with the URL proxy trick
Makefile make commands for easy execution
LICENSE contains the license used by the repo
README.md current document
requirements.txt contains the names of the packages installed through pypi
main.ipynb summarizes and discusses the findings and outcomes of our analysis
hw07-description.ipynb Stat 159 HW7 assignment desctiption

Makefile Commands¶

make

env creates and configures the environment
remove-env remove the environment
update-env update the environment
html build the JupyterBook normally
html-hub build the JupyterBook so that you can view it on the hub with the URL proxy trick: https://stat159.datahub.berkeley.edu/user-redirect/proxy/8000/index.html
clean clean up the generated figures, tables, data, and _build folders.
all run all the notebooks (*.ipynb in codes/ and main.ipynb)

Notes¶

When using pytest to test the functions in the package, we need to call pytest diagnosis in the root directory, i.e., in hw07-Group26, run pytest diagnosis in the terminal. Also, since our testing functions use some generated data, make sure running make all to generated all neccessary files before testing.

Breast Cancer Data Study

A Study of a Breast Cancer Dataset

Contents