A Study of a Breast Cancer Dataset
Contents
A Study of a Breast Cancer Dataset¶
Github Pages here
Authors: Kshitij (TJ) Chauhan, Neha Haq, Wenhao Pan, Jiaji Wu
Introduction¶
This repository includes a study of a breast bancer dataset that is downloaded from Kaggle
As a group, we wanted to focus on the health industry. We decided to look at cancer, as cancer is a widely studied disease today. This motivated us to analyze the breast cancer dataset publicly available on Kaggle. We believe that our analysis of this dataset would be helpful to doctors and patients, who want to find out whether the cancer is benign or malignant.
For our model, we decided to choose to use a classification model as our goal is to classify whether the cancer is benign or malignant. To do this, we ran different models including logistic regression, decision trees and random forests and then, based on the performance on the vailidation set, we picked the best one. Based on our final model, we hope that our model can serve as a strong basis for predicting whether the cancer is benign or malignant based on its characteristics. Furthermore, we performed hypothesis testing using the parametric Two-Sample T-Test and the non-parametric Wilcoxian Rank Sumt Test to see whether our results are statistically significant at a significance level of 5%.
Installation¶
Run make env
to setup the conda environment and install the required dependencies. Use hw07
kernel to execute the Jupyter Notebook.
Repository Structure¶
data/
contains different datasets in csv formatsraw_data.csv
is the original data downloaded from Kaggleclean.csv
is the cleaned version ofraw_data.csv
train.csv
is the training datasetval.csv
is the validation datasettest.csv
is the testing dataset
figures/
contains generated figures from running the notebooks incodes/
tables/
contains generated tables from running the notebooks incodes/
codes/
contains the jupyter notebooks for data analysisdata_prepare.ipynb
prepares the data for later analysisdata_visual.ipynb
conducts data visualizationlogistic_reg.ipynb
conducts logistic regression analysisdecision_tree_and_random_forest.ipynb
conducts decision tree and random forest modeling and comparisonfinal_model_selection.ipynb
chooses the final model between logistic regression, decision tree, and random foresttwo_populations_analysis.ipynb
conducts two sample hypothesis testing
models/
contains different fitted models from running the notebooks incodes/
dt_model.sav
is the fitted decesion tree modelrf_model.sav
is the fitted random forest modellg_model.sav
is the fitted logistic regression model
diagnosis/
contains required files for package creation purposes.README.md
info of packagesetup.py
required to create python packagepyproj.tml
required to create python packagesetup.cfg
required to create python packageLICENSE
info of packagediagnosis/
contains content of packagetests/
tests for created methods__init__.py
required to create python packagemodelmake.py
methods for classification modelingtwosample.py
methods for hypothesis testingmain.py
methods for plotting figuresprepare.py
methods for preparing the datad
_config.yml
required for JupyterBookconf.py
required for JupyterBook_toc.yml
is the table of contents for JupyterBookbook-requirements.txt
packages for the book build in Github Actionsenvironment.yml
hw07 conda environment installationenvsetup.sh
utilized bymake env
envupdate.sh
utilized bymake update
envremove.sh
utilized bymake remove
run_codes.sh
utilized bymake all
html_hub.sh
build JupyterBook to view it on the hub with the URL proxy trickMakefile
make commands for easy executionLICENSE
contains the license used by the repoREADME.md
current documentrequirements.txt
contains the names of the packages installed through pypimain.ipynb
summarizes and discusses the findings and outcomes of our analysishw07-description.ipynb
Stat 159 HW7 assignment desctiption
Makefile Commands¶
make
env
creates and configures the environmentremove-env
remove the environmentupdate-env
update the environmenthtml
build the JupyterBook normallyhtml-hub
build the JupyterBook so that you can view it on the hub with the URL proxy trick: https://stat159.datahub.berkeley.edu/user-redirect/proxy/8000/index.htmlclean
clean up the generated figures, tables, data, and _build folders.all
run all the notebooks (*.ipynb
incodes/
andmain.ipynb
)
Notes¶
When using
pytest
to test the functions in the package, we need to callpytest diagnosis
in the root directory, i.e., inhw07-Group26
, runpytest diagnosis
in the terminal. Also, since our testing functions use some generated data, make sure runningmake all
to generated all neccessary files before testing.