A Study of a Breast Cancer Dataset
Contents
A Study of a Breast Cancer Dataset¶
Github Pages here
Authors: Kshitij (TJ) Chauhan, Neha Haq, Wenhao Pan, Jiaji Wu
Introduction¶
This repository includes a study of a breast bancer dataset that is downloaded from Kaggle
As a group, we wanted to focus on the health industry. We decided to look at cancer, as cancer is a widely studied disease today. This motivated us to analyze the breast cancer dataset publicly available on Kaggle. We believe that our analysis of this dataset would be helpful to doctors and patients, who want to find out whether the cancer is benign or malignant.
For our model, we decided to choose to use a classification model as our goal is to classify whether the cancer is benign or malignant. To do this, we ran different models including logistic regression, decision trees and random forests and then, based on the performance on the vailidation set, we picked the best one. Based on our final model, we hope that our model can serve as a strong basis for predicting whether the cancer is benign or malignant based on its characteristics. Furthermore, we performed hypothesis testing using the parametric Two-Sample T-Test and the non-parametric Wilcoxian Rank Sumt Test to see whether our results are statistically significant at a significance level of 5%.
Installation¶
Run make env to setup the conda environment and install the required dependencies. Use hw07 kernel to execute the Jupyter Notebook.
Repository Structure¶
data/contains different datasets in csv formatsraw_data.csvis the original data downloaded from Kaggleclean.csvis the cleaned version ofraw_data.csvtrain.csvis the training datasetval.csvis the validation datasettest.csvis the testing dataset
figures/contains generated figures from running the notebooks incodes/tables/contains generated tables from running the notebooks incodes/codes/contains the jupyter notebooks for data analysisdata_prepare.ipynbprepares the data for later analysisdata_visual.ipynbconducts data visualizationlogistic_reg.ipynbconducts logistic regression analysisdecision_tree_and_random_forest.ipynbconducts decision tree and random forest modeling and comparisonfinal_model_selection.ipynbchooses the final model between logistic regression, decision tree, and random foresttwo_populations_analysis.ipynbconducts two sample hypothesis testing
models/contains different fitted models from running the notebooks incodes/dt_model.savis the fitted decesion tree modelrf_model.savis the fitted random forest modellg_model.savis the fitted logistic regression model
diagnosis/contains required files for package creation purposes.README.mdinfo of packagesetup.pyrequired to create python packagepyproj.tmlrequired to create python packagesetup.cfgrequired to create python packageLICENSEinfo of packagediagnosis/contains content of packagetests/tests for created methods__init__.pyrequired to create python packagemodelmake.pymethods for classification modelingtwosample.pymethods for hypothesis testingmain.pymethods for plotting figuresprepare.pymethods for preparing the datad
_config.ymlrequired for JupyterBookconf.pyrequired for JupyterBook_toc.ymlis the table of contents for JupyterBookbook-requirements.txtpackages for the book build in Github Actionsenvironment.ymlhw07 conda environment installationenvsetup.shutilized bymake envenvupdate.shutilized bymake updateenvremove.shutilized bymake removerun_codes.shutilized bymake allhtml_hub.shbuild JupyterBook to view it on the hub with the URL proxy trickMakefilemake commands for easy executionLICENSEcontains the license used by the repoREADME.mdcurrent documentrequirements.txtcontains the names of the packages installed through pypimain.ipynbsummarizes and discusses the findings and outcomes of our analysishw07-description.ipynbStat 159 HW7 assignment desctiption
Makefile Commands¶
make
envcreates and configures the environmentremove-envremove the environmentupdate-envupdate the environmenthtmlbuild the JupyterBook normallyhtml-hubbuild the JupyterBook so that you can view it on the hub with the URL proxy trick: https://stat159.datahub.berkeley.edu/user-redirect/proxy/8000/index.htmlcleanclean up the generated figures, tables, data, and _build folders.allrun all the notebooks (*.ipynbincodes/andmain.ipynb)
Notes¶
When using
pytestto test the functions in the package, we need to callpytest diagnosisin the root directory, i.e., inhw07-Group26, runpytest diagnosisin the terminal. Also, since our testing functions use some generated data, make sure runningmake allto generated all neccessary files before testing.