Two Populations Comparative Statistical Analysis¶

This notebook explores the differences between two populations: patients with malignant breast cancer and those with benign breast cancer. We have a sample of 357 patients with benign cancer and 212 patients with malignant cancer consisting of 30 key variables that may help distinguish key features between these two populations which may help to better identify cancer in patients. We will utilize hypothesis testing to see whether the two groups do indeed differ in some key features which is not due to chance.

import pandas as pd
import matplotlib.pyplot as plt
from diagnosis.twosample import plt_by_diagnosis, two_sample_t_test

dat = pd.read_csv('../data/clean.csv')
dat_ = dat.iloc[:, 1:]
dat.head()

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
0	842302	1.0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	842517	1.0	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	84300903	1.0	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	84348301	1.0	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	84358402	1.0	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 32 columns

Below we see all of the features that are included in the data in regards to each patient. While we may conduct a hypothesis test for each of the features, instead we will choose a few so that we do not run into the issue of multiple testing which though may be corrected by using certain techniques. I specifically want to conduct a two sample t test on the feature that differs the most in the patients that are diagnosed vs. those that are not, and the feature that differs the least. We are using the average oberved value in the two samples as the measurement of difference in the two populations. Additionally, we have also computed the standard deviation for each feature in the two populations, as the two sample t test differs for whether the two populations are assumed to have the same variance or not.

There are two assumptions that must be met in order to properly conduct a two sample t test which are (1) independent observations and (2) the data must be sampled from a Gaussian distribution. The first condition is true as we have independent observations of patients which are not influenced by other observations’ conditions. We will see in a bit whether the second assumption is met or not. If it is not, then we will conduct both the parametric two sample t test and the non-parametric Wilcoxon rank sum test.

dat.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

dat__diagnosis_mean = dat_.groupby("diagnosis").mean()
dat__diagnosis_mean

	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	symmetry_mean	fractal_dimension_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
diagnosis
0.0	12.146524	17.914762	78.075406	462.790196	0.092478	0.080085	0.046058	0.025717	0.174186	0.062867	...	13.379801	23.515070	87.005938	558.899440	0.124959	0.182673	0.166238	0.074444	0.270246	0.079442
1.0	17.462830	21.604906	115.365377	978.376415	0.102898	0.145188	0.160775	0.087990	0.192909	0.062680	...	21.134811	29.318208	141.370330	1422.286321	0.144845	0.374824	0.450606	0.182237	0.323468	0.091530

2 rows × 30 columns

dat__diagnosis_std = dat_.groupby("diagnosis").std()
dat__diagnosis_std

	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	symmetry_mean	fractal_dimension_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
diagnosis
0.0	1.780512	3.995125	11.807438	134.287118	0.013446	0.033750	0.043442	0.015909	0.024807	0.006747	...	1.981368	5.493955	13.527091	163.601424	0.020013	0.092180	0.140368	0.035797	0.041745	0.013804
1.0	3.203971	3.779470	21.854653	367.937978	0.012608	0.053987	0.075019	0.034374	0.027638	0.007573	...	4.283569	5.434804	29.457055	597.967743	0.021870	0.170372	0.181507	0.046308	0.074685	0.021553

2 rows × 30 columns

In the below table, we see that the ‘area worst’, which is the average of the three highest values of area for each patient, feature has the minimum difference and the ‘texture_se’, which is the stanadrd error of the texture variable for each patient, has the maximum difference. More so, it appears that ‘area worst’ has a different variance in the population while ‘texture se’ has thee same population variance.

dat__diagnosis_mean.loc["diff"] = dat__diagnosis_mean.loc[0] - dat__diagnosis_mean.loc[1] 
dat__diagnosis_mean

	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	symmetry_mean	fractal_dimension_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
diagnosis
0.0	12.146524	17.914762	78.075406	462.790196	0.092478	0.080085	0.046058	0.025717	0.174186	0.062867	...	13.379801	23.515070	87.005938	558.899440	0.124959	0.182673	0.166238	0.074444	0.270246	0.079442
1.0	17.462830	21.604906	115.365377	978.376415	0.102898	0.145188	0.160775	0.087990	0.192909	0.062680	...	21.134811	29.318208	141.370330	1422.286321	0.144845	0.374824	0.450606	0.182237	0.323468	0.091530
diff	-5.316306	-3.690144	-37.289971	-515.586219	-0.010421	-0.065103	-0.114717	-0.062273	-0.018723	0.000187	...	-7.755010	-5.803138	-54.364392	-863.386881	-0.019886	-0.192152	-0.284368	-0.107793	-0.053222	-0.012088

3 rows × 30 columns

min_diff, max_diff = dat__diagnosis_mean.loc["diff"].min(), dat__diagnosis_mean.loc["diff"].max()

dat__diagnosis_mean.columns[dat__diagnosis_mean.loc["diff"] == min_diff][0]

'area_worst'

dat__diagnosis_mean.columns[dat__diagnosis_mean.loc["diff"] == max_diff][0]

'texture_se'

dat__diagnosis_mean["area_worst"]

diagnosis
0.0      558.899440
1.0     1422.286321
diff    -863.386881
Name: area_worst, dtype: float64

dat__diagnosis_std["area_worst"]

diagnosis
0.0    163.601424
1.0    597.967743
Name: area_worst, dtype: float64

dat__diagnosis_mean["texture_se"]

diagnosis
0.0     1.220380
1.0     1.210915
diff    0.009465
Name: texture_se, dtype: float64

dat__diagnosis_std["texture_se"]

diagnosis
0.0    0.589180
1.0    0.483178
Name: texture_se, dtype: float64

The below plot depicts the count distribution of ‘Worst Area’ for patients with malignant cancer and for those with benign cancer. We see that the patients with malignant cancer have a distribution with a higher variance, while those with benign cancer have a lower variance. We see that patients with benign cancer have a lower mean than those with malignant cancer. The distribution for the malignant cancer in blue below appears to be somewhat normally distrtibuted with a bit of a right tail; however, for th benign cancer patients, the worst area variable is concentrated to th left with a right tail. Overall, both distributions are not normally distributed. To combat this issue, we will perform both the parametric and the non-parametric hypothesis tests.

plt_by_diagnosis(dat, "area_worst", "Worst Area")

../_images/two_populations_analysis_18_0.png

The below plot depicts the count distribution of ‘Texture SE for patients with malignant cancer and for those with benign cancer. We see that the patients with malignant cancer have a very similar to those with benign cancer We see that patients with benign cancer have a slighlty longer right tail than those with malignant cancer. Both distributions depictted below appear to be approximately normally distributed, though they do have a right-skewed tail. Though one may argue that the distributions below are approximately normally distributed and that the two sample t test will work, either way, we will perform both the parametric and the non-parametric hypothesis tests.

plt_by_diagnosis(dat, "texture_se", "Texture SE")

../_images/two_populations_analysis_20_0.png

We are perforrming the below hypothesis tests with an alpha of 0.05.

We see that the two sample t test for ‘area worst’ is statistically highly significant at alpha = 0.05 with a p-value very close to zero. More so, the non-parametric Wilcoxian rank sum test agrees with the outcome of the parametric test. This means that we reject the null hypothesis that the two populations have a different area worst mean, which means that doctors may utilize this variable as an indicator of benign vs malignant cancer.

area_worst_ht_p = two_sample_t_test(dat, "diagnosis", "area_worst", False, True)
area_worst_ht_p

Statistically Highly Significant, Reject Null Hypohesis

Ttest_indResult(statistic=-20.570814251119344, pvalue=4.937923843586185e-54)

area_worst_ht_np = two_sample_t_test(dat, "diagnosis", "area_worst", False, False)
area_worst_ht_np

Statistically Highly Significant, Reject Null Hypohesis

RanksumsResult(statistic=-18.75402925190002, pvalue=1.7946452985715502e-78)

We see that the two sample t test for ‘texture se’ is not statistically significant at alpha = 0.05 with a p-value of approximately 0.84. More so, the non-parametric Wilcoxian rank sum test agrees with the outcome of the parametriic test. This means that we fail to reject the null hypothesis that the two populations have the same area worst mean, which means that doctors may not find utilizing this variable as an indicator of benign vs malignant cancer too helpful.

texture_se_ht_p = two_sample_t_test(dat, "diagnosis", "texture_se", True, True)
texture_se_ht_p

Fail to Reject Null Hypohesis

Ttest_indResult(statistic=0.1977238031013334, pvalue=0.8433320287670163)

texture_se_ht_np = two_sample_t_test(dat, "diagnosis", "texture_se", True, False)
texture_se_ht_np

Fail to Reject Null Hypohesis

RanksumsResult(statistic=-0.46280525524255145, pvalue=0.6435039640045692)

ht_dt = pd.DataFrame({"area worst parametric": area_worst_ht_p, "area worst non-parametric": area_worst_ht_np, "texture se parametric": texture_se_ht_p, "texture se non-parametric": texture_se_ht_np}).T
ht_dt.columns = ["T Statistic", "P Value"]
ht_dt.to_pickle("../tables/ht_results.pkl")

Breast Cancer Data Study

Two Populations Comparative Statistical Analysis

Two Populations Comparative Statistical Analysis¶