{ "cells": [ { "cell_type": "markdown", "id": "6e2ee970-8eab-48a7-95f7-83cbb8dd15f7", "metadata": {}, "source": [ "# Data Visualization\n", "In this notebook, we conduct data visulization on the clean data." ] }, { "cell_type": "markdown", "id": "5d7361fd-9210-480a-935a-94df0c186d19", "metadata": {}, "source": [ "## Load the data" ] }, { "cell_type": "code", "execution_count": 1, "id": "4da2939f-58ac-4265-8d9e-d60404230650", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddiagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_mean...radius_worsttexture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worst
08423021.017.9910.38122.801001.00.118400.277600.30010.14710...25.3817.33184.602019.00.16220.66560.71190.26540.46010.11890
18425171.020.5717.77132.901326.00.084740.078640.08690.07017...24.9923.41158.801956.00.12380.18660.24160.18600.27500.08902
2843009031.019.6921.25130.001203.00.109600.159900.19740.12790...23.5725.53152.501709.00.14440.42450.45040.24300.36130.08758
3843483011.011.4220.3877.58386.10.142500.283900.24140.10520...14.9126.5098.87567.70.20980.86630.68690.25750.66380.17300
4843584021.020.2914.34135.101297.00.100300.132800.19800.10430...22.5416.67152.201575.00.13740.20500.40000.16250.23640.07678
\n", "

5 rows × 32 columns

\n", "
" ], "text/plain": [ " id diagnosis radius_mean texture_mean perimeter_mean area_mean \\\n", "0 842302 1.0 17.99 10.38 122.80 1001.0 \n", "1 842517 1.0 20.57 17.77 132.90 1326.0 \n", "2 84300903 1.0 19.69 21.25 130.00 1203.0 \n", "3 84348301 1.0 11.42 20.38 77.58 386.1 \n", "4 84358402 1.0 20.29 14.34 135.10 1297.0 \n", "\n", " smoothness_mean compactness_mean concavity_mean concave points_mean \\\n", "0 0.11840 0.27760 0.3001 0.14710 \n", "1 0.08474 0.07864 0.0869 0.07017 \n", "2 0.10960 0.15990 0.1974 0.12790 \n", "3 0.14250 0.28390 0.2414 0.10520 \n", "4 0.10030 0.13280 0.1980 0.10430 \n", "\n", " ... radius_worst texture_worst perimeter_worst area_worst \\\n", "0 ... 25.38 17.33 184.60 2019.0 \n", "1 ... 24.99 23.41 158.80 1956.0 \n", "2 ... 23.57 25.53 152.50 1709.0 \n", "3 ... 14.91 26.50 98.87 567.7 \n", "4 ... 22.54 16.67 152.20 1575.0 \n", "\n", " smoothness_worst compactness_worst concavity_worst concave points_worst \\\n", "0 0.1622 0.6656 0.7119 0.2654 \n", "1 0.1238 0.1866 0.2416 0.1860 \n", "2 0.1444 0.4245 0.4504 0.2430 \n", "3 0.2098 0.8663 0.6869 0.2575 \n", "4 0.1374 0.2050 0.4000 0.1625 \n", "\n", " symmetry_worst fractal_dimension_worst \n", "0 0.4601 0.11890 \n", "1 0.2750 0.08902 \n", "2 0.3613 0.08758 \n", "3 0.6638 0.17300 \n", "4 0.2364 0.07678 \n", "\n", "[5 rows x 32 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "clean_data = pd.read_csv(\"../data/clean.csv\")\n", "clean_data.head()" ] }, { "cell_type": "markdown", "id": "86e1be47-4196-4190-93c7-473b9804c86f", "metadata": {}, "source": [ "Since the feature `id` is irrelevant, we drop it from our data." ] }, { "cell_type": "code", "execution_count": 2, "id": "30427c0c-934f-4019-895b-9d68b8435813", "metadata": {}, "outputs": [], "source": [ "clean_data = clean_data.drop(\"id\", axis=1)" ] }, { "cell_type": "markdown", "id": "e9c3dad9-dc33-4ec4-935c-5b2b9b590cd6", "metadata": {}, "source": [ "## Analysis" ] }, { "cell_type": "markdown", "id": "8248692b-e748-4fd8-baf4-7b38a9f3ebc7", "metadata": {}, "source": [ "### General" ] }, { "cell_type": "markdown", "id": "20635049-75a9-4483-95fd-1340d6b7b542", "metadata": {}, "source": [ "Let us first find out what features do we have." ] }, { "cell_type": "code", "execution_count": 3, "id": "4603f654-a987-4c92-b74e-8ae65aa14e22", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',\n", " 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',\n", " 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',\n", " 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',\n", " 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',\n", " 'fractal_dimension_se', 'radius_worst', 'texture_worst',\n", " 'perimeter_worst', 'area_worst', 'smoothness_worst',\n", " 'compactness_worst', 'concavity_worst', 'concave points_worst',\n", " 'symmetry_worst', 'fractal_dimension_worst'],\n", " dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clean_data.columns" ] }, { "cell_type": "markdown", "id": "bfb6b318-9853-46c3-ad39-b40615136f24", "metadata": {}, "source": [ "Besides the feature `id` and the response variable `diagnosis`, there are 30 features in total, which can be splited into three groups - mean, standard error, and \"worst\" or largest. In each group, we have `radius`, `texture`, `perimeter`, `area`, `smoothness`, `compactness`, `concavity`, `concave points`, `symmetry`, and `fractal dimension`.\n", "\n", "Since the three groups of features describe similar things, we might want to inspect the collinearity between each pair of features by computing the correlation matrix." ] }, { "cell_type": "code", "execution_count": 4, "id": "0b5808db-010d-4902-8a3b-b8fbff5db5c5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
radius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_meansymmetry_meanfractal_dimension_mean...radius_worsttexture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worst
radius_mean1.0000000.3237820.9978550.9873570.1705810.5061240.6767640.8225290.147741-0.311631...0.9695390.2970080.9651370.9410820.1196160.4134630.5269110.7442140.1639530.007066
texture_mean0.3237821.0000000.3295330.321086-0.0233890.2367020.3024180.2934640.071401-0.076437...0.3525730.9120450.3580400.3435460.0775030.2778300.3010250.2953160.1050080.119205
perimeter_mean0.9978550.3295331.0000000.9865070.2072780.5569360.7161360.8509770.183027-0.261477...0.9694760.3030380.9703870.9415500.1505490.4557740.5638790.7712410.1891150.051019
area_mean0.9873570.3210860.9865071.0000000.1770280.4985020.6859830.8232690.151293-0.283110...0.9627460.2874890.9591200.9592130.1235230.3904100.5126060.7220170.1435700.003738
smoothness_mean0.170581-0.0233890.2072780.1770281.0000000.6591230.5219840.5536950.5577750.584792...0.2131200.0360720.2388530.2067180.8053240.4724680.4349260.5030530.3943090.499316
compactness_mean0.5061240.2367020.5569360.4985020.6591231.0000000.8831210.8311350.6026410.565369...0.5353150.2481330.5902100.5096040.5655410.8658090.8162750.8155730.5102230.687382
concavity_mean0.6767640.3024180.7161360.6859830.5219840.8831211.0000000.9213910.5006670.336783...0.6882360.2998790.7295650.6759870.4488220.7549680.8841030.8613230.4094640.514930
concave points_mean0.8225290.2934640.8509770.8232690.5536950.8311350.9213911.0000000.4624970.166917...0.8303180.2927520.8559230.8096300.4527530.6674540.7523990.9101550.3757440.368661
symmetry_mean0.1477410.0714010.1830270.1512930.5577750.6026410.5006670.4624971.0000000.479921...0.1857280.0906510.2191690.1771930.4266750.4732000.4337210.4302970.6998260.438413
fractal_dimension_mean-0.311631-0.076437-0.261477-0.2831100.5847920.5653690.3367830.1669170.4799211.000000...-0.253691-0.051269-0.205151-0.2318540.5049420.4587980.3462340.1753250.3340190.767297
radius_se0.6790900.2758690.6917650.7325620.3014670.4974730.6319250.6980500.3033790.000111...0.7150650.1947990.7196840.7515480.1419190.2871030.3805850.5310620.0945430.049559
texture_se-0.0973170.386358-0.086761-0.0662800.0684060.0462050.0762180.0214800.1280530.164174...-0.1116900.409003-0.102242-0.083195-0.073658-0.092439-0.068956-0.119638-0.128215-0.045655
perimeter_se0.6741720.2816730.6931350.7266280.2960920.5489050.6603910.7106500.3138930.039830...0.6972010.2003710.7210310.7307130.1300540.3419190.4188990.5548970.1099300.085433
area_se0.7358640.2598450.7449830.8000860.2465520.4556530.6174270.6902990.223970-0.090170...0.7573730.1964970.7612130.8114080.1253890.2832570.3851000.5381660.0741260.017539
smoothness_se-0.2226000.006614-0.202694-0.1667770.3323750.1352990.0985640.0276530.1873210.401964...-0.230691-0.074743-0.217304-0.1821950.314457-0.055558-0.058298-0.102007-0.1073420.101480
compactness_se0.2060000.1919750.2507440.2125830.3189430.7387220.6702790.4904240.4216590.559837...0.2046070.1430030.2605160.1993710.2273940.6787800.6391470.4832080.2778780.590973
concavity_se0.1942040.1432930.2280820.2076600.2483960.5705170.6912700.4391670.3426270.446630...0.1869040.1002410.2266800.1883530.1684810.4848580.6625640.4404720.1977880.439329
concave points_se0.3761690.1638510.4072170.3723200.3806760.6422620.6832600.6156340.3932980.341198...0.3581270.0867410.3949990.3422710.2153510.4528880.5495920.6024500.1431160.310655
symmetry_se-0.1043210.009127-0.081629-0.0724970.2007740.2299770.1780090.0953510.4491370.345007...-0.128121-0.077473-0.103753-0.110343-0.0126620.0602550.037119-0.0304130.3894020.078079
fractal_dimension_se-0.0426410.054458-0.005523-0.0198870.2836070.5073180.4493010.2575840.3317860.688132...-0.037488-0.003195-0.001000-0.0227360.1705680.3901590.3799750.2152040.1110940.591328
radius_worst0.9695390.3525730.9694760.9627460.2131200.5353150.6882360.8303180.185728-0.253691...1.0000000.3599210.9937080.9840150.2165740.4758200.5739750.7874240.2435290.093492
texture_worst0.2970080.9120450.3030380.2874890.0360720.2481330.2998790.2927520.090651-0.051269...0.3599211.0000000.3650980.3458420.2254290.3608320.3683660.3597550.2330270.219122
perimeter_worst0.9651370.3580400.9703870.9591200.2388530.5902100.7295650.8559230.219169-0.205151...0.9937080.3650981.0000000.9775780.2367750.5294080.6183440.8163220.2694930.138957
area_worst0.9410820.3435460.9415500.9592130.2067180.5096040.6759870.8096300.177193-0.231854...0.9840150.3458420.9775781.0000000.2091450.4382960.5433310.7474190.2091460.079647
smoothness_worst0.1196160.0775030.1505490.1235230.8053240.5655410.4488220.4527530.4266750.504942...0.2165740.2254290.2367750.2091451.0000000.5681870.5185230.5476910.4938380.617624
compactness_worst0.4134630.2778300.4557740.3904100.4724680.8658090.7549680.6674540.4732000.458798...0.4758200.3608320.5294080.4382960.5681871.0000000.8922610.8010800.6144410.810455
concavity_worst0.5269110.3010250.5638790.5126060.4349260.8162750.8841030.7523990.4337210.346234...0.5739750.3683660.6183440.5433310.5185230.8922611.0000000.8554340.5325200.686511
concave points_worst0.7442140.2953160.7712410.7220170.5030530.8155730.8613230.9101550.4302970.175325...0.7874240.3597550.8163220.7474190.5476910.8010800.8554341.0000000.5025280.511114
symmetry_worst0.1639530.1050080.1891150.1435700.3943090.5102230.4094640.3757440.6998260.334019...0.2435290.2330270.2694930.2091460.4938380.6144410.5325200.5025281.0000000.537848
fractal_dimension_worst0.0070660.1192050.0510190.0037380.4993160.6873820.5149300.3686610.4384130.767297...0.0934920.2191220.1389570.0796470.6176240.8104550.6865110.5111140.5378481.000000
\n", "

30 rows × 30 columns

\n", "
" ], "text/plain": [ " radius_mean texture_mean perimeter_mean area_mean \\\n", "radius_mean 1.000000 0.323782 0.997855 0.987357 \n", "texture_mean 0.323782 1.000000 0.329533 0.321086 \n", "perimeter_mean 0.997855 0.329533 1.000000 0.986507 \n", "area_mean 0.987357 0.321086 0.986507 1.000000 \n", "smoothness_mean 0.170581 -0.023389 0.207278 0.177028 \n", "compactness_mean 0.506124 0.236702 0.556936 0.498502 \n", "concavity_mean 0.676764 0.302418 0.716136 0.685983 \n", "concave points_mean 0.822529 0.293464 0.850977 0.823269 \n", "symmetry_mean 0.147741 0.071401 0.183027 0.151293 \n", "fractal_dimension_mean -0.311631 -0.076437 -0.261477 -0.283110 \n", "radius_se 0.679090 0.275869 0.691765 0.732562 \n", "texture_se -0.097317 0.386358 -0.086761 -0.066280 \n", "perimeter_se 0.674172 0.281673 0.693135 0.726628 \n", "area_se 0.735864 0.259845 0.744983 0.800086 \n", "smoothness_se -0.222600 0.006614 -0.202694 -0.166777 \n", "compactness_se 0.206000 0.191975 0.250744 0.212583 \n", "concavity_se 0.194204 0.143293 0.228082 0.207660 \n", "concave points_se 0.376169 0.163851 0.407217 0.372320 \n", "symmetry_se -0.104321 0.009127 -0.081629 -0.072497 \n", "fractal_dimension_se -0.042641 0.054458 -0.005523 -0.019887 \n", "radius_worst 0.969539 0.352573 0.969476 0.962746 \n", "texture_worst 0.297008 0.912045 0.303038 0.287489 \n", "perimeter_worst 0.965137 0.358040 0.970387 0.959120 \n", "area_worst 0.941082 0.343546 0.941550 0.959213 \n", "smoothness_worst 0.119616 0.077503 0.150549 0.123523 \n", "compactness_worst 0.413463 0.277830 0.455774 0.390410 \n", "concavity_worst 0.526911 0.301025 0.563879 0.512606 \n", "concave points_worst 0.744214 0.295316 0.771241 0.722017 \n", "symmetry_worst 0.163953 0.105008 0.189115 0.143570 \n", "fractal_dimension_worst 0.007066 0.119205 0.051019 0.003738 \n", "\n", " smoothness_mean compactness_mean concavity_mean \\\n", "radius_mean 0.170581 0.506124 0.676764 \n", "texture_mean -0.023389 0.236702 0.302418 \n", "perimeter_mean 0.207278 0.556936 0.716136 \n", "area_mean 0.177028 0.498502 0.685983 \n", "smoothness_mean 1.000000 0.659123 0.521984 \n", "compactness_mean 0.659123 1.000000 0.883121 \n", "concavity_mean 0.521984 0.883121 1.000000 \n", "concave points_mean 0.553695 0.831135 0.921391 \n", "symmetry_mean 0.557775 0.602641 0.500667 \n", "fractal_dimension_mean 0.584792 0.565369 0.336783 \n", "radius_se 0.301467 0.497473 0.631925 \n", "texture_se 0.068406 0.046205 0.076218 \n", "perimeter_se 0.296092 0.548905 0.660391 \n", "area_se 0.246552 0.455653 0.617427 \n", "smoothness_se 0.332375 0.135299 0.098564 \n", "compactness_se 0.318943 0.738722 0.670279 \n", "concavity_se 0.248396 0.570517 0.691270 \n", "concave points_se 0.380676 0.642262 0.683260 \n", "symmetry_se 0.200774 0.229977 0.178009 \n", "fractal_dimension_se 0.283607 0.507318 0.449301 \n", "radius_worst 0.213120 0.535315 0.688236 \n", "texture_worst 0.036072 0.248133 0.299879 \n", "perimeter_worst 0.238853 0.590210 0.729565 \n", "area_worst 0.206718 0.509604 0.675987 \n", "smoothness_worst 0.805324 0.565541 0.448822 \n", "compactness_worst 0.472468 0.865809 0.754968 \n", "concavity_worst 0.434926 0.816275 0.884103 \n", "concave points_worst 0.503053 0.815573 0.861323 \n", "symmetry_worst 0.394309 0.510223 0.409464 \n", "fractal_dimension_worst 0.499316 0.687382 0.514930 \n", "\n", " concave points_mean symmetry_mean \\\n", "radius_mean 0.822529 0.147741 \n", "texture_mean 0.293464 0.071401 \n", "perimeter_mean 0.850977 0.183027 \n", "area_mean 0.823269 0.151293 \n", "smoothness_mean 0.553695 0.557775 \n", "compactness_mean 0.831135 0.602641 \n", "concavity_mean 0.921391 0.500667 \n", "concave points_mean 1.000000 0.462497 \n", "symmetry_mean 0.462497 1.000000 \n", "fractal_dimension_mean 0.166917 0.479921 \n", "radius_se 0.698050 0.303379 \n", "texture_se 0.021480 0.128053 \n", "perimeter_se 0.710650 0.313893 \n", "area_se 0.690299 0.223970 \n", "smoothness_se 0.027653 0.187321 \n", "compactness_se 0.490424 0.421659 \n", "concavity_se 0.439167 0.342627 \n", "concave points_se 0.615634 0.393298 \n", "symmetry_se 0.095351 0.449137 \n", "fractal_dimension_se 0.257584 0.331786 \n", "radius_worst 0.830318 0.185728 \n", "texture_worst 0.292752 0.090651 \n", "perimeter_worst 0.855923 0.219169 \n", "area_worst 0.809630 0.177193 \n", "smoothness_worst 0.452753 0.426675 \n", "compactness_worst 0.667454 0.473200 \n", "concavity_worst 0.752399 0.433721 \n", "concave points_worst 0.910155 0.430297 \n", "symmetry_worst 0.375744 0.699826 \n", "fractal_dimension_worst 0.368661 0.438413 \n", "\n", " fractal_dimension_mean ... radius_worst \\\n", "radius_mean -0.311631 ... 0.969539 \n", "texture_mean -0.076437 ... 0.352573 \n", "perimeter_mean -0.261477 ... 0.969476 \n", "area_mean -0.283110 ... 0.962746 \n", "smoothness_mean 0.584792 ... 0.213120 \n", "compactness_mean 0.565369 ... 0.535315 \n", "concavity_mean 0.336783 ... 0.688236 \n", "concave points_mean 0.166917 ... 0.830318 \n", "symmetry_mean 0.479921 ... 0.185728 \n", "fractal_dimension_mean 1.000000 ... -0.253691 \n", "radius_se 0.000111 ... 0.715065 \n", "texture_se 0.164174 ... -0.111690 \n", "perimeter_se 0.039830 ... 0.697201 \n", "area_se -0.090170 ... 0.757373 \n", "smoothness_se 0.401964 ... -0.230691 \n", "compactness_se 0.559837 ... 0.204607 \n", "concavity_se 0.446630 ... 0.186904 \n", "concave points_se 0.341198 ... 0.358127 \n", "symmetry_se 0.345007 ... -0.128121 \n", "fractal_dimension_se 0.688132 ... -0.037488 \n", "radius_worst -0.253691 ... 1.000000 \n", "texture_worst -0.051269 ... 0.359921 \n", "perimeter_worst -0.205151 ... 0.993708 \n", "area_worst -0.231854 ... 0.984015 \n", "smoothness_worst 0.504942 ... 0.216574 \n", "compactness_worst 0.458798 ... 0.475820 \n", "concavity_worst 0.346234 ... 0.573975 \n", "concave points_worst 0.175325 ... 0.787424 \n", "symmetry_worst 0.334019 ... 0.243529 \n", "fractal_dimension_worst 0.767297 ... 0.093492 \n", "\n", " texture_worst perimeter_worst area_worst \\\n", "radius_mean 0.297008 0.965137 0.941082 \n", "texture_mean 0.912045 0.358040 0.343546 \n", "perimeter_mean 0.303038 0.970387 0.941550 \n", "area_mean 0.287489 0.959120 0.959213 \n", "smoothness_mean 0.036072 0.238853 0.206718 \n", "compactness_mean 0.248133 0.590210 0.509604 \n", "concavity_mean 0.299879 0.729565 0.675987 \n", "concave points_mean 0.292752 0.855923 0.809630 \n", "symmetry_mean 0.090651 0.219169 0.177193 \n", "fractal_dimension_mean -0.051269 -0.205151 -0.231854 \n", "radius_se 0.194799 0.719684 0.751548 \n", "texture_se 0.409003 -0.102242 -0.083195 \n", "perimeter_se 0.200371 0.721031 0.730713 \n", "area_se 0.196497 0.761213 0.811408 \n", "smoothness_se -0.074743 -0.217304 -0.182195 \n", "compactness_se 0.143003 0.260516 0.199371 \n", "concavity_se 0.100241 0.226680 0.188353 \n", "concave points_se 0.086741 0.394999 0.342271 \n", "symmetry_se -0.077473 -0.103753 -0.110343 \n", "fractal_dimension_se -0.003195 -0.001000 -0.022736 \n", "radius_worst 0.359921 0.993708 0.984015 \n", "texture_worst 1.000000 0.365098 0.345842 \n", "perimeter_worst 0.365098 1.000000 0.977578 \n", "area_worst 0.345842 0.977578 1.000000 \n", "smoothness_worst 0.225429 0.236775 0.209145 \n", "compactness_worst 0.360832 0.529408 0.438296 \n", "concavity_worst 0.368366 0.618344 0.543331 \n", "concave points_worst 0.359755 0.816322 0.747419 \n", "symmetry_worst 0.233027 0.269493 0.209146 \n", "fractal_dimension_worst 0.219122 0.138957 0.079647 \n", "\n", " smoothness_worst compactness_worst concavity_worst \\\n", "radius_mean 0.119616 0.413463 0.526911 \n", "texture_mean 0.077503 0.277830 0.301025 \n", "perimeter_mean 0.150549 0.455774 0.563879 \n", "area_mean 0.123523 0.390410 0.512606 \n", "smoothness_mean 0.805324 0.472468 0.434926 \n", "compactness_mean 0.565541 0.865809 0.816275 \n", "concavity_mean 0.448822 0.754968 0.884103 \n", "concave points_mean 0.452753 0.667454 0.752399 \n", "symmetry_mean 0.426675 0.473200 0.433721 \n", "fractal_dimension_mean 0.504942 0.458798 0.346234 \n", "radius_se 0.141919 0.287103 0.380585 \n", "texture_se -0.073658 -0.092439 -0.068956 \n", "perimeter_se 0.130054 0.341919 0.418899 \n", "area_se 0.125389 0.283257 0.385100 \n", "smoothness_se 0.314457 -0.055558 -0.058298 \n", "compactness_se 0.227394 0.678780 0.639147 \n", "concavity_se 0.168481 0.484858 0.662564 \n", "concave points_se 0.215351 0.452888 0.549592 \n", "symmetry_se -0.012662 0.060255 0.037119 \n", "fractal_dimension_se 0.170568 0.390159 0.379975 \n", "radius_worst 0.216574 0.475820 0.573975 \n", "texture_worst 0.225429 0.360832 0.368366 \n", "perimeter_worst 0.236775 0.529408 0.618344 \n", "area_worst 0.209145 0.438296 0.543331 \n", "smoothness_worst 1.000000 0.568187 0.518523 \n", "compactness_worst 0.568187 1.000000 0.892261 \n", "concavity_worst 0.518523 0.892261 1.000000 \n", "concave points_worst 0.547691 0.801080 0.855434 \n", "symmetry_worst 0.493838 0.614441 0.532520 \n", "fractal_dimension_worst 0.617624 0.810455 0.686511 \n", "\n", " concave points_worst symmetry_worst \\\n", "radius_mean 0.744214 0.163953 \n", "texture_mean 0.295316 0.105008 \n", "perimeter_mean 0.771241 0.189115 \n", "area_mean 0.722017 0.143570 \n", "smoothness_mean 0.503053 0.394309 \n", "compactness_mean 0.815573 0.510223 \n", "concavity_mean 0.861323 0.409464 \n", "concave points_mean 0.910155 0.375744 \n", "symmetry_mean 0.430297 0.699826 \n", "fractal_dimension_mean 0.175325 0.334019 \n", "radius_se 0.531062 0.094543 \n", "texture_se -0.119638 -0.128215 \n", "perimeter_se 0.554897 0.109930 \n", "area_se 0.538166 0.074126 \n", "smoothness_se -0.102007 -0.107342 \n", "compactness_se 0.483208 0.277878 \n", "concavity_se 0.440472 0.197788 \n", "concave points_se 0.602450 0.143116 \n", "symmetry_se -0.030413 0.389402 \n", "fractal_dimension_se 0.215204 0.111094 \n", "radius_worst 0.787424 0.243529 \n", "texture_worst 0.359755 0.233027 \n", "perimeter_worst 0.816322 0.269493 \n", "area_worst 0.747419 0.209146 \n", "smoothness_worst 0.547691 0.493838 \n", "compactness_worst 0.801080 0.614441 \n", "concavity_worst 0.855434 0.532520 \n", "concave points_worst 1.000000 0.502528 \n", "symmetry_worst 0.502528 1.000000 \n", "fractal_dimension_worst 0.511114 0.537848 \n", "\n", " fractal_dimension_worst \n", "radius_mean 0.007066 \n", "texture_mean 0.119205 \n", "perimeter_mean 0.051019 \n", "area_mean 0.003738 \n", "smoothness_mean 0.499316 \n", "compactness_mean 0.687382 \n", "concavity_mean 0.514930 \n", "concave points_mean 0.368661 \n", "symmetry_mean 0.438413 \n", "fractal_dimension_mean 0.767297 \n", "radius_se 0.049559 \n", "texture_se -0.045655 \n", "perimeter_se 0.085433 \n", "area_se 0.017539 \n", "smoothness_se 0.101480 \n", "compactness_se 0.590973 \n", "concavity_se 0.439329 \n", "concave points_se 0.310655 \n", "symmetry_se 0.078079 \n", "fractal_dimension_se 0.591328 \n", "radius_worst 0.093492 \n", "texture_worst 0.219122 \n", "perimeter_worst 0.138957 \n", "area_worst 0.079647 \n", "smoothness_worst 0.617624 \n", "compactness_worst 0.810455 \n", "concavity_worst 0.686511 \n", "concave points_worst 0.511114 \n", "symmetry_worst 0.537848 \n", "fractal_dimension_worst 1.000000 \n", "\n", "[30 rows x 30 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clean_data.drop(\"diagnosis\", axis=1).corr()" ] }, { "cell_type": "markdown", "id": "0e03232b-9f68-4ca2-89a9-e4250c74f9d6", "metadata": {}, "source": [ "We can see some pairs of features have relatively high correlation, such as `radius_mean` vs `radius_worst` and `perimeter_mean` vs `radius_mean`. This can give us a warning about variability and stability for later computation and statistical analysis." ] }, { "cell_type": "markdown", "id": "1e049a79-7bd0-47fc-ae00-fe11a560baac", "metadata": {}, "source": [ "Then, we print out some statistics of each column or feature. Since all features are numeric, the `describe()` method works for each column." ] }, { "cell_type": "code", "execution_count": 5, "id": "37c6eec1-5896-43c5-9d60-84b99c4f5539", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
diagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_meansymmetry_mean...radius_worsttexture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worst
count569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000...569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000
mean0.37258314.12729219.28964991.969033654.8891040.0963600.1043410.0887990.0489190.181162...16.26919025.677223107.261213880.5831280.1323690.2542650.2721880.1146060.2900760.083946
std0.4839183.5240494.30103624.298981351.9141290.0140640.0528130.0797200.0388030.027414...4.8332426.14625833.602542569.3569930.0228320.1573360.2086240.0657320.0618670.018061
min0.0000006.9810009.71000043.790000143.5000000.0526300.0193800.0000000.0000000.106000...7.93000012.02000050.410000185.2000000.0711700.0272900.0000000.0000000.1565000.055040
25%0.00000011.70000016.17000075.170000420.3000000.0863700.0649200.0295600.0203100.161900...13.01000021.08000084.110000515.3000000.1166000.1472000.1145000.0649300.2504000.071460
50%0.00000013.37000018.84000086.240000551.1000000.0958700.0926300.0615400.0335000.179200...14.97000025.41000097.660000686.5000000.1313000.2119000.2267000.0999300.2822000.080040
75%1.00000015.78000021.800000104.100000782.7000000.1053000.1304000.1307000.0740000.195700...18.79000029.720000125.4000001084.0000000.1460000.3391000.3829000.1614000.3179000.092080
max1.00000028.11000039.280000188.5000002501.0000000.1634000.3454000.4268000.2012000.304000...36.04000049.540000251.2000004254.0000000.2226001.0580001.2520000.2910000.6638000.207500
\n", "

8 rows × 31 columns

\n", "
" ], "text/plain": [ " diagnosis radius_mean texture_mean perimeter_mean area_mean \\\n", "count 569.000000 569.000000 569.000000 569.000000 569.000000 \n", "mean 0.372583 14.127292 19.289649 91.969033 654.889104 \n", "std 0.483918 3.524049 4.301036 24.298981 351.914129 \n", "min 0.000000 6.981000 9.710000 43.790000 143.500000 \n", "25% 0.000000 11.700000 16.170000 75.170000 420.300000 \n", "50% 0.000000 13.370000 18.840000 86.240000 551.100000 \n", "75% 1.000000 15.780000 21.800000 104.100000 782.700000 \n", "max 1.000000 28.110000 39.280000 188.500000 2501.000000 \n", "\n", " smoothness_mean compactness_mean concavity_mean concave points_mean \\\n", "count 569.000000 569.000000 569.000000 569.000000 \n", "mean 0.096360 0.104341 0.088799 0.048919 \n", "std 0.014064 0.052813 0.079720 0.038803 \n", "min 0.052630 0.019380 0.000000 0.000000 \n", "25% 0.086370 0.064920 0.029560 0.020310 \n", "50% 0.095870 0.092630 0.061540 0.033500 \n", "75% 0.105300 0.130400 0.130700 0.074000 \n", "max 0.163400 0.345400 0.426800 0.201200 \n", "\n", " symmetry_mean ... radius_worst texture_worst perimeter_worst \\\n", "count 569.000000 ... 569.000000 569.000000 569.000000 \n", "mean 0.181162 ... 16.269190 25.677223 107.261213 \n", "std 0.027414 ... 4.833242 6.146258 33.602542 \n", "min 0.106000 ... 7.930000 12.020000 50.410000 \n", "25% 0.161900 ... 13.010000 21.080000 84.110000 \n", "50% 0.179200 ... 14.970000 25.410000 97.660000 \n", "75% 0.195700 ... 18.790000 29.720000 125.400000 \n", "max 0.304000 ... 36.040000 49.540000 251.200000 \n", "\n", " area_worst smoothness_worst compactness_worst concavity_worst \\\n", "count 569.000000 569.000000 569.000000 569.000000 \n", "mean 880.583128 0.132369 0.254265 0.272188 \n", "std 569.356993 0.022832 0.157336 0.208624 \n", "min 185.200000 0.071170 0.027290 0.000000 \n", "25% 515.300000 0.116600 0.147200 0.114500 \n", "50% 686.500000 0.131300 0.211900 0.226700 \n", "75% 1084.000000 0.146000 0.339100 0.382900 \n", "max 4254.000000 0.222600 1.058000 1.252000 \n", "\n", " concave points_worst symmetry_worst fractal_dimension_worst \n", "count 569.000000 569.000000 569.000000 \n", "mean 0.114606 0.290076 0.083946 \n", "std 0.065732 0.061867 0.018061 \n", "min 0.000000 0.156500 0.055040 \n", "25% 0.064930 0.250400 0.071460 \n", "50% 0.099930 0.282200 0.080040 \n", "75% 0.161400 0.317900 0.092080 \n", "max 0.291000 0.663800 0.207500 \n", "\n", "[8 rows x 31 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clean_data.describe()" ] }, { "cell_type": "markdown", "id": "32457d1a-e211-4124-ba0b-755631ab5e73", "metadata": {}, "source": [ "We can see that the column `diagnosis` has the mean $0.3726$. Since we use `0` and `1` to represent `belign` and `malignant` cancer, the mean implies that around $37.26\\%$ of the cancers in our data are `malignant`. \n", "\n", "Since the features are numerical, we can also observe their distritbuions. By the central limit theorem in statistics, some of our features should have a approximately normal distribution." ] }, { "cell_type": "code", "execution_count": 6, "id": "9eb796d3-e659-4423-9106-5b2de8bf086d", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEXCAYAAABCjVgAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAYpUlEQVR4nO3deZhkdX3v8feHYXEBFGQgyDagJDr66Kgj6DUqER9BUcFEEa/oxOUSI25XvXGIBokJEU00UeOGiiAuSC6iqDdGJAKaKDggIEuIiKOMjMywiICKzPC9f5zTh6Ltnqlhuur00O/X89TTp06d5dunq+tTv9859atUFZIkAWzWdwGSpNnDUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUNCMSfKRJH81Q9vaPcmtSea1989O8sqZ2Ha7vX9NsmSmtrcB+/3bJNcn+fm49y0NI354TcNIshzYCVgDrAUuBz4FHF9Vd96Dbb2yqr6xAeucDXy6qj6+Iftq1z0GeGhVHb6h686kJLsB/w3sUVWr+qxlNklyIrCiqt7Wdy2ypaAN85yq2gbYAzgOeAvwiZneSZLNZ3qbs8QewA0zEQhpbBL/v/fiv+e9U1V587beG7AcePqkefsAdwKPbO+fCPxtO70D8BXgF8CNwLdo3oSc3K7za+BW4C+ABUABrwB+Cpw7MG/zdntnA+8EzgduBr4EbN8+th/NO83fqRc4EPgtcEe7v4sHtvfKdnoz4G3AT4BVNC2gB7SPTdSxpK3teuCt6zhOD2jXX91u723t9p/e/s53tnWcOMW627XHbDVwUzu968DjZwPHAv/RbuuhwMOAM9tjfCVw6MDyBwHfB34JXAMcM8Tf+STgTe30Lu3v/ur2/kPb/Uz0MPwv4Kp23hnAgwe2U8CRwA+BHwMB/rE9vjcDlwCPBI5o/za/bY/Ll/t+rs/12ybxTkOzU1WdD6wAnjzFw29qH5tP0+30l80q9RKaF9fnVNXWVfXugXWeCjwcOGCaXb4UeDnwYJpurPcPUePXgL8DPt/u79FTLPan7e2PgL2ArYF/nrTMHwJ/AOwPHJ3k4dPs8gM0wbBX+/u8FHhZNV1lzwSubev40ynW3Qz4JE2LYneaF/7JdbyE5oV0G5rwOBP4LLAj8CLgQ0ke0S57W7v/B9IExJ8nOWSauiecQxOytPVf3f4EeArwraqqJE+jCelDgZ1pAvCUSds6BNgXWAg8o13/99t6XkjTajoe+Azw7va4PGc99WnEDAVtrGuB7aeYfwfNi8UeVXVHVX2r2reQ63BMVd1WVb+e5vGTq+rSqroN+Cvg0IkT0RvpxcB7q+rqqroVOAo4bFK3x19X1a+r6mLgYuB3wqWt5YXAUVV1S1UtB95D80K+XlV1Q1WdVlW/qqpbaFoFT5202IlVdVlVraFpBS2vqk9W1ZqquhA4DXh+u72zq+oHVXVnVV0CfG6K7U12DvDktmvqKcC7gSe1jz21fRyaY3ZCVV1YVbfTHLMnJlkwsK13VtWN7d/zDpogexhNS+OKqlo5zHHReBkK2li70HQfTPb3NF0LX09ydZKlQ2zrmg14/CfAFjTdVBvrwe32Bre9OU0LZ8Lg1UK/omlNTLYDsOUU29plmCKS3C/JR5P8JMkvabrRHjgp+AaPwR7Avkl+MXGjebH+vXZ7+yb5ZpLVSW4GXsV6jldV/YimG2cRTQvwK8C1Sf6Au4fC3Y5ZG6Y3TPpdrxl4/N9pWj0fBK5LcnySbYc5LhovQ0H3WJLH07wIfHvyY+075TdV1V7Ac4A3Jtl/4uFpNrm+lsRuA9O707z7vJ6mm+R+A3XNo+m2Gna719K8wA5uew1w3XrWm+z6tqbJ2/rZkOu/iaaLat+q2pbmnTo0/fETBn+Xa4BzquqBA7etq+rP28c/S9PXv1tVPQD4yKRtTeccmtbGllX1s/b+S2nOeVzULnO3Y5bk/sCDJv2udzvuVfX+qnoc8AiabqT/M9Vy6pehoA2WZNskz6bpQ/50Vf1gimWeneShSUJzonNte4PmxXave7Drw5MsTHI/4B3A/62qtTSXed4nyUFJtqA5ubvVwHrXAQvWcbXO54D/nWTPJFtz1zmINRtSXFvLqcCxSbZJsgfwRuDTQ25iG5rzCL9Isj3w9vUs/xXg95O8JMkW7e3xA+c7tgFurKrfJNkH+J9D1nEO8Bqalgo0J7hfC3y7/R2hCZyXJVmUZCuaY3Ze22X2O9q69m3/PrcBv2Hjnw8aAUNBG+LLSW6heYf6VuC9wMumWXZv4Bs0XRHfAT5UVWe3j70TeFvb5fHmDdj/yTRXOP0cuA/wOoCquhl4NfBxmneqt9Gc5J7wL+3PG5JcOMV2T2i3fS7NlTK/oXkRvCde2+7/apoW1Gfb7Q/jn4D70rQ4vgt8bV0Lt+cdngEcRvPO/efAu7grEF8NvKP9mx1NE1jDOIcmUCZC4ds0LbGJ+1TVWTTndU4DVgIPaeuYzrbAx2iuqvoJTVfTP7SPfQJY2D4fvjhkjRoRP7wmSerYUpAkdQwFaY5J8uJ2XKnJt8v6rk39s/tIktSxpSBJ6mzSA1XtsMMOtWDBgr7LkKRNygUXXHB9Vc2f6rFNOhQWLFjAsmXL+i5DkjYpSX4y3WN2H0mSOoaCJKljKEiSOoaCJKljKEiSOoaCJKljKEiSOoaCJKmzSX94bWMtWPrVXva7/LiDetmvJK2PLQVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUmdkoZBktyTfTHJFksuSvL6dv32SM5P8sP253cA6RyW5KsmVSQ4YVW2SpKmNsqWwBnhTVT0ceAJwZJKFwFLgrKraGzirvU/72GHAI4ADgQ8lmTfC+iRJk4wsFKpqZVVd2E7fAlwB7AIcDJzULnYScEg7fTBwSlXdXlU/Bq4C9hlVfZKk3zWWcwpJFgCPAc4DdqqqldAEB7Bju9guwDUDq61o50mSxmTkoZBka+A04A1V9ct1LTrFvJpie0ckWZZk2erVq2eqTEkSIw6FJFvQBMJnquoL7ezrkuzcPr4zsKqdvwLYbWD1XYFrJ2+zqo6vqsVVtXj+/PmjK16S5qBRXn0U4BPAFVX13oGHzgCWtNNLgC8NzD8syVZJ9gT2Bs4fVX2SpN+1+Qi3/STgJcAPklzUzvtL4Djg1CSvAH4KvACgqi5LcipwOc2VS0dW1doR1idJmmRkoVBV32bq8wQA+0+zzrHAsaOqSZK0bn6iWZLUGWX3kaaxYOlXe9nv8uMO6mW/kjYdthQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSZ2RhUKSE5KsSnLpwLxjkvwsyUXt7VkDjx2V5KokVyY5YFR1SZKmN8qWwonAgVPM/8eqWtTe/h9AkoXAYcAj2nU+lGTeCGuTJE1hZKFQVecCNw65+MHAKVV1e1X9GLgK2GdUtUmSptbHOYXXJLmk7V7arp23C3DNwDIr2nmSpDEadyh8GHgIsAhYCbynnZ8plq2pNpDkiCTLkixbvXr1SIqUpLlqrKFQVddV1dqquhP4GHd1Ea0AdhtYdFfg2mm2cXxVLa6qxfPnzx9twZI0x4w1FJLsPHD3ecDElUlnAIcl2SrJnsDewPnjrE2SBJuPasNJPgfsB+yQZAXwdmC/JItouoaWA38GUFWXJTkVuBxYAxxZVWtHVZskaWojC4WqetEUsz+xjuWPBY4dVT2SpPXzE82SpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqDBUKSc4aZp4kadO2zu9TSHIf4H40X5SzHXd9l/K2wINHXJskaczW9yU7fwa8gSYALuCuUPgl8MHRlSVJ6sM6Q6Gq3ge8L8lrq+oDY6pJktSTob6Os6o+kOR/AAsG16mqT42oLklSD4YKhSQnAw8BLgLWtrMLMBQk6V5kqFAAFgMLq6pGWYwkqV/Dfk7hUuD3RlmIJKl/w7YUdgAuT3I+cPvEzKp67kiqkiT1YthQOGaURUiSZodhrz46Z9SFSJL6N+zVR7fQXG0EsCWwBXBbVW07qsIkSeM3bEthm8H7SQ4B9hlFQZKk/tyjUVKr6ovA02a2FElS34btPvrjgbub0Xxuwc8sSNK9zLBXHz1nYHoNsBw4eMarkST1athzCi8bdSGSpP4N+yU7uyY5PcmqJNclOS3JrqMuTpI0XsOeaP4kcAbN9yrsAny5nSdJuhcZNhTmV9Unq2pNezsRmD/CuiRJPRg2FK5PcniSee3tcOCGURYmSRq/YUPh5cChwM+BlcDzAU8+S9K9zLCXpP4NsKSqbgJIsj3wDzRhIUm6lxi2pfCoiUAAqKobgceMpiRJUl+GDYXNkmw3cadtKayzlZHkhPYS1ksH10tyZpIftj8Ht3lUkquSXJnkgA39RSRJG2/YUHgP8J9J/ibJO4D/BN69nnVOBA6cNG8pcFZV7Q2c1d4nyULgMOAR7TofSjJvyNokSTNkqFCoqk8BfwJcB6wG/riqTl7POucCN06afTBwUjt9EnDIwPxTqur2qvoxcBWOwipJYzfsiWaq6nLg8o3c305VtbLd3sokO7bzdwG+O7DcinaeJGmM7tHQ2SOQKeZNOQprkiOSLEuybPXq1SMuS5LmlnGHwnVJdgZof65q568AdhtYblfg2qk2UFXHV9Xiqlo8f74fqpakmTTuUDgDWNJOLwG+NDD/sCRbJdkT2Bs4f8y1SdKcN/Q5hQ2V5HPAfsAOSVYAbweOA05N8grgp8ALAKrqsiSn0pyzWAMcWVVrR1WbJGlqIwuFqnrRNA/tP83yxwLHjqoeSdL6zZYTzZKkWcBQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1Nu+7AI3PgqVf7W3fy487qLd9SxqeLQVJUsdQkCR1euk+SrIcuAVYC6ypqsVJtgc+DywAlgOHVtVNfdQnSXNVny2FP6qqRVW1uL2/FDirqvYGzmrvS5LGaDZ1Hx0MnNROnwQc0l8pkjQ39XX1UQFfT1LAR6vqeGCnqloJUFUrk+zYU20agb6ufPKqJ2nD9BUKT6qqa9sX/jOT/NewKyY5AjgCYPfddx9VfZI0J/XSfVRV17Y/VwGnA/sA1yXZGaD9uWqadY+vqsVVtXj+/PnjKlmS5oSxh0KS+yfZZmIaeAZwKXAGsKRdbAnwpXHXJklzXR/dRzsBpyeZ2P9nq+prSb4HnJrkFcBPgRf0UJskzWljD4Wquhp49BTzbwD2H3c9kqS7zKZLUiVJPTMUJEkdQ0GS1DEUJEkdQ0GS1DEUJEkdQ0GS1DEUJEkdQ0GS1DEUJEkdQ0GS1DEUJEkdQ0GS1DEUJEkdQ0GS1DEUJEkdQ0GS1DEUJEmdPr6jWRqbBUu/2tu+lx93UG/7lu4pWwqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6jpEoj0tcIrY7Oqo1hS0GS1DEUJEkdQ0GS1DEUJEkdQ0GS1Jl1Vx8lORB4HzAP+HhVHddzSdImpc/vpe6LV1zNnFnVUkgyD/gg8ExgIfCiJAv7rUqS5o7Z1lLYB7iqqq4GSHIKcDBwea9VSdIU+myVjap1NNtCYRfgmoH7K4B9BxdIcgRwRHv31iRXAjsA14+lwuFZ0/BmY12zsSaYnXX1XlPeNeXs3uuawozVNM3vPKw9pntgtoVCpphXd7tTdTxw/N1WSpZV1eJRFrahrGl4s7Gu2VgTzM66ZmNNMDvrmo01TTarzinQtAx2G7i/K3BtT7VI0pwz20Lhe8DeSfZMsiVwGHBGzzVJ0pwxq7qPqmpNktcA/0ZzSeoJVXXZEKsev/5Fxs6ahjcb65qNNcHsrGs21gSzs67ZWNPdpKrWv5QkaU6Ybd1HkqQeGQqSpM4mHQpJDkxyZZKrkiwd876XJ/lBkouSLGvnbZ/kzCQ/bH9uN7D8UW2dVyY5YAbrOCHJqiSXDszb4DqSPK79fa5K8v4kU10evDE1HZPkZ+3xuijJs8Zc025JvpnkiiSXJXl9O7/vYzVdXb0dryT3SXJ+kovbmv66nd/bsVpHTb0+rwa2OS/J95N8pb3f6/Nqo1TVJnmjORH9I2AvYEvgYmDhGPe/HNhh0rx3A0vb6aXAu9rphW19WwF7tnXPm6E6ngI8Frh0Y+oAzgeeSPNZkX8FnjnDNR0DvHmKZcdV087AY9vpbYD/bvfd97Garq7ejle7/tbt9BbAecAT+jxW66ip1+fVwP7eCHwW+Mps+B/cmNum3FLohsSoqt8CE0Ni9Olg4KR2+iTgkIH5p1TV7VX1Y+Aqmvo3WlWdC9y4MXUk2RnYtqq+U82z81MD68xUTdMZV00rq+rCdvoW4AqaT9D3faymq2s6I6+rGre2d7dob0WPx2odNU1nLH8/gCS7AgcBH5+0/96eVxtjUw6FqYbEWNc/00wr4OtJLkgz9AbATlW1Epp/dmDHdv64a93QOnZpp0dd32uSXJKme2miOT32mpIsAB5D825z1hyrSXVBj8er7Q65CFgFnFlVvR+raWqC/p9X/wT8BXDnwLxZ87zaUJtyKKx3SIwRe1JVPZZmRNcjkzxlHcv2XeuE6eoYR30fBh4CLAJWAu/po6YkWwOnAW+oql+ua9Ge6+r1eFXV2qpaRDOqwD5JHrmOxfusqdfjlOTZwKqqumDYVcZR18bYlEOh1yExqura9ucq4HSa7qDr2mYg7c9VPdW6oXWsaKdHVl9VXdf+U98JfIy7us/GVlOSLWheeD9TVV9oZ/d+rKaqazYcr7aOXwBnAwcyC47V5JpmwXF6EvDcJMtpurCfluTTzJJjdU9syqHQ25AYSe6fZJuJaeAZwKXt/pe0iy0BvtROnwEclmSrJHsCe9OcVBqVDaqjbd7ekuQJ7RUPLx1YZ0ZM/IO0nkdzvMZWU7uNTwBXVNV7Bx7q9VhNV1efxyvJ/CQPbKfvCzwd+C96PFbT1dT386qqjqqqXatqAc1r0L9X1eHMwv/BoY3yLPaob8CzaK7W+BHw1jHudy+aKwguBi6b2DfwIOAs4Iftz+0H1nlrW+eVzOBVBcDnaJrNd9C823jFPakDWEzzD/Uj4J9pP+0+gzWdDPwAuITmH2PnMdf0hzTN8UuAi9rbs2bBsZqurt6OF/Ao4Pvtvi8Fjr6nz+8x1NTr82pSjftx19VHvT6vNubmMBeSpM6m3H0kSZphhoIkqWMoSJI6hoIkqWMoSJugJIsyMPibNFMMBWmSJPP6rmFCkum+HXERzaWr0owyFDTnJPliO2bVZRPjViW5Nck7kpwHPDHJ4WmGar4oyUcngiLJh5Msy8DwzdPsY58kX2inD07y6yRbphkC+up2/qIk323H7Tl9YtyeJGcn+bsk5wCvT/KCJJemGTb63PbDmu8AXtjW98LRHjHNJYaC5qKXV9XjaD4s9LokDwLuTzPU977ADcALaca3WgSsBV7crvvWqlpM82GqpyZ51DT7uJBmcDuAJ9N8KOnxwL7cNeDdp4C3VNWjaD6A9faB9R9YVU+tqvcARwMHVNWjgedWMyrw0cDnq2pRVX1+Yw6GNGi6pql0b/a6JM9rp3ejGWpgLc34QwD7A48DvteMOMB9uWvsmkPb1sXmNN+FsJDm07R3U1Vr0nxZysNpxuN5L833TMwDvpXkATQv/Oe0q5wE/MvAJgZf6P8DODHJqcAXkEbIUNCckmQ/mnFznlhVv0pyNnAf4DdVtXZiMeCkqjpq0rp7Am8GHl9VNyU5sV13Ot+iGUX3DuAbwIk0ofDmIUq9bWKiql6VZF+aMfsvSrJoiPWle8TuI801DwBuagPhYTTf3jXZWcDzk+wI3Vcr7gFsS/NifXOSnWhe8NflXOANwHeqajXNeDgPAy6rqpuBm5I8uV32JcA5U20kyUOq6ryqOhq4nqZ1cwvNN7VJM8qWguaarwGvSnIJzYBk3528QFVdnuRtNF+itBnNO/0jq+q7Sb5PMwji1TTdOutyHrATTThA0820qu4acGwJ8JEk92u397JptvP3SfamacGcRTMQ40+BpWm+dOadnlfQTHFAPElSx+4jSVLH7iNpIyU5Hdhz0uy3VNW/9VGPtDHsPpIkdew+kiR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUuf/Awzrt0vnhx4VAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure()\n", "plt.hist(clean_data['smoothness_mean'])\n", "plt.xlabel(\"smoothness_mean\")\n", "plt.ylabel(\"count\")\n", "plt.title(\"Distribution of smoothness_mean\")\n", "plt.savefig(\"../figures/smoothness_mean_distr\")\n", "plt.show()\n", "\n", "plt.figure()\n", "plt.hist(clean_data['compactness_se'])\n", "plt.xlabel(\"compactness_se\")\n", "plt.ylabel(\"count\")\n", "plt.title(\"Distribution of compactness_se\")\n", "plt.savefig(\"../figures/compactness_se_distr\")\n", "plt.show()\n", "\n", "plt.figure()\n", "plt.hist(clean_data['area_worst'])\n", "plt.xlabel(\"area_worst\")\n", "plt.ylabel(\"count\")\n", "plt.title(\"Distribution of area_worst\")\n", "plt.savefig(\"../figures/area_worst_distr\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e239912a-67ed-40c7-93fa-6f3fd5dce0b2", "metadata": {}, "source": [ "After plotting the distributions of all the features, we found that the majority of the `mean` features are roughly symmetric or normal whereas the majority of the `se` and `worst` features are relatively right-skewed. " ] }, { "cell_type": "markdown", "id": "4210932c-1779-43f3-93c5-6ce01037b717", "metadata": {}, "source": [ "### Belign vs Malignant" ] }, { "cell_type": "markdown", "id": "69365f0a-147f-419d-a494-4d5978e442fc", "metadata": {}, "source": [ "Since we are interested in studying the difference between belign and malignant cancers, it might be helpful to analyze and compute the statistics of two populations separately and then compare." ] }, { "cell_type": "code", "execution_count": 7, "id": "299dd87f-6df5-4a6c-9eaf-699e040f2658", "metadata": {}, "outputs": [], "source": [ "belign = clean_data[clean_data['diagnosis'] == 0]\n", "malignant = clean_data[clean_data['diagnosis'] == 1]" ] }, { "cell_type": "markdown", "id": "b3053064-3acd-4c69-9488-7f6b2103fe85", "metadata": {}, "source": [ "We first compare the means of each feature in two populations." ] }, { "cell_type": "code", "execution_count": 8, "id": "efe12dc2-a0ab-4df4-83de-52b752ec6d98", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
diagnosis0.01.0
radius_mean12.14652417.462830
texture_mean17.91476221.604906
perimeter_mean78.075406115.365377
area_mean462.790196978.376415
smoothness_mean0.0924780.102898
compactness_mean0.0800850.145188
concavity_mean0.0460580.160775
concave points_mean0.0257170.087990
symmetry_mean0.1741860.192909
fractal_dimension_mean0.0628670.062680
radius_se0.2840820.609083
texture_se1.2203801.210915
perimeter_se2.0003214.323929
area_se21.13514872.672406
smoothness_se0.0071960.006780
compactness_se0.0214380.032281
concavity_se0.0259970.041824
concave points_se0.0098580.015060
symmetry_se0.0205840.020472
fractal_dimension_se0.0036360.004062
radius_worst13.37980121.134811
texture_worst23.51507029.318208
perimeter_worst87.005938141.370330
area_worst558.8994401422.286321
smoothness_worst0.1249590.144845
compactness_worst0.1826730.374824
concavity_worst0.1662380.450606
concave points_worst0.0744440.182237
symmetry_worst0.2702460.323468
fractal_dimension_worst0.0794420.091530
\n", "
" ], "text/plain": [ "diagnosis 0.0 1.0\n", "radius_mean 12.146524 17.462830\n", "texture_mean 17.914762 21.604906\n", "perimeter_mean 78.075406 115.365377\n", "area_mean 462.790196 978.376415\n", "smoothness_mean 0.092478 0.102898\n", "compactness_mean 0.080085 0.145188\n", "concavity_mean 0.046058 0.160775\n", "concave points_mean 0.025717 0.087990\n", "symmetry_mean 0.174186 0.192909\n", "fractal_dimension_mean 0.062867 0.062680\n", "radius_se 0.284082 0.609083\n", "texture_se 1.220380 1.210915\n", "perimeter_se 2.000321 4.323929\n", "area_se 21.135148 72.672406\n", "smoothness_se 0.007196 0.006780\n", "compactness_se 0.021438 0.032281\n", "concavity_se 0.025997 0.041824\n", "concave points_se 0.009858 0.015060\n", "symmetry_se 0.020584 0.020472\n", "fractal_dimension_se 0.003636 0.004062\n", "radius_worst 13.379801 21.134811\n", "texture_worst 23.515070 29.318208\n", "perimeter_worst 87.005938 141.370330\n", "area_worst 558.899440 1422.286321\n", "smoothness_worst 0.124959 0.144845\n", "compactness_worst 0.182673 0.374824\n", "concavity_worst 0.166238 0.450606\n", "concave points_worst 0.074444 0.182237\n", "symmetry_worst 0.270246 0.323468\n", "fractal_dimension_worst 0.079442 0.091530" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clean_data.groupby('diagnosis').mean().transpose()" ] }, { "cell_type": "markdown", "id": "01059826-5996-4c50-9cf8-763df1d9a896", "metadata": {}, "source": [ "Obviously, most of the features have different averages in two populations, but it is hard to tell whether such differences are significant or not. To assess this, we can compare the distributions of features in each population. " ] }, { "cell_type": "code", "execution_count": 9, "id": "f7528556-e8bf-436a-8b9c-e6483b303da0", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure()\n", "plt.hist(belign['smoothness_mean'], alpha=0.5, label=\"belign\")\n", "plt.hist(malignant['smoothness_mean'], alpha=0.5, label=\"malignant\")\n", "plt.xlabel('smoothness_mean')\n", "plt.ylabel('count')\n", "plt.legend()\n", "plt.title(\"Distributions of smoothness_mean in two populations\")\n", "plt.savefig(\"../figures/smoothness_mean_distr_two_popu\")\n", "plt.show()\n", "\n", "plt.figure()\n", "plt.hist(belign['compactness_se'], alpha=0.5, label=\"belign\")\n", "plt.hist(malignant['compactness_se'], alpha=0.5, label=\"malignant\")\n", "plt.xlabel('compactness_se')\n", "plt.ylabel('count')\n", "plt.legend()\n", "plt.title(\"Distributions of compactness_se in two populations\")\n", "plt.savefig(\"../figures/compactness_se_distr_two_popu\")\n", "plt.show()\n", "\n", "plt.figure()\n", "plt.hist(belign['area_worst'], alpha=0.5, label=\"belign\")\n", "plt.hist(malignant['area_worst'], alpha=0.5, label=\"malignant\")\n", "plt.xlabel('area_worst')\n", "plt.ylabel('count')\n", "plt.title(\"Distributions of area_worse in two populations\")\n", "plt.savefig(\"../figures/area_worst_distr_two_popu\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "93ee8fb4-1f01-413a-a71d-df4e305937d6", "metadata": {}, "source": [ "We chose the same three features we plotted before. We can see that although `belign` and `malignant` have different distributions in all the plots, it is still hard to judge whether it is due to the randomness since two populations have different size (as showed clearly in `smoothness_mean` feature). Thus, we should conduct some more rigorous statistical hypothesis testing, such as two-sample t-test, to judge this. And this will be done in a separate notebook called `two-populations-analysis.ipynb`. " ] }, { "cell_type": "code", "execution_count": null, "id": "e40aa6a9-a042-4b5a-9b53-c9370d522555", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "hw07", "language": "python", "name": "hw07" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" } }, "nbformat": 4, "nbformat_minor": 5 }