Demographics in the United States

An exploration of the relationships that exist between demographic datapoints

Introduction

The United States is comprised of so many types of people and jobs, all making different amounts of money with different backgrounds and levels of education. As a country there are many systems in place that limit or guarantee certain future employment opportunities, depending on different qualities like race, gender, and education.

In this report, we explore the relationships between the features in our dataset. We seek to understand the imbalances for gender opportunities by stratifying gender over many factors like job title, income, and more. We also attempt to predict one’s likely income amount (over or under the 50k).

The Data

The dataset we used is from Kaggle.com, and includes multiple demographic datapoints about individuals in the United States. The dataset contains the following columns:

  1. Age: continuous variable

  2. Workclass: (categorical) Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

  3. Fnlwgt: continuous variable, amount of people with same set of datapoints in the US

  4. Education: (categorical) Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

  5. Education-num: continuous.

  6. Marital-status: (categorical) Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

  7. Occupation: (categorical) Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

  8. Relationship: (categorical) Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

  9. Race: (categorical) White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

  10. Sex: (categorical) Female, Male.

  11. Capital-gain: continuous variable

  12. Capital-loss: continuous variable

  13. Hours-per-week: continuous variable

  14. Native-country: (categorical) United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

  15. Label: (categorical) Over_50k, Under_50k

Package Imports

import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import tree
from IPython.display import Image

from gender_functions import plot_gender_eda
from gender_functions import process_df
from gender_functions import plot_tree

Data Import

df = pd.read_csv('adult.data', header = None)
df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status','occupation',
              'relationship', 'race', 'sex','capital-gain','capital-loss','hours-per-week','native country','label']
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native country label
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Exploritory Data Analysis

Understanding what our data looks like is important. We want to see what the data looks like, are there any empty or unreliable variables, is the dataset sparse. We also want to find any outliers and remove them.

First, we can see that there are no missing values in this data.

df.isnull().sum()
age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native country    0
label             0
dtype: int64

Since age is correlated with income, older individulas tend to have higher income due to their experience in the fields. We can see that 43.7% of our data contains Americans over 40 years old.

len(df[df['age'] >= 40]) / len(df)
0.437240870980621

Next, we want to explore the income level of Americans. This dataset found that 24720 individuals make less than 50k a year and 7841 individuals earn more than 50k a year. Therefore, the percentage of Americans that makes more than 50k a year is around 24%.

df.label.value_counts()
 <=50K    24720
 >50K      7841
Name: label, dtype: int64
# Percent of americans making more than 50k a year
7841 / len(df)
0.2408095574460244

We want to see if more Americans over 40 years old make more than 50k a year. Assuming one makes more money as they age, we should see an increase. We found that 9216 individuals earn less than 50k and 5021 individuals who make more than 50k a year over 40 years old. The percentage of Americans over 40 years old and making over 50k a year is 35.3% which is higher than the Americans making over 50k a year without age boundary.

df_over40 = df[df['age'] >= 40]
df_over40.label.value_counts()
 <=50K    9216
 >50K     5021
Name: label, dtype: int64
# Percentage of American over 40 and make more than 50k a year
5021 / len(df_over40)
0.35267261361241836

To explore the education level difference in gender, we created a count plot. We found that more males completed the same education level than females such that more males completed bachelor’s than females. The data set contains more males than females. Without specifying the counts, we can see the overall trend that more males completed education than females.

plt.figure(figsize = (15,8))
ax = sns.countplot(x = 'education', hue = 'sex', data = df)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 30)
plt.title('Education Level Count by Gender')
plt.show()
_images/main_22_0.png

We also want to explore if different races would be more likely to be in a particular occupation. There are more white people in the dataset overthrowing the proportion of the graph, making it unclear to see a trend directly from this plot.

plt.figure(figsize = (15,8))
ax = sns.countplot(x = 'occupation', hue = 'race', data = df)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 30)
plt.title('Occupation Count by Race')
plt.show()
_images/main_24_0.png

We want to exclude the possible outliers in our data to obtain a more accurate analysis of demographic data points. We exclude the distribution above 99 percent quantile or below the 1 percent quantile limit. This means anyone older than 74 or younger than 17 years old will not be included in the analysis.

def filter_outliers(df, column):
    q_low = df[column].quantile(0.01)
    q_hi  = df[column].quantile(0.99)
    filter_df = df[(df[column] < q_hi) & (df[column] > q_low)]
    return filter_df
df = filter_outliers(df, 'age')
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native country label
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Gender Analysis

The following section explores the gender differences amongst non-instrinsic qualities like education level, income, occupation, working class, and marital status, to understand if there are any predictive features or noteworthy imbalances. Intrinsic features such as race, age, or nationality will not be considered, as these alone don’t have any logical predictive power of gender, and included in the model would not be valid split points.

We first want to evaulate the ration of men to women in this dataset, to give context to any imbalances seen in the variables.

df['sex'].value_counts()/len(df['sex'])
 Male      0.670835
 Female    0.329165
Name: sex, dtype: float64

We can see that there is about 33/66 split on gender. This is important context to plotting relationships.

plot_gender_eda(df)
_images/main_33_0.png _images/main_33_1.png _images/main_33_2.png _images/main_33_3.png

From these plots, we can see there certain variables have more stratification on gender than others. Income, for example, shows that men are more likely to be making >50k. It also shows that men are more likely to be married in this dataset. It also shows that females are more educated.

#Processing our dataframe for our model
X, target = process_df(df, 'sex')
#let's divide our data into a training and testing set, giving our test 15% of the data
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.15, random_state=42)

#using a Decision Tree
clf = DecisionTreeClassifier(max_depth = 3, random_state = 42)
clf = clf.fit(X_train,y_train)
clf.feature_names = list(X_train.columns.values)
y_pred = clf.predict(X_test)

Let’s see how accurate is our model.

metrics.accuracy_score(y_test, y_pred)
0.7275392670157068

The below is a visualization of the decision tree.

plot_tree(clf, target)
_images/main_39_0.png

We can see from the decision tree that there is still a lot of impurity in each split, meaning that no single factor is helping to determine gender totally.

Income Prediction

The following section explores the predictive power of our features on the income variable. We are using a decision tree model, with an ensemble method for accuracy improvements.

An important contextual cue for this analysis is the dataset’s income ratio of: over 50k and less than (or equal to) 50k

df['label'].value_counts()/len(df['label'])
 <=50K    0.755506
 >50K     0.244494
Name: label, dtype: float64

As we can see, there is a roughly, 24:76 split.

After doing some pre-processing. Including the creation of dummy-variables for all qualitative predictors in the dataset. Five different decision tree models were applied to the dataset to predict income. These models include a naïve, “CART”, decision tree; a bagged decision tree; a bagged decision tree with support-vector-machine feature selection; a bagged decision tree with symmetric-uncertainty feature selection; and, a bagged decision tree with principal component analysis feature selection. These model types vary in complexity, but, a best model could be selected out of them all by using 10-fold cross validation. Please refer to income.ipynb for compuatations involved in these steps.

Image(filename='figures/accuracy_barchart.png')
_images/main_47_0.png

As seen by this barchart- after creating the five, decision tree models. Although only slightly, the best performer- which obtains the highest, 10-fold, cross-validation accuracy. Is the decision tree, with bagging, and support vector machine feature selection. More specifically, its accuracy tends closely towards 85%.

Image(filename='figures/income_tree.png')
_images/main_49_0.png

As seen by the above figure, this is what the best model for income prediction looks like. Again, this is the bagged decision tree with support-vector-machine feature selection. Also, please refer to income.ipynb for compuatations involved in these steps.

Results and Limitations

From the gender exploration, we can’t conclude too much predictive nature of the features in our dataset. The accuracy of the model was not impressive, and furthermore we can’t assume it was splitting on anything truly predictive. In general however, there is generally a inequal representation of women in many different factors, such as the income gap. Additionally, our dataset may have been an unreliable source in some ways because of the unequal representation of gender in the data.

Overall, with gender, there is data suggesting women may be more educated than men in general, yes overall are less likely to make money.

Now, moving along to our analysis of incomes we have conclude the following. As seen the anaylsis, the best decision tree model is definitely the bagged decision tree with support-vector-machine feature selection. It gives a roughly, 85% accuracy- which in many cases is truly desirable. Nevertheless, despite this accuracy being quite high- it is not great given that the categorical response variable for income is split 76:24. As a result, we have to conclude that decision tree modelling has not been impressive in predicting income levels using the Census Bureau data. At most, it does a slightly better job than prediciting everybody will have a less than 50 thousand dollar income. Perhaps a new dataset, or other modelling method may be more insightful.

The limitations of this dataset come from unrepresentativeness. The categories (race, occupation, native country) in this dataset are not conclusive. The data consists of primarily white males, creating biases in the training models for predictions on both gender analysis and income prediction. The models in this project capture the big picture and ideas that can explore the demographics in the United States. The data only has an income level of over 50k and under 50k. If the data can have numerical values of the actual income instead of a categorical label, it can help to explore more about the income. It would also be beneficial if the data could include geographic locations such as states or cities that can help understand more about each state’s demographic in the United States.

Author Contributions

Midori - I created the EDA section before the gender analysis section. I helped with structuring the README file and with making the Jupyter Book.

Samantha - I helped the team create template files for the EDA and Main ipynb files. I completed the gender exploration section, along with the gender_functions package folder and function tests. Helped write the narrative on the main ipynb file. Helped team with git workflow.

Noam - I contributed to the: income decision tree analyses and corresponding write-up, Makefile, environment, readme, jupyter book, and group- Zoom and Slack- discussions.