For assignment 2 the task was to write some code to process the chosen variables and present:
- the script;
- the output that displays three of your variables as frequency tables;
- a few sentences describing your frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.
My selected variables were:
- the experience of animal phobia;
- the origin of the respondent;
- the respondent’s perceived health state.
The full version of my script (in Python) is here. Below are some of its parts to provide some outline. To print out all my output for frequencies and percentages I wrote the following function:
import pandas as pd def label_print(series_freq, series_percent, series_map, rename=True, ind_sort=False): ''' Print the output of an operation with all captions and labels needed :param series_freq: which series frequency to print out :param series_percent: which series percentages to print out :param series_map: set the dictionary describing the series :param rename: True by default; if True, replaces index with the given labels :param ind_sort: False by default; if True, sort index values (good for sorting numeric index) :return: None ''' print(TITLE.format(series_map[CODE], series_map[MEANING])) if rename: # Use labels for the output print(pd.concat(dict(Frequencies=series_freq.rename(series_map[VALUES]), Percentages=series_percent.rename(series_map[VALUES])), axis=1)) if ind_sort: # Sort numeric labels for the output print(pd.concat(dict(Frequencies=series_freq.sort_index(), Percentages=series_percent.sort_index()), axis=1)) elif not rename and not ind_sort: # print(series) print(pd.concat(dict(Frequencies=series_freq, Percentages=series_percent), axis=1))
I use global variables to storage all necessary string values (such as column names, their meanings, etc.). And I stored all the necessary code meanings as dictionaries in a separate file reference.py, which I import into my script and use to label the output.
So, here is the piece of code to get the frequencies and percentages for my core variable regarding animal phobia:
animals_freq= data[ANIMALS_MAP[CODE]].value_counts(sort=False) animals_percent = data[ANIMALS_MAP[CODE]].value_counts(sort=False, normalize=True) label_print(animals_freq, animals_percent, ANIMALS_MAP)
And here is the output:
Results for S8Q1A1 - EVER HAD FEAR/AVOIDANCE OF INSECTS, SNAKES, BIRDS, OTHER ANIMALS Frequencies Percentages Yes 9093 0.211009 No 32585 0.756155 Unknown 1415 0.032836
From this we see that a considerable number of the respondents (21%) have had some uneasy experience with the animals.
My next variable was origin or descent. Thing is that there are quite a number of distinct values (60 in total). By the way, I counted the unique values like this:
unique_origins = data[ORIGIN_MAP[CODE]].unique() print('num distinct origins:', len(unique_origins))
Some are really numerous (like African American, 7684 occurrences, about 18%); others are very few (like Malaysian 11, .025%) . So, here I will provide just the top by frequency (over 900 occurrences). The code was:
origins_freq = data[ORIGIN_MAP[CODE]].value_counts() origins_percent = data[ORIGIN_MAP[CODE]].value_counts(normalize=True) label_print(origins_freq, origins_percent, ORIGIN_MAP)
And this is the top of the output:
Results for S1Q1E - ORIGIN OR DESCENT Frequencies Percentages African American (Black, Negro, or Afro-American) 7684 0.178312 German 5345 0.124034 English 4455 0.103381 Irish 3066 0.071148 Mexican 2578 0.059824 Unknown 1855 0.043046 Mexican-American 1758 0.040795 Other 1739 0.040355 Italian 1555 0.036085 French 1048 0.024319 Puerto Rican 997 0.023136 American Indian (Native American) 975 0.022625 ...
Last there is the health variable. Code snippet, nothing new:
health_freq = data[HEALTH_MAP[CODE]].value_counts(sort=False) health_percent = data[HEALTH_MAP[CODE]].value_counts(sort=False, normalize=True) label_print(health_freq, health_percent, HEALTH_MAP)
Results for S1Q16 - SELF-PERCEIVED CURRENT HEALTH Frequencies Percentages Very good 12424 0.288307 Excellent 12316 0.285800 Good 10649 0.247117 Fair 5219 0.121110 Poor 2219 0.051493 Unknown 266 0.006173
And I also played with subsetting. I made a subset based on three conditions:
- The respondent should have experienced animal fear
- The respondent should be of one of the top origins (excluding Other and Unknown)
- The respondent should perceive their health as poor.
Here is the code snippet:
condition_ap = data[ANIMALS_MAP[CODE]] == 1 condition_health = data[HEALTH_MAP[CODE]] == 5 condition_origin = data[ORIGIN_MAP[CODE]].isin([1, 19, 15, 18, 27, 29, 36, 35, 39, 3]) raw_subset = data[(condition_ap & condition_health & condition_origin)] subset = raw_subset.copy() print('Subset: top origins + poor perceived health + have AP') origins_ap_freq = subset[ORIGIN_MAP[CODE]].value_counts(sort=False) origins_ap_percent = subset[ORIGIN_MAP[CODE]].value_counts(sort=False, normalize=True) label_print(origins_ap_freq, origins_ap_percent, ORIGIN_MAP)
Here is the output:
Subset: top origins + poor perceived health + have AP Results for S1Q1E - ORIGIN OR DESCENT Frequencies Percentages African American (Black, Negro, or Afro-American) 217 0.430556 American Indian (Native American) 27 0.053571 English 67 0.132937 French 10 0.019841 German 54 0.107143 Irish 35 0.069444 Italian 17 0.033730 Mexican 22 0.043651 Mexican-American 20 0.039683 Puerto Rican 35 0.069444
To wrap up: I now have three frequency tables for each of my variables separately.
For animal fear/avoidance there is a fair share of those who have experienced this (21%). It may be instructive to have a look at some other specific phobias for comparison, but for now I can just note that this share is by no means small.
For origin or descent, we see that the leaders among the respondents’ origins (60 in total) are:
- African American (Black, Negro, or Afro-American) (18%)
- German (12%)
- English (10%)
And the fewest are:
- Jordanian and Malaysian(.025% each)
- Samoan (.02%)
As to health, most of the respondents (82%) find their health state good, very good or excellent. And only 5% estimated it as poor. I wonder if this distribution is going to change for the subset of those with animal phobia. So I even had a look. Here is the code snippet:
raw_subset = data[(condition_ap)] subset_health_ap = raw_subset.copy() print('\nSubset: perceived health + have AP') health_ap_freq = subset_health_ap[HEALTH_MAP[CODE]].value_counts(sort=True, dropna=False) health_ap_percent = subset_health_ap[HEALTH_MAP[CODE]].value_counts(sort=True, dropna=False, normalize=True) label_print(health_ap_freq, health_ap_percent, HEALTH_MAP)
And here is the output:
Subset: perceived health + have AP Results for S1Q16 - SELF-PERCEIVED CURRENT HEALTH Frequencies Percentages Good 2468 0.271418 Very good 2424 0.266579 Excellent 2074 0.228088 Fair 1449 0.159353 Poor 661 0.072693 Unknown 17 0.001870
So in the subset of those with animal fears, the distribution is a bit different indeed. Still, the idea that the health is in at least good state in predominant (about 77%). But now the order of the top three has changed: Good is the most frequent among the three (it used to be the least frequent), Excellent is the least frequent (used to be in the middle). The share of those who think their health is poor has also increased to be 7%. Too early to jump at any conclusions though. I do not even know if this difference is somehow significant.
As to the subset (above) based on origin, poor perceived health and animal phobia – well, it was purely technical. There is little to be concluded or observed based on the frequency table. Some other approach will be necessary. For now, I am just glad I discovered that nice
.isin method for subsetting.
Working witn my variables, I have not encountered any missing values, so I did not have to deal with them technically (although I used the
dropna parameter a couple of times just in case). But there is a similar thing in all three variables: the
Unknown category. I have not decided how to deal with it yet. But most probably I will just drop it in future.