Coursera: Data Management and Visualization – Assignment 2

For assignment 2 the task was to write some code to process the chosen variables and present:

  • the script;
  • the output that displays three of your variables as frequency tables;
  • a few sentences describing your frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.

My selected variables were:

  • the experience of animal phobia;
  • the origin of the respondent;
  • the respondent’s perceived health state.

The full version of my script (in Python) is here. Below are some of its parts to provide some outline. To print out all my output for frequencies and percentages I wrote the following function:

import pandas as pd

def label_print(series_freq, series_percent, series_map, rename=True, ind_sort=False):
    Print the output of an operation with all captions and labels needed
    :param series_freq: which series frequency to print out
    :param series_percent: which series percentages to print out
    :param series_map: set the dictionary describing the series
    :param rename: True by default; if True, replaces index with the given labels
    :param ind_sort: False by default; if True, sort index values (good for sorting numeric index)
    :return: None
    print(TITLE.format(series_map[CODE], series_map[MEANING]))
    if rename:
        # Use labels for the output
                             Percentages=series_percent.rename(series_map[VALUES])), axis=1))
    if ind_sort:
        # Sort numeric labels for the output
                        Percentages=series_percent.sort_index()), axis=1))
    elif not rename and not ind_sort:
        # print(series)
                             Percentages=series_percent), axis=1))

I use global variables to storage all necessary string values (such as column names, their meanings, etc.). And I stored all the necessary code meanings as dictionaries in a separate file, which I import into my script and use to label the output.

So, here is the piece of code to get the frequencies and percentages for my core variable regarding animal phobia:

animals_freq= data[ANIMALS_MAP[CODE]].value_counts(sort=False)
animals_percent = data[ANIMALS_MAP[CODE]].value_counts(sort=False, normalize=True)
label_print(animals_freq, animals_percent, ANIMALS_MAP)

And here is the output:


         Frequencies  Percentages
Yes             9093     0.211009
No             32585     0.756155
Unknown         1415     0.032836

From this we see that a considerable number of the respondents (21%) have had some uneasy experience with the animals.

My next variable was origin or descent. Thing is that there are quite a number of distinct values (60 in total). By the way, I counted the unique values like this:

unique_origins = data[ORIGIN_MAP[CODE]].unique()
print('num distinct origins:', len(unique_origins))

Some are really numerous (like African American, 7684 occurrences, about 18%); others are very few (like Malaysian 11, .025%) . So, here I will provide just the top by frequency (over 900 occurrences). The code was:

origins_freq = data[ORIGIN_MAP[CODE]].value_counts()
origins_percent = data[ORIGIN_MAP[CODE]].value_counts(normalize=True)
label_print(origins_freq, origins_percent, ORIGIN_MAP)

And this is the top of the output:


                                                    Frequencies  Percentages
African American (Black, Negro, or Afro-American)          7684     0.178312
German                                                     5345     0.124034
English                                                    4455     0.103381
Irish                                                      3066     0.071148
Mexican                                                    2578     0.059824
Unknown                                                    1855     0.043046
Mexican-American                                           1758     0.040795
Other                                                      1739     0.040355
Italian                                                    1555     0.036085
French                                                     1048     0.024319
Puerto Rican                                                997     0.023136
American Indian (Native American)                           975     0.022625

Last there is the health variable. Code snippet, nothing new:

health_freq = data[HEALTH_MAP[CODE]].value_counts(sort=False)
health_percent = data[HEALTH_MAP[CODE]].value_counts(sort=False, normalize=True)
label_print(health_freq, health_percent, HEALTH_MAP)

The output:


           Frequencies  Percentages
Very good        12424     0.288307
Excellent        12316     0.285800
Good             10649     0.247117
Fair              5219     0.121110
Poor              2219     0.051493
Unknown            266     0.006173

And I also played with subsetting. I made a subset based on three conditions:

  • The respondent should have experienced animal fear
  • The respondent should be of one of the top origins (excluding Other and Unknown)
  • The respondent should perceive their health as poor.

Here is the code snippet:

condition_ap = data[ANIMALS_MAP[CODE]] == 1
condition_health = data[HEALTH_MAP[CODE]] == 5
condition_origin = data[ORIGIN_MAP[CODE]].isin([1, 19, 15, 18, 27, 29, 36, 35, 39, 3])
raw_subset = data[(condition_ap & condition_health & condition_origin)]
subset = raw_subset.copy()
print('Subset: top origins + poor perceived health + have AP')
origins_ap_freq = subset[ORIGIN_MAP[CODE]].value_counts(sort=False)
origins_ap_percent = subset[ORIGIN_MAP[CODE]].value_counts(sort=False, normalize=True)
label_print(origins_ap_freq, origins_ap_percent, ORIGIN_MAP)

Here is the output:

Subset: top origins + poor perceived health + have AP


                                                   Frequencies  Percentages
African American (Black, Negro, or Afro-American)          217     0.430556
American Indian (Native American)                           27     0.053571
English                                                     67     0.132937
French                                                      10     0.019841
German                                                      54     0.107143
Irish                                                       35     0.069444
Italian                                                     17     0.033730
Mexican                                                     22     0.043651
Mexican-American                                            20     0.039683
Puerto Rican                                                35     0.069444

To wrap up: I now have three frequency tables for each of my variables separately.
For animal fear/avoidance there is a fair share of those who have experienced this (21%). It may be instructive to have a look at some other specific phobias for comparison, but for now I can just note that this share is by no means small.

For origin or descent, we see that the leaders among the respondents’ origins (60 in total) are:

  • African American (Black, Negro, or Afro-American) (18%)
  • German (12%)
  • English (10%)

And the fewest are:

  • Jordanian and Malaysian(.025% each)
  • Samoan (.02%)

As to health, most of the respondents (82%) find their health state good, very good or excellent. And only 5% estimated it as poor. I wonder if this distribution is going to change for the subset of those with animal phobia. So I even had a look. Here is the code snippet:

raw_subset = data[(condition_ap)]
subset_health_ap = raw_subset.copy()
print('\nSubset: perceived health + have AP')
health_ap_freq = subset_health_ap[HEALTH_MAP[CODE]].value_counts(sort=True, dropna=False)
health_ap_percent = subset_health_ap[HEALTH_MAP[CODE]].value_counts(sort=True, dropna=False, normalize=True)
label_print(health_ap_freq, health_ap_percent, HEALTH_MAP)

And here is the output:

Subset: perceived health + have AP


           Frequencies  Percentages
Good              2468     0.271418
Very good         2424     0.266579
Excellent         2074     0.228088
Fair              1449     0.159353
Poor               661     0.072693
Unknown             17     0.001870

So in the subset of those with animal fears, the distribution is a bit different indeed. Still, the idea that the health is in at least good state in predominant (about 77%). But now the order of the top three has changed: Good is the most frequent among the three (it used to be the least frequent), Excellent is the least frequent (used to be in the middle). The share of those who think their health is poor has also increased to be 7%. Too early to jump at any conclusions though. I do not even know if this difference is somehow significant.

As to the subset (above) based on origin, poor perceived health and animal phobia – well, it was purely technical. There is little to be concluded or observed based on the frequency table. Some other approach will be necessary. For now, I am just glad I discovered that nice .isin method for subsetting.

Working witn my variables, I have not encountered any missing values, so I did not have to deal with them technically (although I used the dropna parameter a couple of times just in case). But there is a similar thing in all three variables: the Unknown category. I have not decided how to deal with it yet. But most probably I will just drop it in future.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s