Coursera: Data Management and Visualization – Assignment 3

This assignment’s target was presumably to learn some data management. Not that it was quite clear to me what it is supposed to be, but OK, I just played a bit further with my dataset. The instructions regarding the blog post I also found a bit vague. In fact, the presentation requirements were just like last time:

  • The script;
  • The output (“that displays at least 3 of your data managed variables as frequency distributions”);
  • Some comments (“describing these frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.”).

The full version of my Python script is here. And below is what I have done about my variables so far.

Like previously, I stick to my three basic variables from NESARC dataset, which are:

  • The experience of animal phobia;
  • The origin (or descent);
  • Perceived health.

But this time I decided to use some other variables for comparison.

Animal phobia

In my literature review I mentioned a study that made a distinction between pure and mixed (that is combined with some other specific phobias) animal phobia. So I thought it may be instructive to also look at these two types separately. This required a number of tricks, which are supposedly called data management. These steps were:

  • Take the values from all specific phobia variables (there are 11 of them including animal phobia) and store them in new columns having replaced all ‘No-s’ and ‘Unknowns’ with 0 (so that they only have 1 if there was this phobia experience and 0 if there was not or it is unknown);
  • Create a new column and fill it with the result of summing up all rows within those recoded special phobia columns;
  • Take a subset of my dataset, which only includes respondents with animal phobia experience (that is all values in the correspondent column == 1);
  • Recode the new column with summed results so that it only keeps 1 values and those greater than 1 (indicating there are other phobias apart from animals) are 0;
  • Use this recoded column to distinguish pure and mixed cases of animal phobia.

Here is the code snippet:

# Create new columns to store recoded values for different kinds of specific phobia
for phobia in ALL_SPECIFIC_PHOBIAS:
    data[phobia[CODE] + '_NEW'] = data[phobia[CODE]].replace([2, 9], 0)

# Sum up all values for phobias in new columns and store the result in a new column 'APPUREMIXED'
data[APPUREMIXED] = data.loc[:, sp_new_list].sum(axis=1)
condition_for_replace = data[APPUREMIXED] > 1
data.loc[condition_for_replace, APPUREMIXED] = 0  # replace values > 1 with 0
appuremixed_freq = data[data[ANIMALS_MAP[CODE]] == 1][APPUREMIXED].value_counts(sort=False, dropna=False)
appuremixed_percent = data[data[ANIMALS_MAP[CODE]] == 1][APPUREMIXED].value_counts(sort=False, dropna=False, normalize=True)

print('\nFrequencies, percentages for pure and mixed animal phobia')
print(pd.concat(dict(Frequencies=appuremixed_freq.rename({1: 'Pure', 0: 'Mixed'}),
                     Percentages=appuremixed_percent.rename({1: 'Pure', 0: 'Mixed'})), axis=1))

The resulting frequency distributions:

Frequencies, percentages for pure and mixed animal phobia
        Frequencies  Percentages
 Mixed         6836     0.751787
 Pure          2257     0.248213

I wonder if it could be done in an simpler way.

Origin or descent

With origins it was even trickier. I wanted to see the percentages of the respondents with animal phobia for each kind of origin separately. But I failed to find a way to do it based on this dataset. I am sure there are better ways to handle this (maybe via grouping? but again, I still do not quite understand the mechanics). So I just created a new dataframe to store all necessary data to calculate these percentages. Here are the steps:

  • Get frequencies for origins;
  • Get frequencies for origins for the respondents with animal phobia;
  • Combine these two results into a new dataframe with origin names as indices and two frequencies variables as columns;
  • Calculate percentages and store them in a new column.

I also replaced Unknown and Other origins with NaN values and dropped them when creating the new dataframe.

Code snippet:

# Convert Unknown and Other to NaN
data[ORIGIN_MAP[CODE]] = data[ORIGIN_MAP[CODE]].replace([98, 99], np.nan)

# Get frequencies by origin
origins = data[ORIGIN_MAP[CODE]].value_counts(sort=False, dropna=True)
# Get origin frequencies based on the condition that respondents have animal phobia
condition = data[ANIMALS_MAP[CODE]] == 1
origins_with_ap = data[condition][ORIGIN_MAP[CODE]].value_counts(sort=False, dropna=True)

# Create a new dataframe out of these two frequency series
origins_df = origins.rename(ORIGIN_MAP[VALUES]).to_frame(name='ORIGCOUNTS')
origins_df['ORIGAPCOUNTS'] = origins_with_ap.rename(ORIGIN_MAP[VALUES])

# Create a new column in this new df to store percentages
origins_df['APPERCENT'] = origins_df['ORIGAPCOUNTS'] / origins_df['ORIGCOUNTS']

And here is the top and the bottom of the sorted output (the printed the column with percentages):

Turkish                                                                 0.315789
African American (Black, Negro, or Afro-American)                       0.301406
Other Caribbean or West Indian (Spanish Speaking)                       0.291667
Filipino                                                                0.269058
African (e.g., Egyptian, Nigerian, Algerian)                            0.264706
Guamanian                                                               0.263158
Vietnamese                                                              0.257426
Other Spanish                                                           0.253623
Other Caribbean or West Indian (Non-Spanish Speaking)                   0.252475
Canadian                                                                0.250000
...
Israeli                                                                 0.148936
Russian                                                                 0.138756
Indonesian                                                              0.137931
Chinese                                                                 0.133987
Other Eastern European (Romanian, Bulgarian, Albanian)                  0.114035
Iranian                                                                 0.106383
Iraqi                                                                   0.100000
Samoan                                                                  0.100000
Jordanian                                                               0.090909
Australian, New Zealander                                               0.078947

As was shown in my previous assignment, the overall
percentage of animal phobia was 21%. On the origin level though these percentages demonstrate considerable variety. There are much lower values for some (like Australian, New Zealander, about 8%) and higher values for others (e.g. African American, 30%).

However, these results may have different weight, so to speak. For example, we see that the Turkish origin is on the very top with about 32% of animal phobia rate. But there are only 19 respondents with this origin for the whole dataset, and 6 of them had this animal fear experience. One might doubt that on such a tiny sample the result might be trustworthy. On the other hand, there are African Americans, who are really numerous (7684).

That is why I decided to work only with a subset of those origins, which have 400 or more occurrences in the dataset. I chose 400 as a threshold, because it is kind of a magic number in the research area (based on sample size calculations and confidence intervals).

Here is the code:

subset_orig_gte_400 = origins_df[origins_df['ORIGCOUNTS'] >= 400].copy()
print('Origins subset (gte 400 respondents)')
print(subset_orig_gte_400.sort_values(by=['APPERCENT'], ascending=False))

As a result I got a smaller subset (20 rows instead of 60) with the following animal phobia shares:

African American (Black, Negro, or Afro-American)       0.301406
American Indian (Native American)                       0.241026
South American (e.g., Brazilian, Chilean, Columbian)    0.232323
Central American (e.g., Nicaraguan, Guatemalan)         0.228361
Puerto Rican                                            0.220662
Dutch                                                   0.203980
French                                                  0.201336
Spanish (Spain) , Portugese                             0.198819
Italian                                                 0.198071
Irish                                                   0.187867
Scottish                                                0.187335
English                                                 0.185410
Cuban                                                   0.184444
Mexican-American                                        0.184300
Norwegian                                               0.184211
Mexican                                                 0.183088
German                                                  0.179607
Swedish                                                 0.178654
Polish                                                  0.176768
Russian                                                 0.138756

Here I recall a valuable input by a peer who commented on my Assignment 2 and, among all, mentioned that some Native Americans may have higher animal fear rate.

It is also worth noting that none of the origins showed any extraordinary animal phobia rate, like 50% or higher.

Perceived health

For the perceived health variable I also recoded all Unknowns into NaN, just in case, and then dropped them.

What is more impressive, I had a look at perceived health distribution for those with pure animal phobia. In my previous assignment, I compared the distribution across the whole dataset with the distribution for those with animal phobia. There was some difference (in particular, the percentage of those whose perceived health is poor, was slightly higher, 7% vs. 5%).

So, this time I calculated perceived health distribution for pure animal phobia and compared it with previous calculations. Code:

data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].replace(9, np.nan)

# Get percentages for perceived health distribution (for all)
health_percent = data[HEALTH_MAP[CODE]].value_counts(sort=False, dropna=True, normalize=True)

# health perception vs. animal phobia
health_ap_percent = data[data[ANIMALS_MAP[CODE]] == 1][HEALTH_MAP[CODE]].value_counts(sort=False, dropna=True, normalize=True)

# health perception vs. pure animal phobia
health_pure_ap_percent = data[(has_ap & has_pure_ap)][HEALTH_MAP[CODE]].value_counts(sort=False, dropna=True,
																					  normalize=True)

print('\nCompared distribution percentages for Perceived Health')
print(pd.concat(dict(Dataset=health_percent.rename(HEALTH_MAP[VALUES]),
					 AnimalPhobia=health_ap_percent.rename(HEALTH_MAP[VALUES]),
					 PureAnimalPhobia=health_pure_ap_percent.rename(HEALTH_MAP[VALUES])), axis=1))

Result:

Compared distribution percentages for Perceived Health
           AnimalPhobia   Dataset  PureAnimalPhobia
Excellent      0.228515  0.287576          0.299777
Fair           0.159652  0.121862          0.095323
Good           0.271926  0.248652          0.249889
Poor           0.072829  0.051813          0.044989
Very good      0.267078  0.290097          0.310022

As we can see, the share of those who perceive their health as poor is the smallest in the case of pure animal phobia. These respondents also most often perceive their health as excellent or very good. Actually this reminds me of the study Pure animal phobia is more specific than other specific phobias by Vladeta Ajdacic-Gross et al., which states that “Pure animal phobia showed no associations with any included risk factors and comorbid disorders, in contrast to numerous associations found in the mixed subtype and in other specific phobias”.

Wrap up

The attempt to distinguish pure and mixed animal phobia showed the proportion of 25% (pure) vs. 75% (mixed). While processing other special phobias data I faced a dilemma on how to treat missing values (or Unknown). I saw two ways:

  • Code all Unknowns as NaN to make sure I only count the results for those cases that are definitely true. This approach would imply that pure animal phobia is the one, about which we are absolutely sure that it is not combined with any other specific phobias.
  • Code all No-s and Unknowns as 0 and treat them equally. This would imply that pure animal phobia is the one, about which we have no evidence that it is combined with any other specific phobias.

I chose the latter. First, the approach with NaN would lead to a messier picture with lots of uncertainties to be taken into consideration. Second, and even more important, I cannot be sure that this dataset lists all possible specific phobias. By the way, I failed to find something like pyrophobia there. So, even if I try and clean out all Unknowns on the dataset level, there will still be huge unknowns outside its scope. That is why decided not to make any difference between No and Unknown when recoding these variables.

With the origins variable I ended up with a subset of those with 400 or more occurrences as most representative. The top-3 origins by animal phobia percentage were African American (30%), American Indian (24%), South American (23%). The lowest percentage is in the cases of Russian (14%), Polish (18%) and Swedish (18%). Looks like there may be some geographic pattern indeed. Although we can see that Cuban, Mexican and Mexican-American origins (that is southern) are somewhere in the middle. I might also want to later have a look at the distribution of pure animal phobia across the origins.

As to the perceived health variable, I compared the results for the whole dataset with the results for those with animal phobia and with pure animal phobia. I was surprised to see that those with pure animal phobia tend to estimate their health better than the others: Poor: 7% for the respondents with animal phobia, including mixed cases; 5% for the whole dataset; 4% for those with pure animal phobia. And excellent: 23% (animal phobia including mixed); 29% (whole dataset); 30% (pure animal phobia).

Leave a comment