Coursera: Data Management and Visualization – Assignment 4

For this assignment the task was to visualize some variables and provide some brief description or the result.

Summary

The full Python script is here.

Univariate graph of animal phobia.

data[ANIMALS_MAP[CODE]] = data[ANIMALS_MAP[CODE]].astype('category')
data[ANIMALS_MAP[CODE]] = data[ANIMALS_MAP[CODE]].cat.rename_categories(ANIMALS_MAP[VALUES])
seaborn.countplot(x=ANIMALS_MAP[CODE], data=data)
plt.xlabel('Ever had fear/avoidance of insects, snakes, birds, other animals')
plt.title ('Distribution of Animal Phobia Cases')
plt.show()

This graph shows that the number of the cases with animal phobia is rather small, but still considerable.

Univariate graph of respondents’ distribution by the region of origin.

# Create subset for the 20 origins that have more than 400 occurrences (calculated in Assignment 3)
condition_origin = data[ORIGIN_MAP[CODE]].isin(REGIONS)
subset_origin = data[condition_origin].copy()

# Add a new column with regions based on origin values
subset_origin['REGION'] = subset_origin.apply(lambda row: assign_region(row), axis=1)
seaborn.countplot(subset_origin['REGION'])
plt.xlabel('Regions of Origin')
plt.xticks(rotation=20)
plt.title('Distribution by Regions of Origin')
plt.tight_layout()
plt.show()

This graph is a product of distributing original multiple values by ‘bins’ in order to make them better fit into the image. We see that the predominant region of origin in this dataset is Western Europe, followed by Africa, Latin America and Central Europe. Note that the graph only represents the distribution for the respondents whose origin had 400
or more occurrences in the dataset. (If you are wondering why Native Americans are among the regions, please find the explanation below).

Univariate graph of respondents’ distribution by the health perception.

data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].astype('category')
data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].cat.rename_categories(HEALTH_VALUES)
seaborn.countplot(data[HEALTH_MAP[CODE]])
plt.xlabel('Self-Preceived Current Health')
plt.title('Distribution by Health Perception')
plt.show()

This graph shows that most respondents tend to describe their health as good, very good or excellent.

Bivariate graph categorical -> categorical (region of origin [explanatory] -> type of animal phobia [response]).

condition_ap_for_origin = subset_origin[ANIMALS_MAP[CODE]] == 1
subset_origin_ap = subset_origin[condition_ap_for_origin].copy()
subset_origin_ap[APPUREMIXED] = pd.to_numeric(subset_origin_ap[APPUREMIXED], errors='coerce')
seaborn.catplot(x='REGION', y=APPUREMIXED, kind='bar', ci=None, data=subset_origin_ap)
plt.xlabel('Regions of Origin')
plt.ylabel('Proportion of Pure Animal Phobia')
plt.title('Pure Animal Phobia vs. Region of Origin')
plt.xticks(rotation=20)
plt.show()

Again the explanations regarding Native Americans are below.

That said, we see in this graph that Native Americans and the respondents of the African origins demonstrate the lowest proportion of pure animal phobia. Third lowest are Latin Americans. That is interesting, as these particular groups (especially Native Americans and those with African origin) demonstrate the highest rates of animal phobia in general (both, mixed and pure), see below.

Detail

Animal phobia

As it turned out in the previous assignment, there might be really a considerable difference between pure and mixed animal phobia. It was particularly striking on the example of compared perceived health results for those with mixed and pure animal phobia together and pure animal phobia alone. Briefly, those with pure animal phobia tend to find their health better than those with mixed animal phobia (whose results were even slightly worse than for the whole dataset).

So, I decided to focus on this distinction. Here is the distribution of the types of animal phobia only for the respondents who had this experience.

condition_ap = data[ANIMALS_MAP[CODE]] == 1
subset_ap = data[condition_ap].copy()  # Make a subset of those with animal phobia
subset_ap[APPUREMIXED] = subset_ap[APPUREMIXED].astype('category')
subset_ap[APPUREMIXED] = subset_ap[APPUREMIXED].cat.rename_categories(['Mixed', 'Pure'])
seaborn.countplot(x=APPUREMIXED, data=subset_ap)
plt.xlabel('Types of Animal Phobia')
plt.title ('Distribution of Pure and Mixed Animal Phobia')
plt.show()

The proportion of pure animal phobia is actually similar to the proportion of animal phobia to no-animal-phobia.

Health perception -> Type of AP

So, getting to the health perception. If a person with animal phobia perceives their health as good, is it reasonable to expect that their AP is pure?

condition_ap = data[ANIMALS_MAP[CODE]] == 1
subset_ap = data[condition_ap].copy()  # Make a subset of those with animal phobia

# group health categories into two
subset_ap['HEALTHBINARY'] = subset_ap.apply(lambda row: sort_health(row), axis=1)
seaborn.catplot(x=APPUREMIXED, y='HEALTHBINARY', kind='bar', ci=None, data=subset_ap)
plt.xlabel('Type of Animal Phobia')
plt.ylabel('Proportion of Good Perceived Health')
plt.title('Perceived Health -> Type of AP')
plt.xticks(rotation=20)
plt.tight_layout()
plt.show()

Well, looks like not exactly. Although the proportion of those with pure AP is slightly bigger, the proportion of those with mixed AP and good perceived health is still big, so it is about 73% to 85%, nothing impressive.

By the way, to produce this graph I had to sort all the health categories into two. I chose Good (Excellent, Very good, Good) and Not good (Fair, Poor).

UPD: Messed up this part by confusing explanatory and response variables. A better version of the same stuff is in a later assignment (for week two, Data Analysis Tools, which is the next course in the specialization).

Origin / Descent

This variable turned out to be challenging again. Out of 60 original categories I used only 20 (those that had 400 or more occurrences). Still, they were too many to properly fit them into a plot. So, presumably, I had to group them into some ‘bins’, or generalized categories, based on some principle. I chose regional classification as this principle. Specifically, I tried to use the UN geoscheme. On the way, I found out that this principle was not exactly robust, because there are really numerous versions of such a classification. As a result, my sorting was extremely approximate, but I think it is good enough for this course’s purposes.

There was one origin, however, that I could not fit into this regional classification: the Native Americans. As I initially chose this variable as a potential marker of different cultural backgrounds, I wanted to keep this cultural distinction within my regional groupings as well. In most cases, I think, it is roughly reflected in political geography. But definitely not in this case, because otherwise Native Americans would be merged with Latin or Northern Americans. My solution was to actually keep them as is.

So, getting to the proportions. In my previous assignment, I found considerable variability in animal phobia rate depending on origin. In my new regionalized version this variability is still in place.

    subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].replace(9, np.nan)
    subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].replace(2, 0)
    subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].astype('category')
    print(subset_origin[ANIMALS_MAP[CODE]].describe())
    subset_origin[ANIMALS_MAP[CODE]] = pd.to_numeric(subset_origin[ANIMALS_MAP[CODE]], errors='coerce')
    seaborn.catplot(x='REGION', y=ANIMALS_MAP[CODE], kind='bar', ci=None, data=subset_origin)
    plt.xlabel('Regions of Origin')
    plt.ylabel('Proportion of Animal Phobia')
    plt.title('Animal Phobia vs. Region of Origin')
    plt.xticks(rotation=20)
    plt.show()

We see the highest rates of animal phobia for Africa and [Native Americans], followed by Southern Europe and Latin America.

This picture, however, drastically changes, if we look at the proportions of pure animal phobia by region. That graph was already posted above, in the Summary section, but I will reproduce it here again.

Here we see that the proportion of pure animal phobia is actually the lowest exactly for those who had the highest rates in the general animal phobia. Namely, such groups as Native Americans, Africa and Latin America show the smallest proportions of pure animal phobia compared to the others.

I do not think I can conclude something based on this observation, but I find this observation rather interesting to keep in mind for the future.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s