Coursera: Data Management and Visualization – Assignment 4

For this assignment the task was to visualize some variables and provide some brief description or the result.


The full Python script is here.

Univariate graph of animal phobia.

data[ANIMALS_MAP[CODE]] = data[ANIMALS_MAP[CODE]].astype('category')
data[ANIMALS_MAP[CODE]] = data[ANIMALS_MAP[CODE]].cat.rename_categories(ANIMALS_MAP[VALUES])
seaborn.countplot(x=ANIMALS_MAP[CODE], data=data)
plt.xlabel('Ever had fear/avoidance of insects, snakes, birds, other animals')
plt.title ('Distribution of Animal Phobia Cases')

This graph shows that the number of the cases with animal phobia is rather small, but still considerable.

Univariate graph of respondents’ distribution by the region of origin.

# Create subset for the 20 origins that have more than 400 occurrences (calculated in Assignment 3)
condition_origin = data[ORIGIN_MAP[CODE]].isin(REGIONS)
subset_origin = data[condition_origin].copy()

# Add a new column with regions based on origin values
subset_origin['REGION'] = subset_origin.apply(lambda row: assign_region(row), axis=1)
plt.xlabel('Regions of Origin')
plt.title('Distribution by Regions of Origin')

This graph is a product of distributing original multiple values by ‘bins’ in order to make them better fit into the image. We see that the predominant region of origin in this dataset is Western Europe, followed by Africa, Latin America and Central Europe. Note that the graph only represents the distribution for the respondents whose origin had 400
or more occurrences in the dataset. (If you are wondering why Native Americans are among the regions, please find the explanation below).

Univariate graph of respondents’ distribution by the health perception.

data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].astype('category')
data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].cat.rename_categories(HEALTH_VALUES)
plt.xlabel('Self-Preceived Current Health')
plt.title('Distribution by Health Perception')

This graph shows that most respondents tend to describe their health as good, very good or excellent.

Bivariate graph categorical -> categorical (region of origin [explanatory] -> type of animal phobia [response]).

condition_ap_for_origin = subset_origin[ANIMALS_MAP[CODE]] == 1
subset_origin_ap = subset_origin[condition_ap_for_origin].copy()
subset_origin_ap[APPUREMIXED] = pd.to_numeric(subset_origin_ap[APPUREMIXED], errors='coerce')
seaborn.catplot(x='REGION', y=APPUREMIXED, kind='bar', ci=None, data=subset_origin_ap)
plt.xlabel('Regions of Origin')
plt.ylabel('Proportion of Pure Animal Phobia')
plt.title('Pure Animal Phobia vs. Region of Origin')

Again the explanations regarding Native Americans are below.

That said, we see in this graph that Native Americans and the respondents of the African origins demonstrate the lowest proportion of pure animal phobia. Third lowest are Latin Americans. That is interesting, as these particular groups (especially Native Americans and those with African origin) demonstrate the highest rates of animal phobia in general (both, mixed and pure), see below.


Animal phobia

As it turned out in the previous assignment, there might be really a considerable difference between pure and mixed animal phobia. It was particularly striking on the example of compared perceived health results for those with mixed and pure animal phobia together and pure animal phobia alone. Briefly, those with pure animal phobia tend to find their health better than those with mixed animal phobia (whose results were even slightly worse than for the whole dataset).

So, I decided to focus on this distinction. Here is the distribution of the types of animal phobia only for the respondents who had this experience.

condition_ap = data[ANIMALS_MAP[CODE]] == 1
subset_ap = data[condition_ap].copy()  # Make a subset of those with animal phobia
subset_ap[APPUREMIXED] = subset_ap[APPUREMIXED].astype('category')
subset_ap[APPUREMIXED] = subset_ap[APPUREMIXED].cat.rename_categories(['Mixed', 'Pure'])
seaborn.countplot(x=APPUREMIXED, data=subset_ap)
plt.xlabel('Types of Animal Phobia')
plt.title ('Distribution of Pure and Mixed Animal Phobia')

The proportion of pure animal phobia is actually similar to the proportion of animal phobia to no-animal-phobia.

Health perception -> Type of AP

So, getting to the health perception. If a person with animal phobia perceives their health as good, is it reasonable to expect that their AP is pure?

condition_ap = data[ANIMALS_MAP[CODE]] == 1
subset_ap = data[condition_ap].copy()  # Make a subset of those with animal phobia

# group health categories into two
subset_ap['HEALTHBINARY'] = subset_ap.apply(lambda row: sort_health(row), axis=1)
seaborn.catplot(x=APPUREMIXED, y='HEALTHBINARY', kind='bar', ci=None, data=subset_ap)
plt.xlabel('Type of Animal Phobia')
plt.ylabel('Proportion of Good Perceived Health')
plt.title('Perceived Health -> Type of AP')

Well, looks like not exactly. Although the proportion of those with pure AP is slightly bigger, the proportion of those with mixed AP and good perceived health is still big, so it is about 73% to 85%, nothing impressive.

By the way, to produce this graph I had to sort all the health categories into two. I chose Good (Excellent, Very good, Good) and Not good (Fair, Poor).

UPD: Messed up this part by confusing explanatory and response variables. A better version of the same stuff is in a later assignment (for week two, Data Analysis Tools, which is the next course in the specialization).

Origin / Descent

This variable turned out to be challenging again. Out of 60 original categories I used only 20 (those that had 400 or more occurrences). Still, they were too many to properly fit them into a plot. So, presumably, I had to group them into some ‘bins’, or generalized categories, based on some principle. I chose regional classification as this principle. Specifically, I tried to use the UN geoscheme. On the way, I found out that this principle was not exactly robust, because there are really numerous versions of such a classification. As a result, my sorting was extremely approximate, but I think it is good enough for this course’s purposes.

There was one origin, however, that I could not fit into this regional classification: the Native Americans. As I initially chose this variable as a potential marker of different cultural backgrounds, I wanted to keep this cultural distinction within my regional groupings as well. In most cases, I think, it is roughly reflected in political geography. But definitely not in this case, because otherwise Native Americans would be merged with Latin or Northern Americans. My solution was to actually keep them as is.

So, getting to the proportions. In my previous assignment, I found considerable variability in animal phobia rate depending on origin. In my new regionalized version this variability is still in place.

    subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].replace(9, np.nan)
    subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].replace(2, 0)
    subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].astype('category')
    subset_origin[ANIMALS_MAP[CODE]] = pd.to_numeric(subset_origin[ANIMALS_MAP[CODE]], errors='coerce')
    seaborn.catplot(x='REGION', y=ANIMALS_MAP[CODE], kind='bar', ci=None, data=subset_origin)
    plt.xlabel('Regions of Origin')
    plt.ylabel('Proportion of Animal Phobia')
    plt.title('Animal Phobia vs. Region of Origin')

We see the highest rates of animal phobia for Africa and [Native Americans], followed by Southern Europe and Latin America.

This picture, however, drastically changes, if we look at the proportions of pure animal phobia by region. That graph was already posted above, in the Summary section, but I will reproduce it here again.

Here we see that the proportion of pure animal phobia is actually the lowest exactly for those who had the highest rates in the general animal phobia. Namely, such groups as Native Americans, Africa and Latin America show the smallest proportions of pure animal phobia compared to the others.

I do not think I can conclude something based on this observation, but I find this observation rather interesting to keep in mind for the future.


Coursera: Data Management and Visualization – Assignment 3

This assignment’s target was presumably to learn some data management. Not that it was quite clear to me what it is supposed to be, but OK, I just played a bit further with my dataset. The instructions regarding the blog post I also found a bit vague. In fact, the presentation requirements were just like last time:

  • The script;
  • The output (“that displays at least 3 of your data managed variables as frequency distributions”);
  • Some comments (“describing these frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.”).

The full version of my Python script is here. And below is what I have done about my variables so far.

Like previously, I stick to my three basic variables from NESARC dataset, which are:

  • The experience of animal phobia;
  • The origin (or descent);
  • Perceived health.

But this time I decided to use some other variables for comparison.

Animal phobia

In my literature review I mentioned a study that made a distinction between pure and mixed (that is combined with some other specific phobias) animal phobia. So I thought it may be instructive to also look at these two types separately. This required a number of tricks, which are supposedly called data management. These steps were:

  • Take the values from all specific phobia variables (there are 11 of them including animal phobia) and store them in new columns having replaced all ‘No-s’ and ‘Unknowns’ with 0 (so that they only have 1 if there was this phobia experience and 0 if there was not or it is unknown);
  • Create a new column and fill it with the result of summing up all rows within those recoded special phobia columns;
  • Take a subset of my dataset, which only includes respondents with animal phobia experience (that is all values in the correspondent column == 1);
  • Recode the new column with summed results so that it only keeps 1 values and those greater than 1 (indicating there are other phobias apart from animals) are 0;
  • Use this recoded column to distinguish pure and mixed cases of animal phobia.

Here is the code snippet:

# Create new columns to store recoded values for different kinds of specific phobia
    data[phobia[CODE] + '_NEW'] = data[phobia[CODE]].replace([2, 9], 0)

# Sum up all values for phobias in new columns and store the result in a new column 'APPUREMIXED'
data[APPUREMIXED] = data.loc[:, sp_new_list].sum(axis=1)
condition_for_replace = data[APPUREMIXED] > 1
data.loc[condition_for_replace, APPUREMIXED] = 0  # replace values > 1 with 0
appuremixed_freq = data[data[ANIMALS_MAP[CODE]] == 1][APPUREMIXED].value_counts(sort=False, dropna=False)
appuremixed_percent = data[data[ANIMALS_MAP[CODE]] == 1][APPUREMIXED].value_counts(sort=False, dropna=False, normalize=True)

print('\nFrequencies, percentages for pure and mixed animal phobia')
print(pd.concat(dict(Frequencies=appuremixed_freq.rename({1: 'Pure', 0: 'Mixed'}),
                     Percentages=appuremixed_percent.rename({1: 'Pure', 0: 'Mixed'})), axis=1))

The resulting frequency distributions:

Frequencies, percentages for pure and mixed animal phobia
        Frequencies  Percentages
 Mixed         6836     0.751787
 Pure          2257     0.248213

I wonder if it could be done in an simpler way.

Origin or descent

With origins it was even trickier. I wanted to see the percentages of the respondents with animal phobia for each kind of origin separately. But I failed to find a way to do it based on this dataset. I am sure there are better ways to handle this (maybe via grouping? but again, I still do not quite understand the mechanics). So I just created a new dataframe to store all necessary data to calculate these percentages. Here are the steps:

  • Get frequencies for origins;
  • Get frequencies for origins for the respondents with animal phobia;
  • Combine these two results into a new dataframe with origin names as indices and two frequencies variables as columns;
  • Calculate percentages and store them in a new column.

I also replaced Unknown and Other origins with NaN values and dropped them when creating the new dataframe.

Code snippet:

# Convert Unknown and Other to NaN
data[ORIGIN_MAP[CODE]] = data[ORIGIN_MAP[CODE]].replace([98, 99], np.nan)

# Get frequencies by origin
origins = data[ORIGIN_MAP[CODE]].value_counts(sort=False, dropna=True)
# Get origin frequencies based on the condition that respondents have animal phobia
condition = data[ANIMALS_MAP[CODE]] == 1
origins_with_ap = data[condition][ORIGIN_MAP[CODE]].value_counts(sort=False, dropna=True)

# Create a new dataframe out of these two frequency series
origins_df = origins.rename(ORIGIN_MAP[VALUES]).to_frame(name='ORIGCOUNTS')
origins_df['ORIGAPCOUNTS'] = origins_with_ap.rename(ORIGIN_MAP[VALUES])

# Create a new column in this new df to store percentages
origins_df['APPERCENT'] = origins_df['ORIGAPCOUNTS'] / origins_df['ORIGCOUNTS']

And here is the top and the bottom of the sorted output (the printed the column with percentages):

Turkish                                                                 0.315789
African American (Black, Negro, or Afro-American)                       0.301406
Other Caribbean or West Indian (Spanish Speaking)                       0.291667
Filipino                                                                0.269058
African (e.g., Egyptian, Nigerian, Algerian)                            0.264706
Guamanian                                                               0.263158
Vietnamese                                                              0.257426
Other Spanish                                                           0.253623
Other Caribbean or West Indian (Non-Spanish Speaking)                   0.252475
Canadian                                                                0.250000
Israeli                                                                 0.148936
Russian                                                                 0.138756
Indonesian                                                              0.137931
Chinese                                                                 0.133987
Other Eastern European (Romanian, Bulgarian, Albanian)                  0.114035
Iranian                                                                 0.106383
Iraqi                                                                   0.100000
Samoan                                                                  0.100000
Jordanian                                                               0.090909
Australian, New Zealander                                               0.078947

As was shown in my previous assignment, the overall
percentage of animal phobia was 21%. On the origin level though these percentages demonstrate considerable variety. There are much lower values for some (like Australian, New Zealander, about 8%) and higher values for others (e.g. African American, 30%).

However, these results may have different weight, so to speak. For example, we see that the Turkish origin is on the very top with about 32% of animal phobia rate. But there are only 19 respondents with this origin for the whole dataset, and 6 of them had this animal fear experience. One might doubt that on such a tiny sample the result might be trustworthy. On the other hand, there are African Americans, who are really numerous (7684).

That is why I decided to work only with a subset of those origins, which have 400 or more occurrences in the dataset. I chose 400 as a threshold, because it is kind of a magic number in the research area (based on sample size calculations and confidence intervals).

Here is the code:

subset_orig_gte_400 = origins_df[origins_df['ORIGCOUNTS'] >= 400].copy()
print('Origins subset (gte 400 respondents)')
print(subset_orig_gte_400.sort_values(by=['APPERCENT'], ascending=False))

As a result I got a smaller subset (20 rows instead of 60) with the following animal phobia shares:

African American (Black, Negro, or Afro-American)       0.301406
American Indian (Native American)                       0.241026
South American (e.g., Brazilian, Chilean, Columbian)    0.232323
Central American (e.g., Nicaraguan, Guatemalan)         0.228361
Puerto Rican                                            0.220662
Dutch                                                   0.203980
French                                                  0.201336
Spanish (Spain) , Portugese                             0.198819
Italian                                                 0.198071
Irish                                                   0.187867
Scottish                                                0.187335
English                                                 0.185410
Cuban                                                   0.184444
Mexican-American                                        0.184300
Norwegian                                               0.184211
Mexican                                                 0.183088
German                                                  0.179607
Swedish                                                 0.178654
Polish                                                  0.176768
Russian                                                 0.138756

Here I recall a valuable input by a peer who commented on my Assignment 2 and, among all, mentioned that some Native Americans may have higher animal fear rate.

It is also worth noting that none of the origins showed any extraordinary animal phobia rate, like 50% or higher.

Perceived health

For the perceived health variable I also recoded all Unknowns into NaN, just in case, and then dropped them.

What is more impressive, I had a look at perceived health distribution for those with pure animal phobia. In my previous assignment, I compared the distribution across the whole dataset with the distribution for those with animal phobia. There was some difference (in particular, the percentage of those whose perceived health is poor, was slightly higher, 7% vs. 5%).

So, this time I calculated perceived health distribution for pure animal phobia and compared it with previous calculations. Code:

data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].replace(9, np.nan)

# Get percentages for perceived health distribution (for all)
health_percent = data[HEALTH_MAP[CODE]].value_counts(sort=False, dropna=True, normalize=True)

# health perception vs. animal phobia
health_ap_percent = data[data[ANIMALS_MAP[CODE]] == 1][HEALTH_MAP[CODE]].value_counts(sort=False, dropna=True, normalize=True)

# health perception vs. pure animal phobia
health_pure_ap_percent = data[(has_ap & has_pure_ap)][HEALTH_MAP[CODE]].value_counts(sort=False, dropna=True,

print('\nCompared distribution percentages for Perceived Health')
					 PureAnimalPhobia=health_pure_ap_percent.rename(HEALTH_MAP[VALUES])), axis=1))


Compared distribution percentages for Perceived Health
           AnimalPhobia   Dataset  PureAnimalPhobia
Excellent      0.228515  0.287576          0.299777
Fair           0.159652  0.121862          0.095323
Good           0.271926  0.248652          0.249889
Poor           0.072829  0.051813          0.044989
Very good      0.267078  0.290097          0.310022

As we can see, the share of those who perceive their health as poor is the smallest in the case of pure animal phobia. These respondents also most often perceive their health as excellent or very good. Actually this reminds me of the study Pure animal phobia is more specific than other specific phobias by Vladeta Ajdacic-Gross et al., which states that “Pure animal phobia showed no associations with any included risk factors and comorbid disorders, in contrast to numerous associations found in the mixed subtype and in other specific phobias”.

Wrap up

The attempt to distinguish pure and mixed animal phobia showed the proportion of 25% (pure) vs. 75% (mixed). While processing other special phobias data I faced a dilemma on how to treat missing values (or Unknown). I saw two ways:

  • Code all Unknowns as NaN to make sure I only count the results for those cases that are definitely true. This approach would imply that pure animal phobia is the one, about which we are absolutely sure that it is not combined with any other specific phobias.
  • Code all No-s and Unknowns as 0 and treat them equally. This would imply that pure animal phobia is the one, about which we have no evidence that it is combined with any other specific phobias.

I chose the latter. First, the approach with NaN would lead to a messier picture with lots of uncertainties to be taken into consideration. Second, and even more important, I cannot be sure that this dataset lists all possible specific phobias. By the way, I failed to find something like pyrophobia there. So, even if I try and clean out all Unknowns on the dataset level, there will still be huge unknowns outside its scope. That is why decided not to make any difference between No and Unknown when recoding these variables.

With the origins variable I ended up with a subset of those with 400 or more occurrences as most representative. The top-3 origins by animal phobia percentage were African American (30%), American Indian (24%), South American (23%). The lowest percentage is in the cases of Russian (14%), Polish (18%) and Swedish (18%). Looks like there may be some geographic pattern indeed. Although we can see that Cuban, Mexican and Mexican-American origins (that is southern) are somewhere in the middle. I might also want to later have a look at the distribution of pure animal phobia across the origins.

As to the perceived health variable, I compared the results for the whole dataset with the results for those with animal phobia and with pure animal phobia. I was surprised to see that those with pure animal phobia tend to estimate their health better than the others: Poor: 7% for the respondents with animal phobia, including mixed cases; 5% for the whole dataset; 4% for those with pure animal phobia. And excellent: 23% (animal phobia including mixed); 29% (whole dataset); 30% (pure animal phobia).

Coursera: Data Management and Visualization – Assignment 2

For assignment 2 the task was to write some code to process the chosen variables and present:

  • the script;
  • the output that displays three of your variables as frequency tables;
  • a few sentences describing your frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.

My selected variables were:

  • the experience of animal phobia;
  • the origin of the respondent;
  • the respondent’s perceived health state.

The full version of my script (in Python) is here. Below are some of its parts to provide some outline. To print out all my output for frequencies and percentages I wrote the following function:

import pandas as pd

def label_print(series_freq, series_percent, series_map, rename=True, ind_sort=False):
    Print the output of an operation with all captions and labels needed
    :param series_freq: which series frequency to print out
    :param series_percent: which series percentages to print out
    :param series_map: set the dictionary describing the series
    :param rename: True by default; if True, replaces index with the given labels
    :param ind_sort: False by default; if True, sort index values (good for sorting numeric index)
    :return: None
    print(TITLE.format(series_map[CODE], series_map[MEANING]))
    if rename:
        # Use labels for the output
                             Percentages=series_percent.rename(series_map[VALUES])), axis=1))
    if ind_sort:
        # Sort numeric labels for the output
                        Percentages=series_percent.sort_index()), axis=1))
    elif not rename and not ind_sort:
        # print(series)
                             Percentages=series_percent), axis=1))

I use global variables to storage all necessary string values (such as column names, their meanings, etc.). And I stored all the necessary code meanings as dictionaries in a separate file, which I import into my script and use to label the output.

So, here is the piece of code to get the frequencies and percentages for my core variable regarding animal phobia:

animals_freq= data[ANIMALS_MAP[CODE]].value_counts(sort=False)
animals_percent = data[ANIMALS_MAP[CODE]].value_counts(sort=False, normalize=True)
label_print(animals_freq, animals_percent, ANIMALS_MAP)

And here is the output:


         Frequencies  Percentages
Yes             9093     0.211009
No             32585     0.756155
Unknown         1415     0.032836

From this we see that a considerable number of the respondents (21%) have had some uneasy experience with the animals.

My next variable was origin or descent. Thing is that there are quite a number of distinct values (60 in total). By the way, I counted the unique values like this:

unique_origins = data[ORIGIN_MAP[CODE]].unique()
print('num distinct origins:', len(unique_origins))

Some are really numerous (like African American, 7684 occurrences, about 18%); others are very few (like Malaysian 11, .025%) . So, here I will provide just the top by frequency (over 900 occurrences). The code was:

origins_freq = data[ORIGIN_MAP[CODE]].value_counts()
origins_percent = data[ORIGIN_MAP[CODE]].value_counts(normalize=True)
label_print(origins_freq, origins_percent, ORIGIN_MAP)

And this is the top of the output:


                                                    Frequencies  Percentages
African American (Black, Negro, or Afro-American)          7684     0.178312
German                                                     5345     0.124034
English                                                    4455     0.103381
Irish                                                      3066     0.071148
Mexican                                                    2578     0.059824
Unknown                                                    1855     0.043046
Mexican-American                                           1758     0.040795
Other                                                      1739     0.040355
Italian                                                    1555     0.036085
French                                                     1048     0.024319
Puerto Rican                                                997     0.023136
American Indian (Native American)                           975     0.022625

Last there is the health variable. Code snippet, nothing new:

health_freq = data[HEALTH_MAP[CODE]].value_counts(sort=False)
health_percent = data[HEALTH_MAP[CODE]].value_counts(sort=False, normalize=True)
label_print(health_freq, health_percent, HEALTH_MAP)

The output:


           Frequencies  Percentages
Very good        12424     0.288307
Excellent        12316     0.285800
Good             10649     0.247117
Fair              5219     0.121110
Poor              2219     0.051493
Unknown            266     0.006173

And I also played with subsetting. I made a subset based on three conditions:

  • The respondent should have experienced animal fear
  • The respondent should be of one of the top origins (excluding Other and Unknown)
  • The respondent should perceive their health as poor.

Here is the code snippet:

condition_ap = data[ANIMALS_MAP[CODE]] == 1
condition_health = data[HEALTH_MAP[CODE]] == 5
condition_origin = data[ORIGIN_MAP[CODE]].isin([1, 19, 15, 18, 27, 29, 36, 35, 39, 3])
raw_subset = data[(condition_ap & condition_health & condition_origin)]
subset = raw_subset.copy()
print('Subset: top origins + poor perceived health + have AP')
origins_ap_freq = subset[ORIGIN_MAP[CODE]].value_counts(sort=False)
origins_ap_percent = subset[ORIGIN_MAP[CODE]].value_counts(sort=False, normalize=True)
label_print(origins_ap_freq, origins_ap_percent, ORIGIN_MAP)

Here is the output:

Subset: top origins + poor perceived health + have AP


                                                   Frequencies  Percentages
African American (Black, Negro, or Afro-American)          217     0.430556
American Indian (Native American)                           27     0.053571
English                                                     67     0.132937
French                                                      10     0.019841
German                                                      54     0.107143
Irish                                                       35     0.069444
Italian                                                     17     0.033730
Mexican                                                     22     0.043651
Mexican-American                                            20     0.039683
Puerto Rican                                                35     0.069444

To wrap up: I now have three frequency tables for each of my variables separately.
For animal fear/avoidance there is a fair share of those who have experienced this (21%). It may be instructive to have a look at some other specific phobias for comparison, but for now I can just note that this share is by no means small.

For origin or descent, we see that the leaders among the respondents’ origins (60 in total) are:

  • African American (Black, Negro, or Afro-American) (18%)
  • German (12%)
  • English (10%)

And the fewest are:

  • Jordanian and Malaysian(.025% each)
  • Samoan (.02%)

As to health, most of the respondents (82%) find their health state good, very good or excellent. And only 5% estimated it as poor. I wonder if this distribution is going to change for the subset of those with animal phobia. So I even had a look. Here is the code snippet:

raw_subset = data[(condition_ap)]
subset_health_ap = raw_subset.copy()
print('\nSubset: perceived health + have AP')
health_ap_freq = subset_health_ap[HEALTH_MAP[CODE]].value_counts(sort=True, dropna=False)
health_ap_percent = subset_health_ap[HEALTH_MAP[CODE]].value_counts(sort=True, dropna=False, normalize=True)
label_print(health_ap_freq, health_ap_percent, HEALTH_MAP)

And here is the output:

Subset: perceived health + have AP


           Frequencies  Percentages
Good              2468     0.271418
Very good         2424     0.266579
Excellent         2074     0.228088
Fair              1449     0.159353
Poor               661     0.072693
Unknown             17     0.001870

So in the subset of those with animal fears, the distribution is a bit different indeed. Still, the idea that the health is in at least good state in predominant (about 77%). But now the order of the top three has changed: Good is the most frequent among the three (it used to be the least frequent), Excellent is the least frequent (used to be in the middle). The share of those who think their health is poor has also increased to be 7%. Too early to jump at any conclusions though. I do not even know if this difference is somehow significant.

As to the subset (above) based on origin, poor perceived health and animal phobia – well, it was purely technical. There is little to be concluded or observed based on the frequency table. Some other approach will be necessary. For now, I am just glad I discovered that nice .isin method for subsetting.

Working witn my variables, I have not encountered any missing values, so I did not have to deal with them technically (although I used the dropna parameter a couple of times just in case). But there is a similar thing in all three variables: the Unknown category. I have not decided how to deal with it yet. But most probably I will just drop it in future.

Coursera: Data Management and Visualization – Assignment 1

Dataset, questions, hypothesis

I chose NESARC dataset to explore. The main reasons were:
– the size of the dataset (over 40K rows, which makes it more interesting to operate programmatically);
– high detalization of parameters, which provides great opportunities for asking questions.

I am going to focus on specific phobias (SP), particularly on animal phobia (AP). This kind of phobia appears to be rather widely spread and, according to some studies (see below) has rather specific distinctions from other kinds of SP, such as fear of heights, water, dentists, etc. So, it might be a good reason to single out just one phobia in order to narrow down my analysis. The variable is S8Q1A1 (EVER HAD FEAR/AVOIDANCE OF INSECTS, SNAKES, BIRDS, OTHER ANIMALS).
My original question is if there is any particular association between this AP and the origin of a person (I would cautiously suggest that different origins probably mean different cultural backgrounds). The variable is S1Q1E (ORIGIN OR DESCENT).
Next I would like to have a look at whether there is any association between having AP and the self-perception of health. The variable is S1Q16 (SELF-PERCEIVED CURRENT HEALTH).

So here are the basic questions:
– Is there any association between animal phobia and the origin?
– Is there any association between animal phobia and self-perceived current health description?

There is one more additional question (just in case I have more time for that). After taking a look at the national dimension of AP, it might be interesting to also consider possible cultural/perception changes over time and check out the percentage of people with AP across different age groups.

Based on the literature review (below), my hypothesis is:
– AP shows some association with national / cultural background or context and may be associated with the perception of health condition.

Literature review

AP in the course of other SPs or vs. other SPs, anxiety and various disorders/mental conditions.

[1] Vladeta Ajdacic-Gross, Stephanie Rodgers, Mario Müller, Michael P. Hengartner, Aleksandra Aleksandrowicz, Wolfram Kawohl, Karsten Heekeren, Wulf Rössler, Jules Angst, Enrique Castelao, Caroline Vandeleur, Martin Preisig
Pure animal phobia is more specific than other specific phobias: epidemiological evidence from the Zurich Study, the ZInEP and the PsyCoLaus
European Archives of Psychiatry and Clinical Neuroscience, September 2016, Volume 266, Issue 6, pp 567–577
This study states that pure animal phobia is principally different from other kinds of SP:
“Pure animal phobia and mixed animal/other specific phobias consistently displayed a low age at onset of first symptoms (8–12 years) and clear preponderance of females (OR > 3). Meanwhile, other specific phobias started up to 10 years later and displayed almost a balanced sex ratio. Pure animal phobia showed no associations with any included risk factors and comorbid disorders, in contrast to numerous associations found in the mixed subtype and in other specific phobias. Across the whole range of epidemiological parameters examined in three different samples, pure animal phobia seems to represent a different entity compared to other specific phobias. The etiopathogenetic mechanisms and risk factors associated with pure animal phobias appear less clear than ever”.
Based on this, I should probably also take into account the distinction between ‘pure’ (not combined with other SPs) and ‘mixed’ (goes in combination with other SPs) animal phobia.
So I may need to see the proportion of those who have had only AP symptoms (variable S8Q1A1, EVER HAD FEAR/AVOIDANCE OF INSECTS, SNAKES, BIRDS, OTHER ANIMALS) and those combining AP with other SP episodes.

[2] Kevin Hilbert, Ricard Evens, Nina Isabel Maslowski, Hans-Ulrich Wittchen, Ulrike Lueken
Neurostructural correlates of two subtypes of specific phobia: A voxel-based morphometry study
Psychiatry Research: Neuroimaging
Volume 231, Issue 2, 28 February 2015, Pages 168-175
Abstract: “The animal and blood-injection-injury (BII) subtypes of specific phobia are both characterized by subjective fear but distinct autonomic reactions to threat. Previous functional neuroimaging studies have related these characteristic responses to shared and non-shared neural underpinnings. However, no comparative structural data are available. This study aims to fill this gap by comparing the two subtypes and also comparing them with a non-phobic control group“.
This study shows more complicated dependencies in the comparative analysis of SPs. To be taken into consideration while comparing. Particularly variable S8Q1A8 (EVER HAD FEAR/AVOIDANCE OF SEEING BLOOD/GETTING AN INJECTION) may be of interest.

[3] K. J. Wardenaar, C. C. W. Lim, A. O. Al-Hamzawi, J. Alonso et al.
The cross-national epidemiology of specific phobia in the World Mental Health Surveys
Psychological Medicine, Volume 47, Issue 10 July 2017 , pp. 1744-1760
Results: “The cross-national lifetime and 12-month prevalence rates of specific phobia were, respectively, 7.4% and 5.5%, being higher in females (9.8 and 7.7%) than in males (4.9% and 3.3%) and higher in high- and higher-middle-income countries than in low-/lower-middle-income countries. The median age of onset was young (8 years). Of the 12-month patients, 18.7% reported severe role impairment (13.3–21.9% across income groups) and 23.1% reported any treatment (9.6–30.1% across income groups). Lifetime co-morbidity was observed in 60.5% of those with lifetime specific phobia, with the onset of specific phobia preceding the other disorder in most cases (72.6%). Interestingly, rates of impairment, treatment use and co-morbidity increased with the number of fear subtypes“.
This study indicates some association with age and sex. It also states associations with other disorders. This means that variables, such as sex and probably age as well should be taken into consideration. Luckily, the dataset provides SEX and AGE parameters.

AP in the context of culture / nationality

[4] Cultural Clinical Psychology Study Group, W.A. Arrindell, Martin Eisemann et al.
Phobic anxiety in 11 nations: Part I: Dimensional constancy of the five-factor model
Behaviour Research and Therapy, Volume 41, Issue 4, April 2003, Pages 461-479
(and Part 2 here
Abstract: “The Fear Survey Schedule-III (FSS-III) was administered to a total of 5491 students in Australia, East Germany, Great Britain, Greece, Guatemala, Hungary, Italy, Japan, Spain, Sweden, and Venezuela, and submitted to the multiple group method of confirmatory analysis (MGM) in order to determine the cross-national dimensional constancy of the five-factor model of self-assessed fears originally established in Dutch, British, and Canadian samples. The model comprises fears of bodily injury–illness–death, agoraphobic fears, social fears, fears of sexual and aggressive scenes, and harmless animals fears. Close correspondence between the factors was demonstrated across national samples. In each country, the corresponding scales were internally consistent, were intercorrelated at magnitudes comparable to those yielded in the original samples, and yielded (in 93% of the total number of 55 comparisons) sex differences in line with the usual finding (higher scores for females). In each country, the relatively largest sex differences were obtained on harmless animals fears. The organization of self-assessed fears is sufficiently similar across nations to warrant the use of the same weight matrix (scoring key) for the FSS-III in the different countries and to make cross-national comparisons feasible. This opens the way to further studies that attempt to predict (on an a priori basis) cross-national variations in fear levels with dimensions of national cultures.”
And quoting the abstract for the other part: “Hofstede’s dimensions of national cultures termed Masculinity–Femininity (MAS) and Uncertainty Avoidance (UAI) (Hofstede, 2001) are proposed to be of relevance for understanding national-level differences in self-assessed fears. The potential predictive role of national MAS was based on the classical work of Fodor (Fodor, 1974). Following Fodor, it was predicted that masculine (or tough) societies in which clearer differentiations are made between gender roles (high MAS) would report higher national levels of fears than feminine (or soft/modest) societies in which such differentiations are made to a clearly lesser extent (low MAS). In addition, it was anticipated that nervous-stressful-emotionally-expressive nations (high UAI) would report higher national levels of fears than calm-happy and low-emotional countries (low UAI), and that countries high on both MAS and UAI would report the highest national levels of fears“.

So, to summarize:

  • National / cultural differences show up when it comes to animal fears (particularly harmless)
  • Such fears are more common for ‘masculine’ cultures with more rigid gender roles; and also more typical for ‘nervous/emotionaly expressive’ countries.
  • So there is some cultural association with such animal fears.
    And here is where I am going to rely on S1Q1E (ORIGIN OR DESCENT) parameter.

[5] Eva Landová1, Natavan Bakhshaliyeva et al.
Association Between Fear and Beauty Evaluation of Snakes: Cross-Cultural Findings
Front. Psychol., 16 March 2018
The study states that the fear of snakes has evolutionary reasons and is particularly connected to geogrphical and natural conditions in which a country’s culture was formed. Well, just another case to show that researchers do establish some cultural association with fears (and ultimately phobias).

SPs (including AP) and physical conditions

[6] Cornelia Witthauer, Vladeta Ajdacic-Gross, et al.
Associations of specific phobia and its subtypes with physical diseases: an adult community study
BMC Psychiatry, 2016
Results: “Specific phobia was associated with cardiac diseases, gastrointestinal diseases, respiratory diseases, arthritic conditions, migraine, and thyroid diseases (odds ratios between 1.49 and 2.53). Among the subtypes, different patterns of associations with physical diseases were established“.

[7] Ella L.Oar, Lara J.Farrell et al.
Blood-Injection-Injury Phobia and Dog Phobia in Youth: Psychological Characteristics and Associated Features in a Clinical Sample
Behavior Therapy, Volume 47, Issue 3, May 2016, Pages 312-324
Abstract: “Blood-Injection-Injury (BII) phobia is a particularly debilitating condition that has been largely ignored in the child literature. The present study examined the clinical phenomenology of BII phobia in 27 youths, relative to 25 youths with dog phobia—one of the most common and well-studied phobia subtypes in youth. Children were compared on measures of phobia severity, functional impairment, comorbidity, threat appraisals (danger expectancies and coping), focus of fear, and physiological responding, as well as vulnerability factors including disgust sensitivity and family history. Children and adolescents with BII phobia had greater diagnostic severity. In addition, they were more likely to have a comorbid diagnosis of a physical health condition, to report more exaggerated danger expectancies, and to report fears that focused more on physical symptoms (e.g., faintness and nausea) in comparison to youth with dog phobia. The present study advances knowledge relating to this poorly understood condition in youth“.
Here I can note that Blood-Injection-Injury phobia is often mentioned (and explored) in combination with animal phobia (like here and in [2] for instance, but I have come across other cases as well).

To summarize:

  • SP (AP among them) may be an indication to some physical conditions.
  • Which makes me think there might be some reflection in self-perception.
  • Unfortunately, I failed to find any studies of association between SP and hypochondria, which would be more appropriate for my intention to check exactly subjective perception of health.


Well, looks like I am going to unfreeze this blog for a while. Just enrolled into yet another Coursera course on data analysis and there are assignments that require blog posts.  I see no point in creating a special blog for that, especially as I already have this one, created exactly for courses needs. Here’s the course, by the way. Cannot recommend it for now, as I’ve just started.

So, the next several posts here are hopefully (I do hope the course is worth it) going to be assignments.

Yet another UPD

Hey everyone whoever might read this.
I last updated more than a year ago and now I’m here for basically the same reason as last time. Namely, to say: it works. I mean open online education, which includes platform MOOCs, peer-learning, interactive platforms like Khan Academy, etc. Combined with the willingness to learn new stuff and apply the new knowledge, it apparently does work.
A year ago, my learning efforts and actual work were still mostly separate. Today, I’m happy to say that they’re not any longer. About a month ago, I first tried myself as an analyst. A very low-level analyst of course, but it’s a huge progress for me. I started applying my coding skills (in Python) to searching and extracting data from a database (both directly and via API). By the way, this is an enormous database of government contracts and procurements managed by a great non-governmental project Clearspending.
Also, I’m currently working on a research of the situation with open data in Russia. Things have change since 2013, when Russian governmental bodies first started to publish their data. So it’s high time we looked at the general picture and tried to describe it. I think we’re going to finish it by February.
There have been no data expeditions recently, although it doesn’t mean that we’re not thinking of renewing this good tradition one day.

Last, but not least, this year I started learning Armenian (սովորում եմ հայերեն), which is an extremely interesting and beautiful language. Last summer in Armenia I got a very nice, although rather old, textbook (until then I had trouble finding some decent learning material). Next challenge will be to find a decent Armenian dictionary (I’ve failed so far).

By the way, here’s a picture of the ancient Eribuni fortress in Yerevan.


Oh, and as usual, I’ve got some nice courses to recommend.

  • MongoDB University is a collection of nice online courses that teach how to use Mongo Database. Most of them are rather practical, so if there’s no need to handle MongoDB right away, they may not be awfully interesting. But if there is such a need, they are great.
  • However, one of those courses, which is published at Udacity, has a wider scope. It teaches the techniques of working with different data formats using Python. Nothing extremely complex, but there are lots of very helpful tips.
  • Also, there’s a nice specialization series by Charles Severance at Coursera. It’s Python again. The first two courses in the series are just basics, but they are followed by two courses on how to work with web data (including API) and databases. Again, rather simple, but nontheless very helpful sometimes.


I haven’t posted anything for quite a while, but actually I keep learning. It’s always somewhat sad to see these abandoned blogs created for peer-learning with a couple of posts and then no updates, so you just don’t know if their authors are still learning or gave up on it. Well, I haven’t. True, I’m more into platform MOOCs at the moment, so I’m not using this blog for peer-learning purposes directly. But I generally like this international open peer-learning project and I’m going to update this blog from time to time.

There’s a good occasion for this post: I’ve just completed An Introduction to Interactive Programming in Python at Coursera. I’ve finally done it having failed two previous attempts. It was challenging and I’m not sure I’d have made it if I hadn’t done some preparational job at Codecademy and with the help of Zed Shaw’s ‘Learn Python: The Hard Way‘ (a great educational project by the way).

Just for show, here are the links to the mini-projects I completed during the course. I’m providing the links to my code in Codesculptor, an online application created by one of the instructors for writing and running Python code. In case someone wants to have a look, the best way to do it is by using Chrome (using Mozilla and other browsers may lead to some bugs).

This is actually the first part in Fundamentals of Computing specialisation. Next course in this sequence, Principles of Computing, is going to start in February 2015 and I’m totally going to try it. Before it begins, I’m going to have some fun at Khan Academy.

Finally, some courses I’d like to have a closer look at at a certain point. Maybe someone will find them fascinating as well. If somebody has already dealt with some of them, it would be great if you shared your opinion.


1. I’m still alive.

2. I keep working as a journalist. Recently I’ve actually tried applying my newly acquired skills to my real job. Still much to work on, but at least I seem to have learnt at least something. In the first case I tried to work with some data on death penalty in the US; in the second case, I was visualising some aspects (namely, on kidnapping) of Global Terrorism Database. Both materials are in Russian of course. Moreover, the website does currently not allow for embedding interactive visualisations, so there are only screenshots, while the original interactive stuff is published on my Blogger account (but again, in Russian). Speaking of Global Terrorism Database, there’s a whole course at Coursera based on this project. Don’t know much about it, so I can’t recommend it, but I’ll definitely have a look, as soon as I can.

3. I keep tracking the developments in the activity of Open Data School in Moscow. It’s an interesting project both as an educational initiative and as part of promoting openness. More on it later, as well as on DLMOOC, by the way, which is fascinating (sadly, I’ve been virtually unable to participate full-scale).

4. Meanwhile, I’m trying to keep up with Linear Algebra: Foundation to Frontiers and Statistical Learning.

5. Right now I’m in the middle of running yet another Russian-language data expedition (DE3), which began on 20 February. This one is a bit different from DE1 and DE2. First, we this time we (Irina Radchenko and myself from worked in partnership with Aleksey Sidorenko from NGO “Теплица социальных технологий” (Teplitsa/Greenhouse of Social Technologies). It is also the first time that we have taken a socially meaningful subject, which is orphan diseases. DE3 is going to finish on March 5. Soon after, I’ll be able to tell more about it, as well as about its findings (we’re digging the data on the situation in Russia in the first place). By the way, it will also be great to have some kind of feedback from people from other countries who are aware of the local situation (Jakes?).

6. Last, but not least, I’m currently involved (unfortunately in quite a hybernating way at the moment) in developing an international project on national informational resources. It all started with Team 10, but it’s going to grow. More on it later.

Deep Learning MOOC


As I have already mentioned, it’s not easy being me. In addition to my already formed nice and balanced ‘curriculum’ I have enrolled in yet another MOOC on Deep Learning, or DLMOOC. It begins in a week’s time, on 20 January and it ends on 21 March. It is another instance of a so-called mechanical MOOC, similar to Python MOOC, and also created by P2PU. This one is for educators. Well, as a person who has already launched two data-expeditions and is totally resolved to keep doing it in the future, I thought it might be a good idea to kind of learn a bit more about education in general. And this seems to be a very nice chance, because this MOOC has already collected more than 600 participants, that is educators from all over the world.

To be honest, I don’t think I’ll be able to be a very valuable contributor in terms of active participation, because I still have to work, learn pre-calculus and data analysis. And yes, we’ll have to launch our next expedition one day too (in spring I hope). But I’m sure I’ll still receive lots of valuable experience. I already have. I do like the communication system of DLMOOC with a G+ community as central platform. Although I’m not sure yet if it is appropriate for data-expeditions. It also has a flexible cooperation mechanism with an option to choose whether the participants want to work in ‘offline’ (friend-to-friend) groups or join into virtual groups. And it’s very interesting to see how it is going to develop and work. I will try to make notes on the way and share them here.

Big plans for my 2nd semester

As the previous experience has shown, it’s hard to cover more than one course in one semester (this way of measuring my learning time seems most appropriate), if you have to work at the same time. Or rather one course and a half. Last semester, these were an introduction to statistics and a bit of R. Initially, I had huge plans for the upcoming semester. While learning statistics and earlier some Python basics I got a bit tired of constant guesswork and having to learn separate pieces of underlying fundamentals, without getting the whole picture. So I totally felt like taking two basic courses in this semester, namely some refresher in math and some intro to computer science.

As to computer science, I really liked the description of CS50, a Harvard CS course by David Malan, which has its online representation both as a static archive and as a MOOC at The thing is that:

  • it lasts 10 weeks
  • it has 2 lectures every week, about an hour long each
  • it has 1 seminar a week, about an hour long
  • it includes 9 problem sets, estimated completion time 10 to 20 hours each
  • it includes 1 final project

Well, that’s definitely not what I’m likely to be able to cover before summer, especially if it is combined with a math course. Time for tough decisions. After some hesitation I decided that math comes first:

  • as a more basic subject
  • the thing I really needed while learning stats
  • more realistic to complete by the end of this semester.

There are actually two courses that seem quite appropriate for my needs (and I need to refresh some real basics):

I’m not sure about the latter, but Precalculus looks very promising in terms of at least answering some unresolved questions (simple, but very annoying) I already have after dealing with statistics.

So that’s what going to be my core subject for the semester, just like Statistics was last semester. Now, what about the remaining ‘half a course’ to complete my schedule? Well, I failed to complete Data Analysis last semester and I also want to have some revision of what I learnt about statistics last year. That’s what I think I’m going to be dealing with for the rest of my learning time. Stanford is offering a course in statistical learning (as far as I understand this stands for statistics combined with some machine learning approaches). I hope it won’t be as challenging as it could be after I have acquired some basic skills in handling R (and this course is based on R).

So these are my one and a half courses I’m going to take in this semester. As to CS, I do hope to take it in the summer.

A couple of links for those who also might need some school math refresher: