Coursera: Data Management and Visualization – Assignment 2

For assignment 2 the task was to write some code to process the chosen variables and present:

  • the script;
  • the output that displays three of your variables as frequency tables;
  • a few sentences describing your frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.

My selected variables were:

  • the experience of animal phobia;
  • the origin of the respondent;
  • the respondent’s perceived health state.

The full version of my script (in Python) is here. Below are some of its parts to provide some outline. To print out all my output for frequencies and percentages I wrote the following function:

import pandas as pd

def label_print(series_freq, series_percent, series_map, rename=True, ind_sort=False):
    Print the output of an operation with all captions and labels needed
    :param series_freq: which series frequency to print out
    :param series_percent: which series percentages to print out
    :param series_map: set the dictionary describing the series
    :param rename: True by default; if True, replaces index with the given labels
    :param ind_sort: False by default; if True, sort index values (good for sorting numeric index)
    :return: None
    print(TITLE.format(series_map[CODE], series_map[MEANING]))
    if rename:
        # Use labels for the output
                             Percentages=series_percent.rename(series_map[VALUES])), axis=1))
    if ind_sort:
        # Sort numeric labels for the output
                        Percentages=series_percent.sort_index()), axis=1))
    elif not rename and not ind_sort:
        # print(series)
                             Percentages=series_percent), axis=1))

I use global variables to storage all necessary string values (such as column names, their meanings, etc.). And I stored all the necessary code meanings as dictionaries in a separate file, which I import into my script and use to label the output.

So, here is the piece of code to get the frequencies and percentages for my core variable regarding animal phobia:

animals_freq= data[ANIMALS_MAP[CODE]].value_counts(sort=False)
animals_percent = data[ANIMALS_MAP[CODE]].value_counts(sort=False, normalize=True)
label_print(animals_freq, animals_percent, ANIMALS_MAP)

And here is the output:


         Frequencies  Percentages
Yes             9093     0.211009
No             32585     0.756155
Unknown         1415     0.032836

From this we see that a considerable number of the respondents (21%) have had some uneasy experience with the animals.

My next variable was origin or descent. Thing is that there are quite a number of distinct values (60 in total). By the way, I counted the unique values like this:

unique_origins = data[ORIGIN_MAP[CODE]].unique()
print('num distinct origins:', len(unique_origins))

Some are really numerous (like African American, 7684 occurrences, about 18%); others are very few (like Malaysian 11, .025%) . So, here I will provide just the top by frequency (over 900 occurrences). The code was:

origins_freq = data[ORIGIN_MAP[CODE]].value_counts()
origins_percent = data[ORIGIN_MAP[CODE]].value_counts(normalize=True)
label_print(origins_freq, origins_percent, ORIGIN_MAP)

And this is the top of the output:


                                                    Frequencies  Percentages
African American (Black, Negro, or Afro-American)          7684     0.178312
German                                                     5345     0.124034
English                                                    4455     0.103381
Irish                                                      3066     0.071148
Mexican                                                    2578     0.059824
Unknown                                                    1855     0.043046
Mexican-American                                           1758     0.040795
Other                                                      1739     0.040355
Italian                                                    1555     0.036085
French                                                     1048     0.024319
Puerto Rican                                                997     0.023136
American Indian (Native American)                           975     0.022625

Last there is the health variable. Code snippet, nothing new:

health_freq = data[HEALTH_MAP[CODE]].value_counts(sort=False)
health_percent = data[HEALTH_MAP[CODE]].value_counts(sort=False, normalize=True)
label_print(health_freq, health_percent, HEALTH_MAP)

The output:


           Frequencies  Percentages
Very good        12424     0.288307
Excellent        12316     0.285800
Good             10649     0.247117
Fair              5219     0.121110
Poor              2219     0.051493
Unknown            266     0.006173

And I also played with subsetting. I made a subset based on three conditions:

  • The respondent should have experienced animal fear
  • The respondent should be of one of the top origins (excluding Other and Unknown)
  • The respondent should perceive their health as poor.

Here is the code snippet:

condition_ap = data[ANIMALS_MAP[CODE]] == 1
condition_health = data[HEALTH_MAP[CODE]] == 5
condition_origin = data[ORIGIN_MAP[CODE]].isin([1, 19, 15, 18, 27, 29, 36, 35, 39, 3])
raw_subset = data[(condition_ap & condition_health & condition_origin)]
subset = raw_subset.copy()
print('Subset: top origins + poor perceived health + have AP')
origins_ap_freq = subset[ORIGIN_MAP[CODE]].value_counts(sort=False)
origins_ap_percent = subset[ORIGIN_MAP[CODE]].value_counts(sort=False, normalize=True)
label_print(origins_ap_freq, origins_ap_percent, ORIGIN_MAP)

Here is the output:

Subset: top origins + poor perceived health + have AP


                                                   Frequencies  Percentages
African American (Black, Negro, or Afro-American)          217     0.430556
American Indian (Native American)                           27     0.053571
English                                                     67     0.132937
French                                                      10     0.019841
German                                                      54     0.107143
Irish                                                       35     0.069444
Italian                                                     17     0.033730
Mexican                                                     22     0.043651
Mexican-American                                            20     0.039683
Puerto Rican                                                35     0.069444

To wrap up: I now have three frequency tables for each of my variables separately.
For animal fear/avoidance there is a fair share of those who have experienced this (21%). It may be instructive to have a look at some other specific phobias for comparison, but for now I can just note that this share is by no means small.

For origin or descent, we see that the leaders among the respondents’ origins (60 in total) are:

  • African American (Black, Negro, or Afro-American) (18%)
  • German (12%)
  • English (10%)

And the fewest are:

  • Jordanian and Malaysian(.025% each)
  • Samoan (.02%)

As to health, most of the respondents (82%) find their health state good, very good or excellent. And only 5% estimated it as poor. I wonder if this distribution is going to change for the subset of those with animal phobia. So I even had a look. Here is the code snippet:

raw_subset = data[(condition_ap)]
subset_health_ap = raw_subset.copy()
print('\nSubset: perceived health + have AP')
health_ap_freq = subset_health_ap[HEALTH_MAP[CODE]].value_counts(sort=True, dropna=False)
health_ap_percent = subset_health_ap[HEALTH_MAP[CODE]].value_counts(sort=True, dropna=False, normalize=True)
label_print(health_ap_freq, health_ap_percent, HEALTH_MAP)

And here is the output:

Subset: perceived health + have AP


           Frequencies  Percentages
Good              2468     0.271418
Very good         2424     0.266579
Excellent         2074     0.228088
Fair              1449     0.159353
Poor               661     0.072693
Unknown             17     0.001870

So in the subset of those with animal fears, the distribution is a bit different indeed. Still, the idea that the health is in at least good state in predominant (about 77%). But now the order of the top three has changed: Good is the most frequent among the three (it used to be the least frequent), Excellent is the least frequent (used to be in the middle). The share of those who think their health is poor has also increased to be 7%. Too early to jump at any conclusions though. I do not even know if this difference is somehow significant.

As to the subset (above) based on origin, poor perceived health and animal phobia – well, it was purely technical. There is little to be concluded or observed based on the frequency table. Some other approach will be necessary. For now, I am just glad I discovered that nice .isin method for subsetting.

Working witn my variables, I have not encountered any missing values, so I did not have to deal with them technically (although I used the dropna parameter a couple of times just in case). But there is a similar thing in all three variables: the Unknown category. I have not decided how to deal with it yet. But most probably I will just drop it in future.


Coursera: Data Management and Visualization – Assignment 1

Dataset, questions, hypothesis

I chose NESARC dataset to explore. The main reasons were:
– the size of the dataset (over 40K rows, which makes it more interesting to operate programmatically);
– high detalization of parameters, which provides great opportunities for asking questions.

I am going to focus on specific phobias (SP), particularly on animal phobia (AP). This kind of phobia appears to be rather widely spread and, according to some studies (see below) has rather specific distinctions from other kinds of SP, such as fear of heights, water, dentists, etc. So, it might be a good reason to single out just one phobia in order to narrow down my analysis. The variable is S8Q1A1 (EVER HAD FEAR/AVOIDANCE OF INSECTS, SNAKES, BIRDS, OTHER ANIMALS).
My original question is if there is any particular association between this AP and the origin of a person (I would cautiously suggest that different origins probably mean different cultural backgrounds). The variable is S1Q1E (ORIGIN OR DESCENT).
Next I would like to have a look at whether there is any association between having AP and the self-perception of health. The variable is S1Q16 (SELF-PERCEIVED CURRENT HEALTH).

So here are the basic questions:
– Is there any association between animal phobia and the origin?
– Is there any association between animal phobia and self-perceived current health description?

There is one more additional question (just in case I have more time for that). After taking a look at the national dimension of AP, it might be interesting to also consider possible cultural/perception changes over time and check out the percentage of people with AP across different age groups.

Based on the literature review (below), my hypothesis is:
– AP shows some association with national / cultural background or context and may be associated with the perception of health condition.

Literature review

AP in the course of other SPs or vs. other SPs, anxiety and various disorders/mental conditions.

[1] Vladeta Ajdacic-Gross, Stephanie Rodgers, Mario Müller, Michael P. Hengartner, Aleksandra Aleksandrowicz, Wolfram Kawohl, Karsten Heekeren, Wulf Rössler, Jules Angst, Enrique Castelao, Caroline Vandeleur, Martin Preisig
Pure animal phobia is more specific than other specific phobias: epidemiological evidence from the Zurich Study, the ZInEP and the PsyCoLaus
European Archives of Psychiatry and Clinical Neuroscience, September 2016, Volume 266, Issue 6, pp 567–577
This study states that pure animal phobia is principally different from other kinds of SP:
“Pure animal phobia and mixed animal/other specific phobias consistently displayed a low age at onset of first symptoms (8–12 years) and clear preponderance of females (OR > 3). Meanwhile, other specific phobias started up to 10 years later and displayed almost a balanced sex ratio. Pure animal phobia showed no associations with any included risk factors and comorbid disorders, in contrast to numerous associations found in the mixed subtype and in other specific phobias. Across the whole range of epidemiological parameters examined in three different samples, pure animal phobia seems to represent a different entity compared to other specific phobias. The etiopathogenetic mechanisms and risk factors associated with pure animal phobias appear less clear than ever”.
Based on this, I should probably also take into account the distinction between ‘pure’ (not combined with other SPs) and ‘mixed’ (goes in combination with other SPs) animal phobia.
So I may need to see the proportion of those who have had only AP symptoms (variable S8Q1A1, EVER HAD FEAR/AVOIDANCE OF INSECTS, SNAKES, BIRDS, OTHER ANIMALS) and those combining AP with other SP episodes.

[2] Kevin Hilbert, Ricard Evens, Nina Isabel Maslowski, Hans-Ulrich Wittchen, Ulrike Lueken
Neurostructural correlates of two subtypes of specific phobia: A voxel-based morphometry study
Psychiatry Research: Neuroimaging
Volume 231, Issue 2, 28 February 2015, Pages 168-175
Abstract: “The animal and blood-injection-injury (BII) subtypes of specific phobia are both characterized by subjective fear but distinct autonomic reactions to threat. Previous functional neuroimaging studies have related these characteristic responses to shared and non-shared neural underpinnings. However, no comparative structural data are available. This study aims to fill this gap by comparing the two subtypes and also comparing them with a non-phobic control group“.
This study shows more complicated dependencies in the comparative analysis of SPs. To be taken into consideration while comparing. Particularly variable S8Q1A8 (EVER HAD FEAR/AVOIDANCE OF SEEING BLOOD/GETTING AN INJECTION) may be of interest.

[3] K. J. Wardenaar, C. C. W. Lim, A. O. Al-Hamzawi, J. Alonso et al.
The cross-national epidemiology of specific phobia in the World Mental Health Surveys
Psychological Medicine, Volume 47, Issue 10 July 2017 , pp. 1744-1760
Results: “The cross-national lifetime and 12-month prevalence rates of specific phobia were, respectively, 7.4% and 5.5%, being higher in females (9.8 and 7.7%) than in males (4.9% and 3.3%) and higher in high- and higher-middle-income countries than in low-/lower-middle-income countries. The median age of onset was young (8 years). Of the 12-month patients, 18.7% reported severe role impairment (13.3–21.9% across income groups) and 23.1% reported any treatment (9.6–30.1% across income groups). Lifetime co-morbidity was observed in 60.5% of those with lifetime specific phobia, with the onset of specific phobia preceding the other disorder in most cases (72.6%). Interestingly, rates of impairment, treatment use and co-morbidity increased with the number of fear subtypes“.
This study indicates some association with age and sex. It also states associations with other disorders. This means that variables, such as sex and probably age as well should be taken into consideration. Luckily, the dataset provides SEX and AGE parameters.

AP in the context of culture / nationality

[4] Cultural Clinical Psychology Study Group, W.A. Arrindell, Martin Eisemann et al.
Phobic anxiety in 11 nations: Part I: Dimensional constancy of the five-factor model
Behaviour Research and Therapy, Volume 41, Issue 4, April 2003, Pages 461-479
(and Part 2 here
Abstract: “The Fear Survey Schedule-III (FSS-III) was administered to a total of 5491 students in Australia, East Germany, Great Britain, Greece, Guatemala, Hungary, Italy, Japan, Spain, Sweden, and Venezuela, and submitted to the multiple group method of confirmatory analysis (MGM) in order to determine the cross-national dimensional constancy of the five-factor model of self-assessed fears originally established in Dutch, British, and Canadian samples. The model comprises fears of bodily injury–illness–death, agoraphobic fears, social fears, fears of sexual and aggressive scenes, and harmless animals fears. Close correspondence between the factors was demonstrated across national samples. In each country, the corresponding scales were internally consistent, were intercorrelated at magnitudes comparable to those yielded in the original samples, and yielded (in 93% of the total number of 55 comparisons) sex differences in line with the usual finding (higher scores for females). In each country, the relatively largest sex differences were obtained on harmless animals fears. The organization of self-assessed fears is sufficiently similar across nations to warrant the use of the same weight matrix (scoring key) for the FSS-III in the different countries and to make cross-national comparisons feasible. This opens the way to further studies that attempt to predict (on an a priori basis) cross-national variations in fear levels with dimensions of national cultures.”
And quoting the abstract for the other part: “Hofstede’s dimensions of national cultures termed Masculinity–Femininity (MAS) and Uncertainty Avoidance (UAI) (Hofstede, 2001) are proposed to be of relevance for understanding national-level differences in self-assessed fears. The potential predictive role of national MAS was based on the classical work of Fodor (Fodor, 1974). Following Fodor, it was predicted that masculine (or tough) societies in which clearer differentiations are made between gender roles (high MAS) would report higher national levels of fears than feminine (or soft/modest) societies in which such differentiations are made to a clearly lesser extent (low MAS). In addition, it was anticipated that nervous-stressful-emotionally-expressive nations (high UAI) would report higher national levels of fears than calm-happy and low-emotional countries (low UAI), and that countries high on both MAS and UAI would report the highest national levels of fears“.

So, to summarize:

  • National / cultural differences show up when it comes to animal fears (particularly harmless)
  • Such fears are more common for ‘masculine’ cultures with more rigid gender roles; and also more typical for ‘nervous/emotionaly expressive’ countries.
  • So there is some cultural association with such animal fears.
    And here is where I am going to rely on S1Q1E (ORIGIN OR DESCENT) parameter.

[5] Eva Landová1, Natavan Bakhshaliyeva et al.
Association Between Fear and Beauty Evaluation of Snakes: Cross-Cultural Findings
Front. Psychol., 16 March 2018
The study states that the fear of snakes has evolutionary reasons and is particularly connected to geogrphical and natural conditions in which a country’s culture was formed. Well, just another case to show that researchers do establish some cultural association with fears (and ultimately phobias).

SPs (including AP) and physical conditions

[6] Cornelia Witthauer, Vladeta Ajdacic-Gross, et al.
Associations of specific phobia and its subtypes with physical diseases: an adult community study
BMC Psychiatry, 2016
Results: “Specific phobia was associated with cardiac diseases, gastrointestinal diseases, respiratory diseases, arthritic conditions, migraine, and thyroid diseases (odds ratios between 1.49 and 2.53). Among the subtypes, different patterns of associations with physical diseases were established“.

[7] Ella L.Oar, Lara J.Farrell et al.
Blood-Injection-Injury Phobia and Dog Phobia in Youth: Psychological Characteristics and Associated Features in a Clinical Sample
Behavior Therapy, Volume 47, Issue 3, May 2016, Pages 312-324
Abstract: “Blood-Injection-Injury (BII) phobia is a particularly debilitating condition that has been largely ignored in the child literature. The present study examined the clinical phenomenology of BII phobia in 27 youths, relative to 25 youths with dog phobia—one of the most common and well-studied phobia subtypes in youth. Children were compared on measures of phobia severity, functional impairment, comorbidity, threat appraisals (danger expectancies and coping), focus of fear, and physiological responding, as well as vulnerability factors including disgust sensitivity and family history. Children and adolescents with BII phobia had greater diagnostic severity. In addition, they were more likely to have a comorbid diagnosis of a physical health condition, to report more exaggerated danger expectancies, and to report fears that focused more on physical symptoms (e.g., faintness and nausea) in comparison to youth with dog phobia. The present study advances knowledge relating to this poorly understood condition in youth“.
Here I can note that Blood-Injection-Injury phobia is often mentioned (and explored) in combination with animal phobia (like here and in [2] for instance, but I have come across other cases as well).

To summarize:

  • SP (AP among them) may be an indication to some physical conditions.
  • Which makes me think there might be some reflection in self-perception.
  • Unfortunately, I failed to find any studies of association between SP and hypochondria, which would be more appropriate for my intention to check exactly subjective perception of health.


Well, looks like I am going to unfreeze this blog for a while. Just enrolled into yet another Coursera course on data analysis and there are assignments that require blog posts.  I see no point in creating a special blog for that, especially as I already have this one, created exactly for courses needs. Here’s the course, by the way. Cannot recommend it for now, as I’ve just started.

So, the next several posts here are hopefully (I do hope the course is worth it) going to be assignments.

Yet another UPD

Hey everyone whoever might read this.
I last updated more than a year ago and now I’m here for basically the same reason as last time. Namely, to say: it works. I mean open online education, which includes platform MOOCs, peer-learning, interactive platforms like Khan Academy, etc. Combined with the willingness to learn new stuff and apply the new knowledge, it apparently does work.
A year ago, my learning efforts and actual work were still mostly separate. Today, I’m happy to say that they’re not any longer. About a month ago, I first tried myself as an analyst. A very low-level analyst of course, but it’s a huge progress for me. I started applying my coding skills (in Python) to searching and extracting data from a database (both directly and via API). By the way, this is an enormous database of government contracts and procurements managed by a great non-governmental project Clearspending.
Also, I’m currently working on a research of the situation with open data in Russia. Things have change since 2013, when Russian governmental bodies first started to publish their data. So it’s high time we looked at the general picture and tried to describe it. I think we’re going to finish it by February.
There have been no data expeditions recently, although it doesn’t mean that we’re not thinking of renewing this good tradition one day.

Last, but not least, this year I started learning Armenian (սովորում եմ հայերեն), which is an extremely interesting and beautiful language. Last summer in Armenia I got a very nice, although rather old, textbook (until then I had trouble finding some decent learning material). Next challenge will be to find a decent Armenian dictionary (I’ve failed so far).

By the way, here’s a picture of the ancient Eribuni fortress in Yerevan.


Oh, and as usual, I’ve got some nice courses to recommend.

  • MongoDB University is a collection of nice online courses that teach how to use Mongo Database. Most of them are rather practical, so if there’s no need to handle MongoDB right away, they may not be awfully interesting. But if there is such a need, they are great.
  • However, one of those courses, which is published at Udacity, has a wider scope. It teaches the techniques of working with different data formats using Python. Nothing extremely complex, but there are lots of very helpful tips.
  • Also, there’s a nice specialization series by Charles Severance at Coursera. It’s Python again. The first two courses in the series are just basics, but they are followed by two courses on how to work with web data (including API) and databases. Again, rather simple, but nontheless very helpful sometimes.


I haven’t posted anything for quite a while, but actually I keep learning. It’s always somewhat sad to see these abandoned blogs created for peer-learning with a couple of posts and then no updates, so you just don’t know if their authors are still learning or gave up on it. Well, I haven’t. True, I’m more into platform MOOCs at the moment, so I’m not using this blog for peer-learning purposes directly. But I generally like this international open peer-learning project and I’m going to update this blog from time to time.

There’s a good occasion for this post: I’ve just completed An Introduction to Interactive Programming in Python at Coursera. I’ve finally done it having failed two previous attempts. It was challenging and I’m not sure I’d have made it if I hadn’t done some preparational job at Codecademy and with the help of Zed Shaw’s ‘Learn Python: The Hard Way‘ (a great educational project by the way).

Just for show, here are the links to the mini-projects I completed during the course. I’m providing the links to my code in Codesculptor, an online application created by one of the instructors for writing and running Python code. In case someone wants to have a look, the best way to do it is by using Chrome (using Mozilla and other browsers may lead to some bugs).

This is actually the first part in Fundamentals of Computing specialisation. Next course in this sequence, Principles of Computing, is going to start in February 2015 and I’m totally going to try it. Before it begins, I’m going to have some fun at Khan Academy.

Finally, some courses I’d like to have a closer look at at a certain point. Maybe someone will find them fascinating as well. If somebody has already dealt with some of them, it would be great if you shared your opinion.


1. I’m still alive.

2. I keep working as a journalist. Recently I’ve actually tried applying my newly acquired skills to my real job. Still much to work on, but at least I seem to have learnt at least something. In the first case I tried to work with some data on death penalty in the US; in the second case, I was visualising some aspects (namely, on kidnapping) of Global Terrorism Database. Both materials are in Russian of course. Moreover, the website does currently not allow for embedding interactive visualisations, so there are only screenshots, while the original interactive stuff is published on my Blogger account (but again, in Russian). Speaking of Global Terrorism Database, there’s a whole course at Coursera based on this project. Don’t know much about it, so I can’t recommend it, but I’ll definitely have a look, as soon as I can.

3. I keep tracking the developments in the activity of Open Data School in Moscow. It’s an interesting project both as an educational initiative and as part of promoting openness. More on it later, as well as on DLMOOC, by the way, which is fascinating (sadly, I’ve been virtually unable to participate full-scale).

4. Meanwhile, I’m trying to keep up with Linear Algebra: Foundation to Frontiers and Statistical Learning.

5. Right now I’m in the middle of running yet another Russian-language data expedition (DE3), which began on 20 February. This one is a bit different from DE1 and DE2. First, we this time we (Irina Radchenko and myself from worked in partnership with Aleksey Sidorenko from NGO “Теплица социальных технологий” (Teplitsa/Greenhouse of Social Technologies). It is also the first time that we have taken a socially meaningful subject, which is orphan diseases. DE3 is going to finish on March 5. Soon after, I’ll be able to tell more about it, as well as about its findings (we’re digging the data on the situation in Russia in the first place). By the way, it will also be great to have some kind of feedback from people from other countries who are aware of the local situation (Jakes?).

6. Last, but not least, I’m currently involved (unfortunately in quite a hybernating way at the moment) in developing an international project on national informational resources. It all started with Team 10, but it’s going to grow. More on it later.

Deep Learning MOOC


As I have already mentioned, it’s not easy being me. In addition to my already formed nice and balanced ‘curriculum’ I have enrolled in yet another MOOC on Deep Learning, or DLMOOC. It begins in a week’s time, on 20 January and it ends on 21 March. It is another instance of a so-called mechanical MOOC, similar to Python MOOC, and also created by P2PU. This one is for educators. Well, as a person who has already launched two data-expeditions and is totally resolved to keep doing it in the future, I thought it might be a good idea to kind of learn a bit more about education in general. And this seems to be a very nice chance, because this MOOC has already collected more than 600 participants, that is educators from all over the world.

To be honest, I don’t think I’ll be able to be a very valuable contributor in terms of active participation, because I still have to work, learn pre-calculus and data analysis. And yes, we’ll have to launch our next expedition one day too (in spring I hope). But I’m sure I’ll still receive lots of valuable experience. I already have. I do like the communication system of DLMOOC with a G+ community as central platform. Although I’m not sure yet if it is appropriate for data-expeditions. It also has a flexible cooperation mechanism with an option to choose whether the participants want to work in ‘offline’ (friend-to-friend) groups or join into virtual groups. And it’s very interesting to see how it is going to develop and work. I will try to make notes on the way and share them here.