Yet another UPD

Hey everyone whoever might read this.
I last updated more than a year ago and now I’m here for basically the same reason as last time. Namely, to say: it works. I mean open online education, which includes platform MOOCs, peer-learning, interactive platforms like Khan Academy, etc. Combined with the willingness to learn new stuff and apply the new knowledge, it apparently does work.
A year ago, my learning efforts and actual work were still mostly separate. Today, I’m happy to say that they’re not any longer. About a month ago, I first tried myself as an analyst. A very low-level analyst of course, but it’s a huge progress for me. I started applying my coding skills (in Python) to searching and extracting data from a database (both directly and via API). By the way, this is an enormous database of government contracts and procurements managed by a great non-governmental project Clearspending.
Also, I’m currently working on a research of the situation with open data in Russia. Things have change since 2013, when Russian governmental bodies first started to publish their data. So it’s high time we looked at the general picture and tried to describe it. I think we’re going to finish it by February.
There have been no data expeditions recently, although it doesn’t mean that we’re not thinking of renewing this good tradition one day.

Last, but not least, this year I started learning Armenian (սովորում եմ հայերեն), which is an extremely interesting and beautiful language. Last summer in Armenia I got a very nice, although rather old, textbook (until then I had trouble finding some decent learning material). Next challenge will be to find a decent Armenian dictionary (I’ve failed so far).

By the way, here’s a picture of the ancient Eribuni fortress in Yerevan.

IMG_1229

Oh, and as usual, I’ve got some nice courses to recommend.

  • MongoDB University is a collection of nice online courses that teach how to use Mongo Database. Most of them are rather practical, so if there’s no need to handle MongoDB right away, they may not be awfully interesting. But if there is such a need, they are great.
  • However, one of those courses, which is published at Udacity, has a wider scope. It teaches the techniques of working with different data formats using Python. Nothing extremely complex, but there are lots of very helpful tips.
  • Also, there’s a nice specialization series by Charles Severance at Coursera. It’s Python again. The first two courses in the series are just basics, but they are followed by two courses on how to work with web data (including API) and databases. Again, rather simple, but nontheless very helpful sometimes.

Upd

I haven’t posted anything for quite a while, but actually I keep learning. It’s always somewhat sad to see these abandoned blogs created for peer-learning with a couple of posts and then no updates, so you just don’t know if their authors are still learning or gave up on it. Well, I haven’t. True, I’m more into platform MOOCs at the moment, so I’m not using this blog for peer-learning purposes directly. But I generally like this international open peer-learning project and I’m going to update this blog from time to time.

There’s a good occasion for this post: I’ve just completed An Introduction to Interactive Programming in Python at Coursera. I’ve finally done it having failed two previous attempts. It was challenging and I’m not sure I’d have made it if I hadn’t done some preparational job at Codecademy and with the help of Zed Shaw’s ‘Learn Python: The Hard Way‘ (a great educational project by the way).

Just for show, here are the links to the mini-projects I completed during the course. I’m providing the links to my code in Codesculptor, an online application created by one of the instructors for writing and running Python code. In case someone wants to have a look, the best way to do it is by using Chrome (using Mozilla and other browsers may lead to some bugs).

This is actually the first part in Fundamentals of Computing specialisation. Next course in this sequence, Principles of Computing, is going to start in February 2015 and I’m totally going to try it. Before it begins, I’m going to have some fun at Khan Academy.

Finally, some courses I’d like to have a closer look at at a certain point. Maybe someone will find them fascinating as well. If somebody has already dealt with some of them, it would be great if you shared your opinion.

Briefly

1. I’m still alive.

2. I keep working as a journalist. Recently I’ve actually tried applying my newly acquired skills to my real job. Still much to work on, but at least I seem to have learnt at least something. In the first case I tried to work with some data on death penalty in the US; in the second case, I was visualising some aspects (namely, on kidnapping) of Global Terrorism Database. Both materials are in Russian of course. Moreover, the website does currently not allow for embedding interactive visualisations, so there are only screenshots, while the original interactive stuff is published on my Blogger account (but again, in Russian). Speaking of Global Terrorism Database, there’s a whole course at Coursera based on this project. Don’t know much about it, so I can’t recommend it, but I’ll definitely have a look, as soon as I can.

3. I keep tracking the developments in the activity of Open Data School in Moscow. It’s an interesting project both as an educational initiative and as part of promoting openness. More on it later, as well as on DLMOOC, by the way, which is fascinating (sadly, I’ve been virtually unable to participate full-scale).

4. Meanwhile, I’m trying to keep up with Linear Algebra: Foundation to Frontiers and Statistical Learning.

5. Right now I’m in the middle of running yet another Russian-language data expedition (DE3), which began on 20 February. This one is a bit different from DE1 and DE2. First, we this time we (Irina Radchenko and myself from Datadrivenjournalism.ru) worked in partnership with Aleksey Sidorenko from NGO “Теплица социальных технологий” (Teplitsa/Greenhouse of Social Technologies). It is also the first time that we have taken a socially meaningful subject, which is orphan diseases. DE3 is going to finish on March 5. Soon after, I’ll be able to tell more about it, as well as about its findings (we’re digging the data on the situation in Russia in the first place). By the way, it will also be great to have some kind of feedback from people from other countries who are aware of the local situation (Jakes?).

6. Last, but not least, I’m currently involved (unfortunately in quite a hybernating way at the moment) in developing an international project on national informational resources. It all started with Team 10, but it’s going to grow. More on it later.

Deep Learning MOOC

cropped-wheel-1000x150

As I have already mentioned, it’s not easy being me. In addition to my already formed nice and balanced ‘curriculum’ I have enrolled in yet another MOOC on Deep Learning, or DLMOOC. It begins in a week’s time, on 20 January and it ends on 21 March. It is another instance of a so-called mechanical MOOC, similar to Python MOOC, and also created by P2PU. This one is for educators. Well, as a person who has already launched two data-expeditions and is totally resolved to keep doing it in the future, I thought it might be a good idea to kind of learn a bit more about education in general. And this seems to be a very nice chance, because this MOOC has already collected more than 600 participants, that is educators from all over the world.

To be honest, I don’t think I’ll be able to be a very valuable contributor in terms of active participation, because I still have to work, learn pre-calculus and data analysis. And yes, we’ll have to launch our next expedition one day too (in spring I hope). But I’m sure I’ll still receive lots of valuable experience. I already have. I do like the communication system of DLMOOC with a G+ community as central platform. Although I’m not sure yet if it is appropriate for data-expeditions. It also has a flexible cooperation mechanism with an option to choose whether the participants want to work in ‘offline’ (friend-to-friend) groups or join into virtual groups. And it’s very interesting to see how it is going to develop and work. I will try to make notes on the way and share them here.

Big plans for my 2nd semester

As the previous experience has shown, it’s hard to cover more than one course in one semester (this way of measuring my learning time seems most appropriate), if you have to work at the same time. Or rather one course and a half. Last semester, these were an introduction to statistics and a bit of R. Initially, I had huge plans for the upcoming semester. While learning statistics and earlier some Python basics I got a bit tired of constant guesswork and having to learn separate pieces of underlying fundamentals, without getting the whole picture. So I totally felt like taking two basic courses in this semester, namely some refresher in math and some intro to computer science.

As to computer science, I really liked the description of CS50, a Harvard CS course by David Malan, which has its online representation both as a static archive and as a MOOC at edX.org. The thing is that:

  • it lasts 10 weeks
  • it has 2 lectures every week, about an hour long each
  • it has 1 seminar a week, about an hour long
  • it includes 9 problem sets, estimated completion time 10 to 20 hours each
  • it includes 1 final project

Well, that’s definitely not what I’m likely to be able to cover before summer, especially if it is combined with a math course. Time for tough decisions. After some hesitation I decided that math comes first:

  • as a more basic subject
  • the thing I really needed while learning stats
  • more realistic to complete by the end of this semester.

There are actually two courses that seem quite appropriate for my needs (and I need to refresh some real basics):

I’m not sure about the latter, but Precalculus looks very promising in terms of at least answering some unresolved questions (simple, but very annoying) I already have after dealing with statistics.

So that’s what going to be my core subject for the semester, just like Statistics was last semester. Now, what about the remaining ‘half a course’ to complete my schedule? Well, I failed to complete Data Analysis last semester and I also want to have some revision of what I learnt about statistics last year. That’s what I think I’m going to be dealing with for the rest of my learning time. Stanford is offering a course in statistical learning (as far as I understand this stands for statistics combined with some machine learning approaches). I hope it won’t be as challenging as it could be after I have acquired some basic skills in handling R (and this course is based on R).

So these are my one and a half courses I’m going to take in this semester. As to CS, I do hope to take it in the summer.

A couple of links for those who also might need some school math refresher:

Second Data Expedition in Russian: Mission Accomplished

Not long ago, we completed the second Russian-language data expedition (DE2, ДЭ2) and here’s how it was.

The Russian-language version of this report can be found here.

Our first data expedition (DE1) was launched in July 2013. While organising DE2 we took into account the previous experience.

Brief overview

  • DE2 was launched on 9 December and finished on 23 December 2013.
  • The idea of a data expedition and its principles is based on the projects developed by P2PU and School of Data, which actually coined the term data expedition, as far as I know.
  • Therefore, DE2 was an open P2P-learning project-based initiative, available for everyone, free of charge and based on the idea of mutual help and cooperation.
  • The declared purpose of DE2 was to go through the whole cycle of data processing, with the key emphasis on exploring the structure of the data and patterns within a data set.
  • Unlike DE1, DE2 offered a pre-planned scenario with a sequence of four tasks and instructions aimed at the facilitation of the research process. The tasks of this sequence actually reflected the approach described in the Data Analysis course by Jeff Leek at Coursera.
  • The scenario was based on a certain data set, namely Online Video survey conducted by PSRAI Omnibus and provided by PewInternet project.
  • However, the participants were absolutely free to come up with their own alternative projects. So the scenario part was first and foremost a framework for those who have hard time elaborating their own research pathway.
  • By default, DE2 suggested using Google Spreadsheets and Google/Open Refine as working tools, but the participants were free to use any tools they preferred (which they did).
  • Its communication activity was mainly concentrated in a Google Group, which could be used both as a forum and a mailing list.
  • DE2 required no prerequisites in terms of data processing experience.
  • 20 people signed up for participation.
  • DE2 was organised by Irina Radchenko and myself as part of our larger informal learning Russian-language project Datadrivenjournalism.ru.

Results

Just like in the case of DE1, they were twofold:

Participants’ results:

  • a number of visualisations reflecting the associations and patterns within the data set;
  • some visualisations and spreadsheets reflecting the structure of the data;
  • a number of links to learning resources contributed by participants;
  • a published material (in Russian) based on the research conducted under an alternative (participant-initiated) project within DE2.

(The visualisations and links can be found at the Google Group forum)

Organisational results:

  • the messages at the Google Group forum;
  • two forms filled by participants (initial and final surveys)

DE2 participants used various tools, including:

Process

1. While DE1 can be considered a relative success in terms of participants involvement, the main challenge in DE2 was to keep that involvement strategy and to supplement it with a better structuring solution so that the participants feel more certain regarding what they should do, no matter whether they have any previous experience in working with data.

To this end, we prepared a number of relatively short introductory/reference texts that provided the details about both the project’s basic principles and the meaning of particular aspects of building a data driven story. We began posting these texts at the Group’s forum 5 days before the official start of DE2. These texts, apart from actual tasks included:

  • a brief intro into using GoogleGroups
  • DE2 scenario
  • an intro into a data expedition learning format
  • a description of a possible presentable output structure
  • types of data analysis and possible types of conclusions that can be made based on various analysis procedures
  • an invitation for the participants to introduce themselves
  • an intro to the data set we offered to work with

The four tasks were (briefly):

  1. explore the data description and the data set; find the meaning of the variable names; think about possible questions;
  2. provide a general description of the data set (how many observations, missing data etc.); start exploring possible associations between variables; think about more questions;
  3. continue exploring the associations; build exploratory charts; possibly do some statistical modeling if there are such skills;
  4. create expository charts; write a story; publish it (for those who didn’t have any platform of their own we created a special DE2 blog).

Apparently, some of these texts did a kind of facilitator’s job by simply initiating a space where a discussion could develop.

2. The introduction of the scenario seems to have worked, as most of the participants were trying to follow the tasks and discuss their findings or at least were toying with the provided data in their own way. On the other hand, there was an alternative research initiative, carried out by one of the participants on his own. Although he didn’t have a whole team working on the same project, he managed to receive some informational support and feedback at the forum.

The main objective of the ready-made scenario was twofold:

  • To make sure that those who prefer to work on their own, but have trouble building their own research still have something to do without having to necessarily get involved into the communication process;
  • To make sure that the participants with no experience have something to rely on, as mentioned above.

3. The choice of the data set for the scenario was a product of a compromise. We did realize that for the Russian-speaking audience data on Russia would be more interesting and probably easier to work with (we actually had to translate into Russian the data description provided at the data website to make sure everyone can understand it). But we also wanted the data set (a) to be rather clean to spare inexperienced participants spending much time on cleaning, as they only had two weeks at their disposal; and (b) to contain lots of variables reflecting different parameters of measurement in order to provide a lot of various opportunities to compare them. The data set we came up with in the end was satisfying in terms of the latter two requirements, but was based on the US survey.

4. However, the alternative project was exploring the Russian material. Namely, it was aimed at measuring the effectiveness of the Russian legislation regarding blocking websites that are regarded as harmful for children (those that are deemed to be promoting child porn, drug abuse, suicide, etc.). This law was passed in 2012 and was widely considered a rather inefficient one in terms of its declared objective, but very convenient as a censure tool for blocking undesirable web resources. This is actually a very interesting direction of research, which could well be continued within our upcoming projects.

5. As to the collaboration activity during the expedition, we can mainly judge about it by the forum messages. Although these messages do not reflect any activities outside the forum, so we cannot measure them, but the forum seems somewhat representative as it is. Here is the activity shape, which shows that the communication was not evenly distributed, but it covers the official DE2 period and actually goes beyond the official landmarks. The figures behind this chart include all forum messages, that is both participants’ and organisers’ messages.

enActivityDistribution

The red framework shows the official terms of DE2.

And this is a chart showing the activity dynamic by days of the week measured through all the forum activity period. The most active exchange apparently happens on the first working days and then gradually slides down to almost cease at the weekend.

enActivityWeek

Figures can be found here.

It seems that during the working week, people didn’t have much time to work on DE2, so the most part of work was done at the weekend and after that people shared their findings.

Participants

While in DE1 the participants could join the Google Group on their own, DE2 included one extra step. Those who wanted to participate had to first sign up by filling a registration form. This helped us to collect more information about the participants. Also we expected that this additional step could serve as a motivation filter. After sending the form, everyone was added to the DE2 Group. As a result, we had 20 filled forms, but only 14 participants added to the Group. We couldn’t add the rest, because the email addresses they provided were inactive.

Here’s a brief review of the whole number of those who registered based on the form data. Organisers’ data are not included.

enDE2Participants

Figures can be found here.

DE1 vs. DE2

It is interesting to compare results of DE2 to the results of DE1. Here is the proportional comparison:

enDE1vsDE2.

Figures can be found here.

In this chart, we can see that during DE1 more messages (124) were posted on the forum than during DE2 (107). Given that DE1 was only one week long and DE2 took a fortnight, this might look somewhat discouraging. On the other hand, we can also see that more people were involved into cooperation in DE2. This was measured by two parameters: the number of people who left at least one message on the forum (a self-introduction message in the most cases, if it was the only one) and the number of people who participated in experience/information exchange (normally expressed in a form of questions and answers, as well as sharing findings). In both cases DE2 shows better results (10 and 6 correspondingly) than DE1 (6 and 4), although in the former case fewer people had the access to the forum.

Though it might seem somewhat confusing, the likely explanation is that the timing for DE2 was extremely counterproductive and was actually our huge mistake. Although it was planned to finish a week before the New Year, still, people were outstandingly busy trying to meet their deadlines at work, preparing for exams or just having a lot of fuss due to the upcoming holidays. I think this was one of the most important reasons for those message-free days shown in the chart above, as well as the relatively low number of messages.

Meanwhile, the bigger number of people involved into the working and communication process shows that even though the participants were very busy they were still willing to proceed with the project. This makes me think that DE2 shows some progress compared to DE1, although the timing lesson should be taken into consideration in the future.

I must add that the DE1 report makes no difference between the organisers as participants in its measurements. In the case of DE2, I only counted the data regarding the participants leaving the two organisers aside (with the exception of the cases where it is specially discussed). So when comparing the data on the both expeditions in this report I also used only participants’ data for DE1. This explains the slight differences in figures between the two reports.

Conclusions

Success

  • More participants were involved in the working process.
  • The participants demonstrated friendly and careful attitude to each other (and the feedback provided by 5 of them pointed out the fact that they appreciated the communication and cooperation component).
  • All the respondents in the final survey expressed their intention to participate in following expeditions.
  • The activity, although not very regular, kept persistent through the whole expedition period.

Problems

  • Timing. This is the mistake that should never be repeated. What is the best time for an expedition has yet to be tested, but it is already obvious that it should not be scheduled in such busy months as December, May, June and probably November.
  • Promotion strategy. Almost half of the registered (8 out of 20) signed up during the first week of DE2, after the expedition had officially begun (in the beginning of the second week the registration form was shut down). That means that the promotion should be more timely and efficient.
  • Relevance. Although something could be learn even with the help of the provided data set, still next time we hope to come up with a more relevant one.
  • The lack of final results presented. There was only one participant who published the material based on his research during DE2. At least one story published is by all means great. But no one else came up with a story. Partially the reason for it might be that some participants were satisfied with what they had learnt and felt no need to publish a story. Another reason could be the lack of time. Still I think that a better format for results presentation might become a motivation for a bigger output.

Based on the results of DE2 and taking into account its lessons, we are going to proceed with organising Russian-language DEs. We might also consider launching alternative types of DEs, alongside with the regular ones, such as:

  • DEs for the participants with a particular level/kind of skills.
  • DEs with the emphasis on real research, rather than learning.
  • DEs aimed at mastering particular tools or working techniques.

All in all, DE2 was quite an experience and a great opportunity for getting in touch with wonderful people. I hope we will soon come up with more DE projects.

And happy New Year to everyone who somehow managed to make their way through this post up to this point.

Is it Christmas already?

It’s been quite an intensive period recently. First, I was having two parallel courses at Coursera – on data analysis and on statistics. Second, Irina Radchenko and I were preparing to launch a new Russian-language data expedition under our Datadrivenjournalism.ru project and then we were actually coordinating it for two weeks (9 – 23 December). Third, I suddenly had a huge task at work with a really tough deadline, which actually ruined my plans a bit, but thankfully not all of them. So here’s a brief account of the resulting layout:

I had to drop the data analysis course after its sixth week. Due to that sudden workload I couldn’t afford doing the second assignment, which was somewhat upsetting. But on the other hand, I think I’ll be able to do it later either on my own or within the course iteration (I’m almost sure it’s going to be launched soon again). Anyway, I’m glad I’ve done at least something, because it turned out to be rather helpful, especially in terms of structuring things and my mind. And yes, the previous course Computing for Data Analysis (on R) was extremely helpful. (For those who might be interested: the next iteration of this course starts on 6 January 2014.)

On the other hand, I triumphantly completed Statistics One course and that’s really cool. There are contradictory reviews of this course online. Some of them claim that the course is inconsistent in terms of difficulty: sometimes too easy and even boring, sometimes too complicated. Well, after completeing it, I can’t say that I’ve digested all the material provided. But now I have a better vision of what statistics is like and how it approaches data. Also I can apply some techniques for data analysis with the help R, but I wouldn’t claim I completely understand the mechanisms underlying some of these operations. Next I’m actually going to focus on Open Intro Statistics, which is a great textbook, and revise the material in order to pack it into my head. To wrap up this segment, I’ll add that the material that had been provided within that course by the middle of the semester was enough to complete assignment one in Data Analysis course.

As to the data expedition, it was luckily completed yesterday. Its organisation was considerably different from the previous experience and demanded quite a bit of in-advance preparation, apart from participation as it is. Although I couldn’t participate in it myself as thoroughly as I would want to, I still have to admit that the result somewhat exceded my expectations. I’ll be writing about it in a greater detail after I analyse the the whole picture. For now I can say that the timing was horrible. So the lesson is: never launch learning projects right befor Christmas or the New Year. But nonetheless there are some very inspiring results and the participants were virtually great.

Also, here are some links as usual:

And merry Christmas everyone who celebrates it now!