Yet another UPD

Hey everyone whoever might read this.
I last updated more than a year ago and now I’m here for basically the same reason as last time. Namely, to say: it works. I mean open online education, which includes platform MOOCs, peer-learning, interactive platforms like Khan Academy, etc. Combined with the willingness to learn new stuff and apply the new knowledge, it apparently does work.
A year ago, my learning efforts and actual work were still mostly separate. Today, I’m happy to say that they’re not any longer. About a month ago, I first tried myself as an analyst. A very low-level analyst of course, but it’s a huge progress for me. I started applying my coding skills (in Python) to searching and extracting data from a database (both directly and via API). By the way, this is an enormous database of government contracts and procurements managed by a great non-governmental project Clearspending.
Also, I’m currently working on a research of the situation with open data in Russia. Things have change since 2013, when Russian governmental bodies first started to publish their data. So it’s high time we looked at the general picture and tried to describe it. I think we’re going to finish it by February.
There have been no data expeditions recently, although it doesn’t mean that we’re not thinking of renewing this good tradition one day.

Last, but not least, this year I started learning Armenian (սովորում եմ հայերեն), which is an extremely interesting and beautiful language. Last summer in Armenia I got a very nice, although rather old, textbook (until then I had trouble finding some decent learning material). Next challenge will be to find a decent Armenian dictionary (I’ve failed so far).

By the way, here’s a picture of the ancient Eribuni fortress in Yerevan.

IMG_1229

Oh, and as usual, I’ve got some nice courses to recommend.

  • MongoDB University is a collection of nice online courses that teach how to use Mongo Database. Most of them are rather practical, so if there’s no need to handle MongoDB right away, they may not be awfully interesting. But if there is such a need, they are great.
  • However, one of those courses, which is published at Udacity, has a wider scope. It teaches the techniques of working with different data formats using Python. Nothing extremely complex, but there are lots of very helpful tips.
  • Also, there’s a nice specialization series by Charles Severance at Coursera. It’s Python again. The first two courses in the series are just basics, but they are followed by two courses on how to work with web data (including API) and databases. Again, rather simple, but nontheless very helpful sometimes.
Advertisements

Deep Learning MOOC

cropped-wheel-1000x150

As I have already mentioned, it’s not easy being me. In addition to my already formed nice and balanced ‘curriculum’ I have enrolled in yet another MOOC on Deep Learning, or DLMOOC. It begins in a week’s time, on 20 January and it ends on 21 March. It is another instance of a so-called mechanical MOOC, similar to Python MOOC, and also created by P2PU. This one is for educators. Well, as a person who has already launched two data-expeditions and is totally resolved to keep doing it in the future, I thought it might be a good idea to kind of learn a bit more about education in general. And this seems to be a very nice chance, because this MOOC has already collected more than 600 participants, that is educators from all over the world.

To be honest, I don’t think I’ll be able to be a very valuable contributor in terms of active participation, because I still have to work, learn pre-calculus and data analysis. And yes, we’ll have to launch our next expedition one day too (in spring I hope). But I’m sure I’ll still receive lots of valuable experience. I already have. I do like the communication system of DLMOOC with a G+ community as central platform. Although I’m not sure yet if it is appropriate for data-expeditions. It also has a flexible cooperation mechanism with an option to choose whether the participants want to work in ‘offline’ (friend-to-friend) groups or join into virtual groups. And it’s very interesting to see how it is going to develop and work. I will try to make notes on the way and share them here.

Second Data Expedition in Russian: Mission Accomplished

Not long ago, we completed the second Russian-language data expedition (DE2, ДЭ2) and here’s how it was.

The Russian-language version of this report can be found here.

Our first data expedition (DE1) was launched in July 2013. While organising DE2 we took into account the previous experience.

Brief overview

  • DE2 was launched on 9 December and finished on 23 December 2013.
  • The idea of a data expedition and its principles is based on the projects developed by P2PU and School of Data, which actually coined the term data expedition, as far as I know.
  • Therefore, DE2 was an open P2P-learning project-based initiative, available for everyone, free of charge and based on the idea of mutual help and cooperation.
  • The declared purpose of DE2 was to go through the whole cycle of data processing, with the key emphasis on exploring the structure of the data and patterns within a data set.
  • Unlike DE1, DE2 offered a pre-planned scenario with a sequence of four tasks and instructions aimed at the facilitation of the research process. The tasks of this sequence actually reflected the approach described in the Data Analysis course by Jeff Leek at Coursera.
  • The scenario was based on a certain data set, namely Online Video survey conducted by PSRAI Omnibus and provided by PewInternet project.
  • However, the participants were absolutely free to come up with their own alternative projects. So the scenario part was first and foremost a framework for those who have hard time elaborating their own research pathway.
  • By default, DE2 suggested using Google Spreadsheets and Google/Open Refine as working tools, but the participants were free to use any tools they preferred (which they did).
  • Its communication activity was mainly concentrated in a Google Group, which could be used both as a forum and a mailing list.
  • DE2 required no prerequisites in terms of data processing experience.
  • 20 people signed up for participation.
  • DE2 was organised by Irina Radchenko and myself as part of our larger informal learning Russian-language project Datadrivenjournalism.ru.

Results

Just like in the case of DE1, they were twofold:

Participants’ results:

  • a number of visualisations reflecting the associations and patterns within the data set;
  • some visualisations and spreadsheets reflecting the structure of the data;
  • a number of links to learning resources contributed by participants;
  • a published material (in Russian) based on the research conducted under an alternative (participant-initiated) project within DE2.

(The visualisations and links can be found at the Google Group forum)

Organisational results:

  • the messages at the Google Group forum;
  • two forms filled by participants (initial and final surveys)

DE2 participants used various tools, including:

Process

1. While DE1 can be considered a relative success in terms of participants involvement, the main challenge in DE2 was to keep that involvement strategy and to supplement it with a better structuring solution so that the participants feel more certain regarding what they should do, no matter whether they have any previous experience in working with data.

To this end, we prepared a number of relatively short introductory/reference texts that provided the details about both the project’s basic principles and the meaning of particular aspects of building a data driven story. We began posting these texts at the Group’s forum 5 days before the official start of DE2. These texts, apart from actual tasks included:

  • a brief intro into using GoogleGroups
  • DE2 scenario
  • an intro into a data expedition learning format
  • a description of a possible presentable output structure
  • types of data analysis and possible types of conclusions that can be made based on various analysis procedures
  • an invitation for the participants to introduce themselves
  • an intro to the data set we offered to work with

The four tasks were (briefly):

  1. explore the data description and the data set; find the meaning of the variable names; think about possible questions;
  2. provide a general description of the data set (how many observations, missing data etc.); start exploring possible associations between variables; think about more questions;
  3. continue exploring the associations; build exploratory charts; possibly do some statistical modeling if there are such skills;
  4. create expository charts; write a story; publish it (for those who didn’t have any platform of their own we created a special DE2 blog).

Apparently, some of these texts did a kind of facilitator’s job by simply initiating a space where a discussion could develop.

2. The introduction of the scenario seems to have worked, as most of the participants were trying to follow the tasks and discuss their findings or at least were toying with the provided data in their own way. On the other hand, there was an alternative research initiative, carried out by one of the participants on his own. Although he didn’t have a whole team working on the same project, he managed to receive some informational support and feedback at the forum.

The main objective of the ready-made scenario was twofold:

  • To make sure that those who prefer to work on their own, but have trouble building their own research still have something to do without having to necessarily get involved into the communication process;
  • To make sure that the participants with no experience have something to rely on, as mentioned above.

3. The choice of the data set for the scenario was a product of a compromise. We did realize that for the Russian-speaking audience data on Russia would be more interesting and probably easier to work with (we actually had to translate into Russian the data description provided at the data website to make sure everyone can understand it). But we also wanted the data set (a) to be rather clean to spare inexperienced participants spending much time on cleaning, as they only had two weeks at their disposal; and (b) to contain lots of variables reflecting different parameters of measurement in order to provide a lot of various opportunities to compare them. The data set we came up with in the end was satisfying in terms of the latter two requirements, but was based on the US survey.

4. However, the alternative project was exploring the Russian material. Namely, it was aimed at measuring the effectiveness of the Russian legislation regarding blocking websites that are regarded as harmful for children (those that are deemed to be promoting child porn, drug abuse, suicide, etc.). This law was passed in 2012 and was widely considered a rather inefficient one in terms of its declared objective, but very convenient as a censure tool for blocking undesirable web resources. This is actually a very interesting direction of research, which could well be continued within our upcoming projects.

5. As to the collaboration activity during the expedition, we can mainly judge about it by the forum messages. Although these messages do not reflect any activities outside the forum, so we cannot measure them, but the forum seems somewhat representative as it is. Here is the activity shape, which shows that the communication was not evenly distributed, but it covers the official DE2 period and actually goes beyond the official landmarks. The figures behind this chart include all forum messages, that is both participants’ and organisers’ messages.

enActivityDistribution

The red framework shows the official terms of DE2.

And this is a chart showing the activity dynamic by days of the week measured through all the forum activity period. The most active exchange apparently happens on the first working days and then gradually slides down to almost cease at the weekend.

enActivityWeek

Figures can be found here.

It seems that during the working week, people didn’t have much time to work on DE2, so the most part of work was done at the weekend and after that people shared their findings.

Participants

While in DE1 the participants could join the Google Group on their own, DE2 included one extra step. Those who wanted to participate had to first sign up by filling a registration form. This helped us to collect more information about the participants. Also we expected that this additional step could serve as a motivation filter. After sending the form, everyone was added to the DE2 Group. As a result, we had 20 filled forms, but only 14 participants added to the Group. We couldn’t add the rest, because the email addresses they provided were inactive.

Here’s a brief review of the whole number of those who registered based on the form data. Organisers’ data are not included.

enDE2Participants

Figures can be found here.

DE1 vs. DE2

It is interesting to compare results of DE2 to the results of DE1. Here is the proportional comparison:

enDE1vsDE2.

Figures can be found here.

In this chart, we can see that during DE1 more messages (124) were posted on the forum than during DE2 (107). Given that DE1 was only one week long and DE2 took a fortnight, this might look somewhat discouraging. On the other hand, we can also see that more people were involved into cooperation in DE2. This was measured by two parameters: the number of people who left at least one message on the forum (a self-introduction message in the most cases, if it was the only one) and the number of people who participated in experience/information exchange (normally expressed in a form of questions and answers, as well as sharing findings). In both cases DE2 shows better results (10 and 6 correspondingly) than DE1 (6 and 4), although in the former case fewer people had the access to the forum.

Though it might seem somewhat confusing, the likely explanation is that the timing for DE2 was extremely counterproductive and was actually our huge mistake. Although it was planned to finish a week before the New Year, still, people were outstandingly busy trying to meet their deadlines at work, preparing for exams or just having a lot of fuss due to the upcoming holidays. I think this was one of the most important reasons for those message-free days shown in the chart above, as well as the relatively low number of messages.

Meanwhile, the bigger number of people involved into the working and communication process shows that even though the participants were very busy they were still willing to proceed with the project. This makes me think that DE2 shows some progress compared to DE1, although the timing lesson should be taken into consideration in the future.

I must add that the DE1 report makes no difference between the organisers as participants in its measurements. In the case of DE2, I only counted the data regarding the participants leaving the two organisers aside (with the exception of the cases where it is specially discussed). So when comparing the data on the both expeditions in this report I also used only participants’ data for DE1. This explains the slight differences in figures between the two reports.

Conclusions

Success

  • More participants were involved in the working process.
  • The participants demonstrated friendly and careful attitude to each other (and the feedback provided by 5 of them pointed out the fact that they appreciated the communication and cooperation component).
  • All the respondents in the final survey expressed their intention to participate in following expeditions.
  • The activity, although not very regular, kept persistent through the whole expedition period.

Problems

  • Timing. This is the mistake that should never be repeated. What is the best time for an expedition has yet to be tested, but it is already obvious that it should not be scheduled in such busy months as December, May, June and probably November.
  • Promotion strategy. Almost half of the registered (8 out of 20) signed up during the first week of DE2, after the expedition had officially begun (in the beginning of the second week the registration form was shut down). That means that the promotion should be more timely and efficient.
  • Relevance. Although something could be learn even with the help of the provided data set, still next time we hope to come up with a more relevant one.
  • The lack of final results presented. There was only one participant who published the material based on his research during DE2. At least one story published is by all means great. But no one else came up with a story. Partially the reason for it might be that some participants were satisfied with what they had learnt and felt no need to publish a story. Another reason could be the lack of time. Still I think that a better format for results presentation might become a motivation for a bigger output.

Based on the results of DE2 and taking into account its lessons, we are going to proceed with organising Russian-language DEs. We might also consider launching alternative types of DEs, alongside with the regular ones, such as:

  • DEs for the participants with a particular level/kind of skills.
  • DEs with the emphasis on real research, rather than learning.
  • DEs aimed at mastering particular tools or working techniques.

All in all, DE2 was quite an experience and a great opportunity for getting in touch with wonderful people. I hope we will soon come up with more DE projects.

And happy New Year to everyone who somehow managed to make their way through this post up to this point.

Is it Christmas already?

It’s been quite an intensive period recently. First, I was having two parallel courses at Coursera – on data analysis and on statistics. Second, Irina Radchenko and I were preparing to launch a new Russian-language data expedition under our Datadrivenjournalism.ru project and then we were actually coordinating it for two weeks (9 – 23 December). Third, I suddenly had a huge task at work with a really tough deadline, which actually ruined my plans a bit, but thankfully not all of them. So here’s a brief account of the resulting layout:

I had to drop the data analysis course after its sixth week. Due to that sudden workload I couldn’t afford doing the second assignment, which was somewhat upsetting. But on the other hand, I think I’ll be able to do it later either on my own or within the course iteration (I’m almost sure it’s going to be launched soon again). Anyway, I’m glad I’ve done at least something, because it turned out to be rather helpful, especially in terms of structuring things and my mind. And yes, the previous course Computing for Data Analysis (on R) was extremely helpful. (For those who might be interested: the next iteration of this course starts on 6 January 2014.)

On the other hand, I triumphantly completed Statistics One course and that’s really cool. There are contradictory reviews of this course online. Some of them claim that the course is inconsistent in terms of difficulty: sometimes too easy and even boring, sometimes too complicated. Well, after completeing it, I can’t say that I’ve digested all the material provided. But now I have a better vision of what statistics is like and how it approaches data. Also I can apply some techniques for data analysis with the help R, but I wouldn’t claim I completely understand the mechanisms underlying some of these operations. Next I’m actually going to focus on Open Intro Statistics, which is a great textbook, and revise the material in order to pack it into my head. To wrap up this segment, I’ll add that the material that had been provided within that course by the middle of the semester was enough to complete assignment one in Data Analysis course.

As to the data expedition, it was luckily completed yesterday. Its organisation was considerably different from the previous experience and demanded quite a bit of in-advance preparation, apart from participation as it is. Although I couldn’t participate in it myself as thoroughly as I would want to, I still have to admit that the result somewhat exceded my expectations. I’ll be writing about it in a greater detail after I analyse the the whole picture. For now I can say that the timing was horrible. So the lesson is: never launch learning projects right befor Christmas or the New Year. But nonetheless there are some very inspiring results and the participants were virtually great.

Also, here are some links as usual:

And merry Christmas everyone who celebrates it now!

Links Links Links

A new bunch of links to the resources regarding statistics etc. that seem to me helpful:

Introduction to Statistics

This is an archive of an introductory statistics course at Coursera Statistics: Making Sense of Data by Alison Gibbs, Jeffrey Rosenthal (University of Toronto).

The authors of the course kindly provided a list of recommended literature. I don’t think it would be a crime to reproduce it here. So, they recommended three ‘traditional books’:

  • Introduction to the Practice of Statistics, by David S. Moore and George P. McCabe. (The book is currently in its fifth edition, but any edition will do.)
  • Stats: Data and Models, Canadian edition, by Richard D. De Veaux, Paul F. Velleman, David E. Bock, Augustin M. Vukov, and Augustine C.M. Wong. (The original version of the book, by the first three authors only, is also recommended.)
  • Statistics, by David Freedman, Robert Pisani, and Roger Purves.

And three online resources:

  • OpenIntro Statistics, by David M. Diez, Christopher D. Barr, and Mine Cetinkaya-Rundel. The cool thing about this one is that it’s not just a book, it’s a whole learning tool including labs and some instructions on using R.
  • Online Statistics Education, by David M. Lane, David Scott, Mikki Hebi, Rudy Guerra, Dan Osherson, and Heidi Zimmer
  • HyperStat Online, by David M. Lane
  • StatPrimer, by B. Burt Gerstman

R

Statistics and Python

And last, a couple of books kindly recommended by a great person at P2PU. These connect statistics to programming in Python:

First Data Expedition in Russian: Mission Complete

It’s been a while since I last posted here and there are actually two reasons for that. First, I’ve got a really heavy workload and it’s going to remain so for a while. What is most upsetting, I haven’t got enough time for doing the Python course, but I’m certainly going to make up for it as soon as possible. Second, we were busy organising and then participating in the first Russian-language experimental data-expedition, or data-MOOC. And this is the experience I want to share here as well, because it was extremely inspiring and rather instructive. Besides, it’s about p2p-learning, which is one of the subjects of this blog.

While writing this account I was using the model provided in the account of the School of Data/P2PU’s MOOC.

Now, some overview

  • This project was inspired by participating in Data-MOOC organised by School of Data and P2PU in April-May 2013. Also, I must say that the blog of the Python MOOC has been really helpful and instructive.
  • The project was based on p2p-learning principles and a mechanical MOOC model. For the sake of brevity and attractiveness, we used the term ‘data-expedition’ (экспедиция данных, дата-экспедиция) to describe it.
  • It was a week long: from July 22 to July 28.
  • Its declared objective was to learn how to look for datasets online. To focus the task, we suggested a topic, which was collecting data about universities all over the world. So, unlike the School of Data’s Data-MOOC, it wasn’t supposed to reproduce the complete data-processing cycle, but rather to perform its first stage.
  • The project was organised by Irina Radchenko and myself as part our larger informal project Datadrivenjournalism.ru. Within the Expedition, we acted both as the support team and participants.
  • The goal of this Expedition was twofold. First, we wanted to see if this format works in the local environment. Second, well, I personally wanted to learn more about how to search for data.
  • We announced the upcoming data-expedition ten days before the start and by the beginning 20 people submitted for participation. Which was actually more than we expected.
  • Participation was absolutely free and open and no special skills were required.
  • The participants’ main communication platform was a Google group set as a forum (with a possibility to turn on the mailing option).
  • Our main collaboration tool was Google Docs.
  • This expedition heavily relied on collaboration and p2p initiative. It had no prescribed plan or step-by-step tasks, apart from the initially formulated one. So the organisational messages were first and foremost aimed at facilitating people’s communication and introducing into the specific of the format.

Results

As expected, they are twofold.

Participants’ results:

  • 3 visualisations
  • 1 data-scraping tutorial for beginners
  • A collective Google Doc with a list of sources

Organisational results:

  • 2 surveys (preliminary and final)
  • The participants’ exchange documented on the Google group’s forum
  • The set of collective Google Docs

As to the participants’ results, here are some links:

But in this post, I’ll focus on some highlights of the process.

1. Speaking of the organisation, our main target was to help people get involved in cooperation and boost activity. To this end, we started introducing people into the format a few days before the expedition began. Judging by the previous experience, the lack of confidence and the uncertainty about where to start and what to do is one of the barriers to be overcome. In order to facilitate cooperation, we published consequently a number of organisational messages:

  • Introduction to the objective of the expedition (explaining why searching for data is an important skill)
  • Introduction to the format of ‘expedition’ (or a mechanical MOOC) with some tips on what to do and how to react
  • List of tools for online-collaboration (and invitation to contribute participants’ own ideas)
  • Invitation for the participants to introduce themselves (several possible key-points of introduction were suggested)

By the beginning of the expedition the participants knew each other’s names and how to address each other; they also knew each other’s area of expertise. Moreover, they started communicating before the expedition officially began.

2. Some figures:

  • 20 people joined the expedition group
  • 10 people filled the pre-face survey
  • 6 people actively communicated at the forum during the whole expedition
  • 7 people filled the final survey

3. During the whole period of the Expedition, including the unofficial preliminary/introductory part (which began on 17 July), 124 messages were sent via the forum. Of course there were instances of bilateral communication, but we couldn’t register them for obvious reasons. Here’s the distribution of the forum activity (with the peak in the middle of expedition).

activity_distribution

4. The atmosphere of the communication was friendly and relaxed. The participants actively discussed each other’s initiatives and provided encouraging feedback.

Here are some facts about the participants (based on the pre-face survey):

data_exp_charts_eng

Figures can be found here.

Conclusions

Success

  • People got interested in the project and willingly joined
  • The participants demonstrated friendly and careful attitude to each other
  • The core of the group (6 people) were active during the whole period of the expedition
  • All the respondents in the final survey expressed their intention to participate in following expeditions
  • Most of the respondents in the final survey expressed their intention to complete the projects they started in the course of expedition, but failed to finish due to the lack of time

Problems

  • Obviously, the output of the project wasn’t confined to the declared objective. On the one hand, it’s natural: people learn are free to learn what they want to learn. On the other hand, the absence of a more precise schedule made some participants feel uncomfortable. From which we conclude that a more concentrated approach is needed.
  • All the respondents in the final survey said they didn’t have enough time to compete what they wanted. At the same time, most of them admitted that the terms were adequate to the task.
  • Most of the respondents felt some discomfort because of the lack of a coordinator or instructor and also said they didn’t always understand what and how they should do.

In the final survey, we asked how we could make the process more efficient and here’s the summary of the ideas:

  • More precise schedule of our activity would be good (like breaking the whole expedition period into specific phases)
  • Coordinator is needed
  • Some instruments for encouraging shy and unconfident participants would be helpful
  • It might be better if the output of the whole project is formulated more precisely
  • The topic should probably be narrower
  • Longer expedition terms would make it easier for self-organisation

Prospects

We are totally going to continue our experiments with this format. In the future, we are going to try something like:

  • One-day intensive online expeditions with fixed roles and distributed responsibilities
  • Long term (several weeks’ long) expeditions with a coordinator (constant or elected for a certain term)
  • Workshop-expeditions: online massive projects lead by a volunteer instructor or mentor willing to share their skills
  • Expeditions based on the Data-MOOC scheme (with pre-planned tasks)

We are also going to develop a method to register participants’ achievements, even small ones, in order to encourage further efforts.

Also, we feel the need to create a way to proudly present major achievements. Here we should consider the experience of creating badge systems.

Well, that’s it for now. It was really cool! And there’s quite a bit of work ahead too!

The Russian version of this account can be found here.

Code sharing options

As I’m proceeding with Python MOOC, I had to choose a way to share my code with my peers. There are many options in fact. Here are some of them:

GitHub 

This was recommended by the MOOC instructions. This is a multifunctional platform that allows you to create repositories, gists and forks, follow users, publish privately or openly, download codes and leave comments. What I also like about it is that you can follow users. For sharing homework gists might be the best option.

github

There are two shortcomings though:

  • You can’t publish your code without registration
  • Some users complain that the interface is a bit too complicated, so it takes time to get used to it

Pastebin

This is an extremely easy to use sharing tool. Actually, what you first see at the main page is a box where you can paste your code. You don’t have to register to do it (so you simply have a link that you can later share). You can also set the expiration time for each publication (from 10 minutes to never). And you can make it public, unlisted or private (for members only). You can also register if you like (I did to keep my homework in order).

Pastebin

Shortcomings:

  • I haven’t seen any commenting option, which might be good for feedback and revision while learning
  • I also couldn’t find any option to follow other members.

DPaste 

This is a very minimalistic service. You can’t register, you only can paste your code and save it. After you do it, it will stay there for 30 days and then it’ll be automatically deleted. So it’s good for quick sharing purposes, but not for continuous and systematic use.

dpaste

Also I recently found Bitbucket 

But I haven’t explored it yet. If anyone has some experience, please share. There are some explanations as to how to use it though: Bitbucket 101.

As for me, I’m currently using GitHub and Pastebin, because GitHub looks like a wonderful working space and Pastebin is good for sharing with those who are scared of GitHub: