Briefly

1. I’m still alive.

2. I keep working as a journalist. Recently I’ve actually tried applying my newly acquired skills to my real job. Still much to work on, but at least I seem to have learnt at least something. In the first case I tried to work with some data on death penalty in the US; in the second case, I was visualising some aspects (namely, on kidnapping) of Global Terrorism Database. Both materials are in Russian of course. Moreover, the website does currently not allow for embedding interactive visualisations, so there are only screenshots, while the original interactive stuff is published on my Blogger account (but again, in Russian). Speaking of Global Terrorism Database, there’s a whole course at Coursera based on this project. Don’t know much about it, so I can’t recommend it, but I’ll definitely have a look, as soon as I can.

3. I keep tracking the developments in the activity of Open Data School in Moscow. It’s an interesting project both as an educational initiative and as part of promoting openness. More on it later, as well as on DLMOOC, by the way, which is fascinating (sadly, I’ve been virtually unable to participate full-scale).

4. Meanwhile, I’m trying to keep up with Linear Algebra: Foundation to Frontiers and Statistical Learning.

5. Right now I’m in the middle of running yet another Russian-language data expedition (DE3), which began on 20 February. This one is a bit different from DE1 and DE2. First, we this time we (Irina Radchenko and myself from Datadrivenjournalism.ru) worked in partnership with Aleksey Sidorenko from NGO “Теплица социальных технологий” (Teplitsa/Greenhouse of Social Technologies). It is also the first time that we have taken a socially meaningful subject, which is orphan diseases. DE3 is going to finish on March 5. Soon after, I’ll be able to tell more about it, as well as about its findings (we’re digging the data on the situation in Russia in the first place). By the way, it will also be great to have some kind of feedback from people from other countries who are aware of the local situation (Jakes?).

6. Last, but not least, I’m currently involved (unfortunately in quite a hybernating way at the moment) in developing an international project on national informational resources. It all started with Team 10, but it’s going to grow. More on it later.

Is it Christmas already?

It’s been quite an intensive period recently. First, I was having two parallel courses at Coursera – on data analysis and on statistics. Second, Irina Radchenko and I were preparing to launch a new Russian-language data expedition under our Datadrivenjournalism.ru project and then we were actually coordinating it for two weeks (9 – 23 December). Third, I suddenly had a huge task at work with a really tough deadline, which actually ruined my plans a bit, but thankfully not all of them. So here’s a brief account of the resulting layout:

I had to drop the data analysis course after its sixth week. Due to that sudden workload I couldn’t afford doing the second assignment, which was somewhat upsetting. But on the other hand, I think I’ll be able to do it later either on my own or within the course iteration (I’m almost sure it’s going to be launched soon again). Anyway, I’m glad I’ve done at least something, because it turned out to be rather helpful, especially in terms of structuring things and my mind. And yes, the previous course Computing for Data Analysis (on R) was extremely helpful. (For those who might be interested: the next iteration of this course starts on 6 January 2014.)

On the other hand, I triumphantly completed Statistics One course and that’s really cool. There are contradictory reviews of this course online. Some of them claim that the course is inconsistent in terms of difficulty: sometimes too easy and even boring, sometimes too complicated. Well, after completeing it, I can’t say that I’ve digested all the material provided. But now I have a better vision of what statistics is like and how it approaches data. Also I can apply some techniques for data analysis with the help R, but I wouldn’t claim I completely understand the mechanisms underlying some of these operations. Next I’m actually going to focus on Open Intro Statistics, which is a great textbook, and revise the material in order to pack it into my head. To wrap up this segment, I’ll add that the material that had been provided within that course by the middle of the semester was enough to complete assignment one in Data Analysis course.

As to the data expedition, it was luckily completed yesterday. Its organisation was considerably different from the previous experience and demanded quite a bit of in-advance preparation, apart from participation as it is. Although I couldn’t participate in it myself as thoroughly as I would want to, I still have to admit that the result somewhat exceded my expectations. I’ll be writing about it in a greater detail after I analyse the the whole picture. For now I can say that the timing was horrible. So the lesson is: never launch learning projects right befor Christmas or the New Year. But nonetheless there are some very inspiring results and the participants were virtually great.

Also, here are some links as usual:

And merry Christmas everyone who celebrates it now!

First completed course at Coursera

A week ago, I completed Computing for Data Analysis by Prof. Roger Peng at Coursera. This course was described as an introduction to the R language. Well, this might have been somewhat confusing, because it was an introductory course indeed for those who were totally new to R. But not for those who were total newbies in programming in general, which wasn’t actually directly mentioned in the course description. Judging by numerous complains at the discussion forum within the course, some people really were having hard time trying to figure out where to start having no programming experience whatsoever.

On the other hand, even a very distant familiarity with programming basics in Python made things a bit more tolerable to me than they would have been had I never ever seen things like an IDE or a for-loop before. So for me the course was rather challenging and even frustrating at times, but to my huge surprise I was able to complete the assignments. This doesn’t mean of course that I have perfectly understood, digested and mastered all the material provided. But after the course I really feel much more confident in the R environment. What is even more important, the course helped me to map my skills, so now I know what I need to learn better, where and how I can look for help and which spots in my knowledge I can rely on. All in all, I’m glad I took this course. Thanks to Dr. Peng and his wonderful teaching assistants who made a huge lot of job trying to retell the course material so that even total newbies could keep up.

By the way, I think the course is still available as archive at Coursera. Its video lectures are also available at YouTube.

Also, I must admit, I have developed Stockholm Syndrome began to like R.

And I’ve spent almost two notebooks on it, because I really feel more confident when I make notes on the way.

20131026_102208

Now, as a follow-up, I played a bit with the dataset, which was used for our last assignment focused on regular expressions. We worked with the homicide data from Baltimore Sun site, which provides an interactive application to navigate these data, but doesn’t provide them in a downloadable format. So Dr. Peng simply copied them from the page source and pasted into a text file. Here it is.

For our assignment we had to write two functions. One had to count the number of victims given the cause of death. The other had to count the number of victims of a given age.

I wanted to find out if there are any preferred ways of murder given a gender. I also wanted to visualise my results. To this end, I first wrote a function that sorted victims by gender given a cause and returned the result as a data frame. Then I wrote another function that joined the output of the first one into a general data frame for all the causes presented in the dataset. I realize my code is not exactly neat and nice, but I’m glad that at least it works.

And well, I actually found out that the most common cause of violent death in Baltimore in the period from 2007 to 2012 was shooting; that out of 1245 observations in 1126 cases victims are male, so it looks like this:

bar_chart_by_gender

Also, the only category in which female victims prevail is asphyxiation. So speaking about preferences in killing tools given gender, this chart might be more instructive.

stacked_barchart_by_cause

Well, for more sophisticated data analysis I’ve yet to learn loads of Statistics. By the way, as to Statistics, I’m still taking Statistics One by Prof. Andrew Conway at Coursera. Although it seemed a bit boring at the beginning, now it’s getting more and more interesting.

Also I have completed the Python course at Codecademy. And immediately started a course in JavaScript. Because I like Codecademy. And because I don’t have enough time right now to focus on learning API with Python there. Never mind that I’m currently doing Introduction to Interactive Programming in Python at Coursera. I promise, I’ll quit it, as soon as it becomes too challenging to be combined with Statistics and Data Analysis, which starts on October 28th.

All this stuff is supposed to be completed by January. I must say, now I feel a strongest urge to get down to something a bit more fundamental, like maths and computer science basics.

It’s not easy being me

Just to complain a bit. But also probably somebody will be able to make more use of it than me.

By the end of summer I had a perfectly minimalistic learning plan for the autumn: R and Statistics. Isn’t it sweet? Just that and nothing else at least till December. Well, and a tiny bit of Python (Codecademy) at the background.

And here are some of the courses that turned up out of the blue right as soon as I started implementing my Perfect Plan.

  • Learning from Data at edX, began on September, 30
  • Social Network Analysis at Coursera, begins on October, 7
  • An iteration of Introduction to Interactive Programming in Python at Coursera. A course I failed to finish last spring because I got enrolled in School of Data Mission. Begins on October, 7
  • The Future of Storytelling at some Iversity (a new US MOOC platform, as far as I understand). Not sure it’s worth watching, but might be worth having a look at. Begins on October, 25 UPD. Has begun. No, definitely not worth wasting time on. I don’t mean the course is bad – I don’t know. But not what I think I need now.
  • Data Analysis at Coursera, begins on October, 28

Not to mention the upcoming (not sure when exactly, but this autumn) new iteration of School of Data MOOC (Data Explorer Mission).

Feel like Horrid Henry.

Horrid Henry

UPD. A new iteration of Python Mechanical MOOC is starting on October 21. Bingo!

Links Links Links

A new bunch of links to the resources regarding statistics etc. that seem to me helpful:

Introduction to Statistics

This is an archive of an introductory statistics course at Coursera Statistics: Making Sense of Data by Alison Gibbs, Jeffrey Rosenthal (University of Toronto).

The authors of the course kindly provided a list of recommended literature. I don’t think it would be a crime to reproduce it here. So, they recommended three ‘traditional books’:

  • Introduction to the Practice of Statistics, by David S. Moore and George P. McCabe. (The book is currently in its fifth edition, but any edition will do.)
  • Stats: Data and Models, Canadian edition, by Richard D. De Veaux, Paul F. Velleman, David E. Bock, Augustin M. Vukov, and Augustine C.M. Wong. (The original version of the book, by the first three authors only, is also recommended.)
  • Statistics, by David Freedman, Robert Pisani, and Roger Purves.

And three online resources:

  • OpenIntro Statistics, by David M. Diez, Christopher D. Barr, and Mine Cetinkaya-Rundel. The cool thing about this one is that it’s not just a book, it’s a whole learning tool including labs and some instructions on using R.
  • Online Statistics Education, by David M. Lane, David Scott, Mikki Hebi, Rudy Guerra, Dan Osherson, and Heidi Zimmer
  • HyperStat Online, by David M. Lane
  • StatPrimer, by B. Burt Gerstman

R

Statistics and Python

And last, a couple of books kindly recommended by a great person at P2PU. These connect statistics to programming in Python:

Back to learning. Statistics, R

OK, here am I back from my over a month’s time gap, of which there were two weeks of holydays and the rest was a huge lump of work, including the tasks for my job as well as some work at Moscow Open Data School. But now I hope I’ll be able to afford to spend some time on just learning.

Unfortunately, I couldn’t finish my Python MOOC, because of that sudden workload again. But I’m totally going to get back to it as soon as I can. Following Zach Sims’ (Codecademy) recommendation, I’m simply trying to gradually do the tasks Codecademy to refresh stuff in my mind and to keep digesting Python.

Right now though I’m focused on the Statistics course that has just begun at Coursera (by the way, those who are interested are welcome to join). I wonder how helpful it’s going to be, but there’s one thing I know for sure: I’ve got to learn how to process data in R. And well, the R course is actually integrated into this one, which is great.

While working on the first assignment, which was actually a very simple drill exercise to memorize some R commands, I faced one problem. The problem was that I couldn’t install and load a package in R (MS Windows 7) because of some troubles with administrator access to some saving functions (although I’m obviously the administrator). Or better to say, it did download the package, but it would refuse to save it in the R directory. As far as I know, some students in the same course also had troubles at this point, but they were different. In my case the solution was very simple. I just manually relocated the necessary package from where it was saved by default to where I needed it (namely, in the R library). And there’s also a way to install packages from a manually downloaded (from CRAN) .zip files through the menu (Packages > Install package(s) from local zip files). Well, at this stage this works perfectly well for me.

And here are some helpful links:

First Data Expedition in Russian: Mission Complete

It’s been a while since I last posted here and there are actually two reasons for that. First, I’ve got a really heavy workload and it’s going to remain so for a while. What is most upsetting, I haven’t got enough time for doing the Python course, but I’m certainly going to make up for it as soon as possible. Second, we were busy organising and then participating in the first Russian-language experimental data-expedition, or data-MOOC. And this is the experience I want to share here as well, because it was extremely inspiring and rather instructive. Besides, it’s about p2p-learning, which is one of the subjects of this blog.

While writing this account I was using the model provided in the account of the School of Data/P2PU’s MOOC.

Now, some overview

  • This project was inspired by participating in Data-MOOC organised by School of Data and P2PU in April-May 2013. Also, I must say that the blog of the Python MOOC has been really helpful and instructive.
  • The project was based on p2p-learning principles and a mechanical MOOC model. For the sake of brevity and attractiveness, we used the term ‘data-expedition’ (экспедиция данных, дата-экспедиция) to describe it.
  • It was a week long: from July 22 to July 28.
  • Its declared objective was to learn how to look for datasets online. To focus the task, we suggested a topic, which was collecting data about universities all over the world. So, unlike the School of Data’s Data-MOOC, it wasn’t supposed to reproduce the complete data-processing cycle, but rather to perform its first stage.
  • The project was organised by Irina Radchenko and myself as part our larger informal project Datadrivenjournalism.ru. Within the Expedition, we acted both as the support team and participants.
  • The goal of this Expedition was twofold. First, we wanted to see if this format works in the local environment. Second, well, I personally wanted to learn more about how to search for data.
  • We announced the upcoming data-expedition ten days before the start and by the beginning 20 people submitted for participation. Which was actually more than we expected.
  • Participation was absolutely free and open and no special skills were required.
  • The participants’ main communication platform was a Google group set as a forum (with a possibility to turn on the mailing option).
  • Our main collaboration tool was Google Docs.
  • This expedition heavily relied on collaboration and p2p initiative. It had no prescribed plan or step-by-step tasks, apart from the initially formulated one. So the organisational messages were first and foremost aimed at facilitating people’s communication and introducing into the specific of the format.

Results

As expected, they are twofold.

Participants’ results:

  • 3 visualisations
  • 1 data-scraping tutorial for beginners
  • A collective Google Doc with a list of sources

Organisational results:

  • 2 surveys (preliminary and final)
  • The participants’ exchange documented on the Google group’s forum
  • The set of collective Google Docs

As to the participants’ results, here are some links:

But in this post, I’ll focus on some highlights of the process.

1. Speaking of the organisation, our main target was to help people get involved in cooperation and boost activity. To this end, we started introducing people into the format a few days before the expedition began. Judging by the previous experience, the lack of confidence and the uncertainty about where to start and what to do is one of the barriers to be overcome. In order to facilitate cooperation, we published consequently a number of organisational messages:

  • Introduction to the objective of the expedition (explaining why searching for data is an important skill)
  • Introduction to the format of ‘expedition’ (or a mechanical MOOC) with some tips on what to do and how to react
  • List of tools for online-collaboration (and invitation to contribute participants’ own ideas)
  • Invitation for the participants to introduce themselves (several possible key-points of introduction were suggested)

By the beginning of the expedition the participants knew each other’s names and how to address each other; they also knew each other’s area of expertise. Moreover, they started communicating before the expedition officially began.

2. Some figures:

  • 20 people joined the expedition group
  • 10 people filled the pre-face survey
  • 6 people actively communicated at the forum during the whole expedition
  • 7 people filled the final survey

3. During the whole period of the Expedition, including the unofficial preliminary/introductory part (which began on 17 July), 124 messages were sent via the forum. Of course there were instances of bilateral communication, but we couldn’t register them for obvious reasons. Here’s the distribution of the forum activity (with the peak in the middle of expedition).

activity_distribution

4. The atmosphere of the communication was friendly and relaxed. The participants actively discussed each other’s initiatives and provided encouraging feedback.

Here are some facts about the participants (based on the pre-face survey):

data_exp_charts_eng

Figures can be found here.

Conclusions

Success

  • People got interested in the project and willingly joined
  • The participants demonstrated friendly and careful attitude to each other
  • The core of the group (6 people) were active during the whole period of the expedition
  • All the respondents in the final survey expressed their intention to participate in following expeditions
  • Most of the respondents in the final survey expressed their intention to complete the projects they started in the course of expedition, but failed to finish due to the lack of time

Problems

  • Obviously, the output of the project wasn’t confined to the declared objective. On the one hand, it’s natural: people learn are free to learn what they want to learn. On the other hand, the absence of a more precise schedule made some participants feel uncomfortable. From which we conclude that a more concentrated approach is needed.
  • All the respondents in the final survey said they didn’t have enough time to compete what they wanted. At the same time, most of them admitted that the terms were adequate to the task.
  • Most of the respondents felt some discomfort because of the lack of a coordinator or instructor and also said they didn’t always understand what and how they should do.

In the final survey, we asked how we could make the process more efficient and here’s the summary of the ideas:

  • More precise schedule of our activity would be good (like breaking the whole expedition period into specific phases)
  • Coordinator is needed
  • Some instruments for encouraging shy and unconfident participants would be helpful
  • It might be better if the output of the whole project is formulated more precisely
  • The topic should probably be narrower
  • Longer expedition terms would make it easier for self-organisation

Prospects

We are totally going to continue our experiments with this format. In the future, we are going to try something like:

  • One-day intensive online expeditions with fixed roles and distributed responsibilities
  • Long term (several weeks’ long) expeditions with a coordinator (constant or elected for a certain term)
  • Workshop-expeditions: online massive projects lead by a volunteer instructor or mentor willing to share their skills
  • Expeditions based on the Data-MOOC scheme (with pre-planned tasks)

We are also going to develop a method to register participants’ achievements, even small ones, in order to encourage further efforts.

Also, we feel the need to create a way to proudly present major achievements. Here we should consider the experience of creating badge systems.

Well, that’s it for now. It was really cool! And there’s quite a bit of work ahead too!

The Russian version of this account can be found here.