First completed course at Coursera

A week ago, I completed Computing for Data Analysis by Prof. Roger Peng at Coursera. This course was described as an introduction to the R language. Well, this might have been somewhat confusing, because it was an introductory course indeed for those who were totally new to R. But not for those who were total newbies in programming in general, which wasn’t actually directly mentioned in the course description. Judging by numerous complains at the discussion forum within the course, some people really were having hard time trying to figure out where to start having no programming experience whatsoever.

On the other hand, even a very distant familiarity with programming basics in Python made things a bit more tolerable to me than they would have been had I never ever seen things like an IDE or a for-loop before. So for me the course was rather challenging and even frustrating at times, but to my huge surprise I was able to complete the assignments. This doesn’t mean of course that I have perfectly understood, digested and mastered all the material provided. But after the course I really feel much more confident in the R environment. What is even more important, the course helped me to map my skills, so now I know what I need to learn better, where and how I can look for help and which spots in my knowledge I can rely on. All in all, I’m glad I took this course. Thanks to Dr. Peng and his wonderful teaching assistants who made a huge lot of job trying to retell the course material so that even total newbies could keep up.

By the way, I think the course is still available as archive at Coursera. Its video lectures are also available at YouTube.

Also, I must admit, I have developed Stockholm Syndrome began to like R.

And I’ve spent almost two notebooks on it, because I really feel more confident when I make notes on the way.

20131026_102208

Now, as a follow-up, I played a bit with the dataset, which was used for our last assignment focused on regular expressions. We worked with the homicide data from Baltimore Sun site, which provides an interactive application to navigate these data, but doesn’t provide them in a downloadable format. So Dr. Peng simply copied them from the page source and pasted into a text file. Here it is.

For our assignment we had to write two functions. One had to count the number of victims given the cause of death. The other had to count the number of victims of a given age.

I wanted to find out if there are any preferred ways of murder given a gender. I also wanted to visualise my results. To this end, I first wrote a function that sorted victims by gender given a cause and returned the result as a data frame. Then I wrote another function that joined the output of the first one into a general data frame for all the causes presented in the dataset. I realize my code is not exactly neat and nice, but I’m glad that at least it works.

And well, I actually found out that the most common cause of violent death in Baltimore in the period from 2007 to 2012 was shooting; that out of 1245 observations in 1126 cases victims are male, so it looks like this:

bar_chart_by_gender

Also, the only category in which female victims prevail is asphyxiation. So speaking about preferences in killing tools given gender, this chart might be more instructive.

stacked_barchart_by_cause

Well, for more sophisticated data analysis I’ve yet to learn loads of Statistics. By the way, as to Statistics, I’m still taking Statistics One by Prof. Andrew Conway at Coursera. Although it seemed a bit boring at the beginning, now it’s getting more and more interesting.

Also I have completed the Python course at Codecademy. And immediately started a course in JavaScript. Because I like Codecademy. And because I don’t have enough time right now to focus on learning API with Python there. Never mind that I’m currently doing Introduction to Interactive Programming in Python at Coursera. I promise, I’ll quit it, as soon as it becomes too challenging to be combined with Statistics and Data Analysis, which starts on October 28th.

All this stuff is supposed to be completed by January. I must say, now I feel a strongest urge to get down to something a bit more fundamental, like maths and computer science basics.

Advertisements

It’s not easy being me

Just to complain a bit. But also probably somebody will be able to make more use of it than me.

By the end of summer I had a perfectly minimalistic learning plan for the autumn: R and Statistics. Isn’t it sweet? Just that and nothing else at least till December. Well, and a tiny bit of Python (Codecademy) at the background.

And here are some of the courses that turned up out of the blue right as soon as I started implementing my Perfect Plan.

  • Learning from Data at edX, began on September, 30
  • Social Network Analysis at Coursera, begins on October, 7
  • An iteration of Introduction to Interactive Programming in Python at Coursera. A course I failed to finish last spring because I got enrolled in School of Data Mission. Begins on October, 7
  • The Future of Storytelling at some Iversity (a new US MOOC platform, as far as I understand). Not sure it’s worth watching, but might be worth having a look at. Begins on October, 25 UPD. Has begun. No, definitely not worth wasting time on. I don’t mean the course is bad – I don’t know. But not what I think I need now.
  • Data Analysis at Coursera, begins on October, 28

Not to mention the upcoming (not sure when exactly, but this autumn) new iteration of School of Data MOOC (Data Explorer Mission).

Feel like Horrid Henry.

Horrid Henry

UPD. A new iteration of Python Mechanical MOOC is starting on October 21. Bingo!

Links Links Links

A new bunch of links to the resources regarding statistics etc. that seem to me helpful:

Introduction to Statistics

This is an archive of an introductory statistics course at Coursera Statistics: Making Sense of Data by Alison Gibbs, Jeffrey Rosenthal (University of Toronto).

The authors of the course kindly provided a list of recommended literature. I don’t think it would be a crime to reproduce it here. So, they recommended three ‘traditional books’:

  • Introduction to the Practice of Statistics, by David S. Moore and George P. McCabe. (The book is currently in its fifth edition, but any edition will do.)
  • Stats: Data and Models, Canadian edition, by Richard D. De Veaux, Paul F. Velleman, David E. Bock, Augustin M. Vukov, and Augustine C.M. Wong. (The original version of the book, by the first three authors only, is also recommended.)
  • Statistics, by David Freedman, Robert Pisani, and Roger Purves.

And three online resources:

  • OpenIntro Statistics, by David M. Diez, Christopher D. Barr, and Mine Cetinkaya-Rundel. The cool thing about this one is that it’s not just a book, it’s a whole learning tool including labs and some instructions on using R.
  • Online Statistics Education, by David M. Lane, David Scott, Mikki Hebi, Rudy Guerra, Dan Osherson, and Heidi Zimmer
  • HyperStat Online, by David M. Lane
  • StatPrimer, by B. Burt Gerstman

R

Statistics and Python

And last, a couple of books kindly recommended by a great person at P2PU. These connect statistics to programming in Python:

Back to learning. Statistics, R

OK, here am I back from my over a month’s time gap, of which there were two weeks of holydays and the rest was a huge lump of work, including the tasks for my job as well as some work at Moscow Open Data School. But now I hope I’ll be able to afford to spend some time on just learning.

Unfortunately, I couldn’t finish my Python MOOC, because of that sudden workload again. But I’m totally going to get back to it as soon as I can. Following Zach Sims’ (Codecademy) recommendation, I’m simply trying to gradually do the tasks Codecademy to refresh stuff in my mind and to keep digesting Python.

Right now though I’m focused on the Statistics course that has just begun at Coursera (by the way, those who are interested are welcome to join). I wonder how helpful it’s going to be, but there’s one thing I know for sure: I’ve got to learn how to process data in R. And well, the R course is actually integrated into this one, which is great.

While working on the first assignment, which was actually a very simple drill exercise to memorize some R commands, I faced one problem. The problem was that I couldn’t install and load a package in R (MS Windows 7) because of some troubles with administrator access to some saving functions (although I’m obviously the administrator). Or better to say, it did download the package, but it would refuse to save it in the R directory. As far as I know, some students in the same course also had troubles at this point, but they were different. In my case the solution was very simple. I just manually relocated the necessary package from where it was saved by default to where I needed it (namely, in the R library). And there’s also a way to install packages from a manually downloaded (from CRAN) .zip files through the menu (Packages > Install package(s) from local zip files). Well, at this stage this works perfectly well for me.

And here are some helpful links:

First Data Expedition in Russian: Mission Complete

It’s been a while since I last posted here and there are actually two reasons for that. First, I’ve got a really heavy workload and it’s going to remain so for a while. What is most upsetting, I haven’t got enough time for doing the Python course, but I’m certainly going to make up for it as soon as possible. Second, we were busy organising and then participating in the first Russian-language experimental data-expedition, or data-MOOC. And this is the experience I want to share here as well, because it was extremely inspiring and rather instructive. Besides, it’s about p2p-learning, which is one of the subjects of this blog.

While writing this account I was using the model provided in the account of the School of Data/P2PU’s MOOC.

Now, some overview

  • This project was inspired by participating in Data-MOOC organised by School of Data and P2PU in April-May 2013. Also, I must say that the blog of the Python MOOC has been really helpful and instructive.
  • The project was based on p2p-learning principles and a mechanical MOOC model. For the sake of brevity and attractiveness, we used the term ‘data-expedition’ (экспедиция данных, дата-экспедиция) to describe it.
  • It was a week long: from July 22 to July 28.
  • Its declared objective was to learn how to look for datasets online. To focus the task, we suggested a topic, which was collecting data about universities all over the world. So, unlike the School of Data’s Data-MOOC, it wasn’t supposed to reproduce the complete data-processing cycle, but rather to perform its first stage.
  • The project was organised by Irina Radchenko and myself as part our larger informal project Datadrivenjournalism.ru. Within the Expedition, we acted both as the support team and participants.
  • The goal of this Expedition was twofold. First, we wanted to see if this format works in the local environment. Second, well, I personally wanted to learn more about how to search for data.
  • We announced the upcoming data-expedition ten days before the start and by the beginning 20 people submitted for participation. Which was actually more than we expected.
  • Participation was absolutely free and open and no special skills were required.
  • The participants’ main communication platform was a Google group set as a forum (with a possibility to turn on the mailing option).
  • Our main collaboration tool was Google Docs.
  • This expedition heavily relied on collaboration and p2p initiative. It had no prescribed plan or step-by-step tasks, apart from the initially formulated one. So the organisational messages were first and foremost aimed at facilitating people’s communication and introducing into the specific of the format.

Results

As expected, they are twofold.

Participants’ results:

  • 3 visualisations
  • 1 data-scraping tutorial for beginners
  • A collective Google Doc with a list of sources

Organisational results:

  • 2 surveys (preliminary and final)
  • The participants’ exchange documented on the Google group’s forum
  • The set of collective Google Docs

As to the participants’ results, here are some links:

But in this post, I’ll focus on some highlights of the process.

1. Speaking of the organisation, our main target was to help people get involved in cooperation and boost activity. To this end, we started introducing people into the format a few days before the expedition began. Judging by the previous experience, the lack of confidence and the uncertainty about where to start and what to do is one of the barriers to be overcome. In order to facilitate cooperation, we published consequently a number of organisational messages:

  • Introduction to the objective of the expedition (explaining why searching for data is an important skill)
  • Introduction to the format of ‘expedition’ (or a mechanical MOOC) with some tips on what to do and how to react
  • List of tools for online-collaboration (and invitation to contribute participants’ own ideas)
  • Invitation for the participants to introduce themselves (several possible key-points of introduction were suggested)

By the beginning of the expedition the participants knew each other’s names and how to address each other; they also knew each other’s area of expertise. Moreover, they started communicating before the expedition officially began.

2. Some figures:

  • 20 people joined the expedition group
  • 10 people filled the pre-face survey
  • 6 people actively communicated at the forum during the whole expedition
  • 7 people filled the final survey

3. During the whole period of the Expedition, including the unofficial preliminary/introductory part (which began on 17 July), 124 messages were sent via the forum. Of course there were instances of bilateral communication, but we couldn’t register them for obvious reasons. Here’s the distribution of the forum activity (with the peak in the middle of expedition).

activity_distribution

4. The atmosphere of the communication was friendly and relaxed. The participants actively discussed each other’s initiatives and provided encouraging feedback.

Here are some facts about the participants (based on the pre-face survey):

data_exp_charts_eng

Figures can be found here.

Conclusions

Success

  • People got interested in the project and willingly joined
  • The participants demonstrated friendly and careful attitude to each other
  • The core of the group (6 people) were active during the whole period of the expedition
  • All the respondents in the final survey expressed their intention to participate in following expeditions
  • Most of the respondents in the final survey expressed their intention to complete the projects they started in the course of expedition, but failed to finish due to the lack of time

Problems

  • Obviously, the output of the project wasn’t confined to the declared objective. On the one hand, it’s natural: people learn are free to learn what they want to learn. On the other hand, the absence of a more precise schedule made some participants feel uncomfortable. From which we conclude that a more concentrated approach is needed.
  • All the respondents in the final survey said they didn’t have enough time to compete what they wanted. At the same time, most of them admitted that the terms were adequate to the task.
  • Most of the respondents felt some discomfort because of the lack of a coordinator or instructor and also said they didn’t always understand what and how they should do.

In the final survey, we asked how we could make the process more efficient and here’s the summary of the ideas:

  • More precise schedule of our activity would be good (like breaking the whole expedition period into specific phases)
  • Coordinator is needed
  • Some instruments for encouraging shy and unconfident participants would be helpful
  • It might be better if the output of the whole project is formulated more precisely
  • The topic should probably be narrower
  • Longer expedition terms would make it easier for self-organisation

Prospects

We are totally going to continue our experiments with this format. In the future, we are going to try something like:

  • One-day intensive online expeditions with fixed roles and distributed responsibilities
  • Long term (several weeks’ long) expeditions with a coordinator (constant or elected for a certain term)
  • Workshop-expeditions: online massive projects lead by a volunteer instructor or mentor willing to share their skills
  • Expeditions based on the Data-MOOC scheme (with pre-planned tasks)

We are also going to develop a method to register participants’ achievements, even small ones, in order to encourage further efforts.

Also, we feel the need to create a way to proudly present major achievements. Here we should consider the experience of creating badge systems.

Well, that’s it for now. It was really cool! And there’s quite a bit of work ahead too!

The Russian version of this account can be found here.

Still alive

I think this is the busiest summer I’ve ever had in my life. I’m trying hard to follow my schedule, but not always successfully. Thanks to Python MOOC’s organisers who havekindly  included a week’s break in the middle of the sequence and now I hope to cover week 4 before the next bunch of tasks arrives. I’ll soon post some updates on my findings and experiences.

For now I’ll just save a couple of links here:

This is where MIT OCW hometasks (assignments) can be downloaded. I just keep losing this page. Now I seem to have fixed it.

2013-07-17 04_07_03-Edward Tufte_ Books - The Visual Display of Quantitative Information

And another link, which is not about Python, but I thought it might be interesting for some of my peers. It’s The Visual Display of Quantitative Information by Edward R. Tufte. The shortcoming is that the book is not free. Well, at least it is not supposed to be. Anyway, it was recommended by a person whose judgement I trust here.

Also (just boasting) we’re starting an experimental one week’s long data-MOOC (or data-expedition) in Russian in less than a week’s time. The subject will also be very narrow: we’ll only have to learn different ways of searching for data. I really wonder what it’ll turn to be like. What I know for sure is that it’s going to be a huge pile of various information in addition to Python and my job. And there’ll have to be some additional analytical work afterwards, because we’ll have to sum up our results and understand what we’ll have to improve in its future iterations. The question is how I’m going to find time for all this. But I’ll have to.

Code sharing options

As I’m proceeding with Python MOOC, I had to choose a way to share my code with my peers. There are many options in fact. Here are some of them:

GitHub 

This was recommended by the MOOC instructions. This is a multifunctional platform that allows you to create repositories, gists and forks, follow users, publish privately or openly, download codes and leave comments. What I also like about it is that you can follow users. For sharing homework gists might be the best option.

github

There are two shortcomings though:

  • You can’t publish your code without registration
  • Some users complain that the interface is a bit too complicated, so it takes time to get used to it

Pastebin

This is an extremely easy to use sharing tool. Actually, what you first see at the main page is a box where you can paste your code. You don’t have to register to do it (so you simply have a link that you can later share). You can also set the expiration time for each publication (from 10 minutes to never). And you can make it public, unlisted or private (for members only). You can also register if you like (I did to keep my homework in order).

Pastebin

Shortcomings:

  • I haven’t seen any commenting option, which might be good for feedback and revision while learning
  • I also couldn’t find any option to follow other members.

DPaste 

This is a very minimalistic service. You can’t register, you only can paste your code and save it. After you do it, it will stay there for 30 days and then it’ll be automatically deleted. So it’s good for quick sharing purposes, but not for continuous and systematic use.

dpaste

Also I recently found Bitbucket 

But I haven’t explored it yet. If anyone has some experience, please share. There are some explanations as to how to use it though: Bitbucket 101.

As for me, I’m currently using GitHub and Pastebin, because GitHub looks like a wonderful working space and Pastebin is good for sharing with those who are scared of GitHub: