Recently I ran across a mention of an certain blog that was maliciously hacked into and erased, with the personal information of the original author in its place. The blog was hosted at one of these big blog sites like Blogger, LiveJournal, TypePad, etc. The original author did not make backups.
In any case, I decided to take a shot at trying to see if I can get the contents back, and guess what – I managed to recover 89% of the blog content WITHOUT accessing the original site in any way, shape or form. Of course the author is very happy, and if he would have done certain simple tricks he could have recovered everything. (Yes, he should have backed up, but that’s obvious).
The first step is the Internet Archive. For some reason the Internet Archive did not have this specific blog. After search for my own blog, and some others, it seems that while this might be useful for some content, their update frequency might not sufficient for this. The key to recovering blog content is learning a little about search engines and RSS feeds. Two search engines happen to cache content for about 30 days, and in case of blogs for less. These are Google and Yahoo! Search. Searching through Google’s cache recovered about 3 weeks worth of posts. Of course this is something that can only be done within a short time after the original erasure occurs. To use the cache, simply use this URL:
Now about the RSS feeds: a lot of blogs are indexed via feeds in various blog search engines and online RSS readers. Two, Feedster and BlogLines appear to cache stuff indefinitely unlike Google (BlogDigger happened to be down at the time). Unfortunately, after search for the blog’s name in Feedster, it seems that this specific blog was not indexed in Feedster. As for Bloglines which is an RSS reader, after doing a search, it appears that they only index a blog when there is at least one active subscriber to it. In this specific blog, someone was subscribed to it for about 2 weeks, which recovered another two weeks of the content. If the blog author would have added his feed to Feedster, probably more content would have recovered. As for BlogLines, perhaps having an account with them with one feed would be a solution.
Moving back to search engines, I only happened to find that Yahoo Search has a similar caching feature as Google does. Of course, it is logical to have that feature since the market leader does, but it does take a lot of computing power which some of the smaller search engines do not. Unfortunately, they do not have the same easily usable URL, so I had to manually search for the blog in question. After choose the “more from same site” feature, I was able to recover 89% of the blog content minus three archive pages which showed up in Yahoo Search but did not have the “cached” link next to it.
Given all of this information, it is interesting to see that a lot of web content may not be as susceptible to sabotage as people think. Perhaps this will discourage people from doing it.