29 September 2011

Some tips for downloading wiki using WinHTTrack

I'm on my way updating several chm release now: Sketchup 8, Blender 2.5 and others. I use WinHTTrack rather than wget to download more complex sites like wiki pages. Here some important scan rules to get relatively clean offline wiki site:

+*.css +*.js -ad.doubleclick.net/* -mime:application/foobar -*title=* -*Category:* -*Org:* -*Meta:* -*Talk:* -*User:* -*Special:* -*File:* -*action=* -*section=* -*Dev:* -*Help:* -*Template:*

It's important to make sure you don't have "+*.png" or any other image types scan rule! instead use "Get non-html files related to a link" option in links tab to get images. Wiki is known use fake link to image file which actually a html file which confuse the spider.

Edit: since we don't download the intermediate html file, a regex like this could be used to clean the broken link on all images:
 <a href="http://.*?>(<img.*?)</a>  replace with $1

No comments:

Post a Comment