// Internet Duct Tape

Blog Housekeeping: Cleaning Up Dead Links on Your WordPress.com Blog

Posted in Perl, Technology, WordPress.com Tips by engtech on September 15, 2006

Recently Lorelle was doing a one year retrospective of the various articles she’s written on blogging. She brought up the topic of “blog housekeeping“, which is very important.

Blog maintenance can be broken into different categories:

  • Frequent maintenance
    1. Posting
    2. Linking back to older posts (“deep archives”)
    3. Answering comments
    4. Banning comment spam
  • Occassional maintenance
    1. Backing up the blog
    2. Generating buzz
  • Infrequent maintenance
    1. Pruning dead links
    2. Changing themes
    3. Upgrading software (scripts, plugins, etc)

I haven’t checked my site for dead links yet, so I’ll do it for the first time. Here’s how:

Being a unix geek, I’m going to check my site for dead links using a perl script I’ve mentioned before called LinkLint. I’ll be doing this under Windows XP though, not Unix.

If you aren’t comfortable with creating directories, creating files and installing software then please disregard these instructions. I’m doing this in a manner that is more complicated than is necessary because I’m using a tool I’m familiar with rather than finding a Windows program that does automatic link-checking (if you know of a good freeware one, please leave a comment).

I’m also doing some specific things (like ignoring https://engtech.wordpress.com/tag and named anchors) to reduce the number of errors to strictly the interesting ones. Because finding dead links is interesting, if you’re severly anal retentive. It’s a nice break from organizing the socks.
These instructions will work on Windows and Unix/Linux.

  1. Download and install perl for Windows.
    • skip this step if you’re on unix, perl should already be installed.
  2. Download linklint and unzip it to a directory on your hard drive.
    • C:\linklint – I’ll refer to as LINKDIR for the rest of this guide.
    • The directory location doesn’t matter as long as you remember it.
  3. Create a command file for linklint.
    • I called the command file engtech.txt and saved it in LINKDIR.
    • These are the contents of the command file:
      • # linklink command file to check links on engtech.wordpress.com
        #---------------------------------------------------------------------------
        # -http tells Linklint to check via HTTP protocol
        #---------------------------------------------------------------------------
        -http
        
        #---------------------------------------------------------------------------
        # check links on engtech.wordpress.com
        #---------------------------------------------------------------------------
        -host engtech.wordpress.com
        
        #---------------------------------------------------------------------------
        # -doc is the directory where all the output files go.  It also tells
        #      Linklint to create complete documentation.
        #---------------------------------------------------------------------------
        -doc engtech
        
        #---------------------------------------------------------------------------
        # -htmlonly remove redundant *.txt files from the -doc directory
        #---------------------------------------------------------------------------
        -htmlonly
        
        #---------------------------------------------------------------------------
        # -limit increases the number of pages it will follow
        #---------------------------------------------------------------------------
        -limit 25000
        
        #---------------------------------------------------------------------------
        # -redirect checks for <meta> redirects in the headers of remote URLs.
        #---------------------------------------------------------------------------
        -redirect
        
        #---------------------------------------------------------------------------
        # -delay will wait n seconds before hitting wordpress.com again (be nice)
        #---------------------------------------------------------------------------
        -delay 3
        
        #---------------------------------------------------------------------------
        # -ignore the tag directory that does not exist (although has subdirectories)
        #---------------------------------------------------------------------------
        -ignore /tag/
        
        #---------------------------------------------------------------------------
        # -skip will make sure the link exists, but will not follow them for links
        # All of the files will be accessed by the next/prev links
        #---------------------------------------------------------------------------
        -skip /tag/@
        
        #---------------------------------------------------------------------------
        # -no_anchors will skip testing named anchors. WordPress.com uses a lot of
        # named anchors that don't necessarily go anywhere
        #---------------------------------------------------------------------------
        -no_anchors
        
        #---------------------------------------------------------------------------
        # -cache Creates a linklint.url cache file to speed up
        #---------------------------------------------------------------------------
        -cache engtech
        
        #---------------------------------------------------------------------------
        # do the entire site
        #---------------------------------------------------------------------------
        
        /@
        
        #=== END ===================================================================
  4. Create a batch file for calling linklint.
    • I called the batch file engtech.bat and saved it in LINKDIR.
    • These are the contents of the batch file:
      • perl linklint-2.3.5 @engtech.txt
        pause
    • the “pause” is only necessary to prevent the text output from closing when it has completed.
  5. Doubleclick on the batch file to run it.
  6. Look at the results that will be created in the doc directory specified in the command file (IE: LINKDIR/engtech).
    • It will generate HTML files with information about your site.
    • The errors are stored in errorF.htm (IE: LINKDIR/engtech/errorF.htm).
    • These are my results:
      • host: engtech.wordpress.com
        date: Tue, 05 Sep 2006 22:43:11 (local)
        Linklint version: 2.3.5
        
        #------------------------------------------------------------
        # ERROR   7 files had broken links
        #------------------------------------------------------------
        /2006/04/25/ten-things-every-microsoft-word-user-should-know-%c2%bb-general-disarray-%c2%bb-blog-archive/
        (/2006/04/25/ten-things-every-microsoft-word-user-should-know-%c2%bb-general-disarray-%c2%bb-blog-archive/trackback/)
            had 1 broken link
            /2006/05/16/become-an-excel-ninja-general-disarray-c2bb-blog-archive/
        
        /2006/06/01/emacs-emacsclient-and-gnuclient/
        (/2006/06/01/emacs-emacsclient-and-gnuclient/trackback/)
            had 1 broken link
            /2006/06/01/emacs-emacsclient-and-gnuclient/xemacs.org
        
        /2006/06/page/3/
            had 1 broken link
            /2006/06/page/3/xemacs.org
        
        /2006/07/
            had 1 broken link
            /2006/07/matchstick.ca
        
        /2006/07/28/update-on-matchstickca-and-nokia-6682-phones/
        (/2006/07/28/update-on-matchstickca-and-nokia-6682-phones/trackback/)
            had 1 broken link
            /2006/07/28/update-on-matchstickca-and-nokia-6682-phones/matchstick.ca
        
        /page/10/
            had 1 broken link
            /page/10/xemacs.org
        
        /page/5/
            had 1 broken link
            /page/5/matchstick.ca
  7. Go through the report and fix all the broken links.
    • There are probably some errors reported multiple times because of the way WP cross-references pages.
    • IE: I only had 3 broken links, not 7.
      • 1 of the broken links was because of a bug with WP and trackbacks (escaped characters not being converted properly — feedback sent).
      • The other two were because I forgot to put http:// in the link.
  8. Repeat these steps once every 2-3 months.

If you have multiple sites it is trivial to create batch/command files for each site. The technique I use for running linklint will work for any website, not just wordpress.com

I’ll reiterate one last time before the comments start, this method isn’t recommended for the non-geeky. Installing perl and running scripts is beyond the abilities of most non-techy people out there, and there are probably much simpler solutions for Windows-only users.

2 Responses

Subscribe to comments with RSS.

  1. Lorelle VanFossen said, on September 15, 2006 at 4:23 am

    I’ve a list of a variety of link checking services here which will check one page or your entire site, and you can also use the Web Developer Extension with Firefox to check links, but it only checks the page you are viewing.

  2. Scott said, on September 14, 2006 at 11:18 am

    Man, this is geeky techy stuff:-) BTW I don’t like the new colour scheme. YKW


Comments are closed.