Tutorial - How to quickly create a flat, local, rendered copy of your dynamic web site’s output files - its web imprint

This is a write-up of a recent problem that I have solved. There are many ways to skin this cat - this was how i did it.

Summary

  1. Use a helper PHP script to scan the remote web root directory to create a list of files (web pages, images, etc) to download.
  2. Import that list into some software called Website Downloader which ‘visits’ each file and downloads it, thus collecting a rendered version of dynamically generated output.

Time:

Set-up: ~15-30 mins
(+ file download time)

Requirements:

  • Local Windows Machine - required to run Website Downloader
  • PHP running on Linux (where the site is hosted) - required to run a PHP helper script that collects the list of files to download.

The Problem

One of my clients has a semi-dynamic web site which I am converting to into a fully dynamic site. The end goal is to have a website application that uses the Model-View-Controller (MVC) pattern to serve pages and be more manageable.

Its tempting to start from scratch - wipe the old site away and drop a new, MVC one, in its place. But, there is a Damocles Sword that hangs over this project. The site is, currently, very well placed on Google. So, I would hate to see the site’s PR plummet after the changes have been made. Thus, we need to maintain the web-imprint of the site, on the front-end, whilst making radical changes to the way it is served up (the back-end).

Therefore, I wanted to take a snapshot of the web-imprint BEFORE the changes. Then, take another one AFTER. If these snapshots are placed under version control (i.e. Subversion) then, spotting any differences is easy.

This is why I needed to quickly create a flat, local, rendered copy of a dynamic web site’s output files - its web imprint.

The Solution

First - get the files:

Have a play with the Website Downloader App
Launch the Website Downloader application and play around with it -(i did)

Adjust Settings and Upload files_list_maker script

Modify the file extension of the PHP script to files_list_maker.php to enable it to be parsed by PHP

Then, open it up in a text editor and adjust the hard-coded settings within the PHP script:

  • Set the “$dir_path” variable with the path to your web root folder, the directory that you wish to build a download list for.
  • Set “$fully_qualified_url_to_root” variable to the web address of your home page
  • [Optionally uncomment and add array element values to the “$lister_obj->ignore_list_arr” array to specify file and folder names to ignore.]

Then, upload to a password protected directory of the website that you want to take a snapshot of.

Point your browser to the script

This should then output the list of files (web pages, images, etc) to download.

Copy n Paste that list into a new text file and save it

Save as something like my_directory_list_todaysdate.txt

Import the list into the Website Downloader App

(See screen grabs of the Website Downloader App)

  • Open up the Website Downloader App (if not already open)
  • Click on the ‘Download List’ tab
  • Click on the ‘Open’ button (at the bottom of the screen), - then, choose the file to import. (i.e. my_directory_list_todaysdate.txt) and click OK
  • Click on the ‘Download’ button - you will, then, be asked to select a folder to store the downloaded files
  • Select the download folder and Click OK
  • Have a cup of tea.

The Website Downloader Application will then get to work downloading the files.

Caveats

By using a directory file list, this technique only gets files that actually exist on the server. So it will not get virtual dynamic files that rely on a query string or ‘mod rewrite+bootstrap’. I only needed to get files that existed - so that was not an issue for me.

But, if you want to collect these types of ‘virtual files’ then an alternative would be to use the Website Downloader App to scan the site to build the ‘download list’ by crawling link-to-link.

This, obviously, takes a longer amount of time and you can run the risk of failing to get all the web pages (it only follows links - so you’d have to ensure that any orphan URLs are accounted for). Also you have to be careful of any infinite URL loops or randomly generated URLs - (like a no-cache hashes appended to banner-ad URLs) - that could dirty-up your beautiful local copy.

I looked at a few offerings of software, on download.com, but I chose Website Downloader because:

  • its really simple
  • it can accept an imported ‘download list’
  • it does not alter the files that it downloads

I spent a long time going down the wrong path and searching for “Offline Browser” software. ”Offline browers” and other ‘’site grabber software” wasted a lot of my time by insisting on changing the links within the downloaded html pages. This is, of course, logical because their job is to browse offline. But that was not what I wanted.

Anyway - i hope all these scribblings come in handy if you are facing the same type of problem.

As always,  add comments to air your thoughts and questions.

Leave a Reply