CSpider - A Web Site Processor
Introduction
CSpider is an example application which
illustrates some possible techniques in developing
interactive event-driven web applications.
CSpider will crawl a web site while optionally
executing user-defined functions to enable custom processing
of the contents of the site.
Note: CSpider is supported by Netscape 7.0x, Mozilla and Internet Explorer on sites which are in the same domain as where the CSpider application is hosted. Netscape 7.0x and Mozilla also can spider sites from other domains if the cross-domain security checks are relaxed. See Bypassing Security Restrictions and Signing Code for more details. To enable extended privileges in Netscape 7.0x and Mozilla to allow CSpider to access other domains, install user.xpi, which will automatically install user.js in your profile if you don't already have a copy.
Script
CSpider.js
implements a JavaScript Object CSpider
can be used to recursively visit (spider)
a web site. CSpider uses
CCallWrapper and
WDocumentLoader.
- Constructor
-
- CSpider(String aUrl, Boolean aRestrictUrl, Number aDepth, WDocumentLoader aPageLoader, Number aOnLoadTimeoutInterval)
-
Constructs an instance of a CSpider object which can be used to spider a site beginning at the URL
aUrlto a maximum depth ofaDepth.aPageLoaderis a reference to a window object containing aWDocumentLoaderwhich is responsible for loading pages and notifyingCSpiderwhen each page has completely downloaded. IfaRestrictUrlisfalse,CSpiderwill follow links which do not contain theaUrlas a prefix. If any page does not load in the specified timeaOnLoadTimeoutInterval(in seconds)CSpiderwill enter the'paused'state and the user specified methodmOnPageTimeoutwill be called.The following user-specified functions are called by
CSpiderto allow the customization of an application built usingCSpider.- mOnStart
- mOnBeforePage
- mOnAfterPage
- mOnPause
- mOnRestart
- mOnStop
- mOnPageTimeout
- Class Methods
-
- CSpider.handlePageLoad(CFormData aFormData)
-
CSpider.handlePageLoadis used as a callback function fromWDocumentLoaderfor notification of when pages have completed loading.
- Properties
-
- String mUrl
-
mUrlis the initial page whereCSpiderbegins crawling a site. - Boolean mRestrictUrl
-
When
mRestrictUrlistrue,CSpiderwill only follow links which begin withmUrl. SetmRestrictUrltofalseto allowCSpiderto follow links to other sites. - Number mDepth
-
mDepthis the depth (number of links away from the starting page) thatCSpiderwill crawl. - WDocument Loader mPageLoader
-
mPageLoaderis a reference to the instance ofWDocumentLoaderused to load pages. - Number mOnLoadTimeoutInterval
-
If a page has not completed loading in
mOnLoadTimeoutIntervalseconds, the user-specified functionmOnPageTimeoutis called then the spider enters the 'paused' state. - Array mPagesVisited
-
mPagesVisitedis an array of all pages visited byCSpiderwhile crawling the site. - Object mPageHash
-
mPageHashis a hash which is used to prevent visiting the same page more than once. - String mState
-
mStaterecords the current 'state' of theCSpider.- 'initialized' - initial state.
- 'running' - is running.
- 'paused' - in paused state (can be restarted).
- 'stopped' - finished run.
- HTMLDocument mDocument
-
mDocumentis a reference to the currently loaded document. - Function mOnStart
-
mOnStartis a user-defined function which will be called whenCSpider'srun()method is called. - Function mOnBeforePage
-
mOnBeforePageis a user-defined function which will be called just before a page is loaded. It can be used to initialize page dependent data structure. - Function mOnAfterPage
-
mOnAfterPageis a user-defined function which will be called just after a page is loaded. It can be used to process a page's content. - Function mOnPause
-
mOnPauseis a user-defined function which is called whenCSpiderenters the 'paused' state either as a result of the methodpause()or after a page load time out has occured and themOnPageTimeoutuser-spefified function has been called. - Function mOnRestart
-
mOnRestartis a user-defined function which is called when the methodrestart()is called. - Function mOnStop
-
mOnStopis a user-defined function which will be called afterCSpiderhas completed crawling the site. - Function mOnPageTimeout
-
mOnPageTimeoutis a user-defined function which will be called if a page is not loaded within the specified interval defined bymOnLoadTimeoutInterval. - Function mOnCallWrapperOnLoadPage
-
Internal property used to manage asynchronous calls.
- Function mOnCallWrapperOnLoadPageTimeout
-
Internal property used to manage asynchronous calls.
- Function mOnCallWrapperLoadPage
-
Internal property used to manage asynchronous calls.
- Function mOnCallWrapperPause
-
Internal property used to manage asynchronous calls.
- Methods
-
- init()
-
init()is a convienience method which resets theCSpiderto its initial conditions. - run()
-
run()begins crawling the specified site. It also calls the user-definedmOnStart()function. - pause()
-
pause()pauses theCSpiderand calls the user-definedmOnPause()function. - restart()
-
restart()restarts a paused theCSpiderand calls the user-definedmOnRestart()function. - stop()
-
stop()stops crawling the specified site. It also calls the user-definedmOnStop()function. - addPage(String href)
-
addPage()is an internal method used to queue pages for visiting. - loadPage()
-
loadPage()is an internal method used to invoke the page loader.loadPagecalls the user-defined functionmOnBeforePage. - onLoadPage()
-
onLoadPage()is an internal method used to handle page load events.onLoadPagecalls the user-defined functionmOnAfterPage.
CSpider Application
Launch the CSpider application.
<html>
<head>
<title>CSpider</title>
<script type="text/javascript" src="CCallWrapper.js"></script>
<script type="text/javascript" src="CSpider.js"></script>
<script type="text/javascript">
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
*
* Software distributed under the License is distributed on an "AS IS" basis,
* WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
* for the specific language governing rights and limitations under the
* License.
*
* The Original Code is Netscape code.
*
* The Initial Developer of the Original Code is
* Netscape Corporation.
* Portions created by the Initial Developer are Copyright (C) 2003
* the Initial Developer. All Rights Reserved.
*
* Contributor(s): Bob Clary <bclary@netscape.com>
*
* ***** END LICENSE BLOCK ***** */
var gOutput;
var gSpider;
var gPageLoader;
var gPageCount = 0;
function main(form)
{
gPageCount = 0;
gPageLoader = window.frames.pageLoader;
gOutput = document.getElementById('output');
var url = form.url.value;
var depth = parseInt(form.depth.value);
var restrict = form.restrict.checked;
var timeout = parseFloat(form.timeout.value);
gSpider = new CSpider(url, restrict, depth, pageLoader, timeout);
// CSpider is a strategy pattern. You customize its
// behavior by specifying the following functions which
// will be called by CSpider on your behalf.
gSpider.mOnStart = function()
{
var form = document.forms.spiderForm;
form.run.disabled = true;
form.pause.disabled = false;
form.restart.disabled = true;
form.stop.disabled = false;
msg('Starting...');
return true;
};
gSpider.mOnBeforePage = function()
{
msg('Starting to load ' + this.mCurrentUrl.mUrl + '<br>' +
'Depth : ' + this.mCurrentUrl.mDepth + '<br>' +
'Remaining : ' + this.mPagesPending.length);
return true;
};
gSpider.mOnAfterPage = function()
{
// If you wish to process the DOM of the loaded page,
// use this.mDocument in this user-defined function.
++gPageCount;
msg('Page loaded: ' + this.mCurrentUrl.mUrl + '<br>' +
'Depth : ' + this.mCurrentUrl.mDepth + '<br>' +
'Remaining : ' + this.mPagesPending.length);
return true;
};
gSpider.mOnStop = function()
{
var form = document.forms.spiderForm;
form.run.disabled = false;
form.pause.disabled = true;
form.restart.disabled = true;
form.stop.disabled = true;
msg('Stopped... loaded ' + gPageCount + ' pages');
return true;
};
gSpider.mOnPause = function()
{
var form = document.forms.spiderForm;
form.run.disabled = true;
form.pause.disabled = true;
form.restart.disabled = false;
form.stop.disabled = false;
msg('Paused... click Restart to continue');
return true;
};
gSpider.mOnRestart = function()
{
var form = document.forms.spiderForm;
form.run.disabled = true;
form.pause.disabled = false;
form.restart.disabled = true;
form.stop.disabled = false;
msg('Restarting...');
return true;
};
gSpider.mOnPageTimeout = function()
{
msg('Page Load Timed out...');
return true;
};
gSpider.run();
}
function msg(s)
{
gOutput.innerHTML = '<pre>' + s + '<\/pre>';
}
</script>
</head>
<body>
<h1>CSpider</h1>
<p>
Enter the URL of a web site you wish to process and the depth you wish
to process the site.
</p>
<p>
Note that Internet Explorer can only spider DevEdge due to same-domain
security restrictions. However Netscape 7.0x and Mozilla can process
other web sites if you have enabled the appropriate security bypasses.
See the <a href="./">Example</a> for more details.
</p>
<form name="spiderForm">
<fieldset>
<label>
URL <input name="url" type="text" size="80"
value="http://devedge.netscape.com/">
</label>
<br />
<label>
Depth <input name="depth" type="text" size="4" value="1">
</label>
<label>
Restrict Urls <input name="restrict" type="checkbox" value="on" checked>
</label>
<label>
Page timeout <input name="timeout" type="text" size="4" value="120">
</label>
</fieldset>
<fieldset>
<legend>Controls</legend>
<button name="run" type="button" onclick="main(this.form)">Run</button>
<button name="pause" type="button" onclick="gSpider.pause()" disabled>Pause</button>
<button name="restart" type="button" onclick="gSpider.restart()" disabled>Restart</button>
<button name="stop" type="button" onclick="gSpider.stop()">Stop</button>
</fieldset>
</form>
<div id="output"></div>
<iframe id="pageLoader" name="pageLoader"
width="100%" height="80%" border="0" src="WDocumentLoader.html"></iframe>
</body>
</html>
Change Log
- 2003-07-08
-
-
Fixed bug in depth calculations.
-
Added ability to either restrict the spider to follow urls containing the original URL as a prefix or to follow any link.
-
Added ability to handle page load timeouts.
-
Added ability to pause and restart the spider, replaced
Boolean mRunningwithString mState. -
Added ability of user-specified CSpider "event handlers", to return
trueto continue normal operation or to returnfalseto cause the CSpider to enter the 'paused' state.
-
