Galt Global Review

QFS 360

July 8, 2003
That web full of spiders!
by Tatiana Andronache, I.S.P.


While biologists and philosophers have yet to give a clear answer to the old question: What one came first, the chicken or the egg? We can take comfort in the fact that things are much easier to explain when it comes to spiders and webs: the first to come was the web - the World Wide Web that is. Only after it was built did we have to create the spiders.

What are spiders?
A spider (a.k.a crawler, robot, or wanderer) is a program that periodically reads the content of website pages and indexes them. A database is automatically built and continuously maintained, and indexes pointing to new content are added and dead links are removed. These indexes allow search engines (the likes of Excite, AltaVista, Inktomi, etc.) to retrieve the pages that best match the keywords used in a query submitted by a user. Spidering (or crawling) is the term used to describe the work these tireless spiders perform.

Spiders come with various capabilities; in terms of platforms they are able to function on, how they explore the web pages, and how many servers they span. They can be customized for frequency of spidering, type of documents and domains targeted and statistics they provide, so that they are adequate to the purpose they are used for. Spiders are able to index based on directory structure or hypertext links (or both), and build reports on new, changed or broken content.

While the capability of spidering technology is bewildering, so is the purpose that some use it for.

What we make them do...
Spiders are very useful for content management of corporate intranets: they free up administrators from the tedious tasks of manually detecting and classifying new content, and updating the information catalogues. On the Internet at large, spidering is what ensures that a page is retrievable by a search engine, so its role is essential to commercial websites and businesses that need to be visible and current on the web. Hence the idea - and lucrative practice – of paid spidering: for a fee, your website will be regularly indexed in the database of the search engine.

The quality of their spiders is what distinguishes search engines (ever made a search and a page you know exists was not retrieved by one search engine, but was found by another? Or you had to sift through an endless list of results with little relevance to your query?). As it is essential for websites or documents to be listed at the top of a result list, and that its content is relevant to the key words used in the search, good spiders require a high level of sophistication. Also, they should be able to overcome problems such as search engine spamming (repeated submission of the same web page for indexing, too many pages submitted too often, repeated use of a keyword in a page in order to increase its chances of being indexed). As well, well-behaved spiders should know how to keep off designated areas of a website. In fact, they should have a lot of built-in business ethics.

But this is not always the case. Some spiders are built specifically to go out and comeback with information from competitor’s sites, with complete disregard for spidering restriction rules, conventions and copyright laws. Others are designed to harvest emails and build email lists for use in unsolicited bulk emailing – or spam.

This ugly side of spidering has accounted for some landmark lawsuits (eBay vs. the now defunct Bidders’ Edge for example). Although information posted on a site is considered public domain, as it was in this case, the unauthorized collection of that data for commercial purposes is considered equivalent to trespassing. The data resided on a server, and the server is private property. Spiders are the main culprit in other cases related to intellectual property, trade secrets, infringement of copyright or lawsuits launched – and won – on grounds of potential destructive consequences (such as an European railway company who forced the removal of a link to a site providing information on how to derail trains).

Don't blame the spiders!
Just like their biological counterpart, there are all sorts of spiders out there, and some are really dangerous. However, let’s not forget, spiders are just programs designed by humans, and they will be as “good” or “bad” as those who design and “feed” them with parameters. Unfortunately, it is perfectly possible to be a good spider (technically) while at the same time being a bad one (from an ethics perspective). Whenever technical prowess overtakes ethical and legal frames, you can be sure it’s more than just a bad spider at work.

How Do Spiders Work?
A common misconception about spider operations is that crawling the Web is the same as searching the Web. A spider does NOT search the entire Internet or intranet for all documents relevant to a term. Given the magnitude of information on the Internet, such a query could take days! When you use an Internet search engine to find information, you are in fact querying the index of URLs that the search engine has created by spidering the Internet – much more efficient than querying each URL on the Web one-by-one.

Intranet spiders operate on the same basic principle. In fact, most spiders start the same way – the administrator feeds the spider an address (URL) or points it to a file directory. The spider then gathers information about the resources (including full or abstracted text, hypertext links, etc.), and follows any hyperlinks to retrieve additional documents. The spider supplies this information to the catalog for indexing, automatically building a catalog of information that users can then query. This process repeats indefinitely, without any human intervention required. Users essentially are querying the index of information, which allows results to be returned faster.


Do you have a comment or feedback on this article? Email us and let us know what you think.

 Business News / Business Roundup - Australia / Canada / Europe / United States / Careers / Classified / Information Technology / New Technology / Education News / World Facts / Book Reviews / Archives/Research