| While biologists and philosophers have yet to give a clear
answer to the old question: What one came first, the chicken
or the egg? We can take comfort in the fact that things
are much easier to explain when it comes to spiders and
webs:
the first to come was the web - the World Wide Web that
is. Only after it was built did we have to create the spiders.
What are spiders?
A spider (a.k.a crawler, robot, or wanderer) is a program
that periodically reads the content of website pages
and indexes them. A database is automatically built and
continuously maintained, and indexes pointing to new
content are added and dead links are removed. These indexes
allow search engines (the likes of Excite, AltaVista,
Inktomi, etc.) to retrieve the pages that best match
the keywords used in a query submitted by a user. Spidering
(or crawling) is the term used to describe the work these
tireless spiders perform.
Spiders come with various capabilities; in terms of platforms
they are able to function on, how they explore the web
pages, and how many servers they span. They can be customized
for frequency of spidering, type of documents and domains
targeted and statistics they provide, so that they are
adequate to the purpose they are used for. Spiders are
able to index based on directory structure or hypertext
links (or both), and build reports on new, changed or broken
content.
While the capability of spidering technology is bewildering,
so is the purpose that some use it for.
What we make them do...
Spiders are very useful for content management of corporate
intranets: they free up administrators from the tedious
tasks of manually detecting and classifying new content,
and updating the information catalogues. On the Internet
at large, spidering is what ensures that a page is retrievable
by a search engine, so its role is essential to commercial
websites and businesses that need to be visible and current
on the web. Hence the idea - and lucrative practice – of
paid spidering: for a fee, your website will be regularly
indexed in the database of the search engine.
The quality of their spiders is what distinguishes search
engines (ever made a search and a page you know exists
was not retrieved by one search engine, but was found by
another? Or you had to sift through an endless list of
results with little relevance to your query?). As it is
essential for websites or documents to be listed at the
top of a result list, and that its content is relevant
to the key words used in the search, good spiders require
a high level of sophistication. Also, they should be able
to overcome problems such as search engine spamming (repeated
submission of the same web page for indexing, too many
pages submitted too often, repeated use of a keyword in
a page in order to increase its chances of being indexed).
As well, well-behaved spiders should know how to keep off
designated areas of a website. In fact, they should have
a lot of built-in business ethics.
But this is not always the case. Some spiders are built
specifically to go out and comeback with information
from competitor’s sites, with complete disregard
for spidering restriction rules, conventions and copyright
laws. Others are designed to harvest emails and build
email lists for use in unsolicited bulk emailing – or
spam.
This ugly side of spidering has accounted for some landmark
lawsuits (eBay vs. the now defunct Bidders’ Edge
for example). Although information posted on a site is
considered public domain, as it was in this case, the unauthorized
collection of that data for commercial purposes is considered
equivalent to trespassing. The data resided on a server,
and the server is private property. Spiders are the main
culprit in other cases related to intellectual property,
trade secrets, infringement of copyright or lawsuits launched – and
won – on grounds of potential destructive consequences
(such as an European railway company who forced the removal
of a link to a site providing information on how to derail
trains).
Don't blame the spiders!
Just like their biological counterpart, there are all sorts
of spiders out there, and some are really dangerous.
However, let’s not forget, spiders are just programs
designed by humans, and they will be as “good” or “bad” as
those who design and “feed” them with parameters.
Unfortunately, it is perfectly possible to be a good
spider (technically) while at the same time being a bad
one (from an ethics perspective). Whenever technical
prowess overtakes ethical and legal frames, you can be
sure it’s more than just a bad spider at work.
How Do Spiders Work?
A common misconception about spider operations
is that crawling the Web is the same as searching
the Web. A spider does NOT search the entire Internet
or intranet for all documents relevant to a term.
Given the magnitude of information on the Internet,
such a query could take days! When you use an Internet
search engine to find information, you are in fact
querying the index of URLs that the search engine
has created by spidering the Internet – much
more efficient than querying each URL on the Web
one-by-one.
Intranet spiders operate on the same basic principle.
In fact, most spiders start the same way – the
administrator feeds the spider an address (URL) or
points it to a file directory. The spider then gathers
information about the resources (including full or
abstracted text, hypertext links, etc.), and follows
any hyperlinks to retrieve additional documents.
The spider supplies this information to the catalog
for indexing, automatically building a catalog of
information that users can then query. This process
repeats indefinitely, without any human intervention
required. Users essentially are querying the index
of information, which allows results to be returned
faster. |
Tatiana Andronache is IT technical staff
for a large information technology company in Toronto, Canada.
She can be reached at tatiana.andronache@sympatico.ca
|