indiWiz.com
SiteSearch:
Jump:

Site Map | WizTools.org | jCraze Blog
Web Developer's Den!
Home : WDD : Miscellaneous
This Article...
Print Version
Add Comment
View Comments
Miscellaneous
Net Search
Escaping Search Engines!
Free Web Space
URL Re-direction
Buying Server Space
Ezine Envy!!!
Sections

Commniquè
Sign Guestbook!
Read Guestbook!
MailMe!

Escaping Search Engines!

You can make search engines not index certain pages from your site. There are two ways of doing this. Both the methods may be used in complement to each other. The methods are:

  1. Using META tags.
  2. Using "robots.txt" file.

Using META Tags

To the page which you don't want to index, add the following META tag:

<meta name="robots" content="noindex, nofollow">

An brief explanation:

noindex: The current page should not be indexed. The opposite of this value is "index".

nofollow: The links in the current page should not be followed. The opposite of this value is "follow".

Thus, if you do not want to index the current page, but want to index the links mentioned in the page, you have to give the following META tag:

<meta name="robots" content="noindex, follow">

Using "robots.txt" File

You can also place a plain ASCII file "robots.txt" in the root of your html directory to direct the search engines which pages/directories not to index. Since this technique requires one to place "robots.txt" in the root of the html directory, members of the free sub-folder hosting companies like Geocities will not be able to use this method. Thus this is not a valid "robots.txt" file:

http://www.comp-name.com/user-name/robots.txt

But this one is correct:

http://www.comp-name.com/robots.txt

Syntax of "robots.txt"

The line after "#" is treated as comment (UNIX shell programmers might feel at home!). The keywords associated with this file are:

  1. User-agent: This refers to the robot's name. For example, if you want Google not to index some particular pages/folder, you should set the value of User-agent as Googlebot. This is set as:
    User-agent: Googlebot
    If you do not want any agent to index, use:
    User-agent: *
    For a list of active User-agents, visit www.robotstxt.org.

  2. Disallow: This specifies the links that are not to be indexed. The links are specified in site-relative manner. For example, if I have a folder "admin" in the root HTML directory and I don't want the content of that folder to be indexed, I would set this value thus:
    Disallow: /admin/
    If you do not want to index a particular file, say "mail.html", you have to give:
    Disallow: /mail.html
    Suppose you don't want search engines to index a particular script, say mail.cgi placed within your "/cgi-bin" directory, which is passed with different arguments different times, like:
    /cgi-bin/mail.cgi?id=subwiz&dom=indiwiz
    /cgi-bin/mail.cgi?id=sudhir&dom=yahoo
    You can control such pages thus:
    Disallow: /cgi-bin/mail.cgi?*
    If you do not want any robot to index any of your pages, then give:
    Disallow: /

  3. Allow: This is the opposite of "Disallow". This follows the same syntax of "Disallow", the "Disallow" keyword being replaced by the "Allow" one.

Example

You have the site http://comp-name.com. The following are the contents:

http://comp-name.com/index.html
http://comp-name.com/admin/
http://comp-name.com/private.html

Now you have to specify to the robots not to index the pages within the "admin" folder and the particular page "private.html". Your "robots.txt" file will look like:

#Author: Your_Name
#Last-Modified-Date: Date

User-agent: *
Disallow: /admin/
Disallow: /private.html
Allow: /index.html  #Optional

To Conclude...

Both these techniques have been standardized. Most popular search engines follow these standards. But there may be search engines that do not follow these standards, so be careful ... The best way you can protect your private files is to place them within protected (SSL or non-SSL) folders.

- Subhash.


User Comments

Add Comment

[Quick Stats: Number of main threads: 0, Number of sub-threads: 0]

Sign Guestbook | Who is Subhash?
The contents of this site are copyright© 2000-2008, indiWiz.com. All Rights Reserved.