Copyright ©1996, Que Corporation. All rights reserved. No part of this book may be used or reproduced in any form or by any means, or stored in a database or retrieval system without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this book for any purpose other than your own personal use is a violation of United States copyright laws. For information, address Que Corporation, 201 West 103^rd Street, Indianapolis, IN 46290 or at support@mcp.com.

Notice: This material is excerpted from Special Edition Using HTML, 2nd Edition, ISBN: 0-7897-0758-6. This material has not yet been through the final proof reading stage that it will pass through before being published in printed form. Some errors may exist here that will be corrected before the book is published. This material is provided "as is" without any warranty of any kind.

Chapter 23 All about CGI Scripts

In previous chapters you have learned how to mark up content for your Web site using the HTML standard. Now, we will begin our exploration of the CGI (Common Gateway Interface), which will greatly enhance the level of interactivity on your site. With the use of CGI scripts, you can make your Web presentations more responsive to your users' needs by allowing them to have a more powerful means of interaction with your material.

In this chapter, you will learn:

How the CGI works.
Uses for CGI scripts.
Seeing if you can write CGI scripts?
Common CGI scripting languages.
How to find CGI Resources.

What is CGI?

Here is the answer to the hundred dollar question. What is the CGI anyway? Well, in order to answer that, you are going to need a little background information first.

Each time you sit down in your favorite chair (I hope it is anyway) and start surfing the WWW, you are a client from the Internet's point of view. Each time you click on a link to request a new Web document, you are sending a request to the document's server. The server then receives the request, gets the document, and sends it back to your browser for you to view.

The client/server relationship that is set up between your browser and a Web server works very well for serving up HTML and image files from the server's Web directories. Unfortunately, there is a large flaw with this simple system. The Web server is still not equipped to handle information from your favorite database program or from other applications that require more work than simply transmitting a static document.

One option the designers of the first Web server could have chosen was to build in an interface for each external application from which a client may want to get information. It is hard to imagine trying to program a server to interact with every known application and then trying to keep the server current on each new application as it is developed. Needless to say, it would be impossible. So they developed a better way.

These wizened developers anticipated this problem and solved it by designing the Common Gateway Interface or CGI. This gateway provides a common environment and a set of protocols for external applications to use while interfacing with the Web server. Thus, any application engineer (including yourself) can use the CGI to allow an application to interface with the server. This extends the range of functions the Web server has to include the features provided by a potentially limitless number of external applications.

How the CGI works

Now that you have read a little background, you should have a basic idea of what the CGI is, and why it is needed. The next step in furthering your understanding of the CGI is to learn the basics of how it works. To help you achieve this goal, I will break down this material into the following sections:

The Process.
Characteristics.
The output Header and MIME Types.
Environment Variables.

The Process

The CGI is the common gateway or door that is used by the server to interface -- or communicate -- with applications other than the browser. Thus, CGI scripts act as a link between whatever application is needed and the server while the server is responsible for receiving information from, and sending data back to, the browser.

As a technical note, you should be aware that some people like to use the term program to refer to longer, usually compiled, code and applications written in languages like C and C++. When this is the case, the term script is then used to indicate shorter, noncompiled code written with languages like SH and PERL. However, for the purpose of this and the following chapter, the terms program and script will be used interchangeably as the divisions between them are being rapidly broken down.

For example, when you enter a search request at your favorite search engine, a request is made by the browser to the server to execute a CGI script. At this time, the browser passes the information that was contained in the online form plus the current environment to the server. From here, the server passes the information to the script. This script provides an interface with the database archive and finds the information that you have requested. Once this information is retrieved, the script sends it to the server which feeds it back to your browser as a list of matches to your query.

There is a very nice online description of the CGI at The Common Gateway Interface:

URL address: http://hoohoo.ncsa.uiuc.edu/cgi/

Characteristics of the CGI

Another way of looking at the CGI is to see it as a socket that attaches an extra arm on your server. This new arm, the CGI script, adds new features and abilities to the server that it was previously lacking.

The most common use for these new features is to give the server the ability to dynamically respond to the client. One of the most often seen examples of this is allowing the client to send a search query to a CGI script which then queries a database and returns a list of matching topics from the database. Besides information retrieval, another common theme for using CGI scripts is to customize the user interface on the Web site. This commonly takes the form of counters and animations.

If you see bin or cgi-bin in the path-names of images or links, it is a good indication that the given effect was produced by a CGI script.

These and some of the other common uses for CGI scripts will be discussed in more detail later in this chapter, so stay tuned.

The MIME Content-type output header

It won't be long into your CGI programming career when you will want to write a script that sends information to the server for it to process. Each file that is sent to the server must contain an output header. This header contains the information the server and other applications need to transmit and handle the file properly.

The use of output headers in CGI scripts is an expansion of a system of protocols called MIME (Multipurpose Internet Mail Extensions). Its use for e-mail began in 1992 when the Network Working Group published RFC (Request For Comments) 1341, which defined this new type of e-mail system. This system greatly expanded the ability of Internet e-mail to send and receive various non-text file formats.

Since the release of RFC 1341, a series of improvements has been made to the MIME conventions. You can find some additional information about this by looking at RFC 1521 and RFC 1522. A list of all the RFC documents can be found online at http://ds0.internic.net/rfc/. These documents contain a lot of useful information published by the Network Working Group relating to the function and structure of the Internet backbone.

Each time you, as a client, send a request to the server, it is sent in the form of a MIME message with a specially formatted header. Most of the information in the header is part of the client's protocol for interfacing with the browser. This includes the request method, a URI (Universal Resource Identifier), the protocol version, and then a MIME message. The server then responds to this request with its own message which usually includes the server's protocol version, a status code, and a different MIME message.

The bulk of this client/server communication process is handled automatically by the WWW client application -- usually your Web browser -- and the server. This makes it easier for everyone, since you don't have to know how to format each message in order to access the server and get information. You just need a WWW client. However, to write your own CGI scripts, you will need to know how to format the Content-type line of the MIME header in order for the server to know what type of document your script is sending. Also, you will need to know how to access the server's environment variables so you can use that information in your CGI scripts. In the following sections, you will learn everything necessary to accomplish both of these tasks.

If you decide to write your own WWW client, then you will need to understand the client/server communication process before you can begin. A good place to start your search for more information about this is the W3C Reference Library at http://www.w3.org/hypertext/WWW/Library/.

Using a Content-type output header

Each document that is sent via a CGI script to the server, whether it was created "on-the-fly" or is simply being opened by the script, must contain a Content-type output header as the first part of the document so the server can process it accordingly. In table 23.1 you will see examples of some of the more commonly used MIME Content-types and their associated extensions.

Table 23.1 Examples of MIME types and Extensions

Content-type:	Extensions
application/octet-stream	bin exe
application/postscript	ai eps ps
application/pdf	pdf
application/x-csh	csh
application/x-sh	sh
application/x-wais-source	src
application/x-gtar	gtar
application/x-gzip	gz
application/x-tar	tar
application/zip	zip
audio/x-wav	wav
image/gif	gif
image/jpeg	jpeg jpg jpe
text/HTML	HTML htm
text/plain	txt
text/richtext	rtx
video/mpeg	mpeg mpg mpe
video/quicktime	qt mov
video/x-msvideo	avi
video/x-sgi-movie	movie
x-world/x-vrml	wrl

To help you better understand how to properly use Content-types within a CGI script, let's work through an example. Suppose you have decided to write a CGI script that will display a GIF each time it is executed by a browser.

The first line of code you need is a special comment that contains the path to the scripting language that you are using to write the program. In this case it is PERL 4. The comment symbol "#" must be followed by an exclamation point "!" then the path. This special combination of "#!" on the first line of the file is the standard format for letting the server know which interpreter to use to execute the script. The reason that this special comment is used is that while UNIX servers use this line of code to locate the script's interpreter, other types of server systems have alternate methods of specifying the interpreter's location. However, since this line of code starts with a "#" symbol, it is still a valid PERL comment and does not cause problems on non UNIX servers.

You should double check to make sure you include the correct path-name to your language's interpreter.

#!/usr/local/bin/perl

The next line you will need simply sets the variable "$gif" to the full path name of the image you wish to display.

$gif = "/file/path/your.gif";

Now it is time to let the server know that it will be receiving an image file from this script to display on the client's browser. This is done using the MIME Content-type line. The print statement prints the information between the quotation marks to the server. Each set of "\n" characters that you see on this line adds a carriage return with a line feed. This gives you the required blank line that must occur after the Content-Type information. A blank line lets the server know where the MIME header stops and where the body of information, in this case the gif, starts.

print "Content-type: image/gif\n\n";

The next line creates a file handle named IMAGE that forms a link from this script to the file contained in the variable "$gif" which we set earlier.

open(IMAGE,$gif);

Now, we create a loop that sends the entire contents of the gif to the server as the body of the MIME message we began with the Content-type line.

while(<IMAGE>) { print $_; }

To avoid being sloppy, we will close the file handle to the gif now that we are done sending the image.

close(IMAGE);

Finally, we let the PERL interpreter know that the CGI script is finished running and can be stopped.

exit;

This type of script can be modified into something a little more useful. For example, you could turn it into a random image viewer. Each time someone clicks on the link to the script, it executes and feeds a random gif to the client's browser.

Environment Variables

Hopefully, you now have a little better understanding of what is involved as the client and server communicate with each other. Along with the information that I discussed earlier, a host of environment variables are sent during the client/server communications. Although each server can have its own set of environment variables, for the most part, they are all subsets of a large set of standard variables described by the Internet community to help promote uniform standards.

If you have bin access on a UNIX server, then you can use the following script to easily determine which environment variables your server supports. In addition, this script should also work on other server types such as Microsoft Windows NT server if you properly configure the server to recognize and execute PERL scripts.

Once again, this is the magic line that lets the server know which type of CGI script this is so it can launch the appropriate interpreter.

#!/usr/local/bin/perl

This next line, as was described above, is the MIME output header that lets the server know to expect an HTML document to follow.

print "Content-type: text/html\n\n";

Now that the server is expecting to receive an HTML document, we will send it a list of each environment variable's name and current value by using a "foreach" loop.

foreach $key (keys(%ENV)){
        print "\$ENV{$key} = \"$ENV{$key}\"<br>\n";
}

Finally, we need to tell the interpreter that the script is finished.

exit;

Fig. 23.1

Using the CGI script environment.pl from a browser will generate a screen similar to this one.

If the browser you use doesn't support an environment variable, the value of the variable is set to null and is left empty.

As you can see from the example, most of the variables contain protocol version information, and location information such as the client's IP address and the server's domain. However, if you are creative, you can put some of these variables to good use in your CGI scripts.

The best example I have seen so far is the use of the environment variable "HTTP_USER_AGENT". This contains the name and version number of the client application, which is usually a Web browser. As you can see from figure 23.1, the Netscape 2.0 browser that I used when running this script has a HTTP_USER_AGENT value of Mozilla/2.0 (Win95; I).

Once you know what the values are for various browsers, it is possible to write a CGI script to serve different Web documents based on browser type. Thus, a text-only browser might receive a text version of your Web page, while image-capable browsers will receive the full version.

Uses for CGI scripts

Web sites are interactive by their very nature. Every time you click on a hyper link, you are actively involved in the site, rather than passively reading information. Most users enjoy this added level of interactivity and the feeling of participation it brings. However, hyper links are just the beginning. With CGI scripts, you have access to a whole new set of tools to make your Web site more interactive and dynamic.

The list of uses for CGI scripts is always growing. Here are but a few of the more common ones.

Processing forms
Image maps
Animations
HTML "on the fly"
Counters
Search Engines
WAIS servers
Spiders, Robots, & WebCrawlers

As you can see, you probably have already interacted with many CGI scripts, possibly without even realizing it.

Processing forms

Processing the information entered into a form is by far the most common use of CGI scripts. These scripts are activated when you press the submit/send button on the form, that is usually found near the bottom. Once the script is executed the server sends the script the information that was entered. Then, the script processes this information and, if appropriate, sends some information back to the browser via the server. This information is then displayed on your monitor.

If you execute a script that sends nothing back to the browser, let it know this by using the following line in place of the Content-type line with a blank line.

Status: 204 No response

You can take a look at the following URL to see an example of a simple form on the Web for adding a response to a guestbook.

URL Address: http://www.missouri.edu/~bchemkm/guestbook.htm

If you use the browser's "View Source" command (with Netscape, pull down the View menu and select the View Source option), you should be able to find a line in the HTML document that looks something like this.

<FORM ACTION="http://absolute_path_name/CGI-bin/scriptname.type" METHOD="POST or GET">

The "ACTION" tag tells the browser which script to execute each time the information from the form is sent to the Web server. By using the absolute pathname for the script, you provide a means for the Web server to find the desired script. It is important to remember that you should always use the absolute pathname when indicating the location of scripts on a server.

The "METHOD" tag lets the script know what format the form's information is sent in (either GET or POST). This allows the script to process the form's data correctly. For more information on the METHOD tag, you can look in chapter 21 on forms.

See "Form Layout and Design"

Fig. 23.2

Notice that you can create a nice looking form by inserting the form fields within table tags.

Fig. 23.3

Here is a sample of the source code that is used to produce the table in figure 23.2.

Fig. 23.4

You can use borderless tables, as with this response page, to nicely layout the script's output.

The script that processes this form has several common features that you can find in other forms as you explore the Web.

Contains one or more levels of error checking to insure that the form is filled out properly.
Provides an opportunity for the user to double check the information they have entered.
Notifies you that the information was sent correctly, with a brief thank-you and then points out what you should do next.
Processes the form's information. In this case, the information is added to a response page and the owner of the guestbook -- me -- is notified via e-mail that the guestbook was signed.

CGI scripts are also commonly used to collect survey information, or update the contents of a database. Later, in Chapter 24, you will learn exactly how each of these features works as you learn to write your own guestbook script, much like this one.

Image maps

CGI scripts are commonly used, as is discussed in detail in Chapter 12, for running image maps. Each time you use one of these clickable images, you are executing a CGI script that comes packaged with the Web server. This script compares the coordinates of your "click" with those in the image map's configuration file to determine which URL to send to the server. The server then transmits the information to the browser.

See "Imagemaps: From Browser to Server and Back"

Animations

Think back to when you were a kid in grade school. Do you remember drawing stick men, one on a page, and then flipping the pages quickly to animate it, (instead of listening to what the teacher was saying)? Well, this same kind of sequential image animation is done on Web sites using a simple CGI script.

At http://www.missouri.edu/~bchemkm/guestbook.htm you will find an example I created to demonstrate what this type of animation looks like. Each image is one in a series of 10 gifs from the well known Duke JAVA animation. This sequence is repeated so that the actual animation plays several times.

The Duke animation that is described above was originally designed by Sun MicroSystems for use with their JAVA animation applet called ImageLoop. You can see their original version of this animation at http://java.sun.com/applets/applets/ImageLoop/index.html if you have a browser that supports Hot Java such as a version of Netscape 2.0.

By using JAVA to perform the animation instead of a different CGI language, they are able to add several key features. First, the JAVA applet downloads onto the client's system and runs using that system's resources. This removes some of the processing overhead from the remote server. Also, since the animation applet runs locally, there is no delay in the animation while each image is downloaded to the client's system. Thus, the animation is a lot smoother.

To give you a better feel for how an animation script works, you will need to have a basic understanding of the concept of a boundary. When the script runs, it happily creates the HTML document until it comes to the boundary -- another way of saying an artificial divider. Then, the script inserts the graphic for the first animation. Once the first image is accounted for, the script generates the rest of the HTML document. However, the script remembers where the boundary is in the document and overlays each new image on top of the previous one, creating the animation. This is done using the MIME Content-type for multi-part documents.

Would you like to have this type of simple CGI animation on your own Web site? If so, all you need to do is keep reading. I have provided a very simple PERL animation script to produce these for your own pages in the next chapter. Along with this script is a more detailed discussion of how animation scripts work.

See "Sample of using CGI for Animation"

HTML "on the fly"

Another nifty trick using simple CGI scripts is to generate customized HTML pages. These pages produced "on the fly" by the script can include such things as the current time and date, the name and version of the user's browser or even the user's name.

You can use a simple SH shell script, for example, to generate a little clock (with the date) and indicate which browser the client is using to view your site. To make everything look better, the output can be displayed using table formatting.

See "HTML Table 101" p.[Ch. 13]

Now, I will walk you through this short SH CGI script.

The first line of code is the special comment line that lets the server know what language interpreter to use as it tries to execute the script. In this case, it is the SH shell scripting language usually located in the bin directory on the server.

#!/bin/sh

The SH command "cat << top" appears in the next line. The cat (which stands for concatenation) command tells the server to echo or print to the browser everything between two identical parameters. In this case "top" is used.

cat << top

Now, we tell the server what type of document it is receiving so that it can notify the browser. This is done using an output header with the appropriate MIME Content-type output header discussed earlier in this chapter.

Content-type: text/HTML

As a reminder, you must leave at least one blank line below the Content-type line for the command to work properly. Basically, the blank line lets the server know that the header information is finished and that the rest of the information is the message body.

These are standard HTML structural tags.

<HTML>
<HEAD>

The next line is a META tag. As you learned in chapter 5, this tag can be used to reload a page after an indicated amount of time, in this case one minute. Thus, after each minute elapses, the script is executed again and the page is rebuilt on the fly. This way, the clock maintains the current time.

If the browser you use does not support META tags, then you will need to reload the page each time you wish to update the time.

<META HTTP-EQUIV="refresh" CONTENT="60"; 
URL=http://www.missouri.edu/bchemkm-bin/timescript.sh">
Some more vanilla HTML.
<TITLE>Sample Time Script</TITLE>
</HEAD>
<BODY TEXT="#000000" BGCOLOR="#FFFFFF">
<HR><P>
<CENTER>
<TABLE BORDER=5 CELLSPACING=10 CELLPADDING=2>
<TR>
<TD>
top

Here, we execute the built in UNIX command "date" and pass it several formatting options. The "+" command is used to send formatting information to the date command. The "%" symbol followed by a character represents a format code to tell the date command what to include in the output.

/bin/date "+ %I:%M %p %Z"

You can get a full list of formatting switches for the date command using the UNIX command "man". This will display the manual pages for the requested command. For the date command just type the following on a UNIX command line.

$ man date

The echo commands used here print the information contained within the quotation marks to the browser. Also, we see another use of the "date" command with a different formatted request.

echo "<BR></TD>"
echo "<TD>"
/bin/date "+%A %B %d, %Y"
echo "<BR></TD>"
echo "</TR><TR>"
echo "<TD COLSPAN=2>"

Now, here is an example of incorporating an environment variable to tell the client which browser he is using to view your page.

echo $HTTP_USER_AGENT

Now that you have created the clock and let the user know which browser she is using, it is time to finish off the HTML page. This is done with the "cat" command again, sandwiching the desired HTML between two identical parameters, this time "bottom".

cat << bottom
<BR></TD>
</TR>
</TABLE>
</CENTER><P>
<HR>
<P>The rest of your page's content goes here.<P>
<HR>
</BODY>
</HTML>
bottom

If you have copied everything correctly, and are using a browser that supports META tags, you should see something that looks like figure 23.5.

Fig. 23.5

This is an example of a simple clock produced by using a CGI script.

Counters

If you surf the Web much, you have probably seen several pages that tell you what number visitor you are to the site. The way these sites keep track of the number of visitors is by using a counter. This is a CGI script that increments an internal counter each time the page is requested by the server and then displays the appropriate series of graphics to indicate the current "count".

If you would like to have a counter on your Web site, there are several ways you can go about setting one up. If you have root access to your server, you can install a counter that is accessible by any user on the server. With this option, you will use fewer system resources than if everyone on the system has his own counter script. A nice choice for this type of script is WWW Homepage Access Counter [Counter Release 2.2] which can be found at http://www.semcor.com/~muquit/Count.html.

If you have a working CGI-bin directory, there are several counter scripts you can install for your use. By placing the script in your bin directory, you will be the only user on the system who will have access to it, but if you don't have root access on the server, then this is your best bet. One such script is HTML Access Counter - Counter 4.0 located at http://www.webtools.org/counter/.

Unfortunately, your site may be hosted on a server that is not configured for CGI use. If you find yourself in this situation, you can still have an access counter, but you will need to use one that is hosted by a remote site. Each time someone visits your site, a CGI script is executed on the remote server that exports the count information back to the client's browser. One of the most popular hosted access counters for Web sites is The Web Counter at http://www.digits.com.

There is a lot of information available about access counters on the Internet already. The FAQ - How do I set up an HTML Counter at http://pantheon.cis.yale.edu/~nakamura/counterfaq.html is an excellent source for further information. Also, if you are running a WinNT server, you can take a look at ED Counters, counters... at http://charon.assert.ee/counters.htm. If you're operating a Mac server then you can try Simple Counters at http://cy-mac.welc.cam.ac.uk/CGI-simplecounter.html for more information.

Once you have your counter set up on your site, you should take a look at Counter Digits at http://www.issi.com/people/russ/digits/digits.html. Here you will find a nice collection of images for use with counter scripts.

Fig. 23.6

Two of my favorite image sets from Digit Mania's counter archive.

Search Engines

A common stopping point on the Web is the search engine. These massive information repositories are easily searched thanks to CGI scripts that allow you to interface with them.

Some of the most well known search engines include:

Yahoo at http://www.yahoo.com/
Lycos at http://www.lycos.com/
WebCrawler at http://webcrawler.com/
WWW Yellow Pages at http://www.mcp.com/nrp/wwwyp/

For example, if you enter "search engine" into the Lycos search engine, as in Fig. 23.7, you should get back a list of hits. Each hit in the list is formatted as in Fig. 23.8.

Fig. 23.7

The Lycos search engine's front page.

Fig. 23.8

The first match of the search query "search engine".

Some of the more advanced search engines, like Lycos, will allow you to use the logical operators "and", and "or" to help widen or narrow your search. You can even control the amount of information listed for each site in the search results and the number of matches that are returned.

If one search engine fails to meet you needs, try another. No one search engine can keep a complete list of all web sites.

If your site has a large amount of information to present, then you might want to look into getting your own search engine. This allows people using your site to quickly and efficiently locate the information they need. If you feel that a search engine is what your site needs to improve its presentation of information, then you should consider the following options:

If you are a confident programmer, you can write your own search engine CGI script.
If programming is not your strong point at the moment, you can always port an existing search engine to your site from the Web. Here is a list of links to more information about some of the better freeware and shareware packages:

WILLO at http://www.washington.edu:1180/willow/home.html
GLIMPSE at http://glimpse.cs.arizona.edu:1994/glimpsehelp.html
HIDX at http://mall.turnpike.net/~jc/hidxq.html
SWISH 1.1.1 at http://www.eit.com/software/swish/swish.html

Finally, if the previous options fail to meet your needs, you can always buy a commercially available search engine.

Interface with WAIS servers

If these search engines are not enough to satisfy your site's information distribution needs, you might want to consider implementing a version of WAIS (Wide Area Information Server, pronounced "ways") like freeWAIS on your site. One of the best features of this system is that it catalogues many more types of information than the standard HTML documents that are collected by the web wanderers for use with the standard search engines. A WAIS server keeps track of gifs and other image documents as well as several types of audio and video files. If you have a lot of information in formats other than HTML, then this is a great means of allowing clients to search your site for the information they need.

The WAIS server was originally designed to allow multi-national corporations and other organizations the ability to search their internal databases. Each WAIS server forwards incoming queries to the next server on a list. As the request passes along the chain of servers the amount of collected information grows until all the server locations are searched and one large summary document is sent back to the client.

Recently, the WAIS server has been successfully put to use on stand alone systems. So, you shouldn't feel the need to have multiple server and database locations before you start considering a WAIS server as a means of allowing clients quick and easy access to your site's information.

If you are interested in having these search capabilities on your site, consider getting a current version of freeWAIS (a version of WAIS in the public domain). For more information, you can consult the online FAQ at http://www.cis.ohio-state.edu/hypertext/faq/usenet/wais-faq/freeWAIS-sf/faq.html. Also, you should definitely take a look at the information on the WAIS homepage at http://kaos.erin.gov.au/technical/retrieval/wais/wais.html. Finally, if you would rather have a proprietary version of WAIS software, you should visit WAIS Inc.'s homepage at http://www.wais.com/ for more information. WAIS Inc. is now a part of AOL Productions, Inc.

Spiders, Robots, & WebCrawlers

As you have seen earlier, search engines are used to search vast archives of information on the Web. But how does all that information get compiled? The answer is with CGI scripts called Web wanderers, Web robots, spiders, or webcrawlers. These robots are constantly moving from server to server, site to site, methodically searching for links and pages to process.

You can think of a robot as an automated Web browser. In fact, these programs use the same protocols to access servers and retrieve Web documents that browsers do. They just do it much faster. Each time a robot moves to a new server, it proceeds to systematically archive each Web document's title and URL directory by directory. It may even note the outgoing links and use them to hunt down the next server to visit.

These programs are usually written for one of three major purposes. The most obvious one is to attempt to maintain a single archive that contains information on every document on the Web. However, it is currently taking the fastest robots more than half a year to travel the entire Web. So, it appears that a complete, up-to-date archive of Web documents will become increasingly difficult to maintain. For this reason most newer robots are only looking for information on a specific topic. This helps these archives stay more current than the larger global search sites. Finally, some robots are built to synchronize mirrored sites.

For a well kept listing of all the currently known (more than 50) robots on the Internet and a nice starting point for finding more information, see Martijn Koster's site on web wanderers at:

URL Address: http://info.webcrawler.com/mak/projects/robots/robots.html

Can you write CGI scripts?

Hopefully, you now have a good idea of some of the more common uses for CGI scripts. As you can see, many of them provide helpful tools that you can incorporate into your personal Web site. If you would like to use some of these tools to make your site more dynamic, then you will need to consider a few things before you start.

Can you write CGI scripts?
Choosing a CGI scripting language.

Can you write CGI scripts?

Before you can get started writing your own CGI scripts, you need to find out if your server is specially configured to allow you to use them. The best thing to do is contact your system administrator and find out if you are allowed to run CGI scripts on the server. If you can, you also need to ask what you need to do to use them, and where you should put the scripts once they are written.

In some cases, system administrators do not allow clients to use CGI scripts because they feel they can not afford the added security risks. In that case, you will have to find another means of making your site more interactive.

If you find that you can use CGI scripts and are using a UNIX server, then you will probably have to put your scripts into a specially configured directory which is usually called cgibin or cgi-bin. If you are using Microsoft's Internet Server, then you will probably put your CGI programs in a directory called scripts. This allows the system administrator to configure the server to recognize that the files placed in that directory are executable. If you are using a NCSA version of HTTPD on a UNIX system then this is done by adding a ScriptAlias line to the conf/srm.conf file on the server.

It is important to remember that although CGI scripts are not necessarily complex, you need to have some basic understanding of the programming language you wish to use and the server you plan to run the scripts on. Poorly written scripts can easily become more trouble than they are worth. For example, you could delete entire directories of information or shut down your server if your script were to start forking off new processes in a geometric fashion.

Before starting down the road to becoming a CGI scripter, you should do the following:

Get a programming book on the scripting language you plan to learn.
Notify the network administrator of your local server to find out how to run scripts on your system and what security features she wants you to implement in them.
Subscribe to a listserve and read the appropriate news groups on the language you plan to use. These are wonderful resources for programming information and good places to ask for help if you are stuck.
Find a friend who has experience programming in your scripting language and who can help you smoothly overcome some of the early hurdles you will face.

Which language should you use?

Now that you know what a CGI script is, how it works, and what it can do, the next thing you need to consider is which language you should use. You can write a CGI script in almost any language. So, if you can program in a language already, there is a good chance you can use it to write your scripts. This is usually the best way to start learning how to write CGI scripts, since you are already familiar with the basic syntax of the language. However, you still need to know which languages your Web server is configured to support.

UNIX based NCSA and CERN Web servers are by far the most common. These platforms are easily configured to support most of the major scripting languages including C, C++, JAVA, PERL, and the basic shell scripting languages like SH. On the other hand, if your Web server is using the Mac server then you might be limited to using AppleScript as your scripting language. Likewise, if you are using Windows NT server, then you might need to use Visual Basic as your scripting language. However, it is possible to configure both these systems to support other scripting languages like C and PERL, or even Pascal.

If you are interested in finding out which scripting languages your server is configured to support, you should ask your system administrator to give you a listing of what is available on your server.

Also, if you have access to a UNIX based server and can log into a shell account, then you can find out which languages your system supports by using the UNIX command "which".

If you are using the SH shell, you should see the following

$ which sh

/usr/bin/sh

$ which perl5

/usr/local/bin/perl5

Many scripting languages are freely distributable and fairly easy for an experienced administrator to install. As a last resort, you can always request that a new language be considered for addition to your local system.

If you are lucky, you may find that your server is already configured to support several CGI scripting languages. In this case, you just need to compare the strengths and weaknesses of each language you have available with the programming tasks you anticipate writing the scripts for. Once you do this, you should have a good idea of which programming language is best suited to your specific needs.

Common CGI Scripting Languages

When it comes to the CGI, anything goes. Of the vast numbers of programming languages out there, many more than you could possibly learn in a lifetime, most can work with the CGI. So, you will have to spend a little time sifting through the long list to find the one that will work best for you.

Even though there are a lot of different languages available, they tend to fall into several categories based on the way they are processed, -- compiled, interpreted, and compiled/interpreted -- and on the logic behind how the source is written -- procedural and object-oriented.

This chapter will discuss the most common scripting languages that are available for use on a UNIX server. All of the major languages presented here will be available for both MacHTTPD and WinHTTPD if they are not available at this time. You should note that MacHTTPD comes with AppleScript as its built-in scripting language, while WinHTTPD comes with Visual Basic.

If you would like some more information on either AppleScript or Visual Basic, you can consult the following:

For information of AppleScript books, see http://www.ultranet.com/~mfenner/applescript.html. Also, for additional information see http://www.mtt.com/theSource/mtt/appleScript.html.
For a fairly comprehensive listing of Visual Basic Resources on the Web you should take a look at http://www.qns.com/~robinson/vb/vb.html.

Shell languages are easier to learn than robust scripting languages like C or perl. Likewise, object-oriented languages like C++, PERL 5, and JAVA are the hardest to get used to.

Compiled Languages

Some of the available programming languages are compiled rather than being interpreted. The two most commonly used are C and C++. When using a compiled language, the program as it appears when you write it is referred to as the source code. This source code is then processed by the language's compiler into a much smaller version that is in the machine's native language and is usually referred to as object code. Once the source code is successfully compiled, the object code can be run by the server without fear of syntax errors. In this more compact form, the object code usually executes much faster than code from scripting languages that are compiled at runtime. Unfortunately, this does mean that you have to recompile the source code each time a change is made in the script.

C

One of the most popular CGI scripting language is C. It was developed by Brian Kernighan and Dennis Ritchie in 1972 at Bell Labs. This procedural language is already familiar to a large number of programmers and thus is their scripting language of choice. As such, there are many large archives of existing C source code that you can adapt to fit your specific programming needs.

Since C is a compiled language, it must be processed into a small binary object code before it can be executed. As was mentioned earlier, this allows these scripts to execute very quickly. So, if a quick response from the script is your primary consideration for picking a scripting language, you should stick with a compiled language like C. The best use for CGI scripts coded in C is for processing large amounts of numeric information quickly and efficiently.

Unfortunately, most of the CGI scripts written today focus on complex regular expressions and string data. These types of programs can be very awkward to write in C. This is one major reason why many CGI programmers are using PERL instead.

All UNIX based servers come equipped with C, C++ and at least one shell language such as SH.

C++

Like its predecessor C, C++ (developed by Bjarne Stroustrup at AT&T) is a compiled language that executes small binary object code very quickly. However, C++ is not as similar to C as you might anticipate from the name. While C is a procedural language, C++ is part of the object-oriented paradigm. What this means is that as an object-oriented language, C++ is much more concerned with the function, interaction and reusability of its objects than it is with the actual steps it takes to get the job done.

Since C++ is object-oriented, it will take quite an adjustment if you aren't already familiar with this type of programming. So, expect a large learning curve if you will be writing your first object-oriented source. However, if you do take the time to learn it, you will find that C++ objects are much easier to reuse and to expand its functionality than other procedural language's source.

The only other major drawback for using C++ for your CGI scripting is that there is not a lot of public domain source. Only recently have software engineers started to program object-oriented solutions for CGI scripting needs. Thus, you might have to wait awhile before you start to see large archives of code for public use. However, as time goes on, this will become much less of an issue.

A good source for more information on C++ is the Usenet group comp.lang.c++.moderated.

Interpreted Languages

Unlike C and C++, some languages are not compiled into tight binary code before they are executed. Some, like the shell language SH, are interpreted during execution. This means that any syntax errors in the script will not be detected until the program has already started to run. This, coupled with the limited power of the shell languages, means that they are not as useful for larger scripting jobs as some of the other languages dealt with in this chapter.

PERL, along with several other interpreted languages, avoids this problem by being compiled at runtime. What this means is that the PERL interpreter checks each line of code for proper syntax before the code is compiled. Then, the code is compiled and executed. However, unlike C, this doesn't result in a truly compiled object that can then be reused. PERL scripts are interpreted and compiled each time they are executed. Thus, there is no need to keep track of separate source and object files for the same script.

SH and C shell

There are several commonly available shell scripting languages, or command interpreters as they are sometimes called. The most common ones are SH and C shell. Although these are among the most important user interfaces for the UNIX environment, they are not the best choice for a CGI scripting language.

These shell languages are designed as UNIX tools and thus lack much of the power and features of true programming languages. However, they can be put to good use when writing simple, rather disposable CGI scripts or when you need a little job done in a hurry.

If you do decide to write a script using one of these languages, you should remember that they are not compiled. Rather, they are interpreted line by line, each line of code being executed before the next is read into the command interpreter. Thus, if you have any syntax errors in your script, you won't find them until the script has already executed part way. At that time, your application will crash and could cause serious problems with your system.

PERL 4.036

One of the most commonly used languages for CGI scripting is PERL 4.036. PERL, which stands for "Practical Extraction and Report Language," was developed by Larry Wall, who still maintains it. All the versions of PERL except the newest one, are procedural. However, the newest release, version 5, is object-oriented and represents a major restructuring of the PERL language. However, most PERL 4 programs should run fine using PERL 5. This latest version will be discussed briefly later in this chapter.

A key feature of PERL is that it is very open ended. It doesn't confine the user to a certain rigorous set of syntax. Instead, PERL usually provides several methods of doing each task, which makes it easier to program using your own personal style. Also, PERL supports almost all the common features of C, so a C programmer can write PERL code that looks very much like the C they are used to.

Another key feature of PERL is its powerful handling of strings and regular expressions. Using the built in string manipulation functions of PERL, many scripts are easily written that would be much harder to program in C. Since the overwhelming majority of all CGI scripts handle string data, it is no wonder that so many CGI scripts are written in PERL.

Another thing to keep in mind is that PERL is completely interpreted and compiled at runtime. This means that you won't get a syntax error after the program is already running like you might programming in a shell language. At the same time, it means that you can simply make a change in your source code and it will take effect. You don't have to pre-compile your source into object code each time you make a change like you do using C.

Since PERL 4 is currently the most widely used CGI scripting language on the Web, and as it can be run on a wide variety of server types, I have chosen to use it for the majority of the CGI scripting examples used in both this and the following chapter. If you would like more information about this scripting language you should take a look at the PERL Language Home Page at http://www.perl.com/perl/index.html.

PERL 5.000

At this point, you may be asking yourself why is this guy telling me about PERL 5 when he just got finished making PERL 4 seem like the perfect CGI scripting language? Well, the answer, my friend, is simple. PERL 5 is to PERL 4 what C++ is to C. What this means is that while PERL 4 is procedural, PERL 5 is object-oriented. Also, while PERL 4 is forced mostly to go it alone, PERL 5 comes equipped to handle reusable modules along with a lot of other new features.

PACKAGE - A package is a programming context in which local variables are defined and used, as in a subroutine.

This description of the PERL 5 modules comes directly from the hypertext version of the PERL 5 manual, which can be found at http://www.phlab.missouri.edu/perl/perl5man/.

PERL Modules

In PERL 5, the notion of packages has been extended into the notion of modules. A module is a package that is defined in a library file of the same name, and is designed to be reusable. It may do this by providing a mechanism for exporting some of its symbols into the symbol table of any package using it. Or it may function as a class definition and make its semantics available implicitly through method calls on the class and its objects, without explicit exportation of any symbols. Or it can do a little of both.

For a very up-to-date list of all the PERL 5 modules, see the PERL 5 Module List at ftp://rtfm.mit.edu/pub/usenet/news.answers/perl-faq/module-list

As it stands, PERL 5 represents a total renovation of this language. Almost every line in the original code has been redone. This, coupled with the transition from a procedural to an object-oriented language with a lot of new bells and whistles, will make PERL 5 a very popular CGI scripting language for a long time to come.

For more information on this new version of PERL, see the PERL 5 WWW Page at http://www.metronet.com/1h/perlinfo/perl5.html. Or, you can subscribe to the PERL Usenet group at comp.lang.perl.

Compiled/Interpreted Languages

So far you have been given some examples of compiled and interpreted languages. Recently, though, a language has been developed that is both compiled and interpreted. This programming language is JAVA, which is first compiled into a platform independent binary bytecode. Then, when the script is executed, the pre-compiled bytecode is interpreted by the local platform into a platform-specific machine code. Thus, as long as there is a JAVA interpreter for the platform you are using, you can use any JAVA bytecode regardless of the platform it was written for. This design allows these programs to become truly platform independent. Thus, programmers will no longer have to grapple with porting their software across platforms.

JAVA

The JAVA language is being hailed on the Internet as the scripting language of the future and a possible replacement for the CGI. When Sun MicroSystems first started developing JAVA, they intended to write it entirely in C++. However, as time went on, they decided that there were too many limitations within the language for it to be optimally suited for Internet programming. So, they struck out on their own. However, they have endeavored to stick closely to C++ while designing the language. As a result, JAVA is a member of the object-oriented programming paradigm and should be fairly easy for experienced C++ programmers to pick up.

The object-oriented structure of JAVA is what makes its applications modular while its platform independence makes it very portable. JAVA was defined by Sun MicroSystems in its first white paper as follows:

JAVA: A simple, object-oriented, distributed, interpreted, robust, secure, architecture-neutral, portable, high-performance, multi-threaded, and dynamic language.

If JAVA can actually live up to this description, then it might very well become the dominant scripting language on the Internet.

See "Java and JavaScript"

Finding CGI Resources

As you advance down the path to mastery (or at least proficiency) in your favorite CGI scripting language, you need to know where to look for help and the latest online information.

Listserves

My personal favorite is using listserves. These are groups of people who share a common interest. Each time someone posts a message to the list, everyone who is subscribed will get a copy. Then, any of the hundreds or even thousands of people who received your post may choose to answer your mail and give you the information you requested. The fastest way to find a news group that is right for you is to check out L-Soft's search engine for their listserves at http://www.lsoft.com/lists/LIST_Q.html. Just pick a topic like HTML, CGI, or JAVA and you will get a series of mailing lists with information on how to subscribe to each one.

Newsgroups

If you like the idea of a listserve, but don't want your mailbox filled with mail everyday, then a news group may be for you. These are similar to a listserve except that you read the posts off of a news spool rather than out of your inbox. Also, many newsgroup applications allow you to search the posts by subject, author, or keyword. Here is a list of some of my favorite newgroups on CGI programming.

comp.infosystems.www.authoring.cgi
comp.lang.perl
comp.lang.c++
comp.lang.java
comp.lang.javascript

You will avoid upsetting others on listserves and Newsgroups if you remember to always try to figure out problems on your own before asking for help.

Individual Archives

Another great source of online CGI information is personal Web sites. Many individuals have amassed a mountain of links to key information archives on the net for their favorite scripting language. Finding a couple of these gems can save hours of surfing the Web for information.

Beyond the CGI

As is inevitable with most technology, the CGI for all it's worth, is already becoming outdated. With the explosive growth of technology in this day and age, the CGI is starting to show its age as new and exciting alternatives to CGI scripting are being developed. In this, the final section of this chapter, I will discuss a few of these alternatives including SSI (Server Side Includes) as well as JavaScript and Visual Basic Script.

SSI (Server Side Includes)

If you are using an NCSA server on a UNIX system, then you have access to a special feature of this server commonly referred to as Server Side Includes (SSI). If you turn on this feature of the server, the server will recognize .shtml files as html documents that need to be treated specially. When the server sends a .shtml file it doesn't passively send the requested document to the browser, but rather actively parses it. This means that the server looks at the HTML document line by line as it is sending it to see if the HTML page includes any special instructions that the server should carry out while it is sending the page. Usually these instructions take one of the following forms.

Adding the current date or time.
Adding a file like a standard header or footer.
Adding the output from a script.

For example, if you have a standard footer that you need to place on every page of your Web site, with SSI you can simple place the following line of code at the bottom of each document where you want the footer to appear.

<!--#include file="footer.html"-->

<!--#include virtual="http://www.blah.com/footer.html"-->

Just remember that if you use file then you must include the relative path for the file to be included and that the file must be in the same directory or a subdirectory of the main document. Also, if you want you can use virtual and specify the complete URL for the file you wish to include. Or, if you have a script that generates a custom footer for each page, then you can include the output from that script by placing the following line where you would like the script's output to appear within the document.

<!--#exec cgi="/cgibin/footer.pl"-->

The main advantage for using SSI's within your Web pages is that it can allow your documents to display current information like the date and time without the use of a CGI script. Also, it can allow you to maintain only a single version of information you would have to repeat on many pages under normal circumstances.

However, there is one drawback of using SSI's that you should be aware of. By forcing the server to parse each document it sends to the browser, line by line, a lot of processing time is required which both slows down the server and makes the Web pages take longer to load. If a high traffic site were to parse every page that it sent out to check for SSI's, the server would very likely experience a very marked decrease in efficiency.

For a more detailed discussion of SSI's you should refer to NCSA's online SSI tutorial at http://hoohoo.ncsa.uiuc.edu/docs/tutorials/includes.html.

JavaScript

Along with the development of the new programming language JAVA that was briefly introduced earlier, JavaScript is providing Web authors with alternatives to more traditional CGI programming. By embedding the JavaScript code directly into the Web page, newer browsers like Netscape 2.0 are able to execute these scripts directly on the client's machine without the need to make a call to the server. This can greatly increase the speed at which the client gets feedback from their actions and reduce the load on the Web server at the same time. It is hoped my many that this new scripting language will reduce the heavy server load imposed my many traditional CGI programs by moving much of the processing overhead to the client's machine.

JavaScript is a simpler version of the object based JAVA language that is interpreted at runtime much like PERL rather than having to be compiled before it can be executed. Although JavaScript is a simpler version of the JAVA language, it still retains much of its power. Also, JavaScripts can be written to recognize and react to such things as mouse clicks, form field data, and the use of page navigation.

The complete JavaScript Authoring Guide by Netscape can be found at http://cgi.netscape.com/eng/mozilla/Gold/handbook/javascript/index.html and is an excellent place to start your exploration of this alternative to CGI programming.

Visual Basic (VB) Scripting

Another very promising alternative to CGI will be Visual Basic Script or VBScript which is a cross-platform subset of Visual Basic 4.0 by Microsoft. This scripting language will be in direct competition with JavaScript and will provide much the same functionality as a similar scripting language embedded within the HTML pages themselves.

Like JavaScript, VBScript's major function will be to reduce server overhead by moving the processing load to the client's machine and in the process greatly speed up the response to client's actions. VBScripts will be able to link and automate many types of objects including OLE objects and JAVA applets. Currently, Microsoft plans for their VBScripting language to be fully implemented in the 3.0 release of Microsoft Internet Explorer.

You can find the latest information on VBScript from the Visual Basic Microsoft Web site at http://www.microsoft.com/VBASIC/vbscript/vbscript.htm.

Chapter 23 All about CGI Scripts

What is CGI?

How the CGI works

The Process

Characteristics of the CGI

The MIME Content-type output header

Using a Content-type output header

Environment Variables

Uses for CGI scripts

Processing forms

Image maps

Animations

HTML "on the fly"

Counters

Search Engines

Interface with WAIS servers

Spiders, Robots, & WebCrawlers

Can you write CGI scripts?

Can you write CGI scripts?

Which language should you use?

Common CGI Scripting Languages

Compiled Languages

C

C++

Interpreted Languages

SH and C shell

PERL 4.036

PERL 5.000

Compiled/Interpreted Languages

JAVA

Finding CGI Resources

Listserves

Newsgroups

Individual Archives

Beyond the CGI

SSI (Server Side Includes)

JavaScript

Visual Basic (VB) Scripting

Internet & New Technologies Home Page - Que Home Page For technical support for our books and software contact support@mcp.com © 1996, Que Corporation

Internet & New Technologies Home Page - Que Home Page
For technical support for our books and software contact support@mcp.com
© 1996, Que Corporation