NAME page-stats.pl - Check WWW page accesses (v1.3) SYNOPSIS page-stats.pl -h page-stats.pl [ -b ] [ -i identfile ] [ -l logfile ] DESCRIPTION page-stats.pl will examine the acceslog of a http daemon and search it for occurrences of certain references. These references are then counted and put into a HTML file that is ready to be displayed to the outside world as a "Page Statistics" page. Each page can be selected from the statistics page. The identfile contains the references that should be counted. A line in this file should be in the following format: URL@title@reference[@reference...] which could look like this: ~gnu/index.html@Gnu's pages@/gnu.html@~gnu* Comments are allowed, and should be preceded by a "#". Everything following that character will be ignored. Each line should at least contain the following: URL The URL of the page, as it should be referenced from the "Page Statistics" page. title The title of the page, as you want visitors to see it. Note that leading spaces are significant, so it is possible to make use of indentation for different levels of documents. reference A reference of how the page might be accessed. For instance, if a directory contains a file index.html, it can be accessed by leaving out the "index.html" part, or even the "/" before it. If this is the case, put all references behind each other, separated by "@". You may use a wildcard "*" at the end of a string to match only the begin of an URL. The order of the lines in the identfile matters. Only the first match will be taken into account. Be careful when using wildcards, as they might filter out hits for lines below. Take a look at the (faulty) example below: # Wrong; second line will never be reached! ~gnu/index.html@Gnu's pages@~gnu* ~gnu/info/index.html@Gnu's info files@~gnu/info* The first line will filter out all URLs ending in ".html", which automatically means that URLs that would match /info/*.html are matched as well. Place the second line above the first to solve the problem: # Right! ~gnu/info/index.html@Gnu's info files@~gnu/info* ~gnu/index.html@Gnu's pages@~gnu* Currently page-stats.pl will skip lines in the access_log that contain references to ".gif", ".jpg" or ".jpeg" files, even if you specify matching URLs. If you need the program to be able to handle references to those pictures, you should outcomment the lines as indicated in the code. Note that once the first matching reference is found, the quest for matches is ended. Only the first page will be recognized as a matching reference and its counter will be increased. The HTML "Page Statistics" file is created from two files. These are the ident file with references to check, and a source file that contains the basic HTML page as desired. The name of the source file is determined by replacing the mandatory ".ident" ending of the ident file by ".source". The HTML file that is created will be named in the same way, ending in ".html". It is possible to use certain variables in the source file. These variables will be replaced by page-stats.pl as it rummages through the file. $date The current date and time will be inserted for this variable. $firstrequest The date and time of the first request logged in the access_log will be inserted for this variable. $lastrequest This variable is replaced by the last request logged in the access_log. $list This will be replaced by the complete list of references and their number of hits. $topN This will insert a sorted list of the N most visited pages, where N can be any number . Of course setting a number greater than the number of references is silly. There must be no space between "$top" and the number. OPTIONS -b Benchmark; print used user and system times when ready. -h Displays this manual page. -i identfile Specify the file that determines which references to look for in the logfile. This defaults to 'page-stats.ident'. -l logfile Specify the access_log of the http daemon. The default location is '/usr/local/httpd/logs/access_log'. FILES access_log (generated by httpd) .ident .source (optional) .html (generated by page-stats.pl) SEE ALSO httpd(1). http://www.sci.kun.nl/thalia/guide/#page-stats For the latest version. http://www.sci.kun.nl/thalia/page-stats/ For a working example. CHANGES 03-01-1995: (v1.0) First draft of the program. 03-17-1995: (v1.1) Added 'total number of requests' at the bottom of the page. 05-26-1995: (v1.2) Added '$topN' and '$list'; juggled with the code. Improved performance by skipping images in access_log. Allowed comments in the ident file. Also moved the external README into the code. 07-17-1995: (v1.3) You can now use wildcards to define URLs to recognize. Using arrays to administrate URLs instead of strings. BUGS If the accesslog is big, and there are many references to check, this program can take very long to complete. It is recommended that both the size of the accesslog and the number of references are kept to acceptable levels. The program might not work because the path to Perl in the first line of page-stats.pl is wrong. See if the path is correct by doing 'which perl' at your Unix prompt. If it is not correct, you will have to edit the first line. AUTHOR Mark Koenen , changes by Patrick Atoon