degraeve.com


Features
BloodyIcons.com
T-Shirts

Fun
Image of the Day

CGI & Stuff
GIF2TXT ««
TXT2GIF
Biorhythms
Calendar
Crapola Web Translator
H4x0r Translator
Hangman
Horoscope (daily)
Magic 8 Ball
Mastermind
Rainbow Words ««
ROT13
Subwords!
Weather Forecasts
Webutil
Word Association

Downloads
AIM Buddy Icons
Screen Savers
Star Wars Costumes

Points of Interest
Boston Rumors
Frank's Icons
Make Buddy Icons
GeorgeWGirls.com
SUV Interiors

Reference
ASCII Table
Color Picker
Jedi Robe Pattern ««
Recipes
Special Characters
URL Encoded Chars

Contact
Email Me
Resume

Misc
Archive
Dev
Links
Live cam

Tutorial02: Parsing HTML with Perl

» Download the source code

» Comment on this tutorial

#####################################################################
# Author:     Steve DeGraeve
# URL:        www.degraeve.com
#
# What?
#   This is a stripped down version of how to retrieve and then 
#   parse through an HTML document.
#
#   The script breaks the returned HTML into the HTML code and 
#   HTML content.  This is to facilitate translating only the 
#   words in the content of a web page.  You wouldn't want to 
#   replace every instance of 'img' in a page  because that 
#   would destroy all of your img tags.
#
#   It uses the LWP library that can be found at:
#   http://www.perl.com/CPAN-local/modules/by-module/LWP/
#   It makes retrieving the HTML content MUCH easier.
#
#   All this script does is take the file contents in and spit it
#   back out.  It doesn't fix urls, or paths to images.  
#   i.e.: relative images and links won't work.  Fixing these is 
#   the hard part  --  and the FUN part.
#
#   For a complete working copy, see:
#   http://www.degraeve.com/translator.shtml
#
# Associated Files?
#   stripped.cgi  (this file)
#   test.htm
#
#####################################################################

use LWP::Simple;

&ReadParse(*input);
$url = $input{'url'};
$content = get($url);  # RETRIEVE THE HTML DOCUMENT

print "Content-type: text/html\n\n";   

$len = length($content);
$html = "0";          # THE FLAG
$htmlcode = "";       # THE HTML CODE
$htmlcontent = "";    # THE CONTENT OF THE PAGE

$i = 0;
for ($i = 0; $i < $len; $i++){
  
  $char = substr($content, $i, 1);

  if ($char eq "<"){
    # YOU CAN TRANSLATE THIS BUFFER :)
    print $htmlcontent;   
    $htmlcontent = "";
    $html = "1";
  }

  if ($html eq "1"){
    $htmlcode = $htmlcode . $char;
  }

  if ($html eq "0"){
    $htmlcontent = $htmlcontent . $char ;
  }

  if ($char eq ">"){
    $html = "0";
    # YOU CAN TRANSLATE THIS BUFFER :)
    print $htmlcode;   
    $htmlcode = "";
  }
  
}


##########################################################################
#    THE FOLLOWING IS TO GET THE PARAMS PASSED 
#    TO THE SCRIPT THRU THE GET/POST METHODS
#    I DIDN'T WRITE THIS PART...
#

sub ReadParse {
  local (*in) = @_ if @_;
  local ($i, $key, $val);

  # Read in text
  if (&MethGet) {
    $in = $ENV{'QUERY_STRING'};
  } elsif (&MethPost) {
    read(STDIN,$in,$ENV{'CONTENT_LENGTH'});
  }

  @in = split(/[&;]/,$in);

  foreach $i (0 .. $#in) {
    # Convert plus's to spaces
    $in[$i] =~ s/\+/ /g;
    # Split into key and value.
    ($key, $val) = split(/=/,$in[$i],2); # splits on the first =.

    # Convert %XX from hex numbers to alphanumeric
    $key =~ s/%(..)/pack("c",hex($1))/ge;
    $val =~ s/%(..)/pack("c",hex($1))/ge;

    # Associate key and value
    $in{$key} .= "\0" if (defined($in{$key})); # \0 is the multiple separator
    $in{$key} .= $val;
# Get rid of any weird characters.
    $in{$key} =~ s/\*|\$|\<|\>|\#|\%//gi;
        }

  return scalar(@in);
}

sub MethGet {
  return ($ENV{'REQUEST_METHOD'} eq "GET");
}

sub MethPost {
  return ($ENV{'REQUEST_METHOD'} eq "POST");
}