Encoding and Decoding URLs via perl (including decimal to hex conversion)
Version history:
5/24/00: Thanks to 'campbeln' for improving the script - catching a couple small bugs - and turning this into a resource for CPAN.
12/00: Jesús Quiroga provided me with an internationalization of the encoding that handles a wider variety of accented and related characters. Thanks!
11/07: Dan Black sent in a very nifty compression of the whole routine into a single line, which I've incorporated below. Thanks!
Wednesday, November 7, 2001
I was trying to do something extraordinarily simple - so simple, my pea brain obviously couldn't figure it out. I wanted to find every instance of a restricted character in a prospective URL and turn it into the "url encoded" equivalent inside a perl script.
Restricted characters include most punctuation marks; to transmit these as part of a URL without causing an error or the wrong interpretation on the receiving end, you need to convert the characters into its ASCII code. (ASCII is a long-time standard for numbered letters, control characters, and other symbols.)
The character code gets represented in the URL as a percentage sign followed by the hexadecimal (base 16) two-digit number for the ASCII code. For instance, an exclamation point is decimal 33 in ASCII, or hex 21. To include this in a URL, you use %21. Spaces can be represented as plus signs (+) or %20 (ASCII 32).
You'd think this would be simple, right? But in my search of the Web and perl documentation, I found a lot of information on decoding URLs and turning hex into decimal, but not the reverse of either.
For instance, if you want to decode a URL, you use a very simple search pattern:
sub URLDecode {
my $theURL = $_[0];
$theURL =~ tr/+/ /;
$theURL =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg;
$theURL =~ s/<!--(.|\n)*-->//g;
return $theURL;
}
This pattern takes the hex characters and decodes them back into real characters. The function hex() turns a hex number into decimal; there is no dec() that does the reverse in perl. The "e" at the end of the regexp means "evaluate the replacement pattern as an expression."
After a lot of hunting around and installing the URI module for perl - which I could never get to do proper encoding despite man page instructions - I finally figured out how to do this myself:
Note here's the revised version as of 11/7/01 with help noted at the top of this page:
sub URLEncode {
my $theURL = $_[0];
$theURL =~ s/([\W])/"%" . uc(sprintf("%2.2x",ord($1)))/eg;
return $theURL;
}
The missing piece was the sprintf formatting. The string "%x" means, "take the input and turn it into a hexadecimal character string. Ord converts a character into an ASCII code equivalent in decimal; the %2.2x format turns that into an exactly two-digit hex number.
The reason there's no dec() function in perl is, ostensibly, because %x exists in printf. However, given perl's lifeblood of multiple ways to accomplish everyting, it's surprising that a simple feature exists in one specific way that's hard to find.
I hope this page helps someone save some time and sanity.
(For historical reasons, the original script looked like this:
sub URLEncode {
my $theURL = $_[0];
my $MetaChars = quotemeta( ';,/?\|=+)(*&^%$#@!~`:');
$theURL =~ s/([$MetaChars\"\'\x80-\xFF])/"%" . uc(sprintf("%2.2x", ord($1)))/eg;
$theURL =~ s/ /\+/g;
return $theURL;
}
)
Copyright ©1997-2004 Glenn Fleishman except as noted otherwise. All rights reserved. For permission to reprint, contact Glenn Fleishman at glenn at glennf.com. Replace the "at" with an @.
|