How can I generate URL slugs in Perl?
Web frameworks such as Rails and Django has built-in support for "slugs" which are used to generate readable and SEO-friendly URLs:
- Slugs in Rails
- Slugs in Django
A slug string typically contains only of the characters a-z
, 开发者_如何学JAVA0-9
and -
and can hence be written without URL-escaping (think "foo%20bar").
I'm looking for a Perl slug function that given any valid Unicode string will return a slug representation (a-z
, 0-9
and -
).
A super trivial slug function would be something along the lines of:
$input = lc($input),
$input =~ s/[^a-z0-9-]//g;
However, this implementation would not handle internationalization and accents (I want ë
to become e
). One way around this would be to enumerate all special cases, but that would not be very elegant. I'm looking for something more well thought out and general.
My question:
- What is the most general/practical way to generate Django/Rails type slugs in Perl? This is how I solved the same problem in Java.
The slugify
filter currently used in Django translates (roughly) to the following Perl code:
use Unicode::Normalize;
sub slugify($) {
my ($input) = @_;
$input = NFKD($input); # Normalize (decompose) the Unicode string
$input =~ tr/\000-\177//cd; # Strip non-ASCII characters (>127)
$input =~ s/[^\w\s-]//g; # Remove all characters that are not word characters (includes _), spaces, or hyphens
$input =~ s/^\s+|\s+$//g; # Trim whitespace from both ends
$input = lc($input);
$input =~ s/[-\s]+/-/g; # Replace all occurrences of spaces and hyphens with a single hyphen
return $input;
}
Since you also want to change accented characters to unaccented ones, throwing in a call to unidecode
(defined in Text::Unidecode
) before stripping the non-ASCII characters seems to be your best bet (as pointed out by phaylon).
In that case, the function could look like:
use Unicode::Normalize;
use Text::Unidecode;
sub slugify_unidecode($) {
my ($input) = @_;
$input = NFC($input); # Normalize (recompose) the Unicode string
$input = unidecode($input); # Convert non-ASCII characters to closest equivalents
$input =~ s/[^\w\s-]//g; # Remove all characters that are not word characters (includes _), spaces, or hyphens
$input =~ s/^\s+|\s+$//g; # Trim whitespace from both ends
$input = lc($input);
$input =~ s/[-\s]+/-/g; # Replace all occurrences of spaces and hyphens with a single hyphen
return $input;
}
The former works well for strings that are primarily ASCII, but falls short when the entire string is formed of non-ASCII characters, since they all get stripped out, leaving you with an empty string.
Sample output:
string | slugify | slugify_unidecode
-------------------------------------------------
hello world hello world hello world
北亰 bei-jing
liberté liberta liberte
Note how 北亰 gets slugifies to nothing with the Django-inspired implementation. Note also the difference the NFC normalization makes -- liberté becomes 'liberta' with NFKD after stripping out the second part of the decomposed character, but would becomes 'libert' after stripping out the re-assembled 'é' with NFC.
Are you looking for something like Text::Unidecode?
String::Dirify
is used for making slugs in the blogging software Movable Type/Melody.
Adding Text::Unaccent to the beginning of the chain looks like it will do what you want.
The most turn-key solution is using Text::Slugify which does what you need. It's a trivial amount of code which nicely provides a slugify
function for you.
It relies on Text::Unaccent::PurePerl to remove accents from characters.
精彩评论