Using Perl and Regular Expressions to Process Html Files - Part 1
Like many web content authors, over the past few years I've had many occasions when I've needed to clean up a bunch of HTML files that have been generated by a word processor or publishing package. Initially, I used to clean up the files manually, opening each one in turn, and making the same set of updates to each one. This works fine when you only have a few files to fix, but when you have hundreds or even thousands to do, you can very quickly be looking at weeks or even months of work. A few years ago someone put me on to the idea of using Perl and regular expressions to perform this 'cleaning up' process.
Why write an article about Perl and regular expressions I hear you say. Well, that's a good point. After all the web is full of tutorials on Perl and regular expressions. What I found though, was that when I was trying to find out how I could process HTML files, I found it difficult to find tutorials that met my criteria. I'm not saying they don't exist, I just couldn't find them. Sure, I could find tutorials that explained everything I needed to know about regular expressions, and I could find plenty of tutorials about how to program in Perl, and even how to use regular expressions within Perl scripts. What I couldn't find though, was a tutorial that explained how to open one or more HTML or text files, make updates to those files using regular expressions, and then save and close the files.
The Goal
When converting into HTML documents, the goal is always a seamless transition from the source document (eg reach, a word processing document) to HTML. The last thing you need is for your content authors to spend hours or even days, fixing messy HTML code were changed after it.
Many applications offer excellent tools for converting documents to HTML and, in combination with a well designed cascading style sheet (CSS), can often produce perfect results. Sometimes though, there are little bits of HTML code that are a bit messy, normally caused by authors not applying paragraph tags or styles correctly in the source document.
Why Perl?
Because, let's face it it is, text files, because that is better handled in Perl as a language to good use for this task, all HTML files are. If the Perl regular expression search and replace / have become the de facto standard for use can be used to change bits of text or code in the file.
What is Perl?
Perl (Practical Extraction and Report Language) is a general purpose programming language, which means it can be used to do anything that any other programming language can do. Having said that, Perl is very good at doing certain things, and not so good at others. Although you could do it, you wouldn't normally develop a user interface in Perl as it would be much easier to use a language like Visual Basic to do this. What Perl is really good at, is processing text. This makes it a great choice for manipulating HTML files.
What are regular expressions?
Description of regular expressions and string matching set of strings is given according to the syntax rules. A Perl regular expressions are not unique - is, JavaScript, and many languages, including PHP, you can use them - but, Perl handles them better than other languages.
In part 2, we'll look at our first example Perl script