We were working on a project recently that required us to refine some 3rd-party code. Among other things, we needed to add missing alt attributes to any images which were missing them. (Missing alt attributes are one bane to screen readers, which then resort to reading an image’s filename instead — just imagine hearing “graphic shim dot gif graphic shim dot gif graphic shim dot gif graphic shim dot gif …”.) I have an editor which supports extended file searches with regular expressions, so a regular expression seemed like a sensible path to take.
In case you’re not familiar with them, regular expressions are “special text strings for describing a search pattern”. Remember the asterisk operator from the DOS days? Supposing that you wanted to search a directory for PNG files that started with “candy“ — you could type “dir candy*.png”. Now, that’s not an actual regular expression, but regular expressions are along those lines.
I’ve used regular expressions before, here and there, but I was a bit stumped by this one. I could get as far as devising a regular expression to search for image tags, but I got a bit stuck after that. For what it’s worth, the regular expression for an image tag (assuming HTML 4.x, which is what we were dealing with this time) would be “]+>”. Granted, that doesn’t definitively find only well-formed image tags; there’re plenty of ways to “fake out” that regex if you really wanted to. However, even though the code we were dealing with was largely invalid, it wasn’t maliciously munged.
If you’re curious about that previous regular expression for img tags, here’s how it breaks down:
- The less-than character has no special meaning in regular expressions, so this literally searches for “
- Brackets denote “choose any one of these characters” while the carat (“^”) negates that. So, “[^>]” means “ any character which isn’t a greater-than”.
- A plus sign means “one or more of the preceding”, which in this case is “any character but greater-than”. Put together, this extends the string matching right up until the end of the greater-than sign at the end of the img tag.
- Like the less-than character, the greater-than character has no special meaning in regular expressions, either. So, this just marks the end of the image tag.
That much — searching for an image tag — I had working perfectly. But, I needed a “not” operator in there, a means of saying “ search for img tags without the string ‘alt’ in the middle”; however, as much as I looked around on the Interweb, I could’t find such an operator. (Sure, the carat might seem tempting, but that can only act against single character matches, as far as I know.) Then it dawned on me that instead of creating a “blacklist” of attributes not to find, I could create a “whitelist” of allowable attributes (excluding “alt” from the list, in this case).
Here’s the regex I came up with for that. And, to save you the suspense, it worked out pretty well for us .
Here’s how that one breaks down:
- Similar to the first regular expression above (for just finding image tags, this searches for “one or more characters (“+”) of whitespace (“s”).
- Once again, the plus sign signifies “one or more of the preceding” which signifies one or more attribute/value pairs (the bits inside the parenthesis, which I’ll go over next).
- The vertical bar (“|”) just means “or”, so this allows for the img’s attributes to be “width” or “height” or “border” (and so on)
- The equals sign (“=”) doesn’t have a special meaning in regular expressions, so this literally signifies an equals sign (between the attribute/value pairs, in this case).
- Quotation marks don’t have a special meaning within regular expressions, either, so this part searches for a quotation mark (“) followed by one or more (+) characters which isn’t a quotation mark.
- And, this is just the closing quotation-mark. Together with the previous bit (“[^"]+), this set matches the value assignments for an attribute (such as the “42″ part of the attribute/value pair height=”42″).
- Once again “s” matches any whitespace character and the asterisk (“*”) specifies at least zero or more matches. This accounts for whitespace between attribute/value pairs.
- Finally, the “>” denotes the end of the img tag.
So, essentially, all of this together means “find img tags for which each of its attributes appear on this list: width, height, border, class, id, src, usemap, hspace, vspace”. Because “alt” isn’t included on our “whitelist”, images which have an alt attribute, don’t match our regular expression. However, images which are missing an alt attribute easily match this regex.
Now that you have the regex, how would you go about using it? Well, if you’re running a Unix or OS X system, you could make use of “grep” at the command line. Or, if you’re on Windows, a regex-supporting file-finding utilitiy such as Agent Ransack (freeware) might just do the trick. Then again, your editor may also have an Extended Find function with regex support built in.
A few caveats to this regex:
- This regex was designed to match against HTML 4.x code, since that’s what we were dealing with at the time. If you need to find img tags with missing alt attributes within XHTML, then you’ll need to modify the regex. I haven’t tested it, but perhaps adding “/?” before the final greater-than may do the trick. (Adding that sub-expression would allow for an optional slash before the closing greater-than.)
- Yes, I’m also aware that there’re many more allowable attributes on img tags, such as onclick, onmouseover and others. If you’ll be dealing with code that might contain such attributes, feel free to add them to the “or list” above. (The code we were dealing with was known not to have those attributes on any of the image tags.)
If you end up using this regex, leave a comment below and let us know how it worked out for you. Or, if you’re a regex guru, any optimizations to the regex are welcome as well.