We were working on a project recently that required us to refine some 3rd-party code. Among other things, we needed to add missing alt attributes to any images which were missing them. (Missing alt attributes are one bane to screen readers, which then resort to reading an image’s filename instead — just imagine hearing “graphic shim dot gif graphic shim dot gif graphic shim dot gif graphic shim dot gif …”.) I have an editor which supports extended file searches with regular expressions, so a regular expression seemed like a sensible path to take.
In case you’re not familiar with them, regular expressions are “special text strings for describing a search pattern”. Remember the asterisk operator from the DOS days? Supposing that you wanted to search a directory for PNG files that started with “candy“ — you could type “dir candy*.png”. Now, that’s not an actual regular expression, but regular expressions are along those lines.
I’ve used regular expressions before, here and there, but I was a bit stumped by this one. I could get as far as devising a regular expression to search for image tags, but I got a bit stuck after that. For what it’s worth, the regular expression for an image tag (assuming HTML 4.x, which is what we were dealing with this time) would be “]+>”. Granted, that doesn’t definitively find only well-formed image tags; there’re plenty of ways to “fake out” that regex if you really wanted to. However, even though the code we were dealing with was largely invalid, it wasn’t maliciously munged.
If you’re curious about that previous regular expression for img tags, here’s how it breaks down:
- [^>]+>
- The less-than character has no special meaning in regular expressions, so this literally searches for “
- [^>]+>
- Brackets denote “choose any one of these characters” while the carat (“^”) negates that. So, “[^>]” means “ any character which isn’t a greater-than”.
- ]+>
- A plus sign means “one or more of the preceding”, which in this case is “any character but greater-than”. Put together, this extends the string matching right up until the end of the greater-than sign at the end of the img tag.
- ]+>
- Like the less-than character, the greater-than character has no special meaning in regular expressions, either. So, this just marks the end of the image tag.
That much — searching for an image tag — I had working perfectly. But, I needed a “not” operator in there, a means of saying “ search for img tags without the string ‘alt’ in the middle”; however, as much as I looked around on the Interweb, I could’t find such an operator. (Sure, the carat might seem tempting, but that can only act against single character matches, as far as I know.) Then it dawned on me that instead of creating a “blacklist” of attributes not to find, I could create a “whitelist” of allowable attributes (excluding “alt” from the list, in this case).
Here’s the regex I came up with for that. And, to save you the suspense, it worked out pretty well for us .
Here’s how that one breaks down:
-
((width|height|border|classs|id|src|usemap|hspace|vspace)=”[^"]+”s*)+> - Similar to the first regular expression above (for just finding image tags, this searches for “one or more characters (“+”) of whitespace (“s”).
((width|height|border|classs|id|src|usemap|hspace|vspace)=”[^"]+”s*)+> - Once again, the plus sign signifies “one or more of the preceding” which signifies one or more attribute/value pairs (the bits inside the parenthesis, which I’ll go over next).
(width|height|border|classs|id|src|usemap|hspace|vspace)=”[^"]+”s*)+> - The vertical bar (“|”) just means “or”, so this allows for the img’s attributes to be “width” or “height” or “border” (and so on)
=“[^"]+”s*)+> - The equals sign (“=”) doesn’t have a special meaning in regular expressions, so this literally signifies an equals sign (between the attribute/value pairs, in this case).
“[^"]+“s*)+> - Quotation marks don’t have a special meaning within regular expressions, either, so this part searches for a quotation mark (“) followed by one or more (+) characters which isn’t a quotation mark.
“s*)+> - And, this is just the closing quotation-mark. Together with the previous bit (“[^"]+), this set matches the value assignments for an attribute (such as the “42″ part of the attribute/value pair height=”42″).
s*)+> - Once again “s” matches any whitespace character and the asterisk (“*”) specifies at least zero or more matches. This accounts for whitespace between attribute/value pairs.
> - Finally, the “>” denotes the end of the img tag.
So, essentially, all of this together means “find img tags for which each of its attributes appear on this list: width, height, border, class, id, src, usemap, hspace, vspace”. Because “alt” isn’t included on our “whitelist”, images which have an alt attribute, don’t match our regular expression. However, images which are missing an alt attribute easily match this regex.
Now that you have the regex, how would you go about using it? Well, if you’re running a Unix or OS X system, you could make use of “grep” at the command line. Or, if you’re on Windows, a regex-supporting file-finding utilitiy such as Agent Ransack (freeware) might just do the trick. Then again, your editor may also have an Extended Find function with regex support built in.
A few caveats to this regex:
- This regex was designed to match against HTML 4.x code, since that’s what we were dealing with at the time. If you need to find img tags with missing alt attributes within XHTML, then you’ll need to modify the regex. I haven’t tested it, but perhaps adding “/?” before the final greater-than may do the trick. (Adding that sub-expression would allow for an optional slash before the closing greater-than.)
- Yes, I’m also aware that there’re many more allowable attributes on img tags, such as onclick, onmouseover and others. If you’ll be dealing with code that might contain such attributes, feel free to add them to the “or list” above. (The code we were dealing with was known not to have those attributes on any of the image tags.)
If you end up using this regex, leave a comment below and let us know how it worked out for you. Or, if you’re a regex guru, any optimizations to the regex are welcome as well.
i have query regarding regular expression
i want to append a text after img src by matching though regular expression ,
like
suppose
i want make it like
Or if you don’t want to bother thinking of every
attribute you want to accept, you could use something like this, based on the fact that in order for a word to not be “alt”, it must be either shorter or longer than 3 letters, or if it is 3 letters than at least one of them must be different from the letter in the corresponding position in the word “alt”.
This also allows for non-quoted attribute values. Adjust if you are dealing with non-strictly-lowercase attribute names. And of course, like the regexp above, this can still fail to match a valid img tag if it contains an attribute value with internal escaped quotes.
Apparently HTML in comments, or stuff that looks like HTML, is interpreted . . . let me try again.
So what if i want only the src attribute? I need to get the value of src attrib only, how can i modify this regex.Will be of great help if you can find me a way.
Kudos to you. I’ve needed this exact code and was learning as I went along. The whole negative phrase thing I assumed existed but couldn’t figure out. You made my day
Can u pls post working example in php or in javascript?
Please give the regular expreesion for
height like 5″2′ i.e. 5 Futt 2 inche
please reply soon,,
My id is :
Thank you for such good site. It is sorry that before him did not find
Very clever. I was trying to find a way to negate the alt in the regexp, but manually specifying everything but it is a wonderful work around. Thanks!
I’ve been looking for/trying to create a regex to look for all images without width, height, or alt attributes (so I could then make them XHTML Strict compliant). This regex definitely helps!
I’ve been trying to use the carat to negate the attributes, but didn’t realize it only negates individual characters.
Thanks again.
this is a very good site on regular expression. I am trying to use regular expression for sift through all jsp source files to find out all image tags (standard HTML img tag or struts img tags),regardless of whether they have the alt attributes or not. But my requirement is that the entire code for each image tag must be returned.
This is a not a problem if all the code for an image tag is in one line like . The problem shows if the img tag code crosses multiple lines like
In this case I can only get the first line. But I need to have the all the code returned.
Any help is highly appreciated.
Thanks,
John
You can do the same thing (but for img tags that end with the more xhtml compliment tags by modifying the expression to look like this:
]+/>
It worked for me in jedit.
You can do the same thing (but for img tags that end with the more xhtml compliment tags by modifying the expression to look like this:
Do I have this right? He said attempting to pingback…
http://haloscan.com/tb/leeand00/Removing_Proprietary_Attributes_from_html_tags_in_Awful_3rd_Old_Third_Party_HTML_Code_Using_Regular_Expressions