Regular Expression Cookbook
From DPCanadaWiki
This is a resource for regular expressions patterns which can be used to automate common but tedious post-processing tasks. Post-processors, please consider adding your favourite regex patterns. Also, please help with stress-testing the expressions and report if they fail under certain circumstances.
Many of these patterns come from a study of the Regular Expressions Clinic at the DP-Int forum and the CSS Cookbook at the DP-Int wiki. The contributions of the many DPers who provided or inspired these patterns is gratefully acknowledged.
Contents |
Uppercase text between <sc> tags
Used for producing the plain text version where words between <sc> tags need to be converted to uppercase.
Search for : <sc>([^<]*)</sc>
Replace with: \U$1\E
Notes: Search for <sc>, then capture anything not "<" to variable $1 until you hit </sc>. Replace with an uppercase transformation of variable $1.
This pattern will fail when there are nested html tags e.g. italic or bold tags within the <sc> string, as it will stop at the first "<". But working in plain text, this is all right as by then the other tags will have been replaced with other symbols.
Simple Math Tasks
Pg. number markers need to be renumbered. [Pg 3] is supposed to be [Pg 1], [Pg 4] is [Pg 2] and so on.
Search for : \[Pg (\d+)\]
Replace with: [Pg \C$1-2\E]
There will be a warning that perl code is going to be run. Select 'ok' and the conversion will proceed.
Inserting Page Links in an Index
Detailed patterns are given in the CSS Cookbook at the DP-Int wiki. This one is simplified, and modified for a specific style of index. The pattern needs to be modified depending on how the original index looks like.
We want to search for numbers ending with [,.;] or a linebreak, or a range of numbers e.g. "24-30" ending with the same punctuation. We want to insert links to page anchors. In the case of a range of numbers, we only want to link to the first number e.g. "24-30", link only to page 24. We also want the links enclosed in square brackets.
Search for : (\d+)((?:\-\d+)?)([,\n;\.])
Replace with: <a href="#Page_$1">[$1$2]</a>$3
Notes: Search for one or more digits, capture it to $1. Next search for an optional group comprising a "-" and one or more digits, capture that to $2. Last, search for a range of possible punctuations: ,.; or a linebreak, and capture it to $3.
Creating HTML Tables
Given text like:
Chapter 1 5
Chapter 2 12
We want to grab the first column of text as one variable and the second column as a second variable and insert html table cell tag in between them.
Search for : ^(.*)\b\s\s+(.*)$
Replace with: <tr><td>$1</td><td>$2</td></tr>
Notes: Ensure there are at least 2 white spaces between the two columns. If there are three columns, just repeat the bit \s\s\s+(.*) again. The $ only goes at the end.
Grabbing Text between tags
This is a more robust pattern than the one for grabbing text between <sc> tags above. It will grab everything within the span tags including linebreaks and nested styling tags such as <i>, </i> etc. Change the "class" specification to whatever you need to search for.
Search for : <span class="smcap">([\w\s\p{IsPunct}\n]+?)</span>
