Thursday, May 31, 2018

Java: Search and replace with regular expressions (2)

Many simple search and replace operations can be performed using the String.replaceAll() method. Sometimes, more flexibility is required: for example, if not every instance of the expression needs replacing, or if the replacement string is not fixed. In this case, instaces of Matcher provide a find() method which we will look at here. The idiom introduced here can also be used simply to find and process instances of a pattern in a string, without necessarily appending anything to another string.

The `Matcher.find()` method

Using Matcher.find() shares some similarity to Matcher.matches(). We first need to compile a Pattern representing our regular expression and then from this construct a Matcher around the string that we want to process. But unlike when we use matches(), our expression is now the pattern that we want to find as a portion of the string, rather than as the whole string. And since the pattern can occur multiple times in the string being matched, we will sit in a loop calling the find() method. The find() method will return true as long as there's another match.
To perform the "replacement", as we go along, we actually build up a new StringBuffer that will contain the new version of the string with the replacements made. A couple of methods of the Matcher object will help us with this.
So keeping with our example of removing HTML 'bold' tags, the code now looks like this:

public String removeBoldTags(CharSequence htmlString) {
  Pattern patt = Pattern.compile("<b>([^<]*)</b>");
  Matcher m = patt.matcher(htmlString);
  StringBuffer sb = new StringBuffer(htmlString.length());
  while (m.find()) {
    String text = m.group(1);
    // ... possibly process 'text' ...
    m.appendReplacement(sb, Matcher.quoteReplacement(text));
  }
  m.appendTail(sb);
  return sb.toString();
}

You'll notice that the parameter passed in is not specifically a String but actually just any old CharSequence. The CharSequence interface introduced in Java 1.4 is implemented by String and by a few other classes (such as StringBuffer and CharBuffer) that can hold a 'sequence of characters'. On the other hand, the appendX() methods work only with StringBuffers– it would have been nice if they'd worked with any old Appendable, but the latter interface did not exist when the regular expressions API was added (in Java 1.4; Appendable was added in Java 5).

Group 0

You may recall from our discussion of capturing groups that there is always a group 0, which refers to the entire string when using the matches() method. When using the find() method, group 0 refers to the entire portion of the string found to match the expression on the previous call to find().

Find without the replace

Of course, you can use Matcher.find() without actually using the replace. This can be used, for example, if you just want to count or process instances of a particular pattern within a string.

Java: Search and replace with regular expressions

It is possible to perform search and replace operations on strings using regular expressions. How complicated this is naturally depends on how much flexibility you need:

to replace instances of an expression in a string with a fixed string, then you can use a simple call to String.replaceAll();
if the replacement string isn't fixed, then you can use a loop with a Pattern and Matcher in which you have complete control over the replacement string.

Replacing with a fixed string

If you just want to replace all instances of a given expression within a string with another fixed string, then things are fairly straightforward. For example, the following replaces all instances of digits with a letter X:

str = str.replaceAll("[0-9]", "X");

The following replaces all instances of multiple spaces with a single space:

str = str.replaceAll(" {2,}", " ");

We'll see in the next section that we should be careful about passing "raw" strings as the second paramter, since a couple of characters in this string actually have special meanings.

Replacing with a sub-part of the matched portion

In the replacement string, we can refer to captured groups from the regular expression. For example, the following expression removes instances of the HTML 'bold' tag from a string, but leaves the text inside the tag intact:

str = str.replaceAll("<b>([^<]*)</b>", "$1");

In the expression <b>([^<]*)</b>, we capture the text between the open and close tags as group 1. Then, in the replacement string, we can refer to the text of group 1 with the expression $1. (The second group would be $2 etc.)

Including a dollar sign in the replacement string

To actually include a dollar in the replacement string, we need to put a backslash before the dollar symbol:

str = str.replaceAll("USD", "\\$");

The static method Matcher.quoteReplacement() will replace instances of dollar signs and backslashes in a given string with the correct form to allow them to be used as literal replacements:

str = str.replaceAll("USD",
  Matcher.quoteReplacement("$"));

In general:

If there is a chance that the replacement string will include a dollar sign or a backslash character, then you should wrap it in Matcher.quoteReplacement()¹.

More flexible find and replacement operations

The replaceAll() method is suitable for cases where the replacement string is fixed or of a fixed format. For more flexibility, the Matcher.find() method can be used.

Wednesday, May 30, 2018

Named

Named Capturing Groups and Backreferences

Nearly all modern regular expression engines support numbered capturing groups and numbered backreferences. Long regular expressions with lots of groups and backreferences may be hard to read. They can be particularly difficult to maintain as adding or removing a capturing group in the middle of the regex upsets the numbers of all the groups that follow the added or removed group.
Python's re module was the first to offer a solution: named capturing groups and named backreferences. (?P<name>group) captures the match of group into the backreference "name". name must be an alphanumeric sequence starting with a letter. group can be any regular expression. You can reference the contents of the group with the named backreference (?P=name). The question mark, P, angle brackets, and equals signs are all part of the syntax. Though the syntax for the named backreference uses parentheses, it's just a backreference that doesn't do any capturing or grouping. The HTML tags example can be written as <(?P<tag>[A-Z][A-Z0-9]*)\b[^>]*>.*?</(?P=tag)>.
The .NET framework also supports named capture. Microsoft's developers invented their own syntax, rather than follow the one pioneered by Python and copied by PCRE (the only two regex engines that supported named capture at that time). (?<name>group) or (?'name'group) captures the match of group into the backreference "name". The named backreference is \k<name> or \k'name'. Compared with Python, there is no P in the syntax for named groups. The syntax for named backreferences is more similar to that of numbered backreferences than to what Python uses. You can use single quotes or angle brackets around the name. This makes absolutely no difference in the regex. You can use both styles interchangeably. The syntax using angle brackets is preferable in programming languages that use single quotes to delimit strings, while the syntax using single quotes is preferable when adding your regex to an XML file, as this minimizes the amount of escaping you have to do to format your regex as a literal string or as XML content.
Because Python and .NET introduced their own syntax, we refer to these two variants as the "Python syntax" and the ".NET syntax" for named capture and named backreferences. Today, many other regex flavors have copied this syntax.
Perl 5.10 added support for both the Python and .NET syntax for named capture and backreferences. It also adds two more syntactic variants for named backreferences: \k{one} and \g{two}. There's no difference between the five syntaxes for named backreferences in Perl. All can be used interchangeably. In the replacement text, you can interpolate the variable $+{name} to insert the text matched by a named capturing group.
PCRE 7.2 and later support all the syntax for named capture and backreferences that Perl 5.10 supports. Old versions of PCRE supported the Python syntax, even though that was not "Perl-compatible" at the time. Languages like PHP, Delphi, and R that implement their regex support using PCRE also support all this syntax. Unfortunately, neither PHP or R support named references in the replacement text. You'll have to use numbered references to the named groups. PCRE does not support search-and-replace at all.
Java 7 and XRegExp copied the .NET syntax, but only the variant with angle brackets. Ruby 1.9 and supports both variants of the .NET syntax. The JGsoft flavor supports the Python syntax and both variants of the .NET syntax.
Boost 1.42 and later support named capturing groups using the .NET syntax with angle brackets or quotes and named backreferences using the \g syntax with curly braces from Perl 5.10. Boost 1.47 additionally supports backreferences using the \k syntax with angle brackets and quotes from .NET. Boost 1.47 allowed these variants to multiply. Boost 1.47 allows named and numbered backreferences to be specified with \g or \k and with curly braces, angle brackets, or quotes. So Boost 1.47 and later have six variations of the backreference syntax on top of the basic \1 syntax. This puts Boost in conflict with Ruby, PCRE, PHP, R, and JGsoft which treat \g with angle brackets or quotes as a subroutine call.

Numbers for Named Capturing Groups

Mixing named and numbered capturing groups is not recommended because flavors are inconsistent in how the groups are numbered. If a group doesn't need to have a name, make it non-capturing using the (?:group) syntax. In .NET you can make all unnamed groups non-capturing by setting RegexOptions.ExplicitCapture. In Delphi, set roExplicitCapture. With XRegExp, use the /n flag. Perl supports /n starting with Perl 5.22. With PCRE, set PCRE_NO_AUTO_CAPTURE. The JGsoft flavor and .NET support the (?n) mode modifier. If you make all unnamed groups non-capturing, you can skip this section and save yourself a headache.
Most flavors number both named and unnamed capturing groups by counting their opening parentheses from left to right. Adding a named capturing group to an existing regex still upsets the numbers of the unnamed groups. In .NET, however, unnamed capturing groups are assigned numbers first, counting their opening parentheses from left to right, skipping all named groups. After that, named groups are assigned the numbers that follow by counting the opening parentheses of the named groups from left to right.
The JGsoft regex engine copied the Python and the .NET syntax at a time when only Python and PCRE used the Python syntax, and only .NET used the .NET syntax. Therefore it also copied the numbering behavior of both Python and .NET, so that regexes intended for Python and .NET would keep their behavior. It numbers Python-style named groups along unnamed ones, like Python does. It numbers .NET-style named groups afterward, like .NET does. These rules apply even when you mix both styles in the same regex.
As an example, the regex (a)(?P<x>b)(c)(?P<y>d) matches abcd as expected. If you do a search-and-replace with this regex and the replacement \1\2\3\4 or $1$2$3$4 (depending on the flavor), you will get abcd. All four groups were numbered from left to right, from one till four.
Things are a bit more complicated with the .NET framework. The regex (a)(?<x>b)(c)(?<y>d) again matches abcd. However, if you do a search-and-replace with $1$2$3$4 as the replacement, you will get acbd. First, the unnamed groups (a) and (c) got the numbers 1 and 2. Then the named groups "x" and "y" got the numbers 3 and 4.
In all other flavors that copied the .NET syntax the regex (a)(?<x>b)(c)(?<y>d) still matches abcd. But in all those flavors, except the JGsoft flavor, the replacement \1\2\3\4 or $1$2$3$4 (depending on the flavor) gets you abcd. All four groups were numbered from left to right.
In PowerGREP, which uses the JGsoft flavor, named capturing groups play a special role. Groups with the same name are shared between all regular expressions and replacement texts in the same PowerGREP action. This allows captured by a named capturing group in one part of the action to be referenced in a later part of the action. Because of this, PowerGREP does not allow numbered references to named capturing groups at all. When mixing named and numbered groups in a regex, the numbered groups are still numbered following the Python and .NET rules, like the JGsoft flavor always does.

Multiple Groups with The Same Name

The .NET framework and the JGsoft flavor allow multiple groups in the regular expression to have the same name. All groups with the same name share the same storage for the text they match. Thus, a backreference to that name matches the text that was matched by the group with that name that most recently captured something. A reference to the name in the replacement text inserts the text matched by the group with that name that was the last one to capture something.
Perl and Ruby also allow groups with the same name. But these flavors only use smoke and mirrors to make it look like the all the groups with the same name act as one. In reality, the groups are separate. In Perl, a backreference matches the text captured by the leftmost group in the regex with that name that matched something. In Ruby, a backreference matches the text captured by any of the groups with that name. Backtracking makes Ruby try all the groups.
So in Perl and Ruby, you can only meaningfully use groups with the same name if they are in separate alternatives in the regex, so that only one of the groups with that name could ever capture any text. Then backreferences to that group sensibly match the text captured by the group.
For example, if you want to match "a" followed by a digit 0..5, or "b" followed by a digit 4..7, and you only care about the digit, you could use the regex a(?<digit>[0-5])|b(?<digit>[4-7]). In these four flavors, the group named "digit" will then give you the digit 0..7 that was matched, regardless of the letter. If you want this match to be followed by c and the exact same digit, you could use (?:a(?<digit>[0-5])|b(?<digit>[4-7]))c\k<digit>
PCRE does not allow duplicate named groups by default. PCRE 6.7 and later allow them if you turn on that option or use the mode modifier (?J). But prior to PCRE 8.36 that wasn't very useful as backreferences always pointed to the first capturing group with that name in the regex regardless of whether it participated in the match. Starting with PCRE 8.36 (and thus PHP 5.6.9 and R 3.1.3) and also in PCRE2, backreferences point to the first group with that name that actually participated in the match. Though PCRE and Perl handle duplicate groups in opposite directions the end result is the same if you follow the advice to only use groups with the same name in separate alternatives.
Boost allows duplicate named groups. Prior to Boost 1.47 that wasn't useful as backreferences always pointed to the last group with that name that appears before the backreference in the regex. In Boost 1.47 and later backreferences point to the first group with that name that actually participated in the match just like in PCRE 8.36 and later.
Python, Java, and XRegExp 3 do not allow multiple groups to use the same name. Doing so will give a regex compilation error. XRegExp 2 allowed them, but did not handle them correctly.
In Perl 5.10, PCRE 8.00, PHP 5.2.14, and Boost 1.42 (or later versions of these) it is best to use a branch reset group when you want groups in different alternatives to have the same name, as in (?|a(?<digit>[0-5])|b(?<digit>[4-7]))c\k<digit>. With this special syntax—group opened with (?| instead of (?:—the two groups named "digit" really are one and the same group. Then backreferences to that group are always handled correctly and consistently between these flavors. (Older versions of PCRE and PHP may support branch reset groups, but don't correctly handle duplicate names in branch reset groups.)

Match any character [\s\S]

The Dot Matches (Almost) Any Character

In regular expressions, the dot or period is one of the most commonly used metacharacters. Unfortunately, it is also the most commonly misused metacharacter.
The dot matches a single character, without caring what that character is. The only exception are line break characters. In all regex flavors discussed in this tutorial, the dot does not match line breaks by default.
This exception exists mostly because of historic reasons. The first tools that used regular expressions were line-based. They would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the string could never contain line breaks, so the dot could never match them.
Modern tools and languages can apply regular expressions to very large strings or even entire files. Except for JavaScript and VBScript, all regex flavors discussed here have an option to make the dot match all characters, including line breaks.
In PowerGREP, tick the checkbox labeled "dot matches line breaks" to make the dot match all characters. In EditPad Pro, turn on the "Dot" or "Dot matches newline" search option.
In Perl, the mode where the dot also matches line breaks is called "single-line mode". This is a bit unfortunate, because it is easy to mix up this term with "multi-line mode". Multi-line mode only affects anchors, and single-line mode only affects the dot. You can activate single-line mode by adding an s after the regex code, like this: m/^regex$/s;.
Other languages and regex libraries have adopted Perl's terminology. When using the regex classes of the .NET framework, you activate this mode by specifying RegexOptions.Singleline, such as in Regex.Match("string", "regex", RegexOptions.Singleline).
JavaScript and VBScript do not have an option to make the dot match line break characters. In those languages, you can use a character class such as [\s\S] to match any character. This character matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class matches any character.
In all of Boost's regex grammars the dot matches line breaks by default. Boost's ECMAScript grammar allows you to turn this off with regex_constants::no_mod_m.

Line Break Characters

While support for the dot is universal among regex flavors, there are significant differences in which characters they treat as line break characters. All flavors treat the newline \n as a line break. UNIX text files terminate lines with a single newline. All the scripting languages discussed in this tutorial do not treat any other characters as line breaks. This isn't a problem even on Windows where text files normally break lines with a \r\n pair. That's because these scripting languages read and write files in text mode by default. When running on Windows, \r\n pairs are automatically converted into \n when a file is read, and \n is automatically written to file as \r\n.
std::regex, XML Schema and XPath also treat the carriage return \r as a line break character. JavaScript adds the Unicode line separator \u2028 and page separator \u2029 on top of that. Java includes these plus the Latin-1 next line control character \u0085. Boost adds the form feed \f to the list. Only Delphi and the JGsoft flavor supports all Unicode line breaks, completing the mix with the vertical tab.
.NET is notably absent from the list of flavors that treat characters other than \n as line breaks. Unlike scripting languages that have their roots in the UNIX world, .NET is a Windows development framework that does not automatically strip carriage return characters from text files that it reads. If you read a Windows text file as a whole into a string, it will contain carriage returns. If you use the regex abc.* on that string, without setting RegexOptions.SingleLine, then it will match abc plus all characters that follow on the same line, plus the carriage return at the end of the line, but without the newline after that.
Some flavors allow you to control which characters should be treated as line breaks. Java has the UNIX_LINES option which makes it treat only \n as a line break. PCRE has options that allow you to choose between \n only, \r only, \r\n, or all Unicode line breaks.
On POSIX systems, the POSIX locale determines which characters are line breaks. The C locale treats only the newline \n as a line break. Unicode locales support all Unicode line breaks.

\N Never Matches Line Breaks

Perl 5.12 and PCRE 8.10 introduced \N which matches any single character that is not a line break, just like the dot does. Unlike the dot, \N is not affected by "single-line mode". (?s)\N. turns on single-line mode and then matches any character that is not a line break followed by any character regardless of whether it is a line break.
PCRE's options that control which characters are treated as line breaks affect \N in exactly the same way as they affect the dot.
PHP 5.3.4 and R 2.14.0 also support \N as their regex support is based on PCRE 8.10 or later. JGsoft V2 also supports \N.

Use The Dot Sparingly

The dot is a very powerful regex metacharacter. It allows you to be lazy. Put in a dot, and everything matches just fine when you test the regex on valid data. The problem is that the regex also matches in cases where it should not match. If you are new to regular expressions, some of these cases may not be so obvious at first.
Let's illustrate this with a simple example. Say we want to match a date in mm/dd/yy format, but we want to leave the user the choice of date separators. The quick solution is \d\d.\d\d.\d\d. Seems fine at first. It matches a date like 02/12/03 just fine. Trouble is: 02512703 is also considered a valid date by this regular expression. In this match, the first dot matched 5, and the second matched 7. Obviously not what we intended.
\d\d[- /.]\d\d[- /.]\d\d is a better solution. This regex allows a dash, space, dot and forward slash as date separators. Remember that the dot is not a metacharacter inside a character class, so we do not need to escape it with a backslash.
This regex is still far from perfect. It matches 99/99/99 as a valid date. [01]\d[- /.][0-3]\d[- /.]\d\d is a step ahead, though it still matches 19/39/99. How perfect you want your regex to be depends on what you want to do with it. If you are validating user input, it has to be perfect. If you are parsing data files from a known source that generates its files in the same way every time, our last attempt is probably more than sufficient to parse the data without errors. You can find a better regex to match dates in the example section.

Use Negated Character Classes Instead of the Dot

A negated character class is often more appropriate than the dot. The tutorial section that explains the repeat operators star and plus covers this in more detail. But the warning is important enough to mention it here as well. Again let's illustrate with an example.
Suppose you want to match a double-quoted string. Sounds easy. We can have any number of any character between the double quotes, so ".*" seems to do the trick just fine. The dot matches any character, and the star allows the dot to be repeated any number of times, including zero. If you test this regex on Put a "string" between double quotes, it matches "string" just fine. Now go ahead and test it on Houston, we have a problem with "string one" and "string two". Please respond.
Ouch. The regex matches "string one" and "string two". Definitely not what we intended. The reason for this is that the star is greedy.
In the date-matching example, we improved our regex by replacing the dot with a character class. Here, we do the same with a negated character class. Our original definition of a double-quoted string was faulty. We do not want any number of any character between the quotes. We want any number of characters that are not double quotes or newlines between the quotes. So the proper regex is "[^"\r\n]*".