Wednesday, 25th April 2012
Follow WikiJava on twitter now. @Wikijava

Using the Pattern Matcher with regular expressions

From WikiJava

Jump to: navigation, search


This simple article shows how Java can be used for matching regular expressions on strings. And how to extract any part of any string, given a regular expression matching it.

Contents

Description

The core of this example is the java.util.regex.Matcher Class, which can be obtained by a java.util.regex.Pattern.

As shown in the code below:

  1. The regular expression is used to obtain a Pattern object (Pattern.compile(String pattern, int flags) method)
  2. The pattern is then used to compile a specific String to match (method Pattern.matcher(CharSequence input))
  3. The matches can be extracted sequentially in a while loop that has the method matcher.find in the clause.
  4. The various groups of data found in a result, (that are the strings that in the regex are enclosed in "()" parentheses) can be extracted using the method matcher.group(int group)).

The flag modificators

The possible flags:

UNIX_LINES
matches unix like endlines.
CASE_INSENSITIVE
Ignores capitalisation of the strings (using US-ASCII encoding)
COMMENTS
ignores whitespaces and whatever is included after a "#" sign until the end of the line.
MULTILINE
Modifies the behavior of ^ and $ meta characters. If you use this flag then they can be used to identify the beginning or the end of the lines
LITERAL
The input will be parsed as a sequence of literal characters, this means that special characters won't be given any special meaning.
DOTALL
By default the "." dot metacharacter matches everything, except end of line sequences, with this flag, "." will match these as well.
UNICODE_CASE
This flag goes together with the CASE_INSENSITIVE one. Using these flags then Unicode Standard encoding will be used instead of US-ASCII
CANON_EQ
With this flag the pattern will match character after completely resolving the (eventual) encoding. With this flag enabled, the pattern: ""Anc\u00f2ra"" matches "Anco\u0300ra" which is the same word where the accented "o" has been generated with the combined character (see wikipedia:Combining_character for details on how this works).

Several flags can be used at the same time by separating them with a "|". For example:

Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE 
	| Pattern.UNICODE_CASE);

The matcher.group(int group) method refers to a group (sequence) of characters. The first group found on a match is numbered with 1.

the code

import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class TestHarness {
 
	/**
	 * @param args
	 */
	public static void main(String[] args) {
		String s = "Yesterday I had a good 'breakfast' at 7 am.\n"
				+ "Yesterday I had 'lunch' on the table.\n"
				+ "Yesterday evening I had a long 'dinner' at 21.\n";
 
		String regex = "I had[^']+'(.*)'\\s+(at|on)\\s+(.*)\\s*\\.";
		Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
 
		Matcher matcher = pattern.matcher(s);
		int i = 0;
		while (matcher.find()) {
			String what = matcher.group(1);
			String atOn = matcher.group(2);
			String secondString = matcher.group(3);
			System.out.printf("%s: %s %s \n", what,
					atOn, secondString);
			i++;
		}
		System.out.printf("found %s matches", i);
 
	}
 
}


Comments from the users

To be notified via mail on the updates of this discussion you can login and click on watch at the top of the page


Comments on wikijava are disabled now, cause excessive spam.