From WikiJava
This simple article shows how Java can be used for matching regular expressions on strings. And how to extract any part of any string, given a regular expression matching it.
Description
The core of this example is the java.util.regex.Matcher Class, which can be obtained by a java.util.regex.Pattern.
As shown in the code below:
- The regular expression is used to obtain a
Pattern object (Pattern.compile(String pattern, int flags) method)
- The pattern is then used to compile a specific
String to match (method Pattern.matcher(CharSequence input))
- The matches can be extracted sequentially in a
while loop that has the method matcher.find in the clause.
- The various groups of data found in a result, (that are the strings that in the regex are enclosed in "
()" parentheses) can be extracted using the method matcher.group(int group)).
The flag modificators
The possible flags:
- UNIX_LINES
- matches unix like endlines.
- CASE_INSENSITIVE
- Ignores capitalisation of the strings (using US-ASCII encoding)
- COMMENTS
- ignores whitespaces and whatever is included after a "#" sign until the end of the line.
- MULTILINE
- Modifies the behavior of ^ and $ meta characters. If you use this flag then they can be used to identify the beginning or the end of the lines
- LITERAL
- The input will be parsed as a sequence of literal characters, this means that special characters won't be given any special meaning.
- DOTALL
- By default the "." dot metacharacter matches everything, except end of line sequences, with this flag, "." will match these as well.
- UNICODE_CASE
- This flag goes together with the
CASE_INSENSITIVE one. Using these flags then Unicode Standard encoding will be used instead of US-ASCII
- CANON_EQ
- With this flag the pattern will match character after completely resolving the (eventual) encoding. With this flag enabled, the pattern: "
"Anc\u00f2ra"" matches "Anco\u0300ra" which is the same word where the accented "o" has been generated with the combined character (see wikipedia:Combining_character for details on how this works).
Several flags can be used at the same time by separating them with a "|". For example:
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE
| Pattern.UNICODE_CASE);
The matcher.group(int group) method refers to a group (sequence) of characters. The first group found on a match is numbered with 1.
the code
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TestHarness {
/**
* @param args
*/
public static void main(String[] args) {
String s = "Yesterday I had a good 'breakfast' at 7 am.\n"
+ "Yesterday I had 'lunch' on the table.\n"
+ "Yesterday evening I had a long 'dinner' at 21.\n";
String regex = "I had[^']+'(.*)'\\s+(at|on)\\s+(.*)\\s*\\.";
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(s);
int i = 0;
while (matcher.find()) {
String what = matcher.group(1);
String atOn = matcher.group(2);
String secondString = matcher.group(3);
System.out.printf("%s: %s %s \n", what,
atOn, secondString);
i++;
}
System.out.printf("found %s matches", i);
}
}
Comments from the users
To be notified via mail on the updates of this discussion you can login and click on watch at the top of the page
|
Comments on wikijava are disabled now, cause excessive spam.