Pages

Saturday, 6 April 2013

Regular Expression

We need to use regular expression frequently in text processing for search, parse, validation or XML document integrity. Java provide us a package called java.util.regex to make life easier for regular expression. Bellow I have summarized the things as I need to use Java Regex so frequently :) 

Common matching symbols:

Regular ExpressionDescription
.Matches any sign
^regexregex must match at the beginning of the line
regex$Finds regex must match at the end of the line
[abc]Set definition, can match the letter a or b or c
[abc[vz]]Set definition, can match a or b or c followed by either v or z
[^abc]When a “^” appears as the first character inside [] when it negates the pattern. This can match any character except a or b or c
[a-d1-7]Ranges, letter between a and d and figures from 1 to 7, will not match d1
X|ZFinds X or Z
XZFinds X directly followed by Z
$Checks if a line end follows

Metacharacters:

Regular ExpressionDescription
\dAny digit, short for [0-9]
\DA non-digit, short for [^0-9]
\sA whitespace character, short for [ \t\n\x0b\r\f]
\SA non-whitespace character, for short for [^\s]
\wA word character, short for [a-zA-Z_0-9]
\WA non-word character [^\w]
\S+Several non-whitespace characters

Characters:

CharactersDescription
xThe character x
\\The backslash character
nThe character with octal value 0(0<=n<=7)
nnThe character with octal value 0nn (0<=n<=7)
mnnThe character with octal value 0mnn (0<=m<=3, 0<=n<=7)
\xhhThe character with hexadecimal value 0xhh
\uhhhhThe character with hexadecimal value 0xhhhh
\tThe tab character ('\u0009')
\nThe newline (line feed) character ('\u000A')
\rThe carriage-return character ('\u000D')
\fThe form-feed character ('\u000C')
\aThe alert (bell) character ('\u0007')
\eThe escape character ('\u001B')
\cxThe control character corresponding to x

Quantifier:

Regular ExpressionDescriptionExamples
*Occurs zero or more times, is short for {0,}X* – Finds no or several letter X, .* – any character sequence
+Occurs one or more times, is short for {1,}X+ – Finds one or several letter X
?Occurs no or one times, ? is short for {0,1}X? -Finds no or exactly one letter X
{X}Occurs X number of times, {} describes the order of the preceding liberal\d{3} – Three digits, .{10} – any character sequence of length 10
{X,Y}.Occurs between X and Y times,\d{1,4}- \d must occur at least once and at a maximum of four
*?? after a qualifier makes it a “reluctant quantifier”, it tries to find the smallest match.
A simple example for case insensitive URL matching using java Regex given bellow:

No comments:

Post a Comment