Home >>Java Tutorial >Java Regular Expressions

Java Regular Expressions

Java Regular Expressions

The Java language delivers the java.util.regex package that is basically for the pattern matching with the regular expressions. The Java regular expressions are extremely easy to learn and in functionalities and other things it is known to be very similar to the Perl programming language. A regular expression in java is basically a special sequence of the characters that assists the programmer in matching or finding other strings or sets of strings just by the use of a specialized syntax that is held in a pattern. These expressions can be used to search, edit, or manipulate the text and the data.

The java.util.regex package consists of three primarily classes that are depicted below along with a brief introduction:

  • Pattern Class − A compiled representation of a regular expression is known as a pattern object. Since the pattern class does not delivers any public constructors hence, in order to create a pattern the programmer needs to first invoke one of the public static compile() methods of this functions and then they will return a Pattern object.
  • Matcher Class − The engine that interprets the pattern and performs match operations against an input string is known as a matcher object. It is very similar to the Pattern class as it also defines no public constructors. The programmer have to obtain a matcher object just by invoking the matcher() method that too on a Pattern object.
  • PatternSyntaxException − An unchecked exception that usually indicates a syntax error in the regular expression pattern in java is known as a PatternSyntaxException object.

Capturing Groups

The method of treating multiple characters as a single unit is known as capturing groups. These groups are generally created by placing the characters that are to be grouped inside a set of the parentheses. For instance, the regular expression (god) creates a single group containing the letters "g", "o", and "d". Capturing groups are generally numbered just by counting their opening parentheses from the direction left to the right. For another instance, in the expression ((D)(E(F))), for example, there are four such groups that are depicted below:

  • ((D)(E(F)))
  • (D)
  • (E(F))
  • (F)

Now, in order to find out that how many groups are present there in the expression the programmer should call the groupCount method on a matcher object. The groupCount method is known to return an int that displays the number of capturing groups that present there in the matcher's pattern. In order to represent the entire expression there is a special group that is called group 0. Please note that this group is generally not included in the total groups that are reported by groupCount.

Here is an example that will demonstrates how to find a digit string from the provided alphanumeric string in Java. Please have a thorough look at it:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches {

   public static void main( String args[] ) {
      // String to be scanned to find the pattern.
      String line = "This order was placed for QT3000! OK?";
      String pattern = "(.*)(\\d+)(.*)";

      // Create a Pattern object
      Pattern r = Pattern.compile(pattern);

      // Now create matcher object.
      Matcher m = r.matcher(line);
      if (m.find( )) {
         System.out.println("Found value: " + m.group(0) );
         System.out.println("Found value: " + m.group(1) );
         System.out.println("Found value: " + m.group(2) );
      }else {
         System.out.println("NO MATCH");
      }
   }
}
Output:
Found value: This order was placed for QT3000! OK?
Found value: This order was placed for QT300
Found value: 0

Regular Expression Syntax

Here is the table that is listing down all the regular expression metacharacter syntax that are available in Java:

Sub Expression Matches
^ This syntax is generally used to match the beginning of the line.
$ This syntax is generally used to match the end of the line.
. This syntax is generally used to match any single character except the newline. Just by using the m option allows this to match the newline as well.
[...] This syntax is generally used to match any single character in brackets.
[^...] This syntax is generally used to match any single character not in brackets.
\A This syntax is generally used at the beginning of the entire string.
\z This syntax is generally used at the end of the entire string.
\Z This syntax is generally used at the end of the entire string except allowable final line terminator.
re* This syntax is generally used to match 0 or more occurrences of the preceding expression.
re+ This syntax is generally used to match 1 or more of the previous thing.
re? This syntax is generally used to match 0 or 1 occurrence of the preceding expression.
re{ n} This syntax is generally used to match exactly n number of occurrences of the preceding expression.
re{ n,} This syntax is generally used to match n or more occurrences of the preceding expression.
re{ n, m} This syntax is generally used to match at least n and at most m occurrences of the preceding expression.
a| b This syntax is generally used to match either a or b.
(re) This syntax is generally used to group regular expressions and remembers the matched text.
(?: re) This syntax is generally used to group regular expressions without remembering the matched text.
(?> re) This syntax is generally used to match the independent pattern without backtracking.
\w This syntax is generally used to match the word characters.
\W This syntax is generally used to match the nonword characters.
\s This syntax is generally used to match the whitespace. Equivalent to [\t\n\r\f].
\S This syntax is generally used to match the nonwhitespace.
\d This syntax is generally used to match the digits that are equivalent to [0-9].
\D This syntax is generally used to match the nondigits.
\G This syntax is generally used to match the point where the last match finished.
\n This syntax is used to back-reference in order to capture group number "n".
\b This syntax is generally used to match the word boundaries when outside the brackets. Matches the backspace (0x08) when inside the brackets.
\B This syntax is generally used to match the nonword boundaries.
\n, \t, etc. This syntax is generally used to match newlines, carriage returns, tabs, etc.
\Q This syntax is generally escape (quote) all characters up to \E.
\E This syntax is generally ends quoting begun with \Q.

Methods of the Matcher Class

Here is a list of all the useful instance methods that are used in Java:

1. Index Methods

Index methods in java generally deliver useful index values that display precisely where the match was found in the input string:

Method Description
public int start() This method generally returns the start index of the previous match.
public int start(int group) This method generally returns the start index of the subsequence that is captured by the provided group during the previous match operation.
public int end() This method generally returns the offset just after the last character
public int end(int group) This method generally returns the offset just after the last character of the subsequence that is captured by the provided by the group during the previous match operation.

2. Study Methods

Study methods in the Java generally review the input string and it returns a Boolean that indicates whether the pattern is found or not:

Method Description
public boolean lookingAt() These methods are used to attempt to match the input sequence that start at the beginning of the region and against the pattern.
public boolean find() These methods are used to attempt to find the next subsequence of the input sequence that is used to match the pattern.
public boolean find(int start) These methods are used to reset this matcher and then it attempts to find the next subsequence of the input sequence that basically matches the pattern and starting at the specified index.
public boolean matches()) These methods are used to attempt to match the entire region against the pattern.

3. Replacement Methods

Replacement methods in java are generally proven to be useful in methods for replacing text in an input string:

Method Description
public Matcher appendReplacement(StringBuffer sb, String replacement) This method generally used to implement a non-terminal append-and-replace step.
public StringBuffer appendTail(StringBuffer sb) This method generally used to implement a terminal append-and-replace step.
public String replaceAll(String replacement) This method generally used to replace every subsequence of the input sequence that matches the pattern that is basically with the provided replacement string.
public String replaceFirst(String replacement) This method generally used to replace the first subsequence of the input sequence that matches the pattern that is basically with the provided replacement string.
public static String quoteReplacement(String s) This method generally used to a literal replacement String for the specified String. This method generally creates a String that will work as a literal replacement s in the append Replacement method that are of the Matcher class.

The start and end Methods

Here is an example that will count the number of times the word "dog" appears in the input string, please have a look at it:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class  RegexMatches {

   private static final String REGEX = "\\bPHP\\b";
   private static final String INPUT = "PHP PHP PHPTPOINT PHP Python PHP";

   public static void main( String args[] ) {
      Pattern p = Pattern.compile(REGEX);
      Matcher m = p.matcher(INPUT);   // get a matcher object
      int count = 0;

      while(m.find()) {
         count++;
         System.out.println("Match number "+count);
         System.out.println("start(): "+m.start());
         System.out.println("end(): "+m.end());
      }
   }
}
Output:
Match number 1
start(): 0
end(): 3
Match number 2
start(): 4
end(): 7
Match number 3
start(): 18
end(): 21
Match number 4
start(): 29
end(): 32

The Matches And Lookingat Methods

The matches and lookingAt methods both generally used to match an input sequence that is against a pattern. The difference in these two is that the matches basically require the entire input sequence to be matched, on the other hand the while lookingAt does not.

Here is the example of both these methods that will explain the functionality:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches {

   private static final String REGEX = "PHP";
   private static final String INPUT = "PHPTPOINT";
   private static Pattern pattern;
   private static Matcher matcher;

   public static void main( String args[] ) {
      pattern = Pattern.compile(REGEX);
      matcher = pattern.matcher(INPUT);

      System.out.println("Current REGEX is: "+REGEX);
      System.out.println("Current INPUT is: "+INPUT);

      System.out.println("lookingAt(): "+matcher.lookingAt());
      System.out.println("matches(): "+matcher.matches());
   }
}
Output:
Current REGEX is: PHP
Current INPUT is: PHPTPOINT
lookingAt(): true
matches(): false

The replaceFirst and replaceAll Methods

The replaceFirst and replaceAll methods in java are used to replace the text that generally matches a provided regular expression. As per their names, replaceFirst is used to replace the first occurrence on the other hand replaceAll is used to replaces all the occurrences.

Here is the example that will explain the functionality of both the methods:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches {

   private static String REGEX = "Python";
   private static String INPUT = "Python is the best language. " + "I love Python.";
   private static String REPLACE = "Java";

   public static void main(String[] args) {
      Pattern p = Pattern.compile(REGEX);
      
      // get a matcher object
      Matcher m = p.matcher(INPUT); 
      INPUT = m.replaceAll(REPLACE);
      System.out.println(INPUT);
   }
}
Output:
Java is the best language. I love Java.

The appendReplacement and appendTail Methods

In order to perform the text replacement, the Matcher class is also known to provide appendReplacement and appendTail methods.

Here is the example of both of these methods that will explain the functionality:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches {

   private static String REGEX = "a*b";
   private static String INPUT = "aabJerryaaabJerryabJerrybb";
   private static String REPLACE = "*";
   public static void main(String[] args) {

      Pattern p = Pattern.compile(REGEX);
      
      // get a matcher object
      Matcher m = p.matcher(INPUT);
      StringBuffer sb = new StringBuffer();
      while(m.find()) {
         m.appendReplacement(sb, REPLACE);
      }
      m.appendTail(sb);
      System.out.println(sb.toString());
   }
}
Output:
*Jerry*Jerry*Jerry**

PatternSyntaxException Class Methods

An unchecked exception that generally indicates a syntax error in a regular expression pattern is known as a PatternSyntaxException in Java. The PatternSyntaxException class is also known to deliver the following depicted methods that will help you in determining the exact error in the case the things went wrong:

Method Description
public String getDescription() This method is used to retrieve the description of the error.
public int getIndex() This method is used to retrieve the error index.
public String getPattern() This method is used to retrieve the erroneous regular expression pattern
public String getMessage() This method is used to return a multi-line string that contains the description of the syntax error and its index, the erroneous regular expression pattern, along with a visual indication of the error index that is within the pattern.