ORO, Inc. Logo

CONTENTS

PREFACE

COPYRIGHT AND LICENSE

INTRODUCTION

Installation
Acknowledgements

FAQ

SYNTAX

What is a regular expression?
Perl5 regular expressions

THE INTERFACES

Pattern
PatternCompiler
PatternMatcher
MatchResult

THE CLASSES

Perl5Pattern
Perl5Compiler
Perl5Matcher
PatternMatcherInput
Perl5StreamInput
Util
Perl5Debug

SAMPLE PROGRAMS

MatchResult example
Difference between matches() and contains()
Case sensitivity
Searching an InputStream
Splits
Substitutions

APPENDIX

Package API reference (javadoc generated)
      

OROMatcher TM 1.0 FAQ

  1. Why are my backslashed metacharacters not recognized?
  2. Why doesn't the matches() method work?
  3. Why doesn't the . metacharacter work?
  4. How do I make ^ and $ match at the beginning and end of every line instead of just the beginning and end of the input?
  5. I want to use parentheses to denote a group, but I don't want the group to be saved. How do I do it?
  6. What is a backreference?
  7. How do I access a saved group?
  8. Why does the compiler say my pattern isn't initialized?
  9. How do I find all the matches in a string?
  10. How do I search an InputStream?
  11. Which methods are synchronized?

1. Why are my backslashed metacharacters not recognized?

Backslashed characters have a special meaning to the Java compiler. By chance, some of these are the same as in regular expression grammars, so you can use them without modification (e.g., \n and \r). But others are not legal Java metacharacters (e.g., \s and \w) and the compiler will not recognize them. These metacharacters must be represented with a double backslash when encoded as Java strings. The first backslash escapes the second backslash so the Java compiler will recognize a backslash instead of looking for a Java metacharacter. For example, the expression \w+ must be written in Java as "\\w+" .

2. Why doesn't the matches() method work?

A common mistake is to confuse the behavior of the matches() and contains() methods. matches() tests to see if a string exactly matches a pattern whereas contains() searches for the first pattern match contained somewhere within the string. When used with a PatternMatcherInput instance, the contains() method allows you to search for every pattern match within a string by using a while loop.

3. Why doesn't the . metacharacter work?

Some regular expression grammars allow the . metacharacter to always match any character. The Perl5 regular expression grammar defines . as matching any character except for newline unless the input is to be interpreted as containing a single line (the Perl s option). In order to preserve complete compatibility with Perl5 expressions, OROMatcher TM maintains this behavior. The default behavior of the . metacharacter is to match any character except for \n. If an expression is compiled with the Perl5Compiler.SINGLELINE_MASK option, or if the setMultiline() method of a Perl5Matcher instance is called with an argument of false, then the meaning of the . metacharacter is changed to match all characters including newline. If you want to match any ASCII character in multiline mode (the default), you should use the somewhat cumbersome [\x00-\xff]. Alternatively, if you want to match any Unicode character while in multiline mode, you should use (.|\n).

4. How do I make ^ and $ match at the beginning and end of every line instead of just the beginning and end of the input?

Perl5 defines ^ and $ as matching at the beginning and end of a string. Perl allows you to use the m modifier to change the behavior of those metacharacters so that they will match at the beginning and end of every line. You can achieve this result in one of two ways using OROMatcher TM. First, you can compile your regular expression using the Perl5Compiler.MULTILINE_MASK flag. This will cause your pattern to always treat ^ and $ as matching at the beginning and end of a line. However, if you want your pattern to be interpreted this way only some of the time, then you can call the setMultiline() method of the Perl5Matcher class with an argument of true. When you want to revert to the original behavior, you can call the method with an argument of false.

5. I want to use parentheses to denote a group, but I don't want the group to be saved. How do I do it?

The Perl5 regular expression syntax provides for this. To prevent a parenthesized group from being saved (or creating backreferences) use the Perl5 extended syntax construct (?:regexp). For example, in the expression (?:foo)*(bar)\1, the \1 backreference refers to the (bar) group and the (foo) group simply isn't saved.

6. What is a backreference?

A backreference is an element in a regular expression that refers to whatever was matched by a previously occuring parenthesized group. A backreference is represented by a backslash followed by a number which refers to the group started by the n'th open parenthesis in an expression counting from left to right and starting from 1. For example, the expression (\d+):\1 would match the string 19:19. For more details on backreferences and other advanced regular expression constructs, we recommend you consult the book "Mastering Regular Expressions" by Jeffrey Friedl published by O'Reilly & Associates.

7. How do I access a saved group?

The MatchResult interface defines a method called group() which will return the string matched by a subgroup of a regular expression. The beginOffset(), endOffset(), begin(), and end() methods will also return offset information relative to the beginning of the input or the beginning of the complete match. See the API documentation for MatchResult for more information.

8. Why does the compiler say my pattern isn't initialized?

Sometimes you'll write a piece of code like the following and wonder why the Java compiler produces an error:

  String regex, input;
  Pattern pattern;
  PatternCompiler compiler;
  PatternMatcher matcher;


  // Initialization of input, regex, compiler and matcher omitted

  try {
    pattern = compiler.compile(regex);
  } catch(MalformedPatternException e) {
    System.err.println("Bad pattern.");
    System.err.println(e.getMessage());
    System.exit(1);
  }


  if(matcher.contains(input, pattern)) {
  // Do something useful 
  }

The problem presented here is that the compiler can't tell that the program will exit if pattern isn't properly initialized. In this case, to avoid the compiler error, you should simply initialize your pattern variable to null. Only do this when you can guarantee that the pattern will only be used after being initialized to a compiled regular expression. If you can't guarantee this, you should modify your program to guarantee it. Otherwise, you will run the risk of using an uninitialized pattern and have your program exit from a runtime NullPointerException.

9. How do I find all the matches in a string?

You can find all the matches in a string by creating a PatternMatcherInput instance and using the contains() method in a while loop as follows:
  input   = new PatternMatcherInput(someStringInput);

  while(matcher.contains(input, pattern)) {
    result = matcher.getMatch();  
    // Perform whatever processing on the result you want.
  }

 // Suppose we want to start searching from the beginning again with
 // a different pattern.
 // Just set the current offset to the begin offset.
 input.setCurrentOffset(input.getBeginOffset());
 // Second search omitted
 // Suppose we're done with this input, but want to search another string.
 // There's no need to create another PatternMatcherInput instance.
 // We can just use the setInput() method.
 input.setInput(aNewInputString);

10. How do I search an InputStream?

You can use the Perl5StreamInput class. See the API documentation for the class for more details. Here's a brief example:
 input   = new Perl5StreamInput(new FileInputStream("filename"));

 // We need to put the search loop in a try block because when searching
 // a Perl5StreamInput instance, an IOException may occur, and it
 // must be caught.
 try {
   // Loop until there are no more matches left.
   while(matcher.contains(input, pattern)) {
     result = matcher.getMatch();
     // Perform whatever processing on the result you want.
   }
 } catch(IOException e) {
   System.err.println("Error occurred while reading file.");
   System.err.println(e.getMessage());
   System.exit(1);
 }

11. Which methods are synchronized?

None of the methods of the OROMatcher TM classes is synchronized. The principal reason for this is efficiency. Synchronized method calls incur a sizeable overhead which we don't want to impose on everyone. The secondary reason is that there is little need to use the same compiler and matcher instances in different threads when you can simply instantiate a separate instance for each thread. Pattern and MatchResult instances that need to be shared among threads can be synchronized on explicitly.
Copyright © 1997 ORO, Inc. All rights reserved. Original Reusable Objects, ORO, the ORO logo, and "Component software for the Internet" are trademarks or registered trademarks of ORO, Inc. in the United States and other countries.
Java is a trademark of Sun Microsystems. All other trademarks are the property of their respective holders.