Monday, November 20, 2017

Escaping Regex to get Valid JSON

In my schema, I want to recognize certain patterns to restrict the type of data that a user can enter. I use regex to restrict what a user can enter, but the regex get flagged when I try to validate the JSON using an online validator like this one.
Is there a way to make the validator ignore the regex special chars that are disagreeing with it, but still keep the regex?
The weird thing is that the validator only trips up on certain instances. For instance, it flags the second and not the first instance of regex despite them being identical here:
            "institutionname": {
                "type": "string",
                "description": "institution name",
                "label": "name",
                "input-type": "text",
                "pattern": "^[A-Za-z0-9\s]+$"
            },
            "bio": {
                "type": "string",
                "label": "bio",
                "input-type": "text",
                "pattern": "^[A-Za-z0-9\s]+$",
                "help-box": "tell us about yourself"
            },

-------------
Answer:
Its just the slashes that are messing up the validation you could encode them using %5C which is the hex encoding of \ or what Mike W says you could double escape like \\ and then you could just decode them when you want to use them

Regex to validate JSON

(This regex was brought to you from the proving-the-naysayers-wrong department.)

Yes, a complete regex validation is possible.

Most modern regex implementations allow for recursive regexpressions, which can verify a complete JSON serialized structure. The json.org specification makes it quite straightforward.
$pcre_regex = '
  /
  (?(DEFINE)
     (?<number>   -? (?= [1-9]|0(?!\d) ) \d+ (\.\d+)? ([eE] [+-]? \d+)? )    
     (?<boolean>   true | false | null )
     (?<string>    " ([^"\\\\]* | \\\\ ["\\\\bfnrt\/] | \\\\ u [0-9a-f]{4} )* " )
     (?<array>     \[  (?:  (?&json)  (?: , (?&json)  )*  )?  \s* \] )
     (?<pair>      \s* (?&string) \s* : (?&json)  )
     (?<object>    \{  (?:  (?&pair)  (?: , (?&pair)  )*  )?  \s* \} )
     (?<json>   \s* (?: (?&number) | (?&boolean) | (?&string) | (?&array) | (?&object) ) \s* )
  )
  \A (?&json) \Z
  /six   
';
It works quite well in PHP with the PCRE functions . Should work unmodified in Perl; and can certainly be adapted for other languages. Also it succeeds with the JSON test cases.

Simpler RFC4627 verification

A simpler approach is the minimal consistency check as specified in RFC4627, section 6. It's however just intended as security test and basic non-validity precaution:
  var my_JSON_object = !(/[^,:{}\[\]0-9.\-+Eaeflnr-u \n\r\t]/.test(
         text.replace(/"(\\.|[^"\\])*"/g, ''))) &&
     eval('(' + text + ')');

Thursday, November 16, 2017

Regex Named Groups in Java

(Update: August 2011)
As geofflane mentions in his answer, Java 7 now support named groups.
tchrist points out in the comment that the support is limited.
He details the limitations in his great answer "Java Regex Helper"
Java 7 regex named group support was presented back in September 2010 in Oracle's blog.
In the official release of Java 7, the constructs to support the named capturing group are:
  • (?<name>X) to define a named group name"
  • \k<name> to backreference a named group "name"
  • ${name} to reference to captured group in Matcher's replacement string
  • Matcher.group(String name) to return the captured input subsequence by the given "named group".

Other alternatives for pre-Java 7 were:

(Original answer: Jan 2009, with the next two links now broken)
You can not refer to named group, unless you code your own version of Regex...
That is precisely what Gorbush2 did in this thread.
Regex2
(limited implementation, as pointed out again by tchrist, as it looks only for ASCII identifiers. tchrist details the limitation as:
only being able to have one named group per same name (which you don’t always have control over!) and not being able to use them for in-regex recursion.
Note: You can find true regex recursion examples in Perl and PCRE regexes, as mentioned in Regexp Power, PCRE specs and Matching Strings with Balanced Parentheses slide)
Example:
String:
"TEST 123"
RegExp:
"(?<login>\\w+) (?<id>\\d+)"
Access
matcher.group(1) ==> TEST
matcher.group("login") ==> TEST
matcher.name(1) ==> login
Replace
matcher.replaceAll("aaaaa_$1_sssss_$2____") ==> aaaaa_TEST_sssss_123____
matcher.replaceAll("aaaaa_${login}_sssss_${id}____") ==> aaaaa_TEST_sssss_123____ 

(extract from the implementation)
public final class Pattern
    implements java.io.Serializable
{
[...]
    /**
     * Parses a group and returns the head node of a set of nodes that process
     * the group. Sometimes a double return system is used where the tail is
     * returned in root.
     */
    private Node group0() {
        boolean capturingGroup = false;
        Node head = null;
        Node tail = null;
        int save = flags;
        root = null;
        int ch = next();
        if (ch == '?') {
            ch = skip();
            switch (ch) {

            case '<':   // (?<xxx)  look behind or group name
                ch = read();
                int start = cursor;
[...]
                // test forGroupName
                int startChar = ch;
                while(ASCII.isWord(ch) && ch != '>') ch=read();
                if(ch == '>'){
                    // valid group name
                    int len = cursor-start;
                    int[] newtemp = new int[2*(len) + 2];
                    //System.arraycopy(temp, start, newtemp, 0, len);
                    StringBuilder name = new StringBuilder();
                    for(int i = start; i< cursor; i++){
                        name.append((char)temp[i-1]);
                    }
                    // create Named group
                    head = createGroup(false);
                    ((GroupTail)root).name = name.toString();

                    capturingGroup = true;
                    tail = root;
                    head.next = expr(tail);
                    break;
                }

Tuesday, November 14, 2017

Can regex match a pattern only if that pattern appears exactly one time?

Take this example:
string1 = "aaxadfxasdf"

string2 = "asdfxadf"

string3 = "sadfas"
i need a pattern that matches x such that it returns a match for x only for string2.
-------------
^[a-wy-z]+x[a-wy-z]+$
I'm not sure if you are only expecting letters, or if symbols and numbers might be included before and after the x you want to match. If so, use [^x] instead of [a-wy-z], like so:
^[^x]+x[^x]+$
If you are also wanting to match x by itself, replace the + symbols with * like so:
^[a-wy-z]*x[a-wy-z]*$

------------
^[^x]*x[^x]*$
where the first ^ is the beginning of the line ($ is the end), but the other ^ mean the complement of the character set

PatternSyntaxException

A PatternSyntaxException is an unchecked exception that indicates a syntax error in a regular expression pattern. The PatternSyntaxException class provides the following methods to help you determine what went wrong:
The following source code, RegexTestHarness2.java, updates our test harness to check for malformed regular expressions:
import java.io.Console;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.regex.PatternSyntaxException;

public class RegexTestHarness2 {

    public static void main(String[] args){
        Pattern pattern = null;
        Matcher matcher = null;

        Console console = System.console();
        if (console == null) {
            System.err.println("No console.");
            System.exit(1);
        }
        while (true) {
            try{
                pattern = 
                Pattern.compile(console.readLine("%nEnter your regex: "));

                matcher = 
                pattern.matcher(console.readLine("Enter input string to search: "));
            }
            catch(PatternSyntaxException pse){
                console.format("There is a problem" +
                               " with the regular expression!%n");
                console.format("The pattern in question is: %s%n",
                               pse.getPattern());
                console.format("The description is: %s%n",
                               pse.getDescription());
                console.format("The message is: %s%n",
                               pse.getMessage());
                console.format("The index is: %s%n",
                               pse.getIndex());
                System.exit(0);
            }
            boolean found = false;
            while (matcher.find()) {
                console.format("I found the text" +
                    " \"%s\" starting at " +
                    "index %d and ending at index %d.%n",
                    matcher.group(),
                    matcher.start(),
                    matcher.end());
                found = true;
            }
            if(!found){
                console.format("No match found.%n");
            }
        }
    }
}

To run this test, enter ?i)foo as the regular expression. This mistake is a common scenario in which the programmer has forgotten the opening parenthesis in the embedded flag expression (?i). Doing so will produce the following results:
Enter your regex: ?i)
There is a problem with the regular expression!
The pattern in question is: ?i)
The description is: Dangling meta character '?'
The message is: Dangling meta character '?' near index 0
?i)
^
The index is: 0
From this output, we can see that the syntax error is a dangling metacharacter (the question mark) at index 0. A missing opening parenthesis is the culprit.

Matcher

This section describes some additional useful methods of the Matcher class. For convenience, the methods listed below are grouped according to functionality.

Index Methods

Index methods provide useful index values that show precisely where the match was found in the input string:
  • public int start(): Returns the start index of the previous match.
  • public int start(int group): Returns the start index of the subsequence captured by the given group during the previous match operation.
  • public int end(): Returns the offset after the last character matched.
  • public int end(int group): Returns the offset after the last character of the subsequence captured by the given group during the previous match operation.

Study Methods

Study methods review the input string and return a boolean indicating whether or not the pattern is found.

Replacement Methods

Replacement methods are useful methods for replacing text in an input string.

Using the start and end Methods

Here's an example, MatcherDemo.java, that counts the number of times the word "dog" appears in the input string.
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class MatcherDemo {

    private static final String REGEX =
        "\\bdog\\b";
    private static final String INPUT =
        "dog dog dog doggie dogg";

    public static void main(String[] args) {
       Pattern p = Pattern.compile(REGEX);
       //  get a matcher object
       Matcher m = p.matcher(INPUT);
       int count = 0;
       while(m.find()) {
           count++;
           System.out.println("Match number "
                              + count);
           System.out.println("start(): "
                              + m.start());
           System.out.println("end(): "
                              + m.end());
      }
   }
}

OUTPUT:

Match number 1
start(): 0
end(): 3
Match number 2
start(): 4
end(): 7
Match number 3
start(): 8
end(): 11
You can see that this example uses word boundaries to ensure that the letters "d" "o" "g" are not merely a substring in a longer word. It also gives some useful information about where in the input string the match has occurred. The start method returns the start index of the subsequence captured by the given group during the previous match operation, and end returns the index of the last character matched, plus one.

Using the matches and lookingAt Methods

The matches and lookingAt methods both attempt to match an input sequence against a pattern. The difference, however, is that matches requires the entire input sequence to be matched, while lookingAt does not. Both methods always start at the beginning of the input string. Here's the full code, MatchesLooking.java:
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class MatchesLooking {

    private static final String REGEX = "foo";
    private static final String INPUT =
        "fooooooooooooooooo";
    private static Pattern pattern;
    private static Matcher matcher;

    public static void main(String[] args) {
   
        // Initialize
        pattern = Pattern.compile(REGEX);
        matcher = pattern.matcher(INPUT);

        System.out.println("Current REGEX is: "
                           + REGEX);
        System.out.println("Current INPUT is: "
                           + INPUT);

        System.out.println("lookingAt(): "
            + matcher.lookingAt());
        System.out.println("matches(): "
            + matcher.matches());
    }
}

Current REGEX is: foo
Current INPUT is: fooooooooooooooooo
lookingAt(): true
matches(): false

Using replaceFirst(String) and replaceAll(String)

The replaceFirst and replaceAll methods replace text that matches a given regular expression. As their names indicate, replaceFirst replaces the first occurrence, and replaceAll replaces all occurences. Here's the ReplaceDemo.java code:
import java.util.regex.Pattern; 
import java.util.regex.Matcher;

public class ReplaceDemo {
 
    private static String REGEX = "dog";
    private static String INPUT =
        "The dog says meow. All dogs say meow.";
    private static String REPLACE = "cat";
 
    public static void main(String[] args) {
        Pattern p = Pattern.compile(REGEX);
        // get a matcher object
        Matcher m = p.matcher(INPUT);
        INPUT = m.replaceAll(REPLACE);
        System.out.println(INPUT);
    }
}

OUTPUT: The cat says meow. All cats say meow.
In this first version, all occurrences of dog are replaced with cat. But why stop here? Rather than replace a simple literal like dog, you can replace text that matches any regular expression. The API for this method states that "given the regular expression a*b, the input aabfooaabfooabfoob, and the replacement string -, an invocation of this method on a matcher for that expression would yield the string -foo-foo-foo-."
Here's the ReplaceDemo2.java code:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
 
public class ReplaceDemo2 {
 
    private static String REGEX = "a*b";
    private static String INPUT =
        "aabfooaabfooabfoob";
    private static String REPLACE = "-";
 
    public static void main(String[] args) {
        Pattern p = Pattern.compile(REGEX);
        // get a matcher object
        Matcher m = p.matcher(INPUT);
        INPUT = m.replaceAll(REPLACE);
        System.out.println(INPUT);
    }
}

OUTPUT: -foo-foo-foo-
To replace only the first occurrence of the pattern, simply call replaceFirst instead of replaceAll. It accepts the same parameter.

Using appendReplacement(StringBuffer,String) and appendTail(StringBuffer)

The Matcher class also provides appendReplacement and appendTail methods for text replacement. The following example, RegexDemo.java, uses these two methods to achieve the same effect as replaceAll.
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class RegexDemo {
 
    private static String REGEX = "a*b";
    private static String INPUT = "aabfooaabfooabfoob";
    private static String REPLACE = "-";
 
    public static void main(String[] args) {
        Pattern p = Pattern.compile(REGEX);
        Matcher m = p.matcher(INPUT); // get a matcher object
        StringBuffer sb = new StringBuffer();
        while(m.find()){
            m.appendReplacement(sb,REPLACE);
        }
        m.appendTail(sb);
        System.out.println(sb.toString());
    }
}


OUTPUT: -foo-foo-foo- 

Matcher Method Equivalents in java.lang.String

For convenience, the String class mimics a couple of Matcher methods as well:
  • public String replaceFirst(String regex, String replacement): Replaces the first substring of this string that matches the given regular expression with the given replacement. An invocation of this method of the form str.replaceFirst(regex, repl) yields exactly the same result as the expression Pattern.compile(regex).matcher(str).replaceFirst(repl)
  • public String replaceAll(String regex, String replacement): Replaces each substring of this string that matches the given regular expression with the given replacement. An invocation of this method of the form str.replaceAll(regex, repl) yields exactly the same result as the expression Pattern.compile(regex).matcher(str).replaceAll(repl)