Saturday, March 22, 2014

Microsoft 70-480: Validate user input by using JavaScript

Exam Objectives


Evaluate a regular expression to validate the input format; validate that you are getting the right kind of data type by using built-in functions; prevent code injection

Quick Overview of Training Materials



YouTube videos on JS Form Validation
Regexpal.com
MozDN - typeof
MozDN - Regular Expressions
MSDN - Data Types (JavaScript)
Stackoverflow - type checking in JavaScript
JavaScript spec - RegExp Object
JavaScript Form Validation - w3schools.com (grooooaaaan)



Regex in Javascript


Regular expressions are objects JavaScript that are used to match a string to a specified pattern.  These patterns can be something very simple like /abc/ which matches the string 'abc', or they can use a collection of special characters to make more sophisticated matches. A comprehensive list of these special characters can be found in the Mozilla Dev Net article about regular expressions. I'm pretty sure detailed knowledge of Regex is what the test is after...

In order to use regular expressions in JavaScript, we have to specify a pattern variable. This is done by either assigning the pattern directly into the variable, or by calling the RegExp() constructor:

var rx = /abc/;
var rx = new RegExp("abc");

var rx = /\d*?/;
var rx = new RegExp("\\d*?");

Common fields that may need to be validated in a form include names, email, userid, phone number, zip code, and password. While it is generally considered a bad idea to try to validate email addresses with regular expressions because of the idiosyncracies of the email specification, it is simple to validate the rest. The patterns might look something like these (these are probably a bit naive but they illustrate the point):

  • /\b([0-9]{3}-)?[0-9]{3}-[0-9]{4}\b/  -  Matches phone numbers, with or without area code. The first sub-expression ([0-9]{3}-)?  matches exactly three {3} numbers [0-9] followed by a hyphen -.  The expression is grouped with the parenthesis and made optional with the question mark following it.  We then look for a series of three numbers, a hyphen, followed by four numbers. This whole expression is bounded by \b, which is a special character meaning "word boundary", so that 555-5555 will be matched, but 5555-5555 will not. If you wanted to make it more permissive, we could create a version that excludes the hyphens and then match on either pattern. Such a pattern would look like this:
    (\b([0-9]{3}-)?[0-9]{3}-[0-9]{4}\b)|(\b([0-9]{3})?[0-9]{3}[0-9]{4}\b)
  • /\b[0-9]{5}(-[0-9]{4})?\b/ - Matches zip code with or without extension.  
  • /(?=.*\d)(?=.*[a-z])(?=.*[!@#$%&*])(?=.*[A-Z])[\w!@#$%&*]{10,}/ - Matches a password that is at least 10 characters long, contains at least one uppercase letter, one lowercase letter, one number, and one special character from the group !@#$%&* This expression makes use of the lookahead special character (?=). Basically it means that the string is only matched if the string in the lookahead is also matched afterward. Each lookback is essentially looking for a type of character somewhere in the rest of the string captured with [\w!@#$%&*]{10,}. This expression also uses \d, which is a digit character, and essentially the same as [0-9]. 
  • /\b\w{6,16}\b/ - This pattern could be used to validate username. This matches any string of alphanumeric characters between 6 and 16 characters long. The \b characters are necessary to enforce the maximum length, otherwise the pattern will still just match a substring.
Regexpal.com is a great place to experiment with regular expressions.  In addition to the actual matching string, there are several flags that can be appended to the end to change the behavior:
  • g - global (spec) Specifies whether to test the regex against all possible matches, or only againt the first. 
  • i - ignoreCase (spec) Whether to ignore case when testing a string
  • m - multiline (spec) Whether to search in strings across multiple lines
  • y - sticky (proposal) Whether the search is "sticky", meaning it starts matching at the index indicated in the lastindex property. This is only supported in FF.
Google Chrome Console. Command Line FTW 
Once we have defined a regular expression, we can use it for matching. This can be done from either the regular expression object, or from a string:
  • regex.exec(string) - Searches for a match in a string and returns a result array. Results include last match and captured substrings.
  • regex.test(string) - Tests for a match in a string, returns true or false.
  • string.match(regex) - Searches string using regex pattern, returning an array of all matches. Does NOT act exactly like regex.exec.
  • string.search(regex|| string) - Tests for a match in the string and returns the index of the match, or -1 is search fails.
  • string.replace(regex || string, string) - Finds a substring and replaces it with the string specified in the second parameter.
  • string.split(regex || string) - Splits string into an array of substrings, using regex matches as seperators.

The easiest way to play with these functions is to open up the developer console in your browser. While not really that germain to this exam, I found the text analytics courses (essentials 1 and case study) over at IBM's Big Data University really good practice for learning regular expressions.

Validating form input with Regex


One strategy used to employ regular expressions for form validation is to create a validation function that returns true or false, and assign that function to the onsubmit property of the <form> element, like so:

<form method="post" onsubmit="return validate()"  action="formProcessing.php">


The validate function should be relatively straight forward. Since forms and inputs generally have the name attribute set, it is possible to easily access their values via the DOM HTMLCollection:

function validate() {
     var valid = true;
     var errorMsg = document.getElementById("errorMsg");
     var usernameValue = document["form1"]["username"].value;
     // alternatively, can use selection function
     var passwordValue = document.getElementById("password").value;

However the values from the inputs are retreived, they are then matched against the regex patterns. If not a match, it is probably good practice to indicate to the user which field failed validation and what is required to pass:

if(!regex.test(usernameValue){
     valid = false;
     errorMsg.innerHTML += "Username not valid<br>";
}

At the end of the function, simply return valid. If any of the validation checks failed (and thus set it equal to false) then the validation will fail and the form will not be submitted.


Validating with data type 


Being dynamically typed, nailing down the type of a variable in JavaScript is more involved than in strictly typed languages. One method is to use the typeof function, however this runs into problems with form input because all values in the inputs are of type string.  One way to work around this for numeric values is to explicitly convert the string value to a number and then use equality to compare the original variable with the converted value:

foo == Number(foo) // if foo == "abc" then the number() function will return NaN, which does not equal foo

Checking for integer and floating point values is a little trickier, especially for cases like the number 5.0, which should be a float but is easy to mistake for an int because of simplification in parsing. The following test will return true if the string is a floating point:

a == Number(a) && (String(parseInt(a)).length < String(Number(a)).length || a.length > String(Number(a)).length)

First it checks if "a" is a number. If false the function short circuits to false because of the && operator.  Next the length of the Int parsed value of the string is compared to the string length of the number. For values like 5.01, which will parse to the int 5, this test will be true and the overall expression will short circuit to true because of the || operator. If this test is false (as it would be with 5.0 which both parseInt and Number will reduce to 5), then the last test compares the original string length to the Number string length. If the original string length is longer, it means Number simplified it and thus the number should be a float; in which case the expression returns true. These techniques have their limitations; the value 5.000000000000000001 will be rounded to 5 and evaluate as an int. Use of some Math library functions has also been employed in evaluating for numeric types.  

Checking for blank values is as simple as checking equality with the empty string "", e.g.:

a == ""

Boolean values will most likely be handled by checkboxes or radio buttons, which just require testing against the "checked" attribute for the control.


Avoiding JavaScript injection attacks 


JavaScript injection is also called cross-site injection attack or XSS.  While it is tempting to try to use regex to filter out HTML tags in form input, a better solution is to use an existing library (like HTML Purifier) to do any filtering.  Another possibility is to transform all the inputs into HTML entities (i.e. <p> becomes &lt;p&gt;), which is accomplished by assigning the value of the input to the textContent of a temporary element (like a textarea) and then assigning the innerHTML of this temporary element back to the input. While this will create safe code, it will be ugly. Here is a quick demo on JSFiddle:



No comments:

Post a Comment