Form Design Guide
Form Design 4
A well designed form speeds up and improves the reliability of the data recognition process. Therefore, achieving efficient data recognition begins with correctly designing a master form. A form works well if it has the following characteristics:
The general layout of the form is very important to both human users and the recognition system. The general principle is to design a form that is user friendly. Many of the requirements for making the forms easy for humans are compatible with image-based recognition systems requirements.
In order to achieve good form design, three distinct aspects need to be considered:
The first phase in designing a form consists in analyzing your specific needs and adapting the form to meet those needs. To start the process, determine exactly what information needs to be included on the form. A good tool is to review an existing form you may already have.
To begin the analysis:
Note: A date field should have six characters representing day, month, and year. For other fields like phone and fax number and zip codes, there is an exact number of characters that are required for each field. For name or address fields, use the maximum number of characters which you would expect to see in that field.
Note: Keep the instructions and examples simple and use plain language. A short instruction such as "PLEASE PRINT USING UPPER-CASE LETTERS" can prove invaluable. Something like "PLEASE PRINT WITH BLACK INK" can also be helpful.
Once you have completed the steps above, you can begin designing the physical layout of the form.
Once you have determined what your needs are, the form needs to be physically designed according to your specifications.
A spreadsheet or a drawing application can be used to create a form document. Mitek Systems recommends the following software package which has been proven to be effective in the form layout process:
Company Name Product
This program is available at most retail software outlets. In addition, word processing programs such as Word for Windows and WordPerfect can also be used to layout simple forms. However, the package listed above is recommended for any serious forms design which require the production of the desired graphics and two color artwork.
The size of the page used in the form design may be determined by the distribution method or by other considerations such as the number of fields on the form. The size of the form, weight of the paper, and the anticipated volume of forms to be processed will play a role in the determination of which scanner is used to process the forms.
Design the size of the form so that the forms can be automatically fed into the scanner you have chosen. Letter size documents are readily handled by most scanners. If the form design requires the use of smaller documents, special paper handling capability may be required of the scanner.
The weight of the paper selected can also influence paper handling. Most scanners specify the weight range that can be handled. Typical would be a range of 20 lb. to 30 lb. bond paper or a similar range of rag paper. A good selection for most purposes is 24 lb. bond.
White space provides a "buffer" around meaningful data, ensuring that the recognition system can locate data easily even if the document is scanned at a slight angle or offset.
A well designed form will also include registration marks, or targets, to help the recognition system detect and correct image skew or distortions caused by the scanner.
To optimize results, design the form with two colors to allow for the use of drop-out ink. This allows you to provide information that is visible to the human reader but is not needed for the recognition process. The information printed in a drop-out color is removed during the scanning process.
Many forms also include information (lines, barcodes) that identifies the form. The recognition system can use this information, known as a document identifier or Form ID, to identify the form it is currently processing. It can then locate the appropriate template necessary to read the data.
Finally, one of the most important steps in laying out the form is in the design of the data-entry fields. The data entry fields direct the user as to what data to enter and how to fill in that data. Secondly, the design of the fields can help or hinder the system in locating meaningful data. The better the system can locate and identify the data in the image, the faster and more accurately it can read the data.
Special considerations should be taken if you plan on using fax for input instead of using a scanner for input. Not only are faxed images subjected to shift and skew, but also tend to contain a high level of non-linear distortions. Non-liner distortions are simply best described as "unpredictable" distortions. Some fax machines will stretch or shrink the paper as it goes through the fax machine. Also fax machines often process faxes at different resolutions or scales.
To optimize results for faxed in forms, it is best not to use drop-out ink, due to non-linear distortions. Instead it is best to use all black or a 5% to 8% gray fill for boxes and lines. Most pre-processing software can easily remove black lines or black boxes and gray fill.
Using targets is also very important on faxed in forms. If your pre-processing software cannot correct non-linear distortions, targets are the only way of repairing the image. It is best to use 8 to 10 targets (half on each side of the form) to correct non-linear distortion.
Using Registration Marks (also called Targets)
All scanners skew forms as they travel through the feeder mechanism. These distortions can cause recognition problems. Registration marks (also called targets) are special markings on the page that aid the recognition system in deskewing scanned images.
Each target should be fairly wide and well-defined. Mitek recommends the use of a solid black circle, about 3/8" in diameter, with a drop-out color ring around the black circle separated by a thin ring of white space. This target mark also allows you to check the printer's registration of the two colors on the form.
Note: The drop-out ring is intended only as a printing test. This way you can guarantee that the printer is placing the drop-out ink in the correct areas. Do not try to perform ICR or registration on a form with a drop-out ring. If for some reason the ring does not drop out, registration will be altered.
Other symbols witch make good target marks include 3/8 inch solid black circles or squares, thick angle brackets, and T bars. Satisfactory symbols that can be generated from a word processor include the character ‘O’ (a capital O preferably in bold) and the character ‘X’ (a capital X preferably in bold).
Place the targets on the form in such a way that the box which they define encloses all fields which you want to read. Be sure that the box defined by the targets is both wide and high enough to allow for the detection of image distortion. For example, if you define 2 targets which are both located in the left-hand section of the image, then it makes the scale and slope calculations for fields on the right-hand side of the form less accurate.
For best results:
Drop-out ink is used to provide information that is visible to the human reader but is removed during the scanning process. It refers to the color ink used on a form which cannot be detected by the scanner because of its high reflective value (generally reflectance greater than 60%). Any item or print using drop-out ink is not an item for the recognition system to process. Drop-out ink may cost a little more, but the benefits will significantly offset or negate the extra charge: the form design versus performance trade-off is very cost effective.
Drop-out colors are a function of the scanner used. Most scanners available today are insensitive to pastel blue, green, and yellow. Some scanners have special lenses available which allow them to drop out red. Since each scanner is different, the only way to be sure which color will drop out on your scanner is to run a test.
To test the scanner for drop-out colors:
By completing the above process, you should find several colors that are acceptable to you and your scanner.
Mitek suggests you try the following colors for drop out ink:
PMS 100 Yellow - (remember yellow is hard to read on white paper)
Note: These are all colors that are mixed at a ratio of 30 parts white to 1 part color.
If more than one type of form is going to be read by the system in the same batch, include a form ID feature. There are two mechanisms that PFP uses to perform form identification. One is barcode form ID and the other is line-based form ID.
Using Bar Codes as Form ID
The most accurate and effective way to identify different forms is through the use of barcodes. To use bar codes for form identification purposes, each of your forms must have a unique bar code printed on the form. The first four digits of the bar code also have to be unique to each form. This barcode will be read when the forms are scanned into the system, effectively differentiating one form from another. Among the bar-code types that may be used are:
For best results follow these additional guidelines:
If there is some reason why barcodes cannot be used, forms can be distinguished by the location and length of lines on the form. This is only effective with forms that have significantly different lines structures. It is important to have both horizontal and vertical line on the form. For best results, the forms should contain 3-8 lines in both directions.
Image-based recognition is the process of identifying objects in an image as characters or numbers and representing them in a form that a computer can process. Recognition systems can read a wide variety of data and are not usually limited to special printing or character sensing technologies. The different types of data that can be read by recognition systems include mark sense boxes, barcodes, MICR fonts, OCR fonts, machine print, and hand print.
Machine print includes any data printed by a laser, dot-matrix, or impact printer, a typewriter, or a typesetter. The most common use for machine print fields are for preprinted items such as names, addresses, ID numbers, tracking codes, or serial numbers.
The QuickStrokes API can recognize a variety of machine print styles using any of the field types (isolatedGl_isolated, semi-constrainedGl_semi_constrained, or unconstrainedGl_unconstrained). Although the machine print networks have been trained on several font types, they are not "omni-font". They cannot read every font. Also, machine print fields can have multiple lines, although the system is not designed to do free-format text conversion. The best fonts, in decreasing order of accuracy, are as follows:
Font Name Typical Accuracy
OCR B Excellent
OCR A Very Good
Helvetica Very Good
Times Roman Good
Characters are naturally grouped in fields. A field is a set of data located in a particular region of the form that is to be read as a whole, such as a name or a telephone number. The length of the field is determined by the number of characters contained in it. In a well designed form, the data fields are clearly defined to encourage answers that are correctly formatted. In addition, the better the system can locate and identify meaningful data in the image, the faster and more accurately it can read the data.
To make the fields easy to find:
Check boxes or circles can be used for multiple choice selections or to indicate that a given item is relevant. The recognition system uses "mark sense recognition" to determine whether the box has been checked. The system treats any data within the mark sense box as a "yes" response. Therefore, the user can indicate a choice by filling in the entire box or simply marking with an 'X' or a check mark. A check box can be almost any size, and can be used for applications such as checking an option or verifying that a signature is present. A well designed form will contain as many yes/no or multiple choice questions as possible.
The following is an example of check boxes:
Guidelines in designing check boxes:
Field constraints are lines or boxes in a form to guide (or constrain) the user in entering data. They ensure that the data is in the correct location, is formatted correctly, and does not overlap other data. Because individual handwriting varies so widely, the more constraint you impose on the user, the more likely the characters will be distinct and consistent. Forms to be filled out with hand-printed information should be designed so that each letter or number is to be written in a specifically designated area. Individual character boxes are highly recommended. It is also a good idea to print the character set recommended to give the best results on your form.
Note: It is extremely important that only drop-out inkGl_drop_out_print is used within the designated field areas. If black ink is used within the fields, the system will confuse preprinted information with the information filled out by the user.
The most common types of character fields are Isolated Character Fields, Semi-Constrained Character Fields, and Unconstrained Character Fields.
An isolated character field is a field type where each character position is clearly defined and is clearly separate from the other characters in that field. Isolated character fields yield the best results for forms which are to be filled out with hand-printing. Isolated character fields promotes faster processing of characters with higher accuracy.
The following illustration is an example of an isolated character field:
A semi-constrained character field is a field type where each character position is well defined but not necessarily isolated. Semi-constrained character fields typically provide the best results in most practical situations. They are very similar to isolated fields but the potential for each character's printing to leave its own area or "leak" into the next character's area is high. The best results occur when the character boxes are drawn separated from one another, just as with an isolated field. It is also possible to draw the character boxes so that they are touching, although the resulting accuracy may not be as precise.
The following illustrations are examples of semi-constrained character fields:
An unconstrained character field is a field which does not contain any lines or boxes restricting the position of each character entered. Unconstrained fields are more difficult to recognize and require more processing time, but are invaluable where field designs cannot be controlled. Hand, machine, numeric, and alpha fields can be unconstrained although the best results are realized for numeric fields. Alpha fields where characters are broken or touching are the most difficult to recognize.
For the best design of unconstrained fields, provide ample white space for the field with no lines binding the field. If you want to delineate the field (e.g., the courtesy amount on a check), surround the white space for the field with borders printed in drop-out color. The following illustrations are examples of unconstrained character fields:
If a form has multiple pages, or is double-sided, it is necessary to include page indicators on each page. This will, in most cases, be a page number. PFP can perform pre-recognition on the page number to determine which page is being processed and, therefore, what data to expect.
An alternate page indication method uses rectangles, filled in according to the binary number system, used to signal the recognition system what page is being read.
The following is an example of the page indicators used on a three page form:
Character -A letter or number that is to be read and converted into data by the recognition system.
Character boxes - A data field design where a box is designated for each character in the field.
Check boxes - A box used for yes/no or multiple choice selections whereby any data within the box is treated as a "yes" response by the system. Check boxes use mark sense recognition to determine whether the box has been checked or not. Also known as mark sense boxes.
Constraints - Limitations placed on the data allowed in a field. Physical constraints are lines or boxes placed on a document that confine the position and size of characters on the page. The more constrained the data, the faster and more accurate the recognition.
Data Field- A set of data located in a particular region of the form that is to be read as a whole, such as a name or telephone number.
Drop-out Ink - A color ink (typically 30 parts white to 1 part color) which is invisible to the scanner and therefore is not data to be picked up by the recognition system for processing. Recommended for character boxes, check boxes, punctuation, and other items on a form in or near an area to be recognized.
Duplex scanner - A scanner that reads double-sided forms.
Field- A natural grouping of characters (i.e. letters or numbers); a set of data located in a particular region of the form that is to be read as a whole, such as a name or telephone number.
Field label - Words or phrases next to a field that explain what type of data should be entered in the field; a field label can also act as a field locator for the recognition system.
Form ID - A unique code used by some recognition systems that identifies a form; the recognition system uses that code to identify what form it is currently reading when different forms are intermixed in a single batch.
Form orientation - The size and direction the form is laid out and will be manually placed in the scanner, i.e. landscape or portrait.
Identifier- The ID code which identifies the form; the code which identifies the page number in multiple page forms.
Intelligent Character Recognition (ICR) - A recognition engine which is capable of reading hand-printed and machine printed information.
Isolated -A field type where each character position is clearly defined and is clearly separated from the other characters in that field.
Mark-sense box -Typically used to indicate a choice with an 'X' or a check mark within the field area; also known as a check box.
Non-linear distortion -Used to described the distortion that cannot be restored from de-skewing or de-shifting. Typically this distortion is caused from the stretching or pulling of paper as it is processed by a fax machine. It is also used to describe "unpredictable" distortion, such as, water marks, ink splotches, or mis-printed forms.
Page ID -Codes using the binary number system which identifies the page number in a multi-page form.
Recognition field- A data field which is to be read by a recognition system.
Recognition zone - An area around a recognition data field that is free of other data. Recognition zones can contain more than one line (or field) of data.
Registration marks - Marks or symbols preprinted on a form to help the scanning software ensure that the form was correctly positioned in the scanner. Also known as targets.
Semi-constrained- A field type where each character position is well defined but not necessarily isolated.
Simplex scanner - A scanner that reads single-sided forms.
Skew - Crooked, distorted; all scanners skew and rotate forms as they travel through the feeder mechanism causing distortions in the scanned image.
Targets -Marks used by the system to straighten out distortions in the scanned image caused by skewing and rotation of forms as they travel through the feeder mechanism; also known as registration marks or registration targets.
Unconstrained- A field which does not contain any lines or boxes restricting the position of each character entered.
White Space- A white area around meaningful data which provides a "buffer", ensuring that the recognition system can locate data easily even if the document is scanned at a slight angle or offset.
Copyright ©1999-2087 P.C. Networks, Inc. Last Modified: 05/13/2008 12:35:58 PM