So you want to implement automatic categorization? Automatic categorization can be a powerful tool despite its limitations, but it is still important to test and evaluate before making a commitment to using it.

AuthorLubbes, R. Kirk
PositionTechTrends - Records managers play important role in implementing automatic categorization

At the Core

This article

* presents a high-level overview of automatic categorization technology

* explains the benefits and limitations of various automatic categorization approaches

* describes the key role records managers play in successfully implementing automatic categorization

The massive amounts of data available through the Internet, extranets, and internal corporate databases have created the need for new techniques to organize information. An excellent example is automatic categorization, an information management tool designed to assist enterprises in filing and retrieving the vast numbers of electronic records that they generate or use today. Automatic categorization attempts to assign electronic records to either predefined file structures or to self-defined categories through computer-based processes.

Successful use of automatic categorization requires the melding of technical and records management perspectives. Insight into the theory behind various vendor implementations, the benefits and limitations of each, and an understanding of what is "under the covers," will aid records and information managers in making intelligent decisions in selecting and implementing automatic categorization.

Automatic categorization technology has two principal approaches: pattern matching and rule-based systems. Pattern matching systems use word patterns and concepts within the electronic record to associate the record with a predefined file structure. Pattern matching systems can be further divided based on the technique used to associate patterns with a given category. The four principal pattern matching techniques used by vendors today are k-nearest neighbor, Bayesian, neural networks, and support vector machines.

Rule-based systems depend on a user-specified set of rules to associate the occurrence (or exclusion) of names, phrases, or concepts contained in documents with specific file plan subject headings. The computer parses the document, identifies the user-specified entities, and assigns the document to the appropriate category based upon the rule set.

Pattern Matching Approaches

Pattern matching categorization requires providing the system with representative sample documents for each subject heading in the file plan. Using the sample documents, the categorization software generates an internal representation for each subject heading. The software compares any new documents entering the system to internal subject heading representations and assigns the new document to the subject where it fits best. There are two phases to this process: the training phase, which consists of providing sample documents, and the classification phase, in which new documents are assigned.

The training phase requires the records manager to identify document sets that represent each subject heading in the file plan. This is the training set. Identifying a training set is an empirical problem, one in which the records manager's knowledge of existing records and the current file structure is critical to automatic categorization's success. The manner in which the training set is used differentiates each of the four pattern recognition techniques.

In the classification phase, new documents entering the system are assigned to one or more categories using algorithms fine-tuned during the training phase. This category assignment is equivalent to the records manager's indexing the document in order to assign it to a specific subject heading.

Software developers currently use four primary methods to assign documents to subject headings (categories). The methods are drawn from various mathematics and computer science disciplines. The k-nearest neighbor algorithm is based upon algebra and geometry. Bayesian modeling uses probability theory. Neural networks are an outgrowth of the computer science field of artificial intelligence. Support vector machines (SVM) are founded on machine learning theory.

K-nearest Neighbor

K-nearest neighbor is the easiest categorization approach to understand because the mathematics it uses has a physical analog in the real world.

In k-nearest neighbor, the records manager constructs a training set. The categorization software produces an internal representation in which each document in the training set is a point on a graph. The training set clusters graphic below shows the three-dimensional graph of a training set as produced by the product SERprivateBrain Learnset Viewer. The viewer allows humans to visualize a training set the same way that the software does. The points, representing the documents in each category, form groupings called clusters.

In the training set clusters example, the file plan has five categories, each containing documents created from five different books (The Age of Reason, Holy Bible, Dracula, Moby Dick, and Zarathustra). There are five clusters--one for each book title--represented by different colors as shown in the legend. A cluster exists for each category (as it would for each file plan heading). In the k-nearest neighbor approach, the software generates a sphere that contains the documents (points) in the subject heading (cluster) and calculates the center of the sphere. The simulated spherical k-nearest neighbor boundaries graphic above shows spheres drawn around the clusters representing Moby Dick and The Age of Reason to illustrate this concept. The center of the sphere is called the centroid. The centroid represents the subject heading in the file plan to the computer.

New documents are filed (categorized) during the classification phase. The radius for the centroid sphere defines the maximum distance that any new document's representative point can be from that subject heading's centroid. When the point associated with a new document falls within any cluster's sphere, the document will be filed in that associated subject heading.

Bayesian Modeling

The Bayesian modeling approach is based on the concept that knowledge about the distribution of previous outcomes helps determine the probability of current outcomes. In automatic categorization, Bayesian modeling asserts that if the assignment of a set of documents from the corpus (the training set)...

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT