Automatic categorization: how it works, related issues, and impacts on records management.

AuthorLubbes, R. Kirk
PositionCover Story - Statistical Data Included

A records manager's primary responsibility has always been to process unstructured data. The increase in unstructured documents and the rise in the portion of the material that is electronic has created an environment where the records manager can no longer manage records without having new, automated tools at their command.

Automatic categorization is currently being applied to electronic records management. Anyone hoping to effectively apply categorization needs to understand how automatic categorization works, its benefits, its limitations, and the potential impact it has on recordkeeping operations. Ultimately, automatic categorization and other text analytical tools will provide potential new career opportunities for records managers.

In order to better understand the process of automatic categorization, these key terms should be defined:

Categorization: The assigning of an object to a pre-existing subject heading in a file plan or assigning it to a given class within the taxonomy (also called classification)

Cluster: A group of objects with members that are more similar to each other than to members of any other group

Data visualization: A visual representation of corpus contents, often a topographical map or network of linked nodes

Structured data: Fielded data, or data that is generally contained in a relational database

Summarization: An abstract or a synopsis of a document

Unstructured data: Data not contained in fields (e.g., free text, audio, video, and images)

Over the last two decades, the computer's ability to process data has evolved from the domain of structured data to unstructured data. Structured data can include a series of tables with rows and columns. A formal mathematical model, or a relational model, defines the table structures and the complete set of operations that can be performed on the data.

Structured data represents less than 20 percent of the information available. More than 80 percent of all information resides in unstructured documents. Initially, this data could not be processed in its native form. Data elements contained in documents had to be extracted and entered into structured databases before they could be processed. The primary raison d'etre for forms is to be able to easily enter data into database management systems (DBMS). Products exist today to "read" data from forms, including intelligent character recognition (ICR), but this technology is actually processing structured data. For example, ICR depends upon the position of information in the form to determine the DBMS field into which the data is to be entered.

A records manager's primary responsibility has always been to process unstructured data, generally hardcopy documents. To create a file plan, the records manager analyzes a collection of documents and creates a taxonomy that adds structure to the collection. Assigning documents to a file requires an indexing clerk to extract keywords from the document. The creation of a records control schedule requires the records manager to extract the business and legal relevance of a file series. According to a report from Autonomy Corp., the increase in unstructured information is estimated as doubling every three months. The rise in the portion of the material that is electronic has created an environment where the records manager can no longer manage records without having new tools at their command. These tools, as well as their advantages and limitations, are discussed later in the article. The focus will be primarily on text-based, electronic records, including e-mail, Web URL documents, Word documents, pdf files, and text documents.

The Electronic Records Environment

An organization that has implemented a standard electronic file structure that is universally followed is extremely fortunate. In most organizations, each person has their own directory structure and e-mail folder structure. Some companies have implemented electronic records management systems (RMS), but most have used a day-forward approach, in which all newly generated and received records are placed under the automated RMS on a certain date. Electronic files on existing servers and electronic records in off-line storage (back-office files) are rarely addressed. Generally, metadata does not exist to place the back-office files under RMS control, and surveying the corpus is cost prohibitive. However, these documents are just as vulnerable to discovery as the newer documents.

Automatic categorization attempts to associate electronic records with either a predefined taxonomy or self-defining categories. An understanding of the strengths and potential limitations of automatic categorization in managing records is important if it is to be used successfully. A number of text analysis tools act as a suite to assist in this process. These include feature extraction, clustering, visualization, and summarization tools. Commercial off-the-shelf (COTS) products often combine these tools into a single categorization product...

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT