SGML::ElementMap


Perl module for hierarchical context oriented processing of SGML and XML data


Synopsis

Description

The ElementMap module allows the programmer to set up an event based processing system for SGML/XML data following the element structure of the document. This allows a single function to handle the setup and cleanup for a given element's content, and a more readable overall process.

To use the module, simply make an instance, install handler functions and pass it a grove object. Each instance is capable of processing any number of SGML documents, but can only handle a single document at a time. Handler functions may be shared across any number of instances.

When processing a grove, the engine walks the document structure, and at each grove node, the engine attempts to look up an appropriate handler function. The first handler found gets passed the node information. That handler will eventually call a content processing function, passing control Back to the engine for any content nodes. After the content, the engine returns to the handler. When the handler is finished, it returns to the engine which continues on to the next content node at that level.

The perl module Data::Locations provides functionality very useful to a lot of more complex processing if you can afford to build your output in memory.

Modes

A mode is a group of handlers. The engine instance maintains a list of modes and a list of currently active modes, which it searches in order for a matching event handler during the look up process. Installing a handler places it in each of the currently active modes. Modes may be changed at any time in the event processing. Note that a less specific handler in one mode will override more specific handlers in later modes. A mode is referenced with a string scalar, which is intuitively the name or ID of the mode. The engine initializes the mode list to a single entry namedDEFAULT , and creates new modes automatically for new names.

@PREV_MODES = $obj->mode_set(MODE,MODE,...);

This function replaces the current list of modes with a new list, and returns the previous list of modes (for later reinstatement, perhaps).

$obj->mode_set_push(MODE [,MODE,...] );

This function takes a group of modes and makes it the current list, like mode_set does, but saves the previous mode list on a stack.

$obj->mode_set_pop( [ MODE [ , MODE [ ,... ] ] ] );

This function removes the current list of modes and replaces it with the modes that were in effect before the corresponding call to mode_set_push. If mode names are supplied, then they must match exactly the mode list being removed or the module will issue a "die" error message. Returns the removed list of modes.

$obj->mode_push(MODE [,MODE,...] );

This function takes a group of modes and adds them to the end of the current mode list.

$obj->mode_pop( [ MODE [ , MODE [ ,... ] ] ] );

This function removes and returns the last mode from the current mode list.

Variable Stack

The module manages a Data::LayeredHash object, to help the programmer get an explicit storage mechanism linked to the SGML document tree. This capability is similar to the perl built in construct local. Use the module's stack method to access the LayeredHash object. The module will implicitly push the stack before entering a handler, and pop it afterwards.

Inside Handler Functions

Handlers take at least two arguments: the engine object and an object for the event data. This data object is probably a bare hash ref.

A handler function should normally call exactly one of theprocess_content , suppress_content, orprocess_content_filter methods to deal with the content of the current SGML/XML object. The first two are self explanatory. process_content_filter takes a list of arguments suitable for assignment to a hash, creates a separate layer in the variable stack containing those names and vaules, and then callsprocess_content . After the content processing, the layer is popped and the resulting values of the named variables are returned in the same order they were originally passed. Note that only the values are returned, not the names.

The reprocess_pseudo_element andinsert_pseudo_element methods are alternatives to normal handling of an sgml object's content. Both functions take one scalar argument which will be the name of the pseudo-element to create and the rest of the arguments are assigned to the element's attribute value hash. Both calls result in the module doing a handler lookup as though the pseudo-element had just been reached in the processing. Theinsert call sets the pseudo-element as the content of the current element, and places the original content into the pseudo-element. The reprocess call inserts the pseudo-element as the parent of the object currently being processed and the current element is the pseudo-element's content. Be wary of creating loops when using reprocess_pseudo_element. Returning from the pseudo-element handler will return from the ..._pseudo_element call just as if it were a process_content call. This functionality is somewhat experimental.

Arguments passed to process_content are in turn passed as extra arguments to all immediate-child handlers. Return values from immediate-child handlers are collected together in a list (all handlers are called in list context) and that collected list is returned from process_content. Note that only directly calling process_content passes arguments to handlers or collects return values.

Installing Handler Functions

The module provides a set of functions to register handlers. Each function takes two arguments: a context pattern and a code reference or registered name. The context pattern may be a single string or a reference to a list of strings, which is a shortcut for installing several similar handlers. Also, an array reference may be passed in place of a handler. In this case, the first element of the array must be the handler code reference or name and the rest of the list is passed to the handler as the third and following arguments (these arguments precede any arguments passed down from parent handlers).

The context pattern is used in a regular expression match against the currently open element names separated by slashes, with a couple of caveats: a "//" in a pattern will match any (zero or more) open elements, and the pattern must end with a simple name or "/" (that is, no regular expression special characters may appear after the last slash). A pattern which does not begin with a slash is taken to imply beginning with a double slash.

A pattern that does not end with a name specifies a "default" handler. Default handlers match only if no non-default handlers match.

When two patterns can match, the handler function installed later will override any handler functions installed earlier, regardless of pattern specificity.

Note that only element handlers should use patterns which end with a name. Other handler types will use other data for the terminal name, and may not behave as expected.

Using Named Handler Functions (a.k.a. built-ins)

Calling the register method with a string and a code reference will save that coderef in a hash in the object. The module checks this hash whenever a handler search returns a string instead of a code reference. This means that a registered name may be used in place of a code reference in any handler installation calls. Two simple handlers are always provided: suppress and process (these may be overridden).

Modes do not affect the handlers associated with names, although that may change in the future.

Registered handlers may someday be given privileges over some internal mechanisms, such as stack layer creation, etc.

Debugging with the Module

There are several package level debugging flags that will produce extra output on the module's machinations. Several of these flags are also useful for debugging handler calls. Note that these are package globals, so they do not vary by object instance.

Driver Modules

To facilitate different uses, code managing the interface to the parser or grove resides in interchangeable Driver modules. The module can implicitly select the Grove driver for SGML documents and the XML::Parser driver for XML documents, so the Driver interface can be safely ignored, but note that the Grove driver loads the whole document into memory. There is no separate driver for PerlSAX, because it has no support for incremental parsing. PerlSAX parsers must use the Grove driver.

Context Expressions

The environment string for non elements ends in a slash to distinguish it from the environment string of the enclosing element.

Environment

This module does not check any environment variables.

Future

Add flag to make the default mode implicitly handled.

Factor out the stack code in favor of general tree walking hooks.

History

Originally written (in April 1998) to use SGMLSpm, soon modified to use my hacked SPGrove module, which has an XS interface to libSP. Updated to use new Perl XML tools in April 2000. While updating, the stack functionality was factored out into a separate module. Development after that point is listed in the Changes file.

Author

Robert Braddock, robert@concordant-thought.com

See Also

perl(1), XML::Grove(3pm), Data::Locations(3pm), Hash::Layered(3pm), XML::ESISParser(3pm), XML::Parser(3pm)