SGML::ElementMap Implementation

for DC Perl Mongers, 4 February 2003

DocumentationCodeSamples
  • ElementMap.pm: The main module object
  • Layered.pm: The stack-hash data structure
  • test.pl: main test code
  • layered-test.pl: Hash::Layered specific tests
  • Driver.pm: the Driver base class (does little)
  • Grove.pm: the Grove driver (it's the simplest of the drivers because it has the whole document in memory as a tree)
  • EventQueue.pm: common base class for the event drivers: SGMLS and XMLParser
  • SGMLS.pm: the SGMLSpm driver
  • XMLParser.pm: the XML::Parser::Expat driver (has to be expat because it feeds the document through piecewise in order to get control back from the parsing. it isn't supposed to be pretty)
  • identity.pm: implementation of code to output a document just as it was input. This shows accessing of all the parser data you can get
  • identity-run: command to actually run the identity.pm module
  • calc.pm: implementation of the simple calculator example
  • calc-run: command to actually run the calc.pm code
  • calc.dtd: The DTD for the simple calculator document
  • test-simple.calc: input document for the simple calculator example
  • test-simple.calc-out: output for running the simple calculator on the test-simple.calc document
  • trace-calc-1: output for running the simple calculator on the test-simple.calc document with calc's debug tracing
  • trace-calc-2: output for running the simple calculator on the test-simple.calc document with debug tracing activee for calc.pm and SGML::ElementMap
  • trace-calc-3: output for running the simple calculator on the test-simple.calc document with debug tracing for all of calc.pm, SGML::ElementMap and Hash::Layered
  • desc2dtd.pl: older file for generating a DTD from a DTD description document (like a simple XMLSchema)
  • htmltrans: older but substantial example for formatting ASM International article documents into HTML
  • character-old.pm: an old formatting script showing a different style of interacting with the parser (which is still better than directly managing the parser events)
  • character-new.pm: the modern version of the above script

If you are looking at these on your own, you'll need to understand that several of the samples files refer to older versions of this module, even all the way back to its original incarnation as a helper built on SGML::SPGrove (which is now XML::Grove). The main differences are in accessing the parsed data, but there are some cosmetic changes here and there (and of course inability to use newer capabilities).

You can get the SGML::ElementMap module and its respective submodules, examples and documentation from this directory. Don't forget to get the highest version. That directory also contains these notes and files compressed into a single archive.

Mention:


The processing model

A code sample for OmniMark and ElementMap. This fragment takes elements from "<ext.xref pointer='ARTICLE_ID'>content</ext.xref>" to "<ext.xref vol.no='NUM' collection='NUM'>content</ext.xref>"


;; omnimark code
element ext.xref
  local counter junk
    and stream volnum
    and stream colnum
    and switch successful
  output "<%lq"
  repeat over specified attributes as spec-attr
    output " "
    output key of attribute spec-attr
    output "=%"%hv(spec-attr)%""
  again
  activate successful
  reset junk to system-call "%g(idcommand) --format='vol.no=%%v col.no=%%c' --save-output=%g(TempFile) %v(pointer)"
  do unless file "%g(TempFile)" exists
    deactivate successful
    put #error "Warning: auto-generated file %g(TempFile) not found%n"
    increment ErrorCount
  else
    do scan file "%g(TempFile)"
      match "vol.no=" (letter or digit)+ => vol white-space+ "col.no=" (letter or digit)+ => col
         set buffer volnum to "%x(vol)"
         set buffer colnum to "%x(col)"
      else
        deactivate successful
        put #error "Warning: auto-generated file %g(TempFile) is invalid: [%v(pointer)]%n"
        increment ErrorCount
    done
    ;reset junk to system-call "rm %g(TempFile)"
  done
  do when not active successful
    set buffer volnum to "unknown"
    set buffer colnum to "unknown"
  done
  output " vol.no=%"%g(volnum)%" collection=%"%g(colnum)%""
  output ">%c"
  output ""

Omnimark


# this is just thrown together from the omnimark code: there might be errors
$e->element('EXT.XREF', sub {
    my ($engine, $element) = @_;
    my ($attrs, $successful, $line, $volnum, $colnum);
    $output->print("<".$element->{'Name'}." ");
    $attrs = $element->{'Attributes'}
    foreach (@$attrs) {
        $output->print(" " . $_ . '="' . $attrs->{$_} . '"');
    }
    system($idcommand, "--format='vol.no=%v col.no=%c'",
           "--save-output=".$TempFile, $attrs->{'pointer'});
  OK: {
      $successful = 0;
      if (! -f $TempFile) {
          warn "Warning: auto-generated file ".$TempFile." not found\n";
          $ErrorCount += 1;
          last OK;
      }
      $line = <$TempFile>;
      if ($line && $line =~ m/vol\.no=(\w+)\s+col\.no=(\w+)/) {
          $volnum = $1;
          $colnum = $2;
      } else {
          warn "Warning: auto-generated file " . $TempFile .
              " is invalid: [" . $attrs->{'pointer'} . "]\n";
          $ErrorCount += 1;
          last OK;
      }
      $successful = 1;
  }
    if (!$successful) {
        $volnum = $colnum = 'unknown';
    }
    $output->print(' vol.no="'.$volnum.'" collection="'.$colnum.'">');
    $engine->process_content;
    $output->print("{'Name'}.">");
});

Other processors


Should read the SGML::ElementMap documentation and start looking at the ElementMap.pm code

Why use constants for object data reference?

What do we do with handlers?

What do the main objects look like? (Notice the colons. This is kind of structure describing pseudo-perl. Nothing formal or correct.)


mode : {
  'handler_type' => handler_set : {
          'NAME' or '' => handler_pair : [ pattern, handler_ref ]

$mode = { '_ MODENAME ' => 'FOO',
          '_ FINALIZE ' => '',
          'Element' => {
              'PARA' => [ '.*/SECTION/.*', \§ion_para ],
              '' => [ '', \&no_handler_warning ] },
          'CData' => {
              '' => [ '', \&data_accumulate ] },
      };

Mode


$main = [
     $state_data,
     $all_modes,
     $global_vars,
     $stack_vars
 ];

$state_data = [
     driver : SGML::ElementMap::Driver
     node_path : ''
     handler_modes : [ $mode, $mode_2, $mode_3, ... ]
     handler_mode_stack : [ $mode_set_1, $mode_set_2, ... ]
     named_handlers : { 'NAME' => \&handler }
     last_gen_name : 'aaa'
 ];

$all_modes = { 'MODE_NAME_1' => $mode,
               'MODE_NAME_2' => $mode_2 };

$global_vars = { 'NAME' => $some_value };

$stack_vars = Hash::Layered;

Why global variable support?

Why stack variable support?

Drivers

Different processors need different interfaces to work with them


# these can default to Driver methods
$d->input($type);  # 'file' 'literal' 'handle' etc.
$d->markup($type); # 'xml' or 'sgml'
$d->parser($parser_object);
$d->process_xml_file($elementmap, $file, @handler_args);
$d->process_sgml_file($elementmap, $file, @handler_args);
# these must be implemented in Driver sub-classes
$d->process(...);
$d->reparent_current_subtree($new_el_name, @attribute_pairss);
$d->reparent_subtree($new_el_name, @attribute_pairss);
$d->dispatch_subtrees($elementmap, $pattern, @handler_args);
$d->skip_subtrees();
$d->context_path();

Some of the drivers have a lot in common: the simple event based ones. So we have Driver::EventQueue


Hash::Layered

Sample execution:


$h->set_default('cascade');
$h->{'a'} = 31;
$h->{'b'} = 32;
$h->{'c'} = 33;
$h->push;
cascade a b c d e
default 31 32 33
default

assert($h->{'a'} == 31);
$h->set_layer('opaque');
assert(! defined $h->{'a'});
$h->{'c'} = 34;
$h->{'d'} = 35;
cascade a b c d e
default 31 32 33
opaque 34 35

$h->set_layer('default')
assert($h->{'a'} == 31);
assert($h->{'b'} == 32);
assert($h->{'c'} == 34);
assert($h->{'d'} == 35);
$h->push;
$h->{'a'} = 36
$h->set_layer('oneway');
$h->{'e'} = 37
$h->{'a'} = 38
cascade a b c d e
default 36 32 33
default 34 35
oneway 38 37

assert $h->{'a'} == 38
assert $h->{'b'} == 32
assert $h->{'e'} == 37
$h->pop
assert !defined $h->{'e'}
$h->pop
assert $h->{'a'} 36
assert $h->{'b'} 32
assert $h->{'c'} 33
assert !defined $h->{'d'}

Want to use the object as a hash reference, but still have access to object methods. I initially tried this with a single object; however, that did not work. I don't have notes, unfortunately, but I think the issue was getting the data structure out to work with. Using a single object, it's more difficult to tell when a method is called if it needs to call tied (note that the hash ref and the object ref will be blessed to the same object, so ref() won't help). Using two objects makes this very easy.

(Note: Haven't converted to use sub constants for object fields.)

Have two places for behavior settings

Layers have IDs

intervening_layer($target_index, $is_write)

behaviors for the hash

OK, OK, how does it work?


$layered_hash = [
   $default_layer_state : 'cascade'
   $layer_data_count : -1
   $layer_data_list : [ $layer_data_1, $layer_data_2, ... ]
   $var_val_stack_hash : { }
   $iter_data : [ [ keys], key_index, intervening_layer_for_reads]
];

$layer_data = [
   $sub_id,
   $behavior,
   'VAR1',   # list of all variables that have values in this layer
   'VAR2',
   ...
];

$var_val_stack_hash = {
  'VAR1' => [ $layer_index_1, $val_1,  $layer_index_2, $val_2,  ... ]
  ...
};

Huh?

Lookup of a key

Iteration