Design Notes

Overview

The NIST PRISE indexer creates an index from text files, marked up in SGML format, to be used by the PRISE search engine. PRISE indexing consists of 8 individual programs which produces a set of data to be used with the Z39.50 interface. It contains a two-step process, that does not need an explicit sorting step. The first step (rel.build.tmm) produces the basic inverted file, and the second step (rebuild.tmm) adds the term weights to the inverted file and reorganizes it for maximum efficiency. The creation of the basic inverted file avoids the use of an explicit sort by using a right-threaded binary tree. Below is a description of the 8 programs used in creating a PRISE index.

Verify the input files conform to SGML.
An SGML verifier which checks the input. (not developed by NIST)
Create tokens/terms from the sgmls output.
sgmls.parser returns tokens which are not SGML tags.
Create the basic inverted file from the first set of terms.
rel.build.tmm creates the term tree and temporary posting files, subtracting tokens which are commonwords.
Compute the term weights and create the dictionary.
rebuild.tmm uses the term tree and temporary postings to create the final postings ("postings") containing the term weights and the ascii dictionary ("dictionary").
Build the binary dictionary.
prep builds the binary dictionary.
Create a document map and table for displaying results.
docmap creates the document map and document table.
Create a title map and table for displaying titles.
doctitles creates the titles map and titles table.

Create a numbering sequence used only by the command line search tool "search.small".
docmapseq creates the document map sequence. (for use with search.small)

Processing Details and Location of Source Code by Step

This section describes in greater detail the major steps executed in the PRISE indexer. Each major step is broken into a single program. What follows is a short description of each program.

sgmls
- Input: dtd file, sgmls formatted file.
- Output: permuted version of the original text on standard out.
- Source: see sgmls source.
- Program Flow: not available (see Manual pages)

sgmls.parser

Usage: sgmls.parser [-i] -w work_dir [-d data_dir] -p stdin

-i	initialize parser for new collection
-w work_dir	directory area where index will be created
-d data_dir	directory area where input text resides (default: work_dir)
-p pattern	file wildcard pattern for input text files

Input: tagged text from sgmls, sgmls.actions, options.spec.
Output: a line for each term containing: the term, its record number, its section number, its word position.
Source: DISTRIBUTION_ROOT/prise_index/src/bin/sgmls.parser.
Program Flow:
close_log_files: <>
write_temp_stats: <>
free_context: <>
print_versions: <>

rel.build.tmm
- Input: stdout from sgmls.parser, options.spec.
- Output: tpost??, tcollstats, tree.
- Source: DISTRIBUTION_ROOT/prise_index/src/bin/rel.build.tmm.
- Program Flow:
  - rel.build.tmm.c
rebuild.tmm
- Input: tpost, tree, tcollstats, docstats, options.spec.
- Output: dictionary, postings, collstats, $files; doctitles .
- Program flow:
  - rebuild.tmm.c
prep
- Input: dictionary.
- Output: bsdict.
- Source: DISTRIBUTION_ROOT/prise_index/src/bin/prep.
- Program Flow:
  - prep.c
    - size255: <>
    - to_base255: <>
docmap
- Input: documents, sgmls.actions
- Output: docmap_params, docmap_table.
- Source: DISTRIBUTION_ROOT/prise_index/src/bin/docmap.
- Program Flow:
doctitles
- Input: documents, title_tags.
- Output: titles, titles_table.
- Source: DISTRIBUTION_ROOT/prise_index/src/bin/doctitles.
- Program Flow:
docmapseq
- Input: docmap_params.
- Output: docnos, docno_table.
- Source: DISTRIBUTION_ROOT/prise_index/src/bin/docmapseq.
- Program Flow:
  - docmapseq.c
    - extract_doc: <>

Constraints or Boundaries:

Number of Documents: 2²⁰
Word Size: 99 characters (see symb_defs.h)
Weight: 4095