Extractors

Extractors are the components responsible for turning raw HTML or file content into rich, searchable data. Every time a page is added to the index or a document preview is requested, Hister runs the content through a chain of extractors until one succeeds.

The chain design means specialist extractors run first; generic ones act as a safety net for any content that no specialist handles.

Purpose

Generic HTML-to-text conversion loses a lot of signal. A Stack Overflow answer, a Go package reference, a local Markdown note, and a news article all have different structure and a one-size-fits-all parser cannot take advantage of that structure.

Extractors exist so that each kind of source can be handled in the most domain-specific way possible. A specialist extractor for a particular website or file format can:

  • pull out the parts of the page that actually matter and discard noise (ads, navigation, boilerplate)
  • produce richer plain text that makes search results more relevant
  • surface structured details answers, code snippets, documentation sections that a generic parser would flatten or miss entirely
  • enable to use a custom front-end template for the document preview panel, giving each content type its own layout and presentation

The goal is always to capture more specialised, higher-quality information about the content being processed, so that search results and the document preview are as useful as possible for the source in question.

When a page is fetched by the browser extension, the CLI, or the crawler Hister receives its raw HTML (or file bytes). That content needs to be processed to provide a full Document object.

Extractor chain

Extractors are tried in registration order. Each call to Extract or Preview returns an ExtractorState value that signals how the chain should proceed:

StateMeaning
ExtractorStopThe extractor handled the document successfully; stop the chain and return a successful result.
ExtractorContinueThe extractor was inconclusive; try the next matching extractor in the chain.
ExtractorAbortA fatal error occurred; stop the chain immediately and propagate the error to the caller without trying further extractors.

If no extractor returns ExtractorStop, ErrNoExtractor is returned.

The Extractor interface

A custom extractor must implement the following Go interface (defined in server/extractor/extractor.go):

type Extractor interface {
    // Name returns a human-readable identifier used in logs and config.
    Name() string

    // Match reports whether this extractor applies to the given document.
    // Extract and Preview are only called when Match returns true.
    Match(*document.Document) bool

    // Extract rewrites the document before it is added to the index.
    // Return ExtractorStop on success, ExtractorContinue to fall through to
    // the next extractor, or ExtractorAbort to stop with a fatal error.
    Extract(*document.Document) (types.ExtractorState, error)

    // Preview returns a rendered representation suitable for display.
    // Return ExtractorStop on success, ExtractorContinue to fall through to
    // the next extractor, or ExtractorAbort to stop with a fatal error.
    Preview(*document.Document) (types.PreviewResponse, types.ExtractorState, error)

    // GetConfig returns the extractor's current configuration.
    // Must return sensible defaults before SetConfig is called.
    GetConfig() *config.Extractor

    // SetConfig applies user-supplied configuration on top of defaults.
    // Return an error for any unrecognised option key.
    SetConfig(*config.Extractor) error
}

ExtractorState

types.ExtractorState is defined in the server/types package:

type ExtractorState int

const (
    ExtractorStop     ExtractorState = iota // success, stop the chain
    ExtractorContinue                       // inconclusive, try next extractor
    ExtractorAbort                          // fatal error, stop immediately
)

Document

The whole document.Document struct passed to Match, Extract, and Preview.

PreviewResponse

types.PreviewResponse carries the output of Preview:

type PreviewResponse struct {
    Content  string // HTML or plain text to render
    Template string // optional custom front-end template name; leave blank for default
}

Registering a new extractor

Add an instance of your extractor to the extractors slice in server/extractor/extractor.go. Place it before the generic fallbacks so that it takes priority for the pages it targets.

Configuration

Each extractor can be enabled or disabled, and may expose custom options, through the extractors section of the config file.

extractors:
  <extractor-name>:
    enable: true | false
    options:
      key: value

The <extractor-name> key is the lowercased value returned by the extractor’s Name() method.

Only entries you want to change from the default need to be specified. If an extractor is omitted from the config, its built-in defaults apply.

enable

Controls whether the extractor participates in the chain.

ValueEffect
trueExtractor is active (the default for all built-ins)
falseExtractor is skipped for both indexing and preview

options

A free-form map of extractor-specific settings. The available keys depend on the extractor implementation; each extractor validates its options in SetConfig and returns an error for any unrecognised key.

Implementing GetConfig and SetConfig

GetConfig must return the extractor’s current configuration (or a default when no config has been applied yet):

func (e *MyExtractor) GetConfig() *config.Extractor {
    if e.cfg == nil {
        return &config.Extractor{
            Enable:  true,
            Options: map[string]any{},
        }
    }
    return e.cfg
}

SetConfig should validate that no unknown option keys are present, then store the config:

func (e *MyExtractor) SetConfig(c *config.Extractor) error {
    allowed := map[string]bool{"timeout": true}
    for k := range c.Options {
        if !allowed[k] {
            return fmt.Errorf("unknown option %q", k)
        }
    }
    e.cfg = c
    return nil
}

Config merging (default → user-supplied) is performed automatically by extractor.Init before SetConfig is called, so SetConfig always receives the fully resolved configuration.

Development guidelines

Avoid additional HTTP requests. Work with the HTML and metadata already available in the Document struct wherever possible. Making extra requests inside an extractor adds latency, increases network traffic, and can fail silently in offline or restricted environments. More importantly, outbound requests expose the user’s IP address and browsing activity to external servers, which is a privacy concern. Additional requests are not forbidden, but they should only be made when there is no reasonable alternative.

Avoid embedding third-party content. Strip or discard remote images, videos, iframes, and other externally hosted media before returning content from Extract or Preview wherever possible. Embedding such content causes the browser to contact third-party servers whenever a preview is opened, leaking the user’s IP address without their knowledge. Third-party content is not forbidden, but it should be avoided unless it is essential to the extractor’s purpose. When multimedia must be surfaced, the preferred approach is to render a placeholder button that the user can click to load the video, audio, or embed on demand, so external contact only happens with explicit user intent.

Use custom preview templates when they add value. If the extracted content has a well-defined structure (code documentation, Q&A threads, recipes, and so on), return a non-empty Template in PreviewResponse and build a dedicated Svelte template for it. A tailored layout is almost always more readable than the generic one.