Software Secret Weapons™


 
Lapis
by Pavel Simakov on 2005-11-23 22:35:30 under Text Mining, view comments
Bookmark and Share
 


LAPIS is a lightweight structured text processor. Under the hood it's a regular expression evaluator that uses syntax trees containing regular expressions as leaf nodes. The biggest advantage of Lapis over simple regular expression matching is its ability to effectively combine regular expressions to form new patterns. Lapis was developed in 1999 as a part of PhD Thesis Proposal by Robert C. Miller while at Carnegie Mellon University. Here is an abstract from his thesis:

Web pages, source code, and other text documents contain structured data worthy of automatic processing -- searching, sorting, reformatting, calculating, and so on. Unfortunately, generic text-processing tools limit their input and output to generic formats that may not match the format the user wants to process. The usual solution to this problem is a custom program that parses the source text and processes it, but custom parsers are hard to build and often discard useful information from the source document.

I propose a new approach, lightweight structured text processing, in which users describe relevant document structure interactively and manipulate documents directly with generic tools. This approach will be embodied in LAPIS, a system for displaying and processing web pages, source code, and text files. LAPIS makes several contributions: (1) text constraints, a new pattern language for identifying regions of text in a simple, readable, composable, expressive, and robust manner; (2) algorithms and representations for implementing text constraints with reasonable efficiency; (3) an architecture for composing and reusing structure detectors, such as external C++ or HTML parsers; (4) a user interface that integrates pattern matching, manual selection, learning from examples, and external parsers, allowing the user to combine these techniques for convenient structure description; (5) the ability to refer to "presentation-level" structure that may not be directly reflected in the linear text, such as page layout, typesetting, and table rows and columns; and (6) the ability to handle variations and exceptions in document structure by specifying fuzzy patterns.

My thesis is that lightweight structured text processing lets users describe and manipulate many kinds of text structure, from implicit to explicit, formal to informal, and presentation-level to logical-level, allowing the user to manipulate the text in its original format, and delivering the convenience, speed, and scalability of automation without the cost or difficulty of writing a custom program.

Comments (2)


Leave a comment


 
Dog Emotional 2010 Calendar Dog Emotional Mousepad Dog Fashionable 2010 Calendar Dog Fashionable Mousepad

Copyright © 2004-2010 by Pavel Simakov
any conclusions, recommendations, ideas, thoughts or the source code presented on this site are my own and do not reflect a official opinion of my current or past employers, partners or clients
SourceForge.net Logo