| Software Secret Weapons™ |
Lapis posted by Pavel Simakov on 2005-11-23 22:35:30 under Text Mining
|
|||
|
LAPIS
is a lightweight structured text processor. Under the hood it's a regular expression evaluator that uses syntax trees containing regular expressions as leaf nodes. The biggest advantage of Lapis over simple regular expression matching is its ability to effectively combine regular expressions to form new patterns. Lapis was developed in 1999 as a part of PhD Thesis Proposal by Robert C. Miller while at Carnegie Mellon University. Here is an abstract from his thesis:
Web pages, source code, and other text documents contain structured data worthy of automatic processing -- searching, sorting, reformatting, calculating, and so on. Unfortunately, generic text-processing tools limit their input and output to generic formats that may not match the format the user wants to process. The usual solution to this problem is a custom program that parses the source text and processes it, but custom parsers are hard to build and often discard useful information from the source document.
I propose a new approach, lightweight structured text processing, in which users describe relevant document structure interactively and manipulate documents directly with generic tools. This approach will be embodied in LAPIS, a system for displaying and processing web pages, source code, and text files. LAPIS makes several contributions: (1) text constraints, a new pattern language for identifying regions of text in a simple, readable, composable, expressive, and robust manner; (2) algorithms and representations for implementing text constraints with reasonable efficiency; (3) an architecture for composing and reusing structure detectors, such as external C++ or HTML parsers; (4) a user interface that integrates pattern matching, manual selection, learning from examples, and external parsers, allowing the user to combine these techniques for convenient structure description; (5) the ability to refer to "presentation-level" structure that may not be directly reflected in the linear text, such as page layout, typesetting, and table rows and columns; and (6) the ability to handle variations and exceptions in document structure by specifying fuzzy patterns.
My thesis is that lightweight structured text processing lets users describe and manipulate many kinds of text structure, from implicit to explicit, formal to informal, and presentation-level to logical-level, allowing the user to manipulate the text in its original format, and delivering the convenience, speed, and scalability of automation without the cost or difficulty of writing a custom program.
Comments (2) Leave a comment |
|
|||
| Copyright © 2004-2007 by Pavel Simakov |
|
Comment by Doug — July 5, 2007 @ 4:31 pm
Very intresting, is there an example? is it open source?
Comment by Pavel Simakov — July 9, 2007 @ 11:29 am
The download link is here:
http://groups.csail.mit.edu/uid/lapis/doc/download.html