Infogistics' NLProcessor online demo

[ about infogistics ]
[ products ]
[ partners & customers ]
[ in the spotlight ]
[ jobs ]
[ contact ]
[ home ]

download evaluation version

download integrator-level documentation.

see Modified Penn Treebank Tag-set

see Normalization demo

NLProcessor Interactive Demo
About Tagging

tTAG is a part-of-speech tagger which can handle plain ASCII text and XML marked-up text. tTAG incorporates a tokenizer (tNORM) which segments text into words and sentences. It is also possible to switch off the internal tokenizer and to use tTAG with your own tokenizer. As a morphological classifier tTAG uses a lexicon which which can be easily extended to accommodate new words. For dealing with unknown to the lexicon words tTAG incorporates an advanced unknown word guesser which also can be retrained for new sublanguages. The tTAG achieves 96% to 98% accuracy when all the words in the text are found in the lexicon, on unknown words it achieves 88-92% accuracy.

tTAG comes with resources pre-trained on publicly available corpora using Modified Penn Treebank Tag-set. tTAG also allows you to develop your own resources on your own corpora using your own tag-set. All components of tTAG can be automatically re-trained to your specific needs. When running tTAG you simply specify which resource file you want to use during tagging.

When tagging SGML/XML marked-up text you can tell tTAG to tag the entire text or only to tag certain XML sections (for example, only to tag paragraphs, and not headers or captions). It is also possible to ask tTAG to output the tagged text as XML.

<SENTENCE>
<W TAG="PPS">He</W> <W TAG="VBZ">books</W ><W TAG="NNS">tickets</W>
</SENTENCE>

About Syntactic Chunking

tCHUNK is a syntactic chunker or partial parser. It uses the part-of-speech information provided by tTAG and employs a mildly context-sensitive grammars to detect boundaries of syntactic groups. The chunker leaves all previously added information in the text and creates a structural elements which include words of the chunk:

<NG>This man</NG> <VG>is singing</VG>.

Currently it is capable of recognizing boundaries of simple noun and verb groups. Noun groups do not include prepositional or clausal post-modifiers.

The chunker itself is a combination of a finite state transducer over SGML/XML elements with a grammar for the detection of the syntactic groups. This grammar can employ all the fields of XML elements. For instance a rule can say:``If there is an element of type ``W'' with character data ``book'' and the ``pos'' attribute set to ``NN'' followed by zero or more elements of type ``W'' with the ``pos'' attributes set to ``NN'' -- create an ``NG'' element and put this sequence under it.

[ home ] [ about infogistics ] [ products ]
[ in the spotlight ] [ jobs ] [ contact ] [ partners & customers ]