Textokit

TextoKit – is a set of components for Natural Language Processing based on Apache UIMA platform.

This project is maintained by Textocat

Getting Started

Contents

Installation

TextoKit is available as a bunch of Maven artifacts. They are published to the Textocat repository that you can use by specifying the following in your project POM:

<repository>
    <id>textocat.artifactory</id>
    <url>http://corp.textocat.com/artifactory/oss-repo</url>
    <name>Textocat Open-Source Repository</name>
   </repository>

There are 3 main types of modules:

Depending on which analyzers you need, you include a set of dependencies into your application:

For example, if you need lemmatization capability you start from textokit-lemmatizer-api module and its implementation textokit-lemmatizer-dictionary-sim. Then provide analyzer implementations for all preliminary steps: a tokenizer, a sentence splitter, a Part-of-Speech tagger. The following choices are quite reasonable:

The latter one requires two artifacts be provided: an implementation of morphological dictionary API and a trained model. Hence you add the following:

The former one requires an actual dictionary be provided. The simplest option is to add another dependency on the compiled dictionary:

Consequently, you will end up with the following:

<!-- API that you will use in your app -->
<dependency>
    <groupId>com.textocat.textokit.core</groupId>
    <artifactId>textokit-lemmatizer-api</artifactId>
    <version>${textokit.version}</version>
</dependency>
<!-- analyzer implementations -->
<dependency>
    <groupId>com.textocat.textokit.core</groupId>
    <artifactId>textokit-tokenizer-simple</artifactId>
    <version>${textokit.version}</version>
    <scope>runtime</scope>
</dependency>
<dependency>
    <groupId>com.textocat.textokit.core</groupId>
    <artifactId>textokit-sentence-splitter-heuristic</artifactId>
    <version>${textokit.version}</version>
    <scope>runtime</scope>
</dependency>
<dependency>
    <groupId>com.textocat.textokit.core</groupId>
    <artifactId>textokit-morph-dictionary-opencorpora</artifactId>
    <version>${textokit.version}</version>
    <scope>runtime</scope>
</dependency>
<dependency>
    <groupId>com.textocat.textokit.core</groupId>
    <artifactId>textokit-pos-tagger-opennlp</artifactId>
    <version>${textokit.version}</version>
    <scope>runtime</scope>
</dependency>
<dependency>
    <groupId>com.textocat.textokit.core</groupId>
    <artifactId>textokit-lemmatizer-dictionary-sim</artifactId>
    <version>${textokit.version}</version>
    <scope>runtime</scope>
</dependency>
<!-- models, dictionaries, etc. -->
<dependency>
    <groupId>com.textocat.textokit.core</groupId>
    <artifactId>textokit-dictionary-opencorpora-resource</artifactId>
    <classifier>rnc</classifier>
    <version>0.1-20140407-1</version>
    <scope>runtime</scope>
</dependency>
<dependency>
    <groupId>com.textocat.textokit.core</groupId>
    <artifactId>textokit-pos-tagger-opennlp-model</artifactId>
    <classifier>rnc1M-8cat</classifier>
    <scope>runtime</scope>
    <version>0.1-20151116-1</version>
</dependency>
...
<properties>
  <!-- define in properties block of POM -->
  <textokit.version>0.1-SNAPSHOT</textokit.version>
</properties>

Preliminary reading - UIMA basics

UIMA Tutorial – chapters 1, 3 and 5.

UIMA Reference – chapters 2, 4, 5.

UIMAfit Guide – chapters 1 through 6.

Running

To process texts with UIMA you need to (1) define an input, (2) compose a processing pipeline and (3) consume output. Further sections explain each of these stages in detail.

How to define an input

There is the concept of collection reader in UIMA. Basically it is an interface similar to an iterator of documents where each document is represented by UIMA CAS. Check the UIMA documentation for details.

TextoKit contains several collection reader implementations, we mention here a few:

UIMA needs a CollectionReaderDescription to produce an instance of collection reader in runtime. You can either write an XML description (as supposed in UIMA documentation) or build it programmatically. The latter approach is facilitated by UIMAfit’s CollectionReaderFactory. The quite common convention is to provide a static factory method in a collection reader class where the method produces a description instance.

The main purpose of a collection reader is to set a document text and some initial feature structures into an empty CAS. TextoKit’s collection readers add a single annotation of type DocumentMetadata (from textokit-commons) and set its sourceUri feature (and optionally some others). A value of sourceUri is supposed to be a reference to source of text, e.g., a file URL, a record identifier, etc.

How to compose a text processing pipeline

Text processing pipeline is called an aggregate analysis engine in UIMA. It is composed from a set of smaller analysis engines and a flow controller. The latter defines a route for CASes. Here we assume that a default flow controller implementation, provided by UIMA SDK, is good enough for introduction purposes. It implements linear ordering of constituent analysis engines, so you just list them in an appropriate order.

UIMA needs an AnalysisEngineDescription to produce an instance of analysis engine in runtime. You can either write an XML description (as supposed in UIMA documentation) or build it programmatically. The latter approach is facilitated by UIMAfit’s AnalysisEngineFactory. When you assemble a description for aggregate AE there are a couple of ways to define an inner AE:

In UIMA description an import can be specified by location or fully-qualified name. Import by location has a lot of pitfalls and generally should not be used. Import by fully-qualified name is resolved against application classpath and UIMA datapath. It is preferred way to reference analysis engines in TextoKit.

Fully-qualified names of TextoKit analyzers for basic text processing steps are defined as constants in corresponding facade classes of API modules, here are a few of them:

Using these names you can assemble an aggregate description with UIMAfit’s AnalysisEngineFactory:

import static org.apache.uima.fit.factory.AnalysisEngineFactory.* ;
...
AnalysisEngineDescription aeDesc = createEngineDescription(
        createEngineDescription(TokenizerAPI.AE_TOKENIZER),
        createEngineDescription(SentenceSplitterAPI.AE_SENTENCE_SPLITTER),
        createEngineDescription(PosTaggerAPI.AE_POSTAGGER),
        createEngineDescription(LemmatizerAPI.AE_LEMMATIZER)
);

Some PoS-tagger and lemmatizer implementations can expect an instance of MorphDictionaryHolder (from textokit-morph-dictionary-api) to be injected as an UIMA external resource. It is true for implementations that we have chosen above. For such cases PosTaggerAPI and LemmatizerAPI defines an expected name for external resource holding a morphological dictionary. TextoKit’s MorphDictionaryAPI has factory methods to produce descriptions for such external resource. Consequently, you should add the external resource description into the aggregate description:

ExternalResourceDescription morphDictDesc =
        MorphDictionaryAPIFactory.getMorphDictionaryAPI().getResourceDescriptionForCachedInstance();
morphDictDesc.setName(PosTaggerAPI.MORPH_DICTIONARY_RESOURCE_NAME);
PipelineDescriptorUtils.getResourceManagerConfiguration(aeDesc).addExternalResource(morphDictDesc);

How to consume an output

There are several ways to extract data from a processed CAS depending on a deployment. The first approach is to implement an analysis engine (starting from JCasAnnotator_ImplBase or CasAnnotator_ImplBase), write extraction logic in its process method where a CAS instance is passed. And then you simply add this AE to the end of the pipeline. Here is the example where each word with its lemma and PoS-tag is written to standard output:

public class WordPosLemmaWriter extends JCasAnnotator_ImplBase {
    @Override
    public void process(JCas jCas) throws AnalysisEngineProcessException {
        for(Word w : JCasUtil.select(jCas, Word.class)) {
            String src = w.getCoveredText();
            String lemma = MorphCasUtils.getFirstLemma(w);
            String posTag = MorphCasUtils.getFirstPosTag(w);
            System.out.print(String.format("%s/%s/%s ", src, lemma, posTag));
        }
        // mark the end of a document
        System.out.println("\n");
    }
}

Another approach will be shown in a section below.

How to run a pipeline

UIMA provides quite a few options for pipeline deployment. The easiest way is arguably provided by UIMAfit’s SimplePipeline utility. Continuing our example an invocation is the following:

AnalysisEngineDescription writerDesc = createEngineDescription(WordPosLemmaWriter.class);

SimplePipeline.runPipeline(readerDesc, aeDesc, writerDesc);

This approach is good for quick small experiments and testing, it utilizes a single thread for analysis and only a single CAS instance which is reset before each document processing.

One of ways to enable multi-threaded processing is to use Collection Processing Engine machinery for deployment. Fortunately, UIMAfit has utilities to minimize efforts for configuring and launching CPE: CpeBuilder and CpePipeline. Note that they are provided in a separate module:

<dependency>
    <groupId>org.apache.uima</groupId>
    <artifactId>uimafit-cpe</artifactId>
    <version>2.1.0</version>
</dependency>

Although UIMA CPE provides some limited scalability and remote deployment capabilities, it is a medium solution. There are at least two more advanced tools: UIMA Asynchronous Scaleout and UIMA DUCC.

The complete example

You can see the complete example described above in the project on Github.

How to consume an output - 2

Sometimes writing a separate annotator class just to do something with processing output might be excessive. If you don’t need parallelization of a pipeline (by CPE, AS or DUCC) you can inspect CAS instances right in your application code after they are processed by the pipeline. For such cases UIMAfit’s JCasIterable does all preparations for you: each iterator instantiates a collection reader and an analysis engine, and creates a CAS instance which will be reused for each document in the collection. This CAS instance is returned by the iterator’s next method, once for each document. Consequently, we can rewrite the example above without WordPosLemmaWriter as the following:

JCasIterable jCasIterable = new JCasIterable(readerDesc, aeDesc);
for (JCas jCas : jCasIterable) {
    for (Word w : JCasUtil.select(jCas, Word.class)) {
        String src = w.getCoveredText();
        String lemma = MorphCasUtils.getFirstLemma(w);
        String posTag = MorphCasUtils.getFirstPosTag(w);
        System.out.print(String.format("%s/%s/%s ", src, lemma, posTag));
    }
    // mark the end of a document
    System.out.println("\n");
}