Corpus Importer
Prerequisites
This section requires that you have already set up the PostgreSQL DB and, preferably, the Web Portal. If not, refer to the respective documentation.
The Corpus-Importer is a Java application that transforms and imports UIMA-annotated data from a local path into the UCE environment. Depending on the configuration, it also performs post-processing of the data, such as the creation of embedding spaces.
UIMA
If the data is not yet available in UIMA format, refer to the respective documentation, which also utilizes the Docker Unified UIMA Interface to transform, process, and annotate the data in UIMA format the best way possible. After having transformed your data, proceed here.
Folder Structure
Having set up the database and the web portal (locally or via docker), all that is left to do is to tell the importer where to import from and start it.
For this, the importer always requires the following folder structure:
π corpus_a
β π corpusConfig.json
ββββπ input
β π uima_doc_1.xmi
β π uima_doc_2.xmi
β π ...
β π uima_doc_n.xmi
where corpusConfig.json
holds metadata, and the input
folder contains the actual UIMA files for a single corpus.
Input Structure
As of now, the importer will recursively walk through the input
folder, so every .xmi
file in any subfolder will be considered.
User Setup
Open the docker-compose.yaml
file (if you haven't created the .env
file yet, see here) and locate the uce-importer
service. Within it, mount all local paths to the corpora you want to import using the structure described above, and map them like so:
volumes:
- "./path/to/my_corpora/corpus_a:/app/input/corpora/corpus_a"
- "./path/to/other_corpora/corpus_b:/app/input/corpora/corpus_b"
- "..."
You can mount as many corpora as you like using the same structure. Remember that you can adjust the amount of threads used through the .env
file.
Afterwards, simply start the importer through the compose:
Developer Setup
In the source code, identify the module uce.corpus-importer
and set up your IDE:
Setup
- Add a new
Application
configuration - UCE is developed in Java 21
- Set
-cp corpus-importer
- Main class:
org.texttechnologylab.App
- CLI arguments are obligatory:
-src "./path/to/your_corpus/"
-num 1
-t 1
- Maven should automatically download and index the dependencies. If, for some reason, it does not, you can force an update via
mvn clean install -U
(in IntelliJ, openExecute Maven Goal
, then enter the command).
Open the common.conf
file and adjust the database connection parameters to match your database (port, host, etc.). Now start the importer and import your corpus. Refer to CLI Arguments for a full list of possible parameters.
Logs
The importer logs to both the PostgreSQL database (tables uceimport
and importlog
) and the local logs
directory within the container. Both logs also appear in the standard output of the console.
CLI Arguments
Argument |
Description |
---|---|
-src --importSrc |
The path to the corpus source where the UIMA-annotated files are stored. |
-srcDir --importDir |
Unlike -src , -srcDir is the path to a directory that holds multiple importable src paths. The importer will check for folders within this directory, where each folder should be an importable corpus with a corpusConfig.json and its input UIMA-files. Those are then imported. |
-num --importerNumber |
When starting multiple importers, assign an id to each instance by counting up from 1 to n (not relevant as off now, just set it to 1). |
-t --numThreads |
UCE imports asynchronous. Decide with how many threads, e.g. 4-8-16. By default, this is single threaded. |