Corpus Importer
Prerequisites
This section requires that you have already set up the PostgreSQL DB and, preferably, the Web Portal. If not, refer to the respective documentation.
The Corpus-Importer is a Java application that transforms and imports UIMA-annotated data from a local path into the UCE environment. Depending on the configuration, it also performs post-processing of the data, such as the creation of embedding spaces.
UIMA
If the data is not yet available in UIMA format, refer to the respective documentation, which also utilizes the Docker Unified UIMA Interface to transform, process, and annotate the data in UIMA format the best way possible. After having transformed your data, proceed here.
Folder Structure
Having set up the database and the web portal (locally or via docker), all that is left to do is to tell the importer where to import from and start it.
For this, the importer always requires the following folder structure:
π corpus_a
β   π corpusConfig.json
ββββπ input
    β   π uima_doc_1.xmi
    β   π uima_doc_2.xmi
    β   π ...
    β   π uima_doc_n.xmi
where corpusConfig.json holds metadata, and the input folder contains the actual UIMA files for a single corpus.
Input Structure
As of now, the importer will recursively walk through the input folder, so every .xmi file in any subfolder will be considered.
User Setup
Open the docker-compose.yaml file (if you haven't created the .env file yet, see here) and locate the uce-importer service. Within it, mount all local paths to the corpora you want to import using the structure described above, and map them like so: 
volumes:
    - "./path/to/my_corpora/corpus_a:/app/input/corpora/corpus_a"
    - "./path/to/other_corpora/corpus_b:/app/input/corpora/corpus_b"
    - "..."
You can mount as many corpora as you like using the same structure. Remember that you can adjust the amount of threads used through the .env file.
Afterwards, simply start the importer through the compose:
The importer will automatically import all corpora that is mounted to its local /app/input/corpora/ volume.
Developer Setup
In the source code, identify the module uce.corpus-importer and set up your IDE:
Setup
- Add a new Applicationconfiguration
- UCE is developed in Java 21
- Set -cp corpus-importer
- Main class: org.texttechnologylab.App
- CLI arguments are obligatory:- -src "./path/to/your_corpus/"
- -num 1
- -t 1
 
- Maven should automatically download and index the dependencies. If, for some reason, it does not, you can force an update via mvn clean install -U(in IntelliJ, openExecute Maven Goal, then enter the command).
Open the common.conf file and adjust the database connection parameters to match your database (port, host, etc.). Now start the importer and import your corpus. Refer to CLI Arguments for a full list of possible parameters.
Logs
The importer logs to both the PostgreSQL database (tables uceimport and importlog) and the local logs directory within the container. Both logs also appear in the standard output of the console.
CLI Arguments
| Argument | Description | 
|---|---|
| -src--importSrc | The path to the corpus source where the UIMA-annotated files are stored. | 
| -srcDir--importDir | Unlike -src,-srcDiris the path to a directory that holds multiple importablesrcpaths. The importer will check for folders within this directory, where each folder should be an importable corpus with a corpusConfig.json and its input UIMA-files. Those are then imported. | 
| -num--importerNumber | When starting multiple importers, assign an id to each instance by counting up from 1 to n (not relevant as off now, just set it to 1). | 
| -t--numThreads | UCE imports asynchronous. Decide with how many threads, e.g. 4-8-16. By default, this is single threaded. | 
| -view--casView | Name of the CAS view to import from. If not set, the default view (initial view) is used. Adjust this only if you're familiar with CAS views and UIMA. Otherwise, you probably don't need this. |