Skip to the content.

DUUI Importer — Configuration Reference

The DUUI Importer reads UIMA CAS files (XMI or gzip-compressed XMI) and writes their annotation data into the UDAV PostgreSQL database using bulk COPY operations.

All settings are passed via environment variables (.env file or Docker environment).


Table of Contents


Core Importer Settings

DUUI_IMPORTER

   
Type boolean
Default false
Required Yes (to activate the importer)

Enables or disables the DUUI importer on application startup. When false, the importer is completely inactive and adds no overhead.

DUUI_IMPORTER=true

Input & TypeSystem

DUUI_IMPORTER_PATH

   
Type string (path)
Default src/main/resources/input
Required Yes

Path to the directory containing the XMI or GZ input files.

Docker Compose: This must be an absolute path on the host machine. Docker Compose mounts it into the container at /app/data/input (read-only).

DUUI_IMPORTER_PATH=/data/my-corpus/xmi

DUUI_IMPORTER_FILE_ENDING

   
Type string
Default .xmi
Allowed values .xmi, .gz

File extension used to discover input files in DUUI_IMPORTER_PATH. Only files matching this suffix are processed.

DUUI_IMPORTER_FILE_ENDING=.gz

DUUI_IMPORTER_TYPE_SYSTEM_PATH

   
Type string (path)
Default src/main/resources/types/PlenumTypeSystem.xml (bare JAR) / auto-detected (Docker)
Required No

Path to an external TypeSystem XML file (or the directory containing it). When set, UDAV loads this type system instead of auto-detecting it from the XMI files.

When to set:

Docker Compose: Set this to an absolute host path pointing to the folder or file. The path is mounted into the container at /app/data/types (read-only).

# Leave blank for auto-detection (recommended for most setups)
DUUI_IMPORTER_TYPE_SYSTEM_PATH=

# Or point to your type system file
DUUI_IMPORTER_TYPE_SYSTEM_PATH=/data/my-corpus/typesystem

Note: If the path is set but the file does not exist, startup will fail with an explicit error.


Parallelism & Performance

DUUI_IMPORTER_WORKERS

   
Type integer
Default 4
Min 1

Number of parallel UIMA worker threads that process CAS objects. Each worker holds a CAS from the pool.

Rule of thumb: set equal to the number of available CPU cores. Higher values can improve throughput on I/O-bound corpora but increase memory usage.

DUUI_IMPORTER_WORKERS=8

DUUI_IMPORTER_CAS_POOL_SIZE

   
Type integer
Default workers × 2 (minimum 1)

Size of the UIMA CAS object pool. A larger pool allows more documents to be in-flight simultaneously but increases heap memory usage.

Rule of thumb: workers × 2 is a safe starting point. Increase if workers are often idle waiting for a CAS, decrease if memory is tight.

DUUI_IMPORTER_CAS_POOL_SIZE=16

DUUI_IMPORTER_READER_BATCH_SIZE

   
Type integer
Default 10

Number of documents read per batch by DUUIFileReaderLazy. Larger batches reduce file-system overhead but increase per-batch memory usage.

DUUI_IMPORTER_READER_BATCH_SIZE=20

DUUI_IMPORTER_DB_WORKERS

   
Type integer
Default 1

Number of parallel JooqDatabaseWriter instances (COPY writer stage). Each writer opens its own database connection and writes independently.

Increasing this can improve throughput when the database is the bottleneck, but adds connection overhead and may cause contention on the same tables.

DUUI_IMPORTER_DB_WORKERS=2

DUUI_IMPORTER_PREPARE_DB_SCHEMA

   
Type boolean
Default true

When true, a dedicated single-threaded schema-preparation stage runs before the parallel COPY writers. This stage creates all necessary tables and columns based on the type system but does not insert any data.

Why this matters: DDL operations (CREATE TABLE, ALTER TABLE) are not safe to run from multiple parallel workers simultaneously. By separating schema setup from data writing, the parallel COPY stage can run with allowDdl=false, avoiding lock contention.

Set to false only if you are certain the schema already exists and is complete.

DUUI_IMPORTER_PREPARE_DB_SCHEMA=true

Database Settings

These settings configure the database connection used by the DUUI importer’s JooqDatabaseWriter. They are shared with the main UDAV application.

DB_URL

   
Type string (JDBC URL)
Default jdbc:postgresql://localhost:5432/postgres

JDBC connection URL. In Docker Compose, the host is the service name (postgres).

DB_URL=jdbc:postgresql://postgres:5432/udav

DB_USER / DB_PASS

   
Type string
Default postgres / postgres

Database username and password.

DB_USER=postgres
DB_PASS=postgres

DB_SCHEMA

   
Type string
Default public

PostgreSQL schema in which UDAV creates all its tables (documents, sofas, uima_type_registry, and all annotation tables). The schema is created automatically if it does not exist.

DB_SCHEMA=public

DB_BATCH_SIZE

   
Type integer
Default 3000 (application default) / 10000 (writer default)
Range 100015000 recommended

Number of rows accumulated in a COPY buffer before flushing to PostgreSQL. The buffer also flushes automatically when it exceeds 16 MB regardless of row count.

DB_BATCH_SIZE=10000

DB_MAX_IDENT

   
Type integer
Default 255 (application) / 63 (writer hard-cap for PostgreSQL)

Maximum length for generated SQL identifiers (table and column names). For PostgreSQL, the hard limit is 63 bytes regardless of this setting. The writer enforces max(16, min(configured, dialect_hard_limit)).

Database Hard limit
PostgreSQL 63
MySQL / MariaDB 64
DB_MAX_IDENT=63

DB_DIALECT

   
Type string
Default POSTGRES

SQL dialect for jOOQ. Currently only POSTGRES is fully supported by the DUUI importer (the writer validates this on startup and will fail for other dialects).

DB_DIALECT=POSTGRES

Advanced / Debug

DUUI_IMPORTER_SKIP_VERIFICATION

   
Type boolean
Default false

Passes withSkipVerification(true) to the DUUIComposer. Skips component readiness checks during initialization, which can speed up startup for known-good setups.

DUUI_IMPORTER_SKIP_VERIFICATION=false

DUUI_IMPORTER_DEBUG_XMI

   
Type boolean
Default false

When true, adds an XmiWriter stage to the pipeline that writes each processed CAS back to disk as a gzip-compressed XMI file. Useful for inspecting what the importer sees after RemoveMetaInformation has run.

This adds significant I/O overhead. Only enable for debugging.

DUUI_IMPORTER_DEBUG_XMI=false

DUUI_IMPORTER_DEBUG_XMI_PATH

   
Type string (path)
Default /tmp/export
Only used when DUUI_IMPORTER_DEBUG_XMI=true

Output directory for debug XMI files written by the XmiWriter stage. The directory must be writable by the application process.

DUUI_IMPORTER_DEBUG_XMI_PATH=/tmp/udav-debug-xmi

Pipeline Hash

DUUI_IMPORTER_PIPELINE_HASH_EXTRA

   
Type string
Default (empty)

An optional string appended as extra=<value> when computing the pipeline hash. The pipeline hash is a SHA-256 digest derived from:

The pipeline hash is stored alongside each document in the documents table. When the hash changes, documents are considered outdated and are re-imported on the next run. Use this setting to force a full re-import without changing any other option.

DUUI_IMPORTER_PIPELINE_HASH_EXTRA=reimport-2026-05-11

Configuration Reference Table

Environment Variable Type Default Description
DUUI_IMPORTER boolean false Enable the importer
DUUI_IMPORTER_PATH path src/main/resources/input Host path to input XMI/GZ files
DUUI_IMPORTER_FILE_ENDING string .xmi File extension: .xmi or .gz
DUUI_IMPORTER_TYPE_SYSTEM_PATH path (auto-detect) Path to external TypeSystem XML
DUUI_IMPORTER_WORKERS integer 4 Parallel UIMA processing workers
DUUI_IMPORTER_CAS_POOL_SIZE integer workers × 2 UIMA CAS object pool size
DUUI_IMPORTER_READER_BATCH_SIZE integer 10 Documents per reader batch
DUUI_IMPORTER_DB_WORKERS integer 1 Parallel DB COPY writer instances
DUUI_IMPORTER_PREPARE_DB_SCHEMA boolean true Run schema-prep stage before COPY writers
DUUI_IMPORTER_STORE_COVERED_TEXT boolean false Store annotation covered text in DB
DUUI_IMPORTER_SKIP_VERIFICATION boolean false Skip DUUI composer verification
DUUI_IMPORTER_DEBUG_XMI boolean false Write processed CAS to disk as XMI
DUUI_IMPORTER_DEBUG_XMI_PATH path /tmp/export Output dir for debug XMI files
DUUI_IMPORTER_PIPELINE_HASH_EXTRA string (empty) Extra value appended to pipeline hash
DB_URL string jdbc:postgresql://localhost:5432/postgres JDBC connection URL
DB_USER string postgres Database username
DB_PASS string postgres Database password
DB_SCHEMA string public PostgreSQL schema
DB_BATCH_SIZE integer 3000 COPY buffer row limit
DB_MAX_IDENT integer 255 (capped at 63 for PostgreSQL) Max SQL identifier length
DB_DIALECT string POSTGRES SQL dialect (only POSTGRES supported)

Example Configurations

Minimal — browse existing data, no import

DUUI_IMPORTER=false
DB_URL=jdbc:postgresql://postgres:5432/udav
DB_USER=postgres
DB_PASS=postgres

Standard import — uncompressed XMI, auto-detected type system

DUUI_IMPORTER=true
DUUI_IMPORTER_PATH=/data/my-corpus/xmi
DUUI_IMPORTER_FILE_ENDING=.xmi
DUUI_IMPORTER_WORKERS=4
DUUI_IMPORTER_CAS_POOL_SIZE=8
DUUI_IMPORTER_DB_WORKERS=1
DUUI_IMPORTER_PREPARE_DB_SCHEMA=true

DB_URL=jdbc:postgresql://postgres:5432/udav
DB_USER=postgres
DB_PASS=postgres
DB_BATCH_SIZE=10000
JAVA_OPTS=-Xmx10G -Xms1024m

High-throughput import — gzip corpus, explicit type system

DUUI_IMPORTER=true
DUUI_IMPORTER_PATH=/data/my-corpus/gz
DUUI_IMPORTER_FILE_ENDING=.gz
DUUI_IMPORTER_TYPE_SYSTEM_PATH=/data/my-corpus/typesystem
DUUI_IMPORTER_WORKERS=8
DUUI_IMPORTER_CAS_POOL_SIZE=16
DUUI_IMPORTER_READER_BATCH_SIZE=20
DUUI_IMPORTER_DB_WORKERS=2
DUUI_IMPORTER_PREPARE_DB_SCHEMA=true

DB_URL=jdbc:postgresql://postgres:5432/udav
DB_USER=postgres
DB_PASS=postgres
DB_BATCH_SIZE=10000
JAVA_OPTS=-Xmx20G -Xms2G

Force full re-import of an existing corpus

Change only DUUI_IMPORTER_PIPELINE_HASH_EXTRA — this invalidates the stored pipeline hash for every document, causing all documents to be re-processed on the next run.

DUUI_IMPORTER_PIPELINE_HASH_EXTRA=reimport-2026-05-11