Configuration Reference

All configuration is handled through AlligatorConfig, composed of six sub-configuration groups. Every field can be passed directly to the Alligator constructor.

Data Configuration

Controls input/output data handling.

Field	Type	Default	Description
`input_csv`	`str \| Path \| DataFrame`	required	Path to input CSV or a pandas DataFrame
`output_csv`	`str`	auto	Output CSV path (derived from input filename)
`dataset_name`	`str`	UUID hex	Auto-generated dataset identifier
`table_name`	`str`	filename stem	Auto-derived from input filename
`target_rows`	`list[int]`	`[]`	Row indices to process (empty = all rows)
`target_columns`	`ColType \| None`	`None`	Dict with `NE`/`LIT`/`IGNORED` column assignments
`column_types`	`dict`	`{}`	Map `col_idx` → `[QID, ...]` for type-constrained search
`save_output`	`bool`	`False`	Enable result persistence to MongoDB
`save_output_to_csv`	`bool`	`False`	Write results back to a CSV file
`correct_qids`	`dict`	`{}`	Ground-truth QIDs for evaluation purposes
`dry_run`	`bool`	`False`	Skip actual processing (for testing)
`candidate_retrieval_only`	`bool`	`False`	Stop after Phase 2 (skip ML ranking)
`csv_separator`	`str`	`","`	CSV column separator character
`csv_header`	`str`	`"infer"`	CSV header row handling

Worker Configuration

Controls parallel retrieval workers.

Field	Type	Default	Description
`num_workers`	`int`	`cpu_count // 2`	Number of parallel retrieval workers
`worker_batch_size`	`int`	`64`	Number of rows per worker batch

Retrieval Configuration

Controls LAMAPI endpoint connections.

Field	Type	Default	Description
`entity_retrieval_endpoint`	`str`	`$ENTITY_RETRIEVAL_ENDPOINT`	Entity lookup API URL
`entity_retrieval_token`	`str`	`$ENTITY_RETRIEVAL_TOKEN`	Auth token for the retrieval API
`object_retrieval_endpoint`	`str`	`$OBJECT_RETRIEVAL_ENDPOINT`	Object relationship endpoint URL
`literal_retrieval_endpoint`	`str`	`$LITERAL_RETRIEVAL_ENDPOINT`	Literal values endpoint URL
`candidate_retrieval_limit`	`int`	`20`	Max candidates fetched per entity
`max_candidates_in_result`	`int`	`5`	Max candidates kept in the final output
`http_session_limit`	`int`	`32`	Max concurrent HTTP connections per worker
`http_session_ssl_verify`	`bool`	`False`	Verify SSL certificates for API calls

ML Configuration

Controls the machine learning ranking pipeline.

Field	Type	Default	Description
`ranker_model_path`	`str`	`alligator/models/default.h5`	Path to the Keras ranking model
`reranker_model_path`	`str`	same as ranker	Path to the Keras reranking model
`num_ml_workers`	`int`	`2`	Number of ML worker processes
`ml_worker_batch_size`	`int`	`256`	Document batch size for ML prediction
`ml_processor_id`	`str`	`"ml-processor"`	Identifier prefix for atomic batch claiming
`selected_features`	`list[str]`	27 default features	Feature names used in ML scoring

Feature Configuration

Field	Type	Default	Description
`top_n_cta_cpa_freq`	`int`	`3`	Top-N type/predicate frequencies injected as global features
`doc_percentage_type_features`	`float`	`1.0`	Fraction of documents used to compute global type frequencies

Database Configuration

Field	Type	Default	Description
`mongo_uri`	`str`	`mongodb://gator-mongodb:27017/`	MongoDB connection URI
`db_name`	`str`	`"alligator_db"`	Database name
`input_collection`	`str`	`"input_data"`	Collection for input row documents
`cache_collection`	`str`	`"candidate_cache"`	TTL cache for candidate API responses
`object_cache_collection`	`str`	`"object_cache"`	Cache for object relationship responses
`literal_cache_collection`	`str`	`"literal_cache"`	Cache for literal value responses
`error_log_collection`	`str`	`"error_logs"`	Collection for error logging

Match Threshold Variables

Configured via environment variables; control how ML output is translated into binary match decisions.

Variable	Default	Description
`RAW_MIN_CONFIDENCE`	`0.1`	Minimum raw ML score for a cell to be eligible for auto-matching
`MATCH_THRESHOLD`	`0.5`	Minimum normalised score for the top candidate to be auto-matched
`MATCH_MARGIN_DELTA`	`0.1`	Also accept if the top candidate leads the second by at least this delta

See Scoring & Thresholds for a full explanation of how these interact.

Data Configuration​

Worker Configuration​

Retrieval Configuration​

ML Configuration​

Feature Configuration​

Database Configuration​

Match Threshold Variables​