Skip to main content

Configuration Reference

All configuration is handled through AlligatorConfig, composed of six sub-configuration groups. Every field can be passed directly to the Alligator constructor.

Data Configuration

Controls input/output data handling.

FieldTypeDefaultDescription
input_csvstr | Path | DataFramerequiredPath to input CSV or a pandas DataFrame
output_csvstrautoOutput CSV path (derived from input filename)
dataset_namestrUUID hexAuto-generated dataset identifier
table_namestrfilename stemAuto-derived from input filename
target_rowslist[int][]Row indices to process (empty = all rows)
target_columnsColType | NoneNoneDict with NE/LIT/IGNORED column assignments
column_typesdict{}Map col_idx[QID, ...] for type-constrained search
save_outputboolFalseEnable result persistence to MongoDB
save_output_to_csvboolFalseWrite results back to a CSV file
correct_qidsdict{}Ground-truth QIDs for evaluation purposes
dry_runboolFalseSkip actual processing (for testing)
candidate_retrieval_onlyboolFalseStop after Phase 2 (skip ML ranking)
csv_separatorstr","CSV column separator character
csv_headerstr"infer"CSV header row handling

Worker Configuration

Controls parallel retrieval workers.

FieldTypeDefaultDescription
num_workersintcpu_count // 2Number of parallel retrieval workers
worker_batch_sizeint64Number of rows per worker batch

Retrieval Configuration

Controls LAMAPI endpoint connections.

FieldTypeDefaultDescription
entity_retrieval_endpointstr$ENTITY_RETRIEVAL_ENDPOINTEntity lookup API URL
entity_retrieval_tokenstr$ENTITY_RETRIEVAL_TOKENAuth token for the retrieval API
object_retrieval_endpointstr$OBJECT_RETRIEVAL_ENDPOINTObject relationship endpoint URL
literal_retrieval_endpointstr$LITERAL_RETRIEVAL_ENDPOINTLiteral values endpoint URL
candidate_retrieval_limitint20Max candidates fetched per entity
max_candidates_in_resultint5Max candidates kept in the final output
http_session_limitint32Max concurrent HTTP connections per worker
http_session_ssl_verifyboolFalseVerify SSL certificates for API calls

ML Configuration

Controls the machine learning ranking pipeline.

FieldTypeDefaultDescription
ranker_model_pathstralligator/models/default.h5Path to the Keras ranking model
reranker_model_pathstrsame as rankerPath to the Keras reranking model
num_ml_workersint2Number of ML worker processes
ml_worker_batch_sizeint256Document batch size for ML prediction
ml_processor_idstr"ml-processor"Identifier prefix for atomic batch claiming
selected_featureslist[str]27 default featuresFeature names used in ML scoring

Feature Configuration

FieldTypeDefaultDescription
top_n_cta_cpa_freqint3Top-N type/predicate frequencies injected as global features
doc_percentage_type_featuresfloat1.0Fraction of documents used to compute global type frequencies

Database Configuration

FieldTypeDefaultDescription
mongo_uristrmongodb://gator-mongodb:27017/MongoDB connection URI
db_namestr"alligator_db"Database name
input_collectionstr"input_data"Collection for input row documents
cache_collectionstr"candidate_cache"TTL cache for candidate API responses
object_cache_collectionstr"object_cache"Cache for object relationship responses
literal_cache_collectionstr"literal_cache"Cache for literal value responses
error_log_collectionstr"error_logs"Collection for error logging

Match Threshold Variables

Configured via environment variables; control how ML output is translated into binary match decisions.

VariableDefaultDescription
RAW_MIN_CONFIDENCE0.1Minimum raw ML score for a cell to be eligible for auto-matching
MATCH_THRESHOLD0.5Minimum normalised score for the top candidate to be auto-matched
MATCH_MARGIN_DELTA0.1Also accept if the top candidate leads the second by at least this delta

See Scoring & Thresholds for a full explanation of how these interact.