Skip to main content

Python API Reference

Alligator

The primary entry point for programmatic use.

from alligator import Alligator

gator = Alligator(input_csv="tables/my_table.csv", **kwargs)
gator.run()

Constructor Parameters

All AlligatorConfig fields can be passed as keyword arguments. Commonly used ones:

ParameterTypeDefaultDescription
input_csvstr | Path | DataFramerequiredInput data source
num_workersintcpu_count // 2Parallel retrieval workers
worker_batch_sizeint64Rows per worker batch
candidate_retrieval_limitint20Max candidates fetched per entity
max_candidates_in_resultint5Max candidates in output
num_ml_workersint2ML pipeline workers
ml_worker_batch_sizeint256ML batch size
mongo_uristrmongodb://gator-mongodb:27017/MongoDB URI
target_columnsColType | NoneNoneManual column type overrides
column_typesdict{}Wikidata type constraints per column
candidate_retrieval_onlyboolFalseStop after Phase 2
save_outputboolFalsePersist results to MongoDB
save_output_to_csvboolFalseWrite results to CSV

See Configuration Reference for the full parameter list.

Methods

run() → None

Runs the complete pipeline: onboard data → worker processing → ML ranking → output.

gator.run()

onboard_data() → None

Runs only Phase 1: column classification and MongoDB ingestion.

gator.onboard_data()

save_output() → None

Manually triggers Phase 4: output assembly and optional CSV write.

gator.save_output()

close_mongo_connection() → None

Closes the MongoDB connection pool. Call this when done if managing the lifecycle manually.

gator.close_mongo_connection()

Key Types

ColType

Used for manual column type assignment via target_columns.

from alligator.types import ColType

target_columns: ColType = {
"NE": {0: "OTHERS", 2: "LOC"}, # col_idx → NER label
"LIT": {1: "NUMBER", 3: "STRING"}, # col_idx → literal type
"IGNORED": [4, 5],
}

NER Labels: "LOC", "ORG", "PERS", "OTHERS"

Literal Types: "NUMBER", "STRING", "DATE", "BOOLEAN"

Entity

Represents a named entity extracted from an NE cell.

@dataclass
class Entity:
value: str # cell text value
row_index: int
col_index: int
correct_qids: list[str] # ground-truth QIDs (for evaluation)
fuzzy: bool # whether fuzzy retrieval was used
ner_type: str # NER label

Candidate

Represents a Wikidata entity candidate for a cell.

@dataclass
class Candidate:
id: str # Wikidata QID (e.g. "Q12345")
name: str # entity label
description: str # entity description
score: float # normalised ML score [0, 1]
features: dict # 27-feature vector
types: list[str] # Wikidata type QIDs
predicates: dict # predicate map for CPA
matches: bool # True for the auto-matched candidate

def to_dict(self) -> dict: ...

@classmethod
def from_dict(cls, data: dict) -> 'Candidate': ...

Feature Names

The 27 default features used in ML scoring are accessible via:

from alligator.feature import DEFAULT_FEATURES

print(DEFAULT_FEATURES)
# ['ed_score', 'jaccard', 'jaro_winkler', ..., 'cta_t1', 'cpa_t1', ...]

Feature categories:

  • String similarity: exact match, Levenshtein, Jaro-Winkler, Jaccard
  • N-gram overlap: character and token level
  • Description similarity: desc, descNgram
  • Type indicators: ntype_LOC, ntype_ORG, ntype_PERS, ntype_OTHERS
  • Column relationship features (rerank only): cta_t1…t5, cpa_t1…t5, lit_*
  • Retrieval score: ed_score (score from LAMAPI)