Python API Reference

`Alligator`

The primary entry point for programmatic use.

from alligator import Alligator

gator = Alligator(input_csv="tables/my_table.csv", **kwargs)
gator.run()

Constructor Parameters

All AlligatorConfig fields can be passed as keyword arguments. Commonly used ones:

Parameter	Type	Default	Description
`input_csv`	`str \| Path \| DataFrame`	required	Input data source
`num_workers`	`int`	`cpu_count // 2`	Parallel retrieval workers
`worker_batch_size`	`int`	`64`	Rows per worker batch
`candidate_retrieval_limit`	`int`	`20`	Max candidates fetched per entity
`max_candidates_in_result`	`int`	`5`	Max candidates in output
`num_ml_workers`	`int`	`2`	ML pipeline workers
`ml_worker_batch_size`	`int`	`256`	ML batch size
`mongo_uri`	`str`	`mongodb://gator-mongodb:27017/`	MongoDB URI
`target_columns`	`ColType \| None`	`None`	Manual column type overrides
`column_types`	`dict`	`{}`	Wikidata type constraints per column
`candidate_retrieval_only`	`bool`	`False`	Stop after Phase 2
`save_output`	`bool`	`False`	Persist results to MongoDB
`save_output_to_csv`	`bool`	`False`	Write results to CSV

See Configuration Reference for the full parameter list.

Methods

`run() → None`

Runs the complete pipeline: onboard data → worker processing → ML ranking → output.

gator.run()

`onboard_data() → None`

Runs only Phase 1: column classification and MongoDB ingestion.

gator.onboard_data()

`save_output() → None`

Manually triggers Phase 4: output assembly and optional CSV write.

gator.save_output()

`close_mongo_connection() → None`

Closes the MongoDB connection pool. Call this when done if managing the lifecycle manually.

gator.close_mongo_connection()

Key Types

`ColType`

Used for manual column type assignment via target_columns.

from alligator.types import ColType

target_columns: ColType = {
    "NE":      {0: "OTHERS", 2: "LOC"},   # col_idx → NER label
    "LIT":     {1: "NUMBER", 3: "STRING"}, # col_idx → literal type
    "IGNORED": [4, 5],
}

NER Labels: "LOC", "ORG", "PERS", "OTHERS"

Literal Types: "NUMBER", "STRING", "DATE", "BOOLEAN"

`Entity`

Represents a named entity extracted from an NE cell.

@dataclass
class Entity:
    value: str          # cell text value
    row_index: int
    col_index: int
    correct_qids: list[str]   # ground-truth QIDs (for evaluation)
    fuzzy: bool               # whether fuzzy retrieval was used
    ner_type: str             # NER label

`Candidate`

Represents a Wikidata entity candidate for a cell.

@dataclass
class Candidate:
    id: str             # Wikidata QID (e.g. "Q12345")
    name: str           # entity label
    description: str    # entity description
    score: float        # normalised ML score [0, 1]
    features: dict      # 27-feature vector
    types: list[str]    # Wikidata type QIDs
    predicates: dict    # predicate map for CPA
    matches: bool       # True for the auto-matched candidate

    def to_dict(self) -> dict: ...

    @classmethod
    def from_dict(cls, data: dict) -> 'Candidate': ...

Feature Names

The 27 default features used in ML scoring are accessible via:

from alligator.feature import DEFAULT_FEATURES

print(DEFAULT_FEATURES)
# ['ed_score', 'jaccard', 'jaro_winkler', ..., 'cta_t1', 'cpa_t1', ...]

Feature categories:

String similarity: exact match, Levenshtein, Jaro-Winkler, Jaccard
N-gram overlap: character and token level
Description similarity: desc, descNgram
Type indicators: ntype_LOC, ntype_ORG, ntype_PERS, ntype_OTHERS
Column relationship features (rerank only): cta_t1…t5, cpa_t1…t5, lit_*
Retrieval score: ed_score (score from LAMAPI)

Alligator​

Constructor Parameters​

Methods​

run() → None​

onboard_data() → None​

save_output() → None​

close_mongo_connection() → None​

Key Types​

ColType​

Entity​

Candidate​

Feature Names​