Skip to main content

Examples

Basic Usage

from alligator import Alligator

gator = Alligator(
input_csv="tables/imdb_top_100.csv",
num_workers=4,
num_ml_workers=2,
worker_batch_size=64,
candidate_retrieval_limit=16,
mongo_uri="mongodb://localhost:27017/",
)
gator.run()

With Wikidata Type Constraints

Constrain candidate search per NE column to specific Wikidata entity types. This significantly improves precision when you know the expected entity types in each column.

from alligator import Alligator

# Map column index (as string) → list of Wikidata QIDs
column_types = {
"0": ["Q11424"], # film
"3": ["Q483394"], # music genre
"7": ["Q5"], # human
"8": ["Q5"], # human
}

gator = Alligator(
input_csv="tables/imdb_top_100.csv",
column_types=column_types,
num_workers=4,
num_ml_workers=2,
worker_batch_size=64,
candidate_retrieval_limit=16,
mongo_uri="mongodb://localhost:27017/",
)
gator.run()

Manual Column Type Assignment

Override automatic column classification with explicit NE/LIT/IGNORED assignments:

from alligator import Alligator
from alligator.types import ColType

target_columns: ColType = {
"NE": {0: "OTHERS", 2: "LOC"},
"LIT": {1: "NUMBER", 3: "STRING"},
"IGNORED": [4, 5],
}

gator = Alligator(
input_csv="tables/my_table.csv",
target_columns=target_columns,
mongo_uri="mongodb://localhost:27017/",
)
gator.run()

Candidate Retrieval Only

Stop the pipeline after Phase 2 to inspect raw candidates before committing to ML scoring:

gator = Alligator(
input_csv="tables/my_table.csv",
candidate_retrieval_only=True,
mongo_uri="mongodb://localhost:27017/",
)
gator.run()

Process a Subset of Rows

gator = Alligator(
input_csv="tables/my_table.csv",
target_rows=[0, 1, 2, 10, 11],
mongo_uri="mongodb://localhost:27017/",
)
gator.run()

Save Results to CSV

gator = Alligator(
input_csv="tables/my_table.csv",
save_output=True,
save_output_to_csv=True,
mongo_uri="mongodb://localhost:27017/",
)
gator.run()

The output CSV will be written to the same directory as the input file.

CLI Examples

Basic run

python3 -m alligator.cli --gator.input_csv tables/my_table.csv

With column types and increased parallelism

python3 -m alligator.cli \
--gator.input_csv tables/imdb_top_100.csv \
--gator.column_types '{"0": ["Q11424"], "7": ["Q5"]}' \
--gator.num_workers 8 \
--gator.num_ml_workers 4

Using a YAML config file

config.yaml
gator:
input_csv: tables/my_table.csv
num_workers: 8
num_ml_workers: 4
worker_batch_size: 64
candidate_retrieval_limit: 20
save_output: true
save_output_to_csv: true
python3 -m alligator.cli --config config.yaml