Leaderboard Instructions
This guide will walk you through the process of testing a model and contributing to the MammoTab Leaderboard.
The testing is performed on a sample of 870 tables, containing a total of 84,907 cells. This sample has been carefully selected to represent the diverse characteristics of the MammoTab dataset and STI key challenges.
Dataset Characteristics
The test sample includes:
- Entity Annotations:
- 71,500 entities
- 14,856 NIL mentions
- Type Annotations:
- 266,703 generic types
- 1,125,199 specific types
- Additional Features:
- 3,518 acronyms
- 12,135 typos
- 7,117 aliases
- Domain Distribution:
- 435 single-domain tables
- 435 multi-domain tables
- Table Dimensions:
- Total Columns: 5,252
- Total Rows: 37,820
- Rows per table: min=4, max=253, avg=43.47, median=33.0
- Columns per table: min=1, max=36, avg=6.04, median=4.0
- Cells per table: min=4, max=264
Prerequisites
Before starting, ensure you have:
- Access to the MammoTab execution repository
- A Hugging Face account with API token
- Docker installed on your system
- Sufficient hardware resources for model execution
Step 1: Model Selection and Setup
Select Your Model
- Access the Model Spreadsheet
- Choose a model from the available list in the "Model" sheet
- Search it on HuggingFace, where you can copy the correct name (e.g. "Qwen/Qwen2.5-1.5B", "Qwen/Qwen2.5-0.5B", "microsoft/phi-2")
- Update the "Group in charge" column with your affiliation information
- Set the model's status to "In Progress" when you begin testing
Step 2: Environment Setup
Clone the Repository
# Clone the repository and navigate to it
git clone git@github.com:unimib-datAI/mammotab_execution.git
cd mammotab_execution
Configure Environment Variables
- Create a
.env
file in the main directory:
# Navigate to the project directory and create the .env file
cd mammotab_execution
nano .env
- Add the following configuration:
# MongoDB Configuration
MONGO_VERSION="6.0"
MONGO_PORT="27017"
MONGO_INITDB_ROOT_USERNAME="root"
MONGO_INITDB_ROOT_PASSWORD="mammotab_execution"
MONGO_INITDB_DATABASE="mammotab"
# Model Configuration
BATCH_SIZE=4
MODEL_NAME="your-model-name"
TOKENIZER_NAME="your-tokenizer-name"
HF_TOKEN="your-huggingface-token"
note
MODEL_NAME
andTOKENIZER_NAME
should be obtained from the Hugging Face model documentationHF_TOKEN
is your personal Hugging Face API token- Adjust
BATCH_SIZE
based on your hardware capabilities
Step 3: Dataset and Execution
Initialize the Dataset
# Run the initialization script
./init.sh
Start the Annotation Process
# Start the Docker containers
docker compose up
caution
The annotation process may take several days to complete, depending on the model size and your hardware specifications. See Hardware Considerations for more details.
Step 4: Results Submission
Export Results
- Once the process completes, locate the
[model-name].json
file in the main directory - Send the results file to the MammoTab team for leaderboard updates
Troubleshooting
Common Issues
- Docker Issues: Ensure Docker is running and you have sufficient permissions
- Memory Errors: Try reducing the
BATCH_SIZE
in the.env
file - API Token Issues: Verify your Hugging Face token is valid and has necessary permissions
Support
For additional support, please contact the MammoTab team or open an issue in the repository.
Best Practices
-
Hardware Considerations
- Use GPU-enabled machines for faster processing
- Monitor system resources during execution
- Consider using cloud services for large models
-
Data Management
- Keep regular backups of your results
- Document any issues or observations during testing
- Maintain clear communication with the MammoTab team
-
Performance Optimization
- Start with a smaller batch size and increase gradually
- Monitor memory usage and adjust accordingly
- Consider using model quantization for large models