Leaderboard Instructions
This guide will walk you through the process of testing a model and contributing to the MammoTab Leaderboard.
The testing is performed on a sample of 870 tables, containing a total of 85,565 cells. This sample has been carefully selected to represent the diverse characteristics of the MammoTab dataset and STI key challenges:
Dataset Characteristics
The test sample includes:
- Entity Annotations:
- 71,500 entities
- 14,856 NIL mentions
- Type Annotations:
- 266,703 generic types
- 1,125,199 specific types
- Additional Features:
- 3,518 acronyms
- 12,135 typos
- 7,117 aliases
- Domain Distribution:
- 435 single-domain tables
- 435 multi-domain tables
Prerequisites
Before starting, ensure you have:
- Access to the MammoTab execution repository
- A Hugging Face account with API token
- Docker installed on your system
- Sufficient hardware resources for model execution
Step 1: Model Selection and Setup
Select Your Model
- Access the Model Spreadsheet
- Choose a model from the available list in the "Model" sheet
- Update the "Group in charge" column with your team's information
- Set the model's status to "In Progress" when you begin testing
Step 2: Environment Setup
Clone the Repository
git clone git@github.com:unimib-datAI/mammotab_execution.git
cd mammotab_execution
Configure Environment Variables
- Create a
.env
file in the main directory:
nano .env
- Add the following configuration:
# MongoDB Configuration
MONGO_VERSION="6.0"
MONGO_PORT="27017"
MONGO_INITDB_ROOT_USERNAME="root"
MONGO_INITDB_ROOT_PASSWORD="mammotab_execution"
MONGO_INITDB_DATABASE="mammotab"
# Model Configuration
BATCH_SIZE=4
MODEL_NAME="your-model-name"
TOKENIZER_NAME="your-tokenizer-name"
HF_TOKEN="your-huggingface-token"
note
MODEL_NAME
andTOKENIZER_NAME
should be obtained from the Hugging Face model documentationHF_TOKEN
is your personal Hugging Face API token- Adjust
BATCH_SIZE
based on your hardware capabilities
Step 3: Dataset and Execution
Initialize the Dataset
./init.sh
Start the Annotation Process
docker compose up
caution
The annotation process may take several days to complete, depending on the model size and your hardware specifications.
Step 4: Results Submission
Export Results
- Once the process completes, locate the
[model-name].json
file in the main directory - Send the results file to the MammoTab team for leaderboard updates
Troubleshooting
Common Issues
- Docker Issues: Ensure Docker is running and you have sufficient permissions
- Memory Errors: Try reducing the
BATCH_SIZE
in the.env
file - API Token Issues: Verify your Hugging Face token is valid and has necessary permissions
Support
For additional support, please contact the MammoTab team or open an issue in the repository.
Best Practices
-
Hardware Considerations
- Use GPU-enabled machines for faster processing
- Monitor system resources during execution
- Consider using cloud services for large models
-
Data Management
- Keep regular backups of your results
- Document any issues or observations during testing
- Maintain clear communication with the MammoTab team
-
Performance Optimization
- Start with a smaller batch size and increase gradually
- Monitor memory usage and adjust accordingly
- Consider using model quantization for large models