MammoTab
MammoTab, is a dataset composed of 1M Wikipedia tables extracted from over 20M Wikipedia pages and annotated through Wikidata. The lack of this kind of datasets in the stateof-the-art makes MammoTab a good resource for testing and training Semantic Table Interpretation approaches. The dataset has been designed to cover several key challenges, such as disambiguation, homonymy, and NIL-mentions
Learn moreMammotab 2024
V2
Introducing the enhanced MammoTab 2.0! This latest version features a fully refactored codebase, resulting in a streamlined number of tables. Thanks to advanced data cleaning techniques, the annotations are now of superior quality. Additionally, each table is complemented by comprehensive metadata that detail their features, addressing the key challenges of the STI.
DownloadComing soon...888.372N. of tables40.702.248Entities4.937.828Classes24.193Properties4.121.995NIL21.731.092Total rows4Min rows24.193Max rows5.030.655Total cols1Min cols1.000Max colsEvaluation
Dataset 511 tables, 9741 mentions.
APPROACHES CEA Zang 2023 0.86 Deng 2022 0.31 Avogadro 2023 0.62 Upgrades
Add table metadata
Column classification (NIL, Ne)
Domain classification
Add table context
Add classification for key STI challenges
Add export material
Mammotab 2024
V2-alpha (SemTab)
This version was created for SemTab 2024, it comes from the preliminary version of V2. The annotations within MammoTab 24 are derived from Wikidata v. 20240401 and follow the structure used in the SemTab challenge. All tables are stored in a separate CSV file, where each line in the file corresponds to a row in the table. Target columns for annotation, CTA, and CEA are saved in separate CSV files.
DownloadComing soon after SemTab2024Evaluation
Dataset 511 tables, 9741 mentions.
APPROACHES CEA Zang 2023 0.86 Deng 2022 0.31 Avogadro 2023 0.62 Upgrades
Greater accuracy in annotations
New annostions for CPA (Columns Predicate Annotations)
Mammotab 2022
V1
The annotations within MammoTab22 are derived from Wikidata v. 20220511 and follow the structure used in the SemTab challenge. All tables are stored in a separate CSV file, where each line in the file corresponds to a row in the table. Target columns for annotation, CTA, and CEA are saved in separate CSV files
Download980.254N. of tables43.661.125Entities5.541.283 Classes23.229.899Total rows4Min rows14.436Max rows5.638.191Total cols1Min cols1.0100.012Max colsEvaluation
DATASET CEA CTA CPA Semtab2019 R4 0.983 - 0.832 Semtab2020 R4 0.907 0.993 0.997 Semtab2020 2T 0.907 0.728 - Semtab2021 R3 0.968 0.984 0.993 MammoTab 22 0.853 0.659 -