Skip to main content

MammoTab

MammoTab, is a dataset composed of 1M Wikipedia tables extracted from over 20M Wikipedia pages and annotated through Wikidata. The lack of this kind of datasets in the stateof-the-art makes MammoTab a good resource for testing and training Semantic Table Interpretation approaches. The dataset has been designed to cover several key challenges, such as disambiguation, homonymy, and NIL-mentions

Learn more
Logo
  • Mammotab 2024

    V2

    Introducing the enhanced MammoTab 2.0! This latest version features a fully refactored codebase, resulting in a streamlined number of tables. Thanks to advanced data cleaning techniques, the annotations are now of superior quality. Additionally, each table is complemented by comprehensive metadata that detail their features, addressing the key challenges of the STI.

    Download
    888.372N. of tables
    40.702.248Entities
    4.937.828Classes
    24.193Properties
    4.121.995NIL
    21.731.092Total rows
    4Min rows
    24.193Max rows
    5.030.655Total cols
    1Min cols
    1.000Max cols

    Evaluation

    Dataset 511 tables, 9741 mentions.

    APPROACHESCEA
    Zang 20230.86
    Deng 20220.31
    Avogadro 20230.62
    Find on Zenodozenodo
    Find on Github
  • Upgrades

    metadata

    Add table metadata

    column classification

    Column classification (NIL, Ne)

    Domain classification

    Domain classification

    table context

    Add table context

    keys

    Add classification for key STI challenges

    export

    Add export material

  • Mammotab 2024

    V2-alpha (SemTab)

    This version was created for SemTab 2024, it comes from the preliminary version of V2. The annotations within MammoTab 24 are derived from Wikidata v. 20240401 and follow the structure used in the SemTab challenge. All tables are stored in a separate CSV file, where each line in the file corresponds to a row in the table. Target columns for annotation, CTA, and CEA are saved in separate CSV files.

    Download

    Evaluation

    Dataset 511 tables, 9741 mentions.

    APPROACHESCEA
    Zang 20230.86
    Deng 20220.31
    Avogadro 20230.62
    Find on Zenodozenodo
    Find on Github
  • Upgrades

    arcs

    Greater accuracy in annotations

    annotations

    New annostions for CPA (Columns Predicate Annotations)

  • Mammotab 2022

    V1

    The annotations within MammoTab22 are derived from Wikidata v. 20220511 and follow the structure used in the SemTab challenge. All tables are stored in a separate CSV file, where each line in the file corresponds to a row in the table. Target columns for annotation, CTA, and CEA are saved in separate CSV files

    Download
    980.254N. of tables
    43.661.125Entities
    5.541.283 Classes
    23.229.899Total rows
    4Min rows
    14.436Max rows
    5.638.191Total cols
    1Min cols
    1.0100.012Max cols

    Evaluation

    Mtab performance

    DATASETCEACTACPA
    Semtab2019 R4 0.983 - 0.832
    Semtab2020 R4 0.907 0.993 0.997
    Semtab2020 2T 0.907 0.728 -
    Semtab2021 R3 0.968 0.984 0.993
    MammoTab 22 0.853 0.659 -