Data Cleaning
The MammoTab dataset comprises tables extracted from Wikipedia, enriched with semantic annotations. To ensure the dataset`s quality and usability, various data-cleaning tasks are applied. This document outlines the essential data-cleaning rules performed in this version.
Rules applied to table cells
- CE1 remove the following html tags from the cell:
sup,ref,span,sub,code,small,poem - CE2 convert
brtags into spaces removing the newlines - CE3 remove all tags and join different parts of the text with a space
- CE4 remove
{{formatnum:and}}from string - CE5 remove
† - CE6 remove text between brackets and manage the case
(number) text - CE7 replace common html artifacts
\xa0->,\ ->,\&->&,\–->- - CE8 remove extra parenthesis like
{{}},[[]],(), and" - CE9 remove leading and trailing spaces
- CE10 handle specific page types
file:,help:,wikipedia:wikiproject
Rules applied to table columns
- CO1 remove all columns containing only empty strings
,- - CO2 remove all columns containing only one repeated value
- CO3 remove columns containing only QIDs
- CO4 remove first word when repeated in the column
Rules applied to table rows
- TR1 remove all rows containing only empty strings
,- - TR2 remove all rows containing only one repeated value
- TR3 remove rows where total appears at least twice
- TR4 remove rows with most of the cells empty