Skip to main content
Version: 2.0.0

Data Cleaning

The MammoTab dataset comprises tables extracted from Wikipedia, enriched with semantic annotations. To ensure the dataset`s quality and usability, various data-cleaning tasks are applied. This document outlines the essential data-cleaning rules performed in this version.

Rules applied to table cells

  • CE1 remove the following html tags from the cell: sup, ref, span, sub, code, small, poem
  • CE2 convert br tags into spaces removing the newlines
  • CE3 remove all tags and join different parts of the text with a space
  • CE4 remove {{formatnum: and }} from string
  • CE5 remove
  • CE6 remove text between brackets and manage the case (number) text
  • CE7 replace common html artifacts \xa0 -> , \  -> , \& -> &, \– -> -
  • CE8 remove extra parenthesis like {{}}, [[]], (), and "
  • CE9 remove leading and trailing spaces
  • CE10 handle specific page types file:, help:, wikipedia:wikiproject

Rules applied to table columns

  • CO1 remove all columns containing only empty strings , -
  • CO2 remove all columns containing only one repeated value
  • CO3 remove columns containing only QIDs
  • CO4 remove first word when repeated in the column

Rules applied to table rows

  • TR1 remove all rows containing only empty strings , -
  • TR2 remove all rows containing only one repeated value
  • TR3 remove rows where total appears at least twice
  • TR4 remove rows with most of the cells empty