Skip to main content
Version: 2.0.0

4. Filter the dump

This step removes meaningless columns/rows (e.g. all cells are the empty string). To process a single dump run:

python mammotab_filter.py wiki_tables [enwiki...bz2] wiki_tables_filtered

or run

bash parallel_filter.sh # set NPROC in the file as it fits your needs

to process multiple dumps in parallel. Once finished you should see a folder named wiki_tables_filtered.