5. NIL Detection using Wikipedia titles
To detect which mentions are NIL we obtain all the titles/links actually present in wikipedia (if a title/link is not present the mention is NIL). For each dump run:
python mammotab_entity_titles.py [dump]
or parallelize it with e.g.
NPROC=4
ls enwiki-20220520*.bz2 | \
xargs -I {} -n 1 -P $NPROC bash -c 'python mamotab_entity_titles.py {}'
It should create a folder wiki_entities_titles
and then run
python merge_title_dicts.py wiki_entities_titles
to merge all inside a single pickle file (all_titles.pickle
)