Auxiliary files
Wikidata classes ontology
Required to sort the types from generic to specific:
- Download and filter subclass relationships from a wikidata dump, e.g.:
sudo apt install bzip2 #or equivament for non debian based systems
wget <dump url>
bzcat latest-all.nt.bz2 | awk '$2 == "<http://www.wikidata.org/prop/direct/P279>" {print $0}'| gzip -c > ontology_all.gz
where P279
is "subclass of".
- Run
cd utilities
python prepare_ontology.py
Once finished you should have two pickle files:
-
ontology_complete.pickle
#dictionary of superclasses: superclasses[wikidata_class] -
depth.pickle
#dictionary of depth (max depth from a top level wikidata class) : depth[wikidata_class]
Move them to the main folder to proceed.
mv *.pickle ..
cd ..
Most common types (generic types)
In order to define if a given type is generic or specific we most common types across wikidata are identified.
The following bash command allows to have a list of the "Instance Of" wikidata property.
bzcat latest-all.nt.bz2 |
awk '$2 == "<http://www.wikidata.org/prop/direct/P31>"
{print $0}'| gzip -c > InstanceOf.gz
By then running the following script
python types_counter.py
A json file called most_common.json
is created which contains an ordered dictionary for wikidata types frequency.
The threashold to distinguish between generic and specific types was empirically set to consider the first 5000 types as generic (currently types having <=250 entity instances).