Table of Contents
- Download Wikidata Dumps
- Decompress The Archive
- Loading Wikidata in a Database
- Updating Wikidata Dumps
Wikidata is a knowledge base powered by the Wikimedia Foundation, the same non-profit behind Wikipedia. Wikidata is growing at an incredible rate thanks to heavy processing of datasets ranging from Google Freebase. Although Wikidata can be consumed in the browsed, it is a great source for data mining, especially in an era of natural language understanding.
Just like with Wikipedia, the team running Wikidata releases a snapshot of the entire dataset every few days, for free. However, these a huge files and require time to be download, decompressed, and then, a lot of compute power to be parsed. Remember that if you do not want to go through all of these, you can query Wikidata using its SPARQL endpoint, or using its REST API.
If you want to play around with it, use the API or Query Service, but if you need to use it heavily in your own product, you probably want to load it on your own infrastructure or database. Trust me, this is the easy part.
Every few days, Wikidata will produce a set of very large files publicly downloadable by anybody interested. These files are the dumps of the entire Wikidata dataset including entities, properties, claims, references, qualifiers, labels, descriptions, sitelinks, and so on. Make sure you are aware of this lexicon while working with the dumps; if not, check Wikidata’s data modeling page, and specifically the pages about your favorite format of consumption (JSON, for example).
Wikidata dumps are available in different formats:
- JSON dumps
.json— each line in the document is an entity as a JSON object (30GB)
- RDF dumps
.nt— structured using triples and loved by semantic web fanciers (35GB)
- XML dumps
.xml— a good old XML structure with also daily incremental dumps (20GB)
All dumps contain the entirety of Wikidata’s brain and JSON dumps being the most used and downloaded due to the ease of processing in most programming language. JSON is also less verbose.
Downloading these files seems to be capped at 2MB/sec for me which is not helpful but it makes sense in order to save some valuable resources. We are not paying a dime for this incredibly well-structured data so let’s not complain too much.
The entire Wikidata dumps, once decompressed, showed a size of 500 gigabytes for the JSON version. The dumps are compressed using bzip2 or gzip — do not worry too much, these programs come preinstalled in most modern computers. Double-click to decompress the file or go in your terminal and use the appropriate command:
gunzip latest-all.json.gzfor the gzipped archive
bzip2 -d latest-all.json.bz2for the bzipped archive
The unzipping can easily take around an hour depending on your hardware. Easily double or triple that time if you are on an old laptop. While decompressing it, no progress status is shown on the terminal so let it run and come back in an hour or two to see if it is decompressed, or not.
Bzip2 takes longer to decompress but the file is lighter to download; gzip is a bigger file but takes less time to be unzipped. I tend to go with the gzip file
latest-all.json.gz to speed things up a tad.
So far, I have only worked with the JSON dumps from Wikidata but what follows goes for all. The input formats matter very little so pick the one you are most comfortable working with because you have a lot of work ahead. Basically, you cannot realistically have your application consume these dumps in production as-is so you must do some preprocessing, processing, and eventually loading into your own database of choice.
First of all, what exactly do you need from Wikidata’s dumps? Some are only interested in specific types (people) or specific chunks (sports-related items) of data or relationships (instances of, subclasses of). This is where things get complicated and compute power welcome.
Let me be honest… I’ve worked with these dumps for a couple of years and it is a challenge because each mistake in this processing may only be discovered weeks down the line and requires you to redo it again. Mind it, ran on my MacBook Pro 8GB RAM, processing the entire JSON dump took 4 entire days and I have optimized my Node.JS tool to reach that sort of timeframe. Running it on the cloud in a 64GB and multicore machine would obviously be a lot faster.
Processing Wikidata dumps is just like any ETL job:
- Extract — we’ve got the raw data by now, just ingest it in your processing tool
- Transform — discard entities you are not interested in, and transform the entities you need into the format required for the loading phase
- Load — import your new clean data into your database or system of choice
Personally, I was building a comprehensive knowledge graph for an English audience (yes you, the customers of topicseed). We do content idea generation so we need as many topics as possible, with as many relationships as possible.
Therefore, we at first only discarded entities that:
- have no english label,
- have no english description, or
- have no P31 (instance of) or P279 (subclass of) statements.
Makes sense, right? We only want entities (topics, for us) that are instances of or subclasses of other topics, and they had to also have an English label and description. If we open up to new languages, it will be easy to import these only but let’s not over-optimize too early.
Once I finished this test run I ran some checks on my database where I imported these entities. One entity was giving birth to hundreds of thousands of other entities. Worse, these instances were not actual topics. Welcome to the world of trials and errors. Wikidata also stores a huge list of scientific and scholarly articles (P31 includes Q13442814). So I went into my app and added some other conditions to be fulfilled for an entity to be considered, and four days later I had my output file.
Some people like intermediary phases with several tools doing very specific jobs during that processing phase:
- A discarder to remove unwanted entities
- A transformer to morph the JSON entities and relationships into SQL or any other format needed for the loading phase
My tool did it all on the fly with an array of statements that was flush every few minutes into a file for persistence in case of error. Memory is faster than I/O so the goal was to touch the disk as little as possible. The discarder was included in my tool and hundreds of thousands of entities were discarded.
With such large files, you cannot afford memory leaks in your application so launch it while coding just to make sure the memory and CPUs numbers stay on track. My first version had my fan going insane. I rewrote parts of the tool and it is now totally sane. Remember that it takes hours and days to transform so you still want to be able to use your laptop during that time.
Obviously, it is impossible to have your application load the entire dumped file in memory, it is half a terabyte after all! Therefore, use the right way of piping the file slowly to avoid bottlenecks. On NodeJS, I personally used
event-stream packages. If you need to perform some light processing, you may want to do to transformation in two stages to speed the clean up phase up.
At topicseed, I early on decided to work with a graph database called Dgraph, developed by a lovely team in San Francisco (started in India). If you are interested in this amazing NoSQL approach, check them out it’s totally free and open source. Their docs are great and they move fast after listening to customer feedbacks.
Dgraph requires a gzipped RDF file for its bulk loading feature. Some our tool took as input the huge JSON dump, and the output produced is a very large RDF file full of triples. I then ran a tool from Dgraph to preprocess the data so it could simply be uploaded on my Google Cloud’s disks where the database servers run. Basically, I processed my data twice, once for Dgraph, and then Dgraph did it for its workers so I just have to upload the data and it’s all up and running.
If you want to import data to MySQL, MongoDB, Neo4J, or DynamoDB, you will have to build your own processing pipeline so the data can be consumed by an actual app. Obviously, do not think about loading the data after you processed it as you will need to make engineering choices at the transform phase in order to facilitate the data loading.
Loading the data took A LOT OF hours (perhaps because I was greedy with my instances). But depending on how aggressive your discard function is during the transform step, you will load a potentially much smaller dataset.
Unfortunately, this is the hard part and will need a new article once I figured it out. There are no incremental JSON dumps so it is hard to keep track of changes. There is the RecentChanges API that allows the public to query for recent edits to each entity over the last thirty days. Currently, this is your best bet but it is indeed not great since there are hundreds of edits every minute.
My method, as of now and until I find a better way, is to trigger a Google Cloud Functions every minute. The function grabs all recent changes, creations and deletions. Then, it queries the API again to get all entities updated, created or deleted in order to make sure that they are not discarded (by the same conditions listed above).
Once we have a list of updates affecting entities relevant to our domain-specific application, then I store these in a database as such (with some extra metadata I will spare you):
- Subject — the entity that is changed
- Predicate — the field or property that is changed (english label, property P31, etc)
- Value — the new value as displayed in the RecentChanges API’s comment
- Timestamp — when it was updated on Wikidata
There is a unique constraint on the Subject-Predicate pair and an
SQL UPDATE ON DUPLICATE KEY so I only store the last state of a Subject-Predicate pair with the timestamp also updated. That way, I can query for entities (subjects) that have not been changed for a week and then merge these changes into the topicseed knowledge graph.
Why do I not update my official version of the graph database immediately? Simply because Wikidata is a collaborative tool like Wikipedia and therefore I only want to update a topic when the storm passed. If a defacer comes along and updates an entity (Donald Trump, Barack Obama, etc), by waiting a week I can be sure that a moderator or other member will correct this mistake. Therefore, I save myself a database update (summing up to thousands per day at the rate updates go on Wikidata).
However, this syncing pipeline is not ideal but it works for now. I shall let you know if I find a better way to keep my imported Wikidata’s data updated.