I tried using one of these dumps a year ago (wanted to play around and see what ...

jsheard · 2025-05-02T13:21:58 1746192118

> I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.

More recently they starting putting the data up on Kaggle in a format which is supposed to be easier to ingest.

https://enterprise.wikimedia.com/blog/kaggle-dataset/

PeterStuer · 2025-05-02T14:04:36 1746194676

More recently is very recently, not enough time yet for data collectors to evaluate changing processes.

StableAlkyne · 2025-05-02T13:26:27 1746192387

Good timing to learn about this, given that it's Friday. Thanks! I'll check it out

GuinansEyebrows · 2025-05-02T15:32:18 1746199938

I think there are engineers working for crawler companies who are paid well enough to figure out how to do this without kneecapping the most well-known noncommercial projects still surviving on the capitalized internet.

neets · 2025-05-02T16:10:40 1746202240

Have you tried any of the ZIM file exports?

https://dumps.wikimedia.org/kiwix/zim/wikipedia/

Philpax · 2025-05-02T17:18:41 1746206321

Yeah, it's a bit confusing at first to navigate. Luckily, they offer XML dumps that aren't too bad to work with:

1. Go to https://dumps.wikimedia.org/enwiki/latest/ (or a date of your choice in /enwiki)

2. Download https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page... and https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.... The first file is a bz2-multistream-compressed dump of a XML containing all of English Wikipedia's text, while the second file is an index to make it easier to find specific articles.

3. You can either:

  a. unpack the first file
  b. use the second file to locate specific articles within the first file; it maps page title -> file offset for the relevant bz2 stream
  c. use a streaming decoder to process the entire Wiki without ever decompressing it wholly

4. Once you have the XML, getting at the actual text isn't too difficult; you should use a streaming XML decoder to avoid as much allocation as possible when processing this much data.

The XML contains pages like this:

    <page>
      <title>AccessibleComputing</title>
      <ns>0</ns>
      <id>10</id>
      <redirect title="Computer accessibility" />
      <revision>
        <id>1219062925</id>
        <parentid>1219062840</parentid>
        <timestamp>2024-04-15T14:38:04Z</timestamp>
        <contributor>
          <username>Asparagusus</username>
          <id>43603280</id>
        </contributor>
        <comment>Restored revision 1002250816 by [[Special:Contributions/Elli|Elli]] ([[User talk:Elli|talk]]): Unexplained redirect breaking</comment>
        <origin>1219062925</origin>
        <model>wikitext</model>
        <format>text/x-wiki</format>
        <text bytes="111" sha1="kmysdltgexdwkv2xsml3j44jb56dxvn" xml:space="preserve">#REDIRECT [[Computer accessibility]]

    {{rcat shell|
    {{R from move}}
    {{R from CamelCase}}
    {{R unprintworthy}}
    }}</text>
        <sha1>kmysdltgexdwkv2xsml3j44jb56dxvn</sha1>
      </revision>
    </page>

so all you need to do is get at the `text`.

ks2048 · 2025-05-02T17:34:30 1746207270

The bigger problem is this is wikitext markup. It would be helpful if they also provide HTML and/or plain text.

I know there are now a couple pretty-good wikitext parsers, but for years, it was a bigger problem. The only "official" one was the huge php app itself.

Philpax · 2025-05-02T17:46:17 1746207977

Oh, it's godawful; the format is a crime against all things structured. I use `parse-wiki-text-2` [0], which is a fork of `parse-wiki-text`, a Rust library by an author who has now disappeared into the wind. (Every day that I parse Wikipedia, I thank him for his contributions, wherever he may be.)

I wrote another Rust library [1] that wraps around `parse-wiki-text-2` that offers a simplified AST that takes care of matching tags for you. It's designed to be bound to WASM [2], which is how I'm pretty reliably parsing Wikitext for my web application. (The existing JS libraries aren't fantastic, if I'm being honest.)

[0]: https://github.com/soerenmeier/parse-wiki-text-2

[1]: https://github.com/philpax/wikitext_simplified

[2]: https://github.com/genresinspace/genresinspace.github.io/blo...

mjevans · 2025-05-02T14:28:52 1746196132

What they need to do is have 'major edits' push out an updated static render physical file like old school processes would. Then either host those somewhere as is, or also in a compressed format. (E.G. compressed weekly snapshot retained for a year?)

Also make a cname from bots.wikipedia.org to that site.