Wikipedia dump extractor
 All Classes Files Functions Variables
Wikipedia dump extractor Documentation

System requirements

The following components and software are required to run or build the application.

Running

Building

Usage

Wikipedia-dump-extractor is a command line tool that extracts revision metadata of one or multiple wikipedia pages from a dump file. It is written only for so called "stub"-dump files that contain only history metadata.

wikipedia-dump-extractor

Several ways exist to build and run this java application, in the following sections we only discuss building class files. You are also free to create a executeable jar file, but that won't be covered here.

With javac

Navigate to the directory of the source files and execute the following command: javac Main.java [arguments] The builded files will be put in the same directory. It can be executed as described below.

With Eclipse IDE

The whole project has been created with eclipse so to compile the project you just need to import it via "Import->General->Existing Projects into Workspace". Afterwards start the building process by hitting "Run" and select "Main" as main class if asked. This builds and run the application with no arguments - usage information should be written in the console tab and the class files can be found in the /bin/ folder of the project.

Executing wikipedia-dump-extractor

Using the class files

First, change your working directory to be the directory where all class files reside. Then you are able to run the application by the following command:

java Main [arguments]

Replace [arguments] with the actual arguments you want to pass to the application which will be discussed later.

Using the executeable jar

If an executeable jar file has been created it fairly easy to run the application, just run the following command:

java -jar path/to/jar/jarfile.jar [arguments]

You must replace path/to/jar/jarfile.jar with the correct path and file name and [arguments] with the actual arguments(discussed in the next section) you want to pass to the application.

Allowed arguments

Running the application with no arguments will print the allowed arguments. However, currently two options exist: list and extract.

"List" will output the full list of page titles contained in the dump file. The list command must be specified in the following way:

<dumpfile> –list

"Extract" will read all revision metadata of the given pages from the dump file. This command must be specified in the following way:

<dumpfile> –extract <pagetitle> [<pagetitle2>...]