Wikipedia dump extractor
 All Classes Files Functions Variables
Public Member Functions | Private Attributes | List of all members
ExtractPageSAXHandler Class Reference

A SAX handler that exports revisions metadata from a Wikipedia-stub-dump file into a single xml file. More...

Inherits DefaultHandler.

Public Member Functions

 ExtractPageSAXHandler (File exportFile, String pageTitle) throws ParserConfigurationException
 Creates a new ExtractPageSAXHandler instance.
void startDocument () throws SAXException
void startElement (String uri, String localName, String qName, Attributes attributes) throws SAXException
void characters (char[] ch, int start, int length) throws SAXException
void endElement (String uri, String localName, String qName) throws SAXException
void endDocument () throws SAXException

Private Attributes

File exportFile
 the export file to export to
DocumentBuilderFactory docFactory
 the document builder factory used to create the DOM document builder
DocumentBuilder docBuilder
 the document builder used to create the DOM document
Document currentDoc
 the current DOM document instance.
Element currentRoot
 the current root element instance.
Stack< Element > elements = new Stack<Element>()
 a stack containing the current parent element.
boolean export = false
 indicates if the parser is inside a page element that should be exported
boolean inTitle = false
 indicates if the parser is inside a title tag
boolean inRevision = false
 indicates if the parser is inside a revision tag
String pageTitle = ""
 the title of the page which should be exported

Detailed Description

A SAX handler that exports revisions metadata from a Wikipedia-stub-dump file into a single xml file.

This Handler throws an SAXException with the message "Finished extraction" after the page has been found and exported.

See Also
https://meta.wikimedia.org/wiki/Data_dumps/Dump_format
Author
Florian Zoubek zoube.nosp@m.k@bi.nosp@m.tanda.nosp@m.rt.a.nosp@m.t

Constructor & Destructor Documentation

ExtractPageSAXHandler.ExtractPageSAXHandler ( File  exportFile,
String  pageTitle 
) throws ParserConfigurationException

Creates a new ExtractPageSAXHandler instance.

For details and behavior of this handler see the class description.

Parameters
exportFilethe file to export to
pageTitlethe title of the page which should be exported
Exceptions
ParserConfigurationException

Member Function Documentation

void ExtractPageSAXHandler.characters ( char[]  ch,
int  start,
int  length 
) throws SAXException
void ExtractPageSAXHandler.endDocument ( ) throws SAXException
void ExtractPageSAXHandler.endElement ( String  uri,
String  localName,
String  qName 
) throws SAXException
void ExtractPageSAXHandler.startDocument ( ) throws SAXException
void ExtractPageSAXHandler.startElement ( String  uri,
String  localName,
String  qName,
Attributes  attributes 
) throws SAXException

Member Data Documentation

Document ExtractPageSAXHandler.currentDoc
private

the current DOM document instance.

null until the page has been found.

Element ExtractPageSAXHandler.currentRoot
private

the current root element instance.

null until the page has been found.

DocumentBuilder ExtractPageSAXHandler.docBuilder
private

the document builder used to create the DOM document

DocumentBuilderFactory ExtractPageSAXHandler.docFactory
private

the document builder factory used to create the DOM document builder

Stack<Element> ExtractPageSAXHandler.elements = new Stack<Element>()
private

a stack containing the current parent element.

boolean ExtractPageSAXHandler.export = false
private

indicates if the parser is inside a page element that should be exported

File ExtractPageSAXHandler.exportFile
private

the export file to export to

boolean ExtractPageSAXHandler.inRevision = false
private

indicates if the parser is inside a revision tag

boolean ExtractPageSAXHandler.inTitle = false
private

indicates if the parser is inside a title tag

String ExtractPageSAXHandler.pageTitle = ""
private

the title of the page which should be exported


The documentation for this class was generated from the following file: