-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
200 lines (160 loc) · 7.27 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
A quick little tool for extracting sets of pages from a MediaWiki dump file.
Can read MediaWiki XML export dumps (version 0.3, minus uploads), perform
optional filtering, and output back to XML or to SQL statements to add things
directly to a database in 1.4 or 1.5 schema.
Still very much under construction.
MIT-style license like our other Java/C# tools; boilerplate to be added.
Contains code from the Apache Commons Compress project for cross-platform
bzip2 input/output support (Apache License 2.0).
If strange XML errors are encountered under Java 1.4, try 1.5:
* http://java.sun.com/j2se/1.5.0/download.jsp
* http://www.apple.com/downloads/macosx/apple/java2se50release1.html
USAGE:
Sample command line for a direct database import:
java -jar mwdumper.jar --format=mysql:1.5 pages_full.xml.bz2 |
mysql -u <username> -p <databasename>
You can also do complex filtering to produce multiple output files:
java -jar mwdumper.jar \
--output=bzip2:pages_public.xml.bz2 \
--format=xml \
--filter=notalk \
--filter=namespace:\!NS_USER \
--filter=latest \
--output=bzip2:pages_current.xml.bz2 \
--format=xml \
--filter=latest \
--output=gzip:pages_full_1.25.sql.gz \
--format=mysql:1.25 \
--output=gzip:pages_full_1.4.sql.gz \
--format=mysql:1.4 \
pages_full.xml.gz
A bare parameter will be interpreted as a file to read XML input from;
if none is given or "-" input will be read from stdin. Input files with
".gz" or ".bz2" extensions will be decompressed as gzip and bzip2 streams,
respectively.
Internal decompression of 7-zip .7z files is not yet supported; you can
pipe such files through p7zip's 7za:
7za e -so pages_full.xml.7z |
java -jar mwdumper --format=mysql:1.5 |
mysql -u <username> -p <databasename>
Defaults if no parameters are given:
* read uncompressed XML from stdin
* write uncompressed XML to stdout
* no filtering
Output sinks:
--output=stdout
Send uncompressed XML or SQL output to stdout for piping.
(May have charset issues.) This is the default if no output
is specified.
--output=file:<filename.xml>
Write uncompressed output to a file.
--output=gzip:<filename.xml.gz>
Write compressed output to a file.
--output=bzip2:<filename.xml.bz2>
Write compressed output to a file.
--output=mysql:<jdbc url>
Valid only for SQL format output; opens a connection to the
MySQL server and sends commands to it directly.
This will look something like:
mysql://localhost/databasename?user=<username>&password=<password>
Output formats:
--format=xml
Output back to MediaWiki's XML export format; use this for
filtering dumps for limited import. Output should be idempotent.
--format=xml:0.3
Output in legacy 0.3 XML format; use with tools that can't handle
anything newer.
--format=mysql:1.4
SQL statements formatted for bulk import in MediaWiki 1.4's schema.
(MySQL output format.)
--format=mysql:1.5
SQL statements formatted for bulk import in MediaWiki 1.5's schema.
Both SQL schema versions currently require that the table structure
be already set up in an empty database; use maintenance/tables.sql
from the MediaWiki distribution.
(MySQL output format.)
--format=mysql:1.25
SQL statements formatted for bulk import into MediaWiki 1.25's schema.
--format=pgsql:1.4
SQL statements formatted for bulk import in MediaWiki 1.4's schema.
(PostgreSQL output format.)
--format=pgsql:1.5
SQL statements formatted for bulk import in MediaWiki 1.5's schema.
(PostgreSQL output format.)
For backwards compatibility, "sql" is an alias for "mysql". PostgreSQL output
requires superuser access to temporary disable foreign key checsk during the
import.
Filter actions:
--filter=latest
Skips all but the last revision listed for each page.
FIXME: currently this pays no attention to the timestamp or
revision number, but simply the order of items in the dump.
This may or may not be strictly correct.
--filter=before:<timestamp>
Skips all revisions after the specified timestamp.
Timestamp must be compatible with the one used in a XML Dump
(e.g. 2005-12-31T23:59:59Z).
--filter=after:<timestamp>
Skips all revisions before the specified timestamp.
--filter=list:<list-filename>
Excludes all pages whose titles do not appear in the given file.
Use one title per line; blanks and lines starting with # are
ignored. Talk and subject pages of given titles are both matched.
--filter=exactlist:<list-filename>
As above, but does not try to match associated talk/subject pages.
--filter=revlist:<list-filename>
Includes only the revisions specified by ID in the given file.
--filter=namespace:[!]<NS_KEY,NS_OTHERKEY,100,...>
Includes only pages in (or not in, with "!") the given namespaces.
You can use the NS_* constant names or the raw numeric keys.
--filter=notalk
Excludes all talk pages from output (including custom namespaces)
--filter=titlematch:<regex>
Excludes all pages whose titles do not match the regex.
Misc options:
--progress=<n>
Change progress reporting interval from the default 1000 revisions.
--quiet
Don't send any progress output to stderr.
BUILDING:
To help develop mwdumper, get the source from:
git clone https://gerrit.wikimedia.org/r/mediawiki/tools/mwdumper
Make sure you have Maven 2 or later installed on your system, then:
mvn package
This will compile and build target/mwdumper-1.16.jar, which can be run
as in the USAGE section above.
PERFORMANCE TIPS:
To speed up importing into a database, you might try:
* Java's -server option may significantly increase performance on some
versions of Sun's JVM for large files. (Not all installations will
have this available.)
* Increase MySQL's innodb_log_file_size. The default is as little as 5mb, but
you can improve performance dramatically by increasing this to reduce
the number of disk writes. (See the my-huge.cnf sample config.)
* If you don't need it, disable the binary log (log-bin option) during the
import. On a standalone machine this is just wasteful, writing a second
copy of every query that you'll never use.
* Various other wacky tips in the MySQL reference manual at
http://dev.mysql.com/mysql/en/innodb-tuning.html
TODO:
* Add some more junit tests
* Include table initialization in SQL output
* Allow use of table prefixes in SQL output
* Ensure that titles and other bits are validated correctly.
* Test XML input for robustness
* Provide filter to strip ID numbers
* <siteinfo> is technically optional; live without it and use default namespaces
* GUI frontend(s)
* Port to Python? ;)
Change history (abbreviated):
2007-07-06: Fixed namespace filter inversion (brion)
2007-03-09: Added PostgreSQL support (river)
2006-10-25: Added before and after filters by Aurimas Fischer
2005-10-25: Switched SqlWriter.sqlEscape back to less memory-hungry StringBuffer
2005-10-24: Fixed SQL output in non-UTF-8 locales
2005-10-21: Applied more speedup patches from Folke
2005-10-11: SQL direct connection, GUI work begins
2005-10-10: Applied speedup patches from Folke Behrens
2005-10-05: Use bulk inserts in SQL mode
2005-09-29: Converted from C# to Java
2005-08-27: Initial extraction code