R1473/14fa0f329072master
README.md
Warcbase
Warcbase is an open-source platform for managing web archives built on HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing.
Getting Started
Clone the repo:
git clone git@github.com:lintool/warcbase.git
You can then build Warcbase:
mvn clean package appassembler:assemble
For the impatient, to skip tests:
mvn clean package appassembler:assemble -DskipTests
To create Eclipse project files:
mvn eclipse:clean mvn eclipse:eclipse
You can then import the project into Eclipse.
Ingesting Content
Somewhat ironically (given the name of the project), Warcbase currently supports only ARC files. Don't worry, we're on it.
You can find some sample data here. Ingesting data into Warcbase is fairly straightforward:
$ setenv CLASSPATH_PREFIX "/etc/hbase/conf/" $ sh target/appassembler/bin/IngestWarcFiles \ -dir /path/to/warc/dir/ -name archive_name -create
Command-line options:
+ Use the -dir option to specify the directory containing the data files. + Use the -name option to specify the name of the archive (will correspond to the HBase table name). + Use the -create option to create a new table (and drop the existing table if a table with the same name exists already). Alternatively, use -append to add to an existing table.
That should do it. The data should now be in Warcbase.
Wayback/Warcbase Integration
Warcbase comes with a browser exposed as a REST API that conforms to Wayback's schema of collection/YYYYMMDDHHMMSS/targetURL. Here's how you start the browser:
$ setenv CLASSPATH_PREFIX "/etc/hbase/conf/" $ sh target/appassembler/bin/WarcBrowser -port 8080
You can now use http://myhost:8080/ to browse the archive. For example:
+ http://myhost:8080/mycollection/*/http://mysite.com/ will give you a list of available versions of http://mysite.com/. + http://myhost:8080/mycollection/19991231235959/http://mysite.com/ will give you the record of http://mysite.com/ just before Y2K.
Note that this API serves up raw records, so the HTML pages don't look pretty, and images don't render properly (since the browser gets confused by record headers). So how do you actually navigate through the archive? This is where Wayback/Warcbase integration comes in.
As it turns out, the Wayback code has the ability separate rendering/browsing from data storage. More details can be found in this technical overview. In short, we can customize a Wayback instance to point at the Warcbase REST API, and have the Wayback fetch records from HBase. This is accomplished by custom implementations of ResourceIndex and ResourceStore in here.
Here's how to install the integration:
- Make sure you already have Wayback installed. See this installation guide and configuration guide.
- Add the Warcbase jar to the Wayback's WAR deployment. In a standard setup, you would copy warcbase-0.1.0-SNAPSHOT.jar to the TOMCAT_ROOT/webapps/ROOT/WEB-INF/lib/.
- Replace the BDBCollection.xml configuration in TOMCAT_ROOT/webapps/ROOT/WEB-INF/ with the version in [src/main/resources/](https://github.com/lintool/warcbase/tree/master/src/main/resources).
- Open up BDBCollection.xml and specify the correct HOST, PORT, and TABLE.
- Shutdown and restart Tomcat.
Now navigate to your Wayback as before. Enjoy browsing your web archive!
Extracting the Webgraph
First, use a MapReduce tool to extract all URLs from ARC data:
$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar \ org.warcbase.analysis.demo.MapReduceArcDemo \ -input inputDir -output outputDir
Next, build an FST for the URLs (using Lucene's FST package). The two relevant classes in Warcbase are:
+ UriMappingBuilder: takes a list of URLs as input and builds the FST mapping. + UriMapping: load the mapping file generated by Builder and provides an API for accessing the FST.
To build the Lucene FST :
$ sh target/appassembler/bin/UriMappingBuilder inputDirectory outputFile
This command will read all files under inputDirectory as input, build an FST, and write the data file to outputFile. The UriMapping class provides a simple command-line interface:
# Lookup by URL, fetches the integer id $ sh target/appassembler/bin/UriMapping -data fst.dat -getId http://www.foo.com/ # Lookup by id, fetches the URL $ sh target/appassembler/bin/UriMapping -data fst.dat -getUrl 42 # Fetches all URLs with the prefix $ sh target/appassembler/bin/UriMapping -data fst.dat -getPrefix http://www.foo.com/
Then, we can use the FST mapping data to extract the webgraph and at the same time map URLs to unique integer ids. This is accomplished by a Hadoop program:
$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.data.ExtractLinks -input inputDir -output outputDir -uriMapping fstData -numReducers 1
Finally, instead of extracting links between individual URLs, we can extract the site-level webgraph by merging all URLs with common prefix into a "supernode". Link counts between supernodes represent the total number of links between their sub-URLs. In order to do this, following input files are required:
+ a prefix file providing URL prefixes for each supernode (comma-delimited: id, URL prefix); + an FST mapping file to map individual URLs to unique integer ids (from above); + a directory containing the original link graph (which can be extracted from the previous step).
Then run this program:
$ sh target/appassembler/bin/ExtractSiteLinks -prefixfile prefix.data -fstfile fst.data -linkdir extract-links-data -output sitelinks.data