diff --git a/README.md b/README.md index 9360a72..078533e 100644 --- a/README.md +++ b/README.md @@ -1,144 +1,155 @@ Warcbase ======== Warcbase is an open-source platform for managing web archives built on HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing. Getting Started --------------- Clone the repo: ``` git clone git@github.com:lintool/warcbase.git ``` You can then build Warcbase: ``` mvn clean package appassembler:assemble ``` For the impatient, to skip tests: ``` mvn clean package appassembler:assemble -DskipTests ``` To create Eclipse project files: ``` mvn eclipse:clean mvn eclipse:eclipse ``` You can then import the project into Eclipse. Ingesting Content ----------------- Somewhat ironically (given the name of the project), Warcbase currently supports only ARC files. Don't worry, we're [on it](https://github.com/lintool/warcbase/issues/64). You can find some sample data [here](https://archive.org/details/ExampleArcAndWarcFiles). Ingesting data into Warcbase is fairly straightforward: ``` $ setenv CLASSPATH_PREFIX "/etc/hbase/conf/" $ sh target/appassembler/bin/IngestWarcFiles \ -dir /path/to/warc/dir/ -name archive_name -create ``` Command-line options: + Use the `-dir` option to specify the directory containing the data files. + Use the `-name` option to specify the name of the archive (will correspond to the HBase table name). + Use the `-create` option to create a new table (and drop the existing table if a table with the same name exists already). Alternatively, use `-append` to add to an existing table. That should do it. The data should now be in Warcbase. Wayback/Warcbase Integration ---------------------------- Warcbase comes with a browser exposed as a REST API that conforms to Wayback's schema of `collection/YYYYMMDDHHMMSS/targetURL`. Here's how you start the browser: ``` $ setenv CLASSPATH_PREFIX "/etc/hbase/conf/" $ sh target/appassembler/bin/WarcBrowser -port 8080 ``` You can now use `http://myhost:8080/` to browse the archive. For example: + `http://myhost:8080/mycollection/*/http://mysite.com/` will give you a list of available versions of `http://mysite.com/`. + `http://myhost:8080/mycollection/19991231235959/http://mysite.com/` will give you the record of `http://mysite.com/` just before Y2K. Note that this API serves up raw records, so the HTML pages don't look pretty, and images don't render properly (since the browser gets confused by record headers). So how do you actually navigate through the archive? This is where Wayback/Warcbase integration comes in. As it turns out, the Wayback code has the ability separate rendering/browsing from data storage. More details can be found in this [technical overview](https://github.com/iipc/openwayback/wiki/Technical-overview). In short, we can customize a Wayback instance to point at the Warcbase REST API, and have the Wayback fetch records from HBase. This is accomplished by custom implementations of `ResourceIndex` and `ResourceStore` in [here](https://github.com/lintool/warcbase/tree/master/src/main/java/org/warcbase/wayback). Here's how to install the integration: 1. Make sure you already have Wayback installed. See this [installation guide](https://github.com/iipc/openwayback/wiki/How-to-install) and [configuration guide](https://github.com/iipc/openwayback/wiki/How-to-configure). 2. Add the Warcbase jar to the Wayback's WAR deployment. In a standard setup, you would copy `warcbase-0.1.0-SNAPSHOT.jar` to the `TOMCAT_ROOT/webapps/ROOT/WEB-INF/lib/`. 3. Replace the `BDBCollection.xml` configuration in `TOMCAT_ROOT/webapps/ROOT/WEB-INF/` with the version in [`src/main/resources/`](https://github.com/lintool/warcbase/tree/master/src/main/resources). 4. Open up `BDBCollection.xml` and specify the correct `HOST`, `PORT`, and `TABLE`. 5. Shutdown and restart Tomcat. Now navigate to your Wayback as before. Enjoy browsing your web archive! -Extracting the Webgraph ------------------------ +Building the URL mapping +------------------------ -First, use a MapReduce tool to extract all URLs from ARC data: +It's convenient for a variety of tasks to map every URL to a unique integer id. Lucene's FST package provides a nice API for this task. + +There are two ways to build the URL mapping, the first of which is via a MapReduce job: ``` $ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar \ - org.warcbase.analysis.demo.MapReduceArcDemo \ - -input inputDir -output outputDir + org.warcbase.data.UrlMappingMapReduceBuilder \ + -input /hdfs/path/to/data -output fst ``` -Next, build an FST for the URLs (using Lucene's FST package). The two relevant classes in Warcbase are: - -+ `UriMappingBuilder`: takes a list of URLs as input and builds the FST mapping. -+ `UriMapping`: load the mapping file generated by Builder and provides an API for accessing the FST. +The FST data in this case will be written to HDFS. The potential issue with this approach is that building the FST is relatively memory hungry, and cluster memory is sometimes scarce. -To build the Lucene FST : +The alternative is to build the mapping locally on a machine with ample amounts of memory. To do this, first run a MapReduce job to extract all the unique URLs: ``` -$ sh target/appassembler/bin/UriMappingBuilder inputDirectory outputFile +$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar \ + org.warcbase.analysis.ExtractUniqueUrls \ + -input /hdfs/path/to/data -output urls ``` -This command will read all files under `inputDirectory` as input, build an FST, and write the data file to `outputFile`. The `UriMapping` class provides a simple command-line interface: +Now copy the `urls/` directory out of HDFS and then run the following program: +``` +sh target/appassembler/bin/UrlMappingBuilder -input urls -output fst +``` + +Where `urls` is the output directory from above and `fst` is the name of the FST data file. We can examine the FST data with the following utility program: ``` # Lookup by URL, fetches the integer id $ sh target/appassembler/bin/UriMapping -data fst.dat -getId http://www.foo.com/ # Lookup by id, fetches the URL $ sh target/appassembler/bin/UriMapping -data fst.dat -getUrl 42 # Fetches all URLs with the prefix $ sh target/appassembler/bin/UriMapping -data fst.dat -getPrefix http://www.foo.com/ ``` -Then, we can use the FST mapping data to extract the webgraph and at the same time map URLs to unique integer ids. This is accomplished by a Hadoop program: + +Extracting the Webgraph +----------------------- + +We can use the mapping data (from above) to extract the webgraph and at the same time map URLs to unique integer ids. This is accomplished by a Hadoop program: ``` -$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.data.ExtractLinks - -input inputDir -output outputDir -uriMapping fstData -numReducers 1 +$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar \ + org.warcbase.data.ExtractLinks \ + -input inputDir -output outputDir -uriMapping fstData -numReducers 1 ``` Finally, instead of extracting links between individual URLs, we can extract the site-level webgraph by merging all URLs with common prefix into a "supernode". Link counts between supernodes represent the total number of links between their sub-URLs. In order to do this, following input files are required: + a prefix file providing URL prefixes for each supernode (comma-delimited: id, URL prefix); + an FST mapping file to map individual URLs to unique integer ids (from above); + a directory containing the original link graph (which can be extracted from the previous step). Then run this program: ``` $ sh target/appassembler/bin/ExtractSiteLinks -prefixfile prefix.data -fstfile fst.data -linkdir extract-links-data -output sitelinks.data ```