Warcbase
========

Warcbase is an open-source platform for managing web archives built on HBase. The platform provides a flexible data model for storing and managing raw content as well as
metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing.

Getting Started
---------------

Once you check out the repo, build WarcBase:

```
mvn clean package appassembler:assemble
```

Ingesting Content
-----------------

Ingesting archive content from ARC or WARC files:

```
$ setenv CLASSPATH_PREFIX "/etc/hbase/conf/"
$ sh target/appassembler/bin/IngestWarcFiles -dir /path/to/warc/ -name archive_name -create
```

Command-line options:

+ Use the `-dir` option to specify directory containing WARC files.
+ Use the `-name` option to specify the name of the archive (will correspond to the HBase table name).
+ Use the `-create` option to create a new table (and drop the existing table if a table with the same name exists already). Alternatively, use `-append` to add to an existing table.

Starting the browser:

```
$ setenv CLASSPATH_PREFIX "/etc/hbase/conf/"
$ sh target/appassembler/bin/WarcBrowser -port 9191 -server http://myhost:9191/
```

Navigate to `http://myhost:9191/` to browse the archive.

Extracting the Webgraph
-----------------------

First, use a MapReduce tool to extract all URLs from ARC data:

```
$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar \
  org.warcbase.analysis.demo.MapReduceArcDemo \
  -input inputDir -output outputDir
```

Next, build an FST for the URLs (using Lucene's FST package). The two relevant classes in Warcbase are:

+ `UriMappingBuilder`: takes a list of URLs as input and builds the FST mapping.
+ `UriMapping`: load the mapping file generated by Builder and provides an API for accessing the FST.

To build the Lucene FST :

```
$ sh target/appassembler/bin/UriMappingBuilder inputDirectory outputFile
```

This command will read all files under `inputDirectory` as input, build an FST, and write the data file to `outputFile`. The `UriMapping` class provides a simple command-line interface:


```
# Lookup by URL, fetches the integer id
$ sh target/appassembler/bin/UriMapping -data fst.dat -getId http://www.foo.com/

# Lookup by id, fetches the URL
$ sh target/appassembler/bin/UriMapping -data fst.dat -getUrl 42

# Fetches all URLs with the prefix
$ sh target/appassembler/bin/UriMapping -data fst.dat -getPrefix http://www.foo.com/
```

Then, we can use the FST mapping data to extract the webgraph and at the same time map URLs to unique integer ids. This is accomplished by a Hadoop program:

```
$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.data.ExtractLinks 
  -input inputDir -output outputDir -uriMapping fstData -numReducers 1
```

Finally, instead of extracting links between specific urls, we can extract site-level links by merging all urls with common prefix into a supernode. Link counts between supernodes is the total number of links between their sub-urls. In order to do this, following input files are required: a prefix file providing prefix urls for each supernode; a fst mapping file to map URLs to unique integer ids; a directory containing links between all suburls which can be extracted from previous step. To run this program:

```
$ sh target/appassembler/bin/ExtractSiteLinks -prefixfile prefix.data -fstfile fst.data -linkdir extract-links-data  -output sitelinks.data
```