Warcbase ======== Warcbase is an open-source platform for managing web archives built on HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing. Getting Started --------------- Once you check out the repo, build WarcBase: ``` mvn clean package appassembler:assemble ``` Ingesting Content ----------------- Ingesting archive content from ARC or WARC files: ``` $ setenv CLASSPATH_PREFIX "/etc/hbase/conf/" $ sh target/appassembler/bin/IngestWarcFiles -dir /path/to/warc/ -name archive_name -create ``` Command-line options: + Use the `-dir` option to specify directory containing WARC files. + Use the `-name` option to specify the name of the archive (will correspond to the HBase table name). + Use the `-create` option to create a new table (and drop the existing table if a table with the same name exists already). Alternatively, use `-append` to add to an existing table. Starting the browser: ``` $ setenv CLASSPATH_PREFIX "/etc/hbase/conf/" $ sh target/appassembler/bin/WarcBrowser -port 9191 -server http://myhost:9191/ ``` Navigate to `http://myhost:9191/` to browse the archive. Extracting the Webgraph ----------------------- First, use a MapReduce tool to extract all URLs from ARC data: ``` $ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar \ org.warcbase.analysis.demo.MapReduceArcDemo \ -input inputDir -output outputDir ``` Next, build an FST for the URLs (using Lucene's FST package). The two relevant classes in Warcbase are: + `UriMappingBuilder`: takes a list of URLs as input and builds the FST mapping. + `UriMapping`: load the mapping file generated by Builder and provides an API for accessing the FST. To build the Lucene FST : ``` $ sh target/appassembler/bin/UriMappingBuilder inputDirectory outputFile ``` This command will read all files under `inputDirectory` as input, build an FST, and write the data file to `outputFile`. The `UriMapping` class provides a simple command-line interface: ``` # Lookup by URL, fetches the integer id $ sh target/appassembler/bin/UriMapping -data fst.dat -getId http://www.foo.com/ # Lookup by id, fetches the URL $ sh target/appassembler/bin/UriMapping -data fst.dat -getUrl 42 # Fetches all URLs with the prefix $ sh target/appassembler/bin/UriMapping -data fst.dat -getPrefix http://www.foo.com/ ``` Then, we can use the FST mapping data to extract the webgraph and at the same time map URLs to unique integer ids. This is accomplished by a Hadoop program: ``` $ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.data.ExtractLinks -input inputDir -output outputDir -uriMapping fstData -numReducers 1 ``` Finally, instead of extracting links between specific urls, we can extract site-level links by merging all urls with common prefix into a supernode. Link counts between supernodes is the total number of links between their sub-urls. In order to do this, following input files are required: a prefix file providing prefix urls for each supernode; a fst mapping file to map URLs to unique integer ids; a directory containing links between all suburls which can be extracted from previous step. To run this program: ``` $ sh target/appassembler/bin/ExtractSiteLinks -prefixfile prefix.data -fstfile fst.data -linkdir extract-links-data -output sitelinks.data ```