R1473/c5432ac2468cmaster
README.md
Warcbase
Warcbase is an open-source platform for managing web archives built on HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing.
Getting Started
Once you check out the repo, build WarcBase:
mvn clean package appassembler:assemble
Ingesting WARC files:
$ setenv CLASSPATH_PREFIX "/etc/hbase/conf/" $ sh target/appassembler/bin/IngestWarcFiles -dir /path/to/warc/ -name archive_name -create
Command-line options:
+ Use the -dir option to specify directory containing WARC files. + Use the -name option to specify the name of the archive (will correspond to the HBase table name). + Use the -create option to create a new table (and drop the existing table if a table with the same name exists already). Alternatively, use -append to add to an existing table.
Starting the browser:
$ setenv CLASSPATH_PREFIX "/etc/hbase/conf/" $ sh target/appassembler/bin/WarcBrowser -port 9191 -server http://myhost:9191/
Navigate to http://myhost:9191/ to browse the archive.
Building an FST for URLs
Two main class implementing Lucene FST:
+ UriMappingBuilder: takes a list of URLs as input and builds the FST mapping. + UriMapping: load the mapping file generated by Builder and provides an API for accessing FST.
To Build Lucene FST :
$ sh target/appassembler/bin/UriMappingBuilder inputDirectory outputFile
This command will read all files under inputDirectory as input, build FST mapping and store in outputFile.