Diffusion warcbase (master)

Edit
warcbase
Restricted Project
ActivePublic

mirror of warcbase library, in order to build and locally publish the artifact

warcbase (master)

.gitignore
.travis.yml
CONTRIBUTING.md
README.md
pom.xml
vis/
warcbase-core/
warcbase-hbase/

Recent Commits

Commit	Author	Details	Committed
b59e3f21afb0	lintool	Minor tweak to README.	Oct 1 2016
ec426771d44e	lintool	Fixed Issue #251: java.lang.NullPointerException on Collection	Sep 30 2016
abc6d5bef700	lintool	Fix fo issue #244: NPE from link extraction.	Sep 30 2016
a83eed067bb5	Jimmy Lin/GitHub	Merge pull request #253 from ukwa/master	Sep 30 2016
8a34bd047020	lintool	Fixed issue #244: java.util.zip.ZipException: invalid distance code	Sep 30 2016
ae517ff3096e	lintool	Better error trapping for issue #244: java.util.zip.ZipException: invalid…	Sep 30 2016
a190ceace8bf	Andrew Jackson	Also allow XHTML through, as per #252.	Sep 29 2016
5938479cfab4	ianmilligan1	minor change to run tests again	Sep 23 2016
205e9c179eba	ianmilligan1	added https to better play with github pages and others	Aug 5 2016
35e440d9a761	Ian Milligan/GitHub	Merge pull request #242 from yb1/checksum	Aug 2 2016
7f9de3dbcbf8	Youngbin Kim	Multiple partitions	Aug 2 2016
fe306339e3af	Ian Milligan/GitHub	Merge pull request #241 from yb1/checksum	Jul 31 2016
71069610d8b4	Youngbin Kim	Changed output type to rdd	Jul 31 2016
ab72ae4ce231	lintool	Add UDF for computing MD5 checksum. Issue #211	Jul 28 2016
3b4ebe297a21	Youngbin Kim	checksum	Jul 27 2016

README.md

Warcbase ![Build Status](https://travis-ci.org/lintool/warcbase)

Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark.

There are two main ways of using Warcbase:

+ The first and most common is to analyze web archives using Spark: these functionalities are contained in the warcbase-core module. + The second is to take advantage of HBase to provide random access as well as analytics capabilities. Random access allows Warcbase to provide temporal browsing of archived content (i.e., "wayback" functionality): these functionalities are contained in the warcbase-hbase module.

You can use Warcbase without HBase, and since HBase requires more extensive setup, it is recommended that if you're just starting out, play with the Spark analytics and don't worry about HBase.

Getting Started

Clone the repo:

$ git clone http://github.com/lintool/warcbase.git

You can then build Warcbase. If you are just interested in the analytics function, you can run the following:

$ mvn clean package -pl warcbase-core

For the impatient, to skip tests:

$ mvn clean package -pl warcbase-core -DskipTests

If you are interested in the HBase functionality as well, you can build everything using:

$ mvn clean package

Warcbase is built against CDH 5.7.1:

+ Hadoop version: 2.6.0-cdh5.7.1 + Spark version: 1.6.0-cdh5.7.1 + HBase version: 1.2.0-cdh5.7.1

The Hadoop ecosystem is evolving rapidly, so there may be incompatibilities with other versions.

Spark Quickstart

For the impatient, let's do a simple analysis with Spark. Within the repo there's already a sample ARC file stored at warcbase-core/src/test/resources/arc/example.arc.gz. Our supporting resources repository also has larger ARC and WARC files as real-world examples.

If you need to install Spark, we have a walkthrough here. This page also has instructions on how to install and run Spark Notebook, an interactive web-based editor.

Once you've got Spark installed, go ahead and fire up the Spark shell:

$ spark-shell --jars warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar

Here's a simple script that extracts and counts the top-level domains (i.e., number of pages for each top-level domain) in the sample ARC data:

scala
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("warcbase-core/src/test/resources/arc/example.arc.gz", sc)
  .keepValidPages()
  .map(r => ExtractDomain(r.getUrl))
  .countItems()
  .take(10)

Tip: By default, commands in the Spark shell must be one line. To run multi-line commands, type :paste in the Spark shell: you can then copy-paste the script above directly into Spark shell. Use Ctrl-D to finish the command.

What to learn more? Check out our detailed documentation.

Visualizations

The result of analyses of using Warcbase can serve as input to visualizations that help scholars interactively explore the data. Examples include:

+ Basic crawl statistics from the Canadian Political Parties and Political Interest Groups collection. + Interactive graph visualization using Gephi. + Named entity visualization for exploring relative frequencies of people, places, and locations. + Shine interface for faceted full-text search.

Next Steps

+ Ingesting content into HBase: loading ARC and WARC data into HBase + Warcbase/Wayback integration: guide to provide temporal browsing capabilities + Warcbase Java tools: building the URL mapping, extracting the webgraph

License

Licensed under the Apache License, Version 2.0.

Acknowledgments

This work is supported in part by the U.S. National Science Foundation, the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, the Ontario Ministry of Research and Innovation's Early Researcher Award program, and the Mellon Foundation (via Columbia University). Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.

EditwarcbaseRestricted ProjectActivePublic