diff --git a/README.md b/README.md
index dfe7307..5a3bfd2 100644
--- a/README.md
+++ b/README.md
@@ -1,120 +1,119 @@
Warcbase
========
Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark.
There are two main ways of using Warcbase:
-+ The first and most common is to analyze web archives using Spark (the preferred approach) or Pig (which is in the process of being deprecated).
++ The first and most common is to analyze web archives using [Spark](http://spark.apache.org/).
+ The second is to take advantage of HBase to provide random access as well as analytics capabilities. Random access allows Warcbase to provide temporal browsing of archived content (i.e., "wayback" functionality).
You can use Warcbase without HBase, and since HBase requires more extensive setup, it is recommended that if you're just starting out, play with the Spark analytics and don't worry about HBase.
Warcbase is built against CDH 5.4.1:
+ Hadoop version: 2.6.0-cdh5.4.1
+ HBase version: 1.0.0-cdh5.4.1
-+ Pig version: 0.12.0-cdh5.4.1
+ Spark version: 1.3.0-cdh5.4.1
The Hadoop ecosystem is evolving rapidly, so there may be incompatibilities with other versions.
Detailed documentation is available in [this repository's wiki](https://github.com/lintool/warcbase/wiki).
Getting Started
---------------
Clone the repo:
```
$ git clone http://github.com/lintool/warcbase.git
```
You can then build Warcbase:
```
$ mvn clean package appassembler:assemble
```
For the impatient, to skip tests:
```
$ mvn clean package appassembler:assemble -DskipTests
```
To create Eclipse project files:
```
$ mvn eclipse:clean
$ mvn eclipse:eclipse
```
You can then import the project into Eclipse.
Spark Quickstart
----------------
For the impatient, let's do a simple analysis with Spark. Within the repo there's already a sample ARC file stored at `src/test/resources/arc/example.arc.gz`.
If you need to install Spark, [we have a walkthrough here for installation on OS X](https://github.com/lintool/warcbase/wiki/Installing-and-Running-Spark-under-OS-X). This page also has instructions on how to get Spark Notebook, an interactive web-based editor, running.
Once you've got Spark installed, you can go ahead and fire up the Spark shell:
```
$ spark-shell --jars target/warcbase-0.1.0-SNAPSHOT-fatjar.jar
```
Here's a simple script that extracts and counts the top-level domains (i.e., number of pages for each top-level domain) in the sample ARC data:
```
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val r = RecordLoader.loadArc("src/test/resources/arc/example.arc.gz", sc)
.keepValidPages()
.map(r => ExtractTopLevelDomain(r.getUrl))
.countItems()
.take(10)
```
**Tip:** By default, commands in the Spark shell must be one line. To run multi-line commands, type `:paste` in Spark shell: you can then copy-paste the script above directly into Spark shell. Use Ctrl-D to finish the command.
What to learn more? Check out [analyzing web archives with Spark](https://github.com/lintool/warcbase/wiki/Analyzing-Web-Archives-with-Spark).
What About Pig?
---------------
Warcbase was originally conceived with Pig for analytics, but we have transitioned over to Spark as the language of choice for scholarly interactions with web archive data. Spark has several advantages, including a cleaner interface, easier to write user-defined functions (UDFs), as well as integration with different "notebook" frontends.
Visualizations
--------------
The result of analyses of using Warcbase can serve as input to visualizations that help scholars interactively explore the data. Examples include:
+ [Basic crawl statistics](http://lintool.github.io/warcbase/vis/crawl-sites/index.html) from the Canadian Political Parties and Political Interest Groups collection.
+ [Interactive graph visualization](https://github.com/lintool/warcbase/wiki/Gephi:-Converting-Site-Link-Structure-into-Dynamic-Visualization) using Gephi.
+ [Shine interface](http://webarchives.ca/) for faceted full-text search.
Next Steps
----------
+ [Ingesting content into HBase](https://github.com/lintool/warcbase/wiki/Ingesting-Content-into-HBase): loading ARC and WARC data into HBase
+ [Warcbase/Wayback integration](https://github.com/lintool/warcbase/wiki/Warcbase-Wayback-Integration): guide to provide temporal browsing capabilities
+ [Warcbase Java tools](https://github.com/lintool/warcbase/wiki/Warcbase-Java-Tools): building the URL mapping, extracting the webgraph
License
-------
Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).
Acknowledgments
---------------
This work is supported in part by the National Science Foundation and by the Mellon Foundation (via Columbia University). Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.
diff --git a/pom.xml b/pom.xml
index 49cda9d..47fb08a 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1,546 +1,517 @@
4.0.0
org.warcbase
warcbase
jar
0.1.0-SNAPSHOT
Warcbase
An open-source platform for managing web archives built on Hadoop and HBase
http://warcbase.org/
The Apache Software License, Version 2.0
http://www.apache.org/licenses/LICENSE-2.0.txt
repo
scm:git:git@github.com:lintool/warcbase.git
scm:git:git@github.com:lintool/warcbase.git
git@github.com:lintool/warcbase.git
lintool
Jimmy Lin
jimmylin@umd.edu
milad621
Milad Gholami
mgholami@cs.umd.edu
jeffyRao
Jinfeng Rao
jinfeng@cs.umd.edu
org.sonatype.oss
oss-parent
7
UTF-8
UTF-8
8.1.12.v20130726
2.6.0-cdh5.4.1
1.0.0-cdh5.4.1
3.4.5-cdh5.4.1
- 0.12.0-cdh5.4.1
1.3.0-cdh5.4.1
2.10.4
maven-clean-plugin
2.6.1
src/main/solr/lib
false
org.apache.maven.plugins
maven-compiler-plugin
3.2
1.7
1.7
org.apache.maven.plugins
maven-shade-plugin
2.3
package
shade
META-INF/services/org.apache.lucene.codecs.Codec
*:*
META-INF/*.SF
META-INF/*.DSA
META-INF/*.RSA
true
fatjar
org.apache.hadoop:*
org.apache.maven.plugins
maven-dependency-plugin
2.4
copy
package
copy-dependencies
src/main/solr/lib
org.codehaus.mojo
appassembler-maven-plugin
1.9
-Xms512M -Xmx24576M
org.warcbase.WarcbaseAdmin
WarcbaseAdmin
org.warcbase.data.UrlMappingBuilder
UrlMappingBuilder
org.warcbase.data.UrlMapping
UrlMapping
org.warcbase.data.ExtractLinks
ExtractLinks
org.warcbase.data.ExtractSiteLinks
ExtractSiteLinks
org.warcbase.ingest.IngestFiles
IngestFiles
org.warcbase.ingest.SearchForUrl
SearchForUrl
org.warcbase.browser.WarcBrowser
WarcBrowser
org.warcbase.analysis.DetectDuplicates
DetectDuplicates
org.warcbase.browser.SeleniumBrowser
SeleniumBrowser
org.scala-tools
maven-scala-plugin
2.15.2
process-resources
add-source
compile
scala-test-compile
process-test-resources
testCompile
${scala.version}
true
-target:jvm-1.7
-g:vars
-deprecation
-dependencyfile
${project.build.directory}/.scala_dependencies
maven
http://repo.maven.apache.org/maven2/
cloudera
https://repository.cloudera.com/artifactory/cloudera-repos/
internetarchive
Internet Archive Maven Repository
http://builds.archive.org:8080/maven2
junit
junit
4.12
test
org.scalatest
scalatest_2.10
2.2.4
test
commons-codec
commons-codec
1.8
commons-io
commons-io
2.4
org.jsoup
jsoup
1.7.3
com.google.guava
guava
14.0.1
tl.lin
lintools-datatypes
1.0.0
org.apache.hbase
hbase-client
${hbase.version}
org.apache.hadoop hadoop-core
org.apache.hbase
hbase-server
${hbase.version}
org.apache.hadoop hadoop-core
org.mortbay.jetty servlet-api-2.5
javax.servlet servlet-api
asm asm
org.apache.hadoop
hadoop-client
${hadoop.version}
javax.servlet servlet-api
org.apache.zookeeper
zookeeper
${zookeeper.version}
-
- org.apache.pig
- pig
- ${pig.version}
-
- org.mortbay.jetty servlet-api-2.5
- javax.servlet servlet-api
-
-
-
- org.apache.pig
- pigunit
- ${pig.version}
-
- commons-lang commons-lang
- commons-logging commons-logging
-
-
-
org.netpreserve.openwayback
openwayback-core
2.0.0.BETA.2
org.apache.hadoop hadoop-core
ch.qos.logback logback-classic
org.netpreserve.openwayback openwayback-cdx-server
org.netpreserve.openwayback openwayback-access-control-core
it.unimi.dsi dsiutils
fastutil fastutil
org.netpreserve.commons
webarchive-commons
1.1.4
org.apache.hadoop hadoop-core
commons-lang commons-lang
fastutil fastutil
it.unimi.dsi
dsiutils
2.2.0
ch.qos.logback logback-classic
commons-lang commons-lang
it.unimi.dsi
fastutil
6.5.15
commons-lang commons-lang
org.eclipse.jetty
jetty-server
${jettyVersion}
org.eclipse.jetty
jetty-webapp
${jettyVersion}
true
org.slf4j
slf4j-log4j12
1.6.4
org.apache.commons
commons-lang3
3.0
commons-cli
commons-cli
1.2
net.sf.opencsv
opencsv
2.3
org.apache.tika
tika-core
1.9
org.apache.tika
tika-parsers
1.9
org.antlr
antlr
3.5.2
-
-
-
org.seleniumhq.selenium
selenium-java
2.42.2
org.seleniumhq.selenium selenium-htmlunit-driver
org.seleniumhq.selenium selenium-ie-driver
org.webbitserver webbit
org.scala-lang
scala-library
2.10.4
org.apache.spark
spark-core_2.10
${spark.version}
com.typesafe config
org.xerial.snappy snappy-java
com.fasterxml.jackson.core
jackson-core
2.6.3
com.typesafe
config
1.2.1
org.xerial.snappy
snappy-java
1.0.5
edu.stanford.nlp
stanford-corenlp
3.4.1
com.syncthemall
boilerpipe
1.2.2
xerces
xercesImpl
2.11.0
org.apache.lucene
lucene-core
4.7.2
org.apache.solr
solr-core
4.7.2
slf4j-api org.slf4j
org.apache.hadoop hadoop-annotations
org.apache.hadoop hadoop-common
org.apache.hadoop hadoop-hdfs
com.typesafe config
uk.bl.wa.discovery
warc-hadoop-indexer
2.2.0-BETA-5
asm asm
com.typesafe config
diff --git a/src/main/java/org/warcbase/pig/ArcLoader.java b/src/main/java/org/warcbase/pig/ArcLoader.java
deleted file mode 100644
index 61a55e4..0000000
--- a/src/main/java/org/warcbase/pig/ArcLoader.java
+++ /dev/null
@@ -1,121 +0,0 @@
-package org.warcbase.pig;
-
-import java.io.IOException;
-import java.util.List;
-
-import org.apache.hadoop.io.LongWritable;
-import org.apache.hadoop.mapreduce.Job;
-import org.apache.hadoop.mapreduce.RecordReader;
-import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
-import org.apache.log4j.Logger;
-import org.apache.pig.Expression;
-import org.apache.pig.FileInputLoadFunc;
-import org.apache.pig.LoadMetadata;
-import org.apache.pig.PigException;
-import org.apache.pig.ResourceSchema;
-import org.apache.pig.ResourceStatistics;
-import org.apache.pig.backend.executionengine.ExecException;
-import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
-import org.apache.pig.data.DataByteArray;
-import org.apache.pig.data.DataType;
-import org.apache.pig.data.Tuple;
-import org.apache.pig.data.TupleFactory;
-import org.archive.io.arc.ARCRecordMetaData;
-import org.warcbase.data.ArcRecordUtils;
-import org.warcbase.io.ArcRecordWritable;
-import org.warcbase.mapreduce.WacArcInputFormat;
-
-import com.google.common.collect.Lists;
-
-public class ArcLoader extends FileInputLoadFunc implements LoadMetadata {
- private static final Logger LOG = Logger.getLogger(ArcLoader.class);
-
- private static final TupleFactory TUPLE_FACTORY = TupleFactory.getInstance();
-
- private RecordReader in;
-
- public ArcLoader() {
- }
-
- @Override
- public WacArcInputFormat getInputFormat() throws IOException {
- return new WacArcInputFormat();
- }
-
- @Override
- public Tuple getNext() throws IOException {
- try {
- if ( !in.nextKeyValue() ) {
- return null;
- }
-
- ArcRecordWritable r = in.getCurrentValue();
- ARCRecordMetaData meta = r.getRecord().getMetaData();
-
- List protoTuple = Lists.newArrayList();
- protoTuple.add(meta.getUrl());
- protoTuple.add(meta.getDate()); // These are the standard 14-digit dates.
- protoTuple.add(meta.getMimetype());
-
- try {
- protoTuple.add(new DataByteArray(ArcRecordUtils.getBodyContent(r.getRecord())));
- } catch (OutOfMemoryError e) {
- // When we get a corrupt record, this will happen...
- // Try to recover and move on...
- LOG.error("Encountered OutOfMemoryError ingesting " + meta.getUrl());
- LOG.error("Attempting to continue...");
- }
-
- return TUPLE_FACTORY.newTupleNoCopy(protoTuple);
- } catch (InterruptedException e) {
- int errCode = 6018;
- String errMsg = "Error while reading input";
- throw new ExecException(errMsg, errCode, PigException.REMOTE_ENVIRONMENT, e);
- }
- }
-
- @SuppressWarnings({ "unchecked", "rawtypes" })
- @Override
- public void prepareToRead(RecordReader reader, PigSplit split) {
- in = reader;
- }
-
- @Override
- public void setLocation(String location, Job job) throws IOException {
- FileInputFormat.setInputPaths(job, location);
- }
-
- @Override
- public String[] getPartitionKeys(String location, Job job) throws IOException {
- return null;
- }
-
- @Override
- public ResourceSchema getSchema(String location, Job job) throws IOException {
- // Schema is (url:chararray, date:chararray, mime:chararray, content:bytearray)
- ResourceSchema schema = new ResourceSchema();
-
- ResourceSchema.ResourceFieldSchema[] fields = new ResourceSchema.ResourceFieldSchema[4];
- fields[0] = new ResourceSchema.ResourceFieldSchema();
- fields[0].setName("url").setType(DataType.CHARARRAY);
- fields[1] = new ResourceSchema.ResourceFieldSchema();
- fields[1].setName("date").setType(DataType.CHARARRAY);
- fields[2] = new ResourceSchema.ResourceFieldSchema();
- fields[2].setName("mime").setType(DataType.CHARARRAY);
- fields[3] = new ResourceSchema.ResourceFieldSchema();
- fields[3].setName("content").setType(DataType.BYTEARRAY);
-
- schema.setFields(fields);
-
- return schema;
- }
-
- @Override
- public ResourceStatistics getStatistics(String location, Job job) throws IOException {
- return null;
- }
-
- @Override
- public void setPartitionFilter(Expression partitionFilter) throws IOException {
- }
-}
diff --git a/src/main/java/org/warcbase/pig/WarcLoader.java b/src/main/java/org/warcbase/pig/WarcLoader.java
deleted file mode 100644
index 298a75e..0000000
--- a/src/main/java/org/warcbase/pig/WarcLoader.java
+++ /dev/null
@@ -1,149 +0,0 @@
-package org.warcbase.pig;
-
-import com.google.common.collect.Lists;
-import org.apache.hadoop.io.LongWritable;
-import org.apache.hadoop.mapreduce.Job;
-import org.apache.hadoop.mapreduce.RecordReader;
-import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
-import org.apache.log4j.Logger;
-import org.apache.pig.*;
-import org.apache.pig.backend.executionengine.ExecException;
-import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
-import org.apache.pig.data.DataByteArray;
-import org.apache.pig.data.DataType;
-import org.apache.pig.data.Tuple;
-import org.apache.pig.data.TupleFactory;
-import org.archive.io.ArchiveRecordHeader;
-import org.archive.io.warc.WARCRecord;
-import org.archive.util.ArchiveUtils;
-import org.warcbase.data.WarcRecordUtils;
-import org.warcbase.io.WarcRecordWritable;
-import org.warcbase.mapreduce.WacWarcInputFormat;
-
-import java.io.IOException;
-import java.text.DateFormat;
-import java.text.ParseException;
-import java.text.SimpleDateFormat;
-import java.util.Date;
-import java.util.List;
-
-public class WarcLoader extends FileInputLoadFunc implements LoadMetadata {
- private static final Logger LOG = Logger.getLogger(WarcLoader.class);
-
- private static final TupleFactory TUPLE_FACTORY = TupleFactory.getInstance();
- private static final DateFormat ISO8601 = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssX");
-
- private RecordReader in;
-
- public WarcLoader() {
- }
-
- @Override
- public WacWarcInputFormat getInputFormat() throws IOException {
- return new WacWarcInputFormat();
- }
-
- @Override
- public Tuple getNext() throws IOException {
- try {
- WARCRecord record;
- ArchiveRecordHeader header;
-
- // We're going to continue reading WARC records from the underlying input format
- // until we reach a "response" record.
- while (true) {
- if (!in.nextKeyValue()) {
- return null;
- }
-
- record = in.getCurrentValue().getRecord();
- header = record.getHeader();
-
- if (header.getHeaderValue("WARC-Type").equals("response")) {
- break;
- }
- }
-
- String url = header.getUrl();
- byte[] content = null;
- String type = null;
-
- try {
- content = WarcRecordUtils.getContent(record);
- type = WarcRecordUtils.getWarcResponseMimeType(content);
- } catch (OutOfMemoryError e) {
- // When we get a corrupt record, this will happen...
- // Try to recover and move on...
- LOG.error("Encountered OutOfMemoryError ingesting " + url);
- LOG.error("Attempting to continue...");
- }
-
- Date d = null;
- String date = null;
- try {
- d = ISO8601.parse(header.getDate());
- date = ArchiveUtils.get14DigitDate(d);
- } catch (ParseException e) {
- LOG.error("Encountered ParseException ingesting " + url);
- }
-
- List protoTuple = Lists.newArrayList();
- protoTuple.add(url);
- protoTuple.add(date);
- protoTuple.add(type);
- protoTuple.add(new DataByteArray(content));
-
- return TUPLE_FACTORY.newTupleNoCopy(protoTuple);
- } catch (InterruptedException e) {
- int errCode = 6018;
- String errMsg = "Error while reading input";
- throw new ExecException(errMsg, errCode, PigException.REMOTE_ENVIRONMENT, e);
- }
- }
-
- @SuppressWarnings({ "unchecked", "rawtypes" })
- @Override
- public void prepareToRead(RecordReader reader, PigSplit split) {
- in = reader;
- }
-
- @Override
- public void setLocation(String location, Job job) throws IOException {
- FileInputFormat.setInputPaths(job, location);
- }
-
-
- @Override
- public String[] getPartitionKeys(String location, Job job) throws IOException {
- return null;
- }
-
- @Override
- public ResourceSchema getSchema(String location, Job job) throws IOException {
- // Schema is (url:chararray, date:chararray, mime:chararray, content:bytearray)
- ResourceSchema schema = new ResourceSchema();
-
- ResourceSchema.ResourceFieldSchema[] fields = new ResourceSchema.ResourceFieldSchema[4];
- fields[0] = new ResourceSchema.ResourceFieldSchema();
- fields[0].setName("url").setType(DataType.CHARARRAY);
- fields[1] = new ResourceSchema.ResourceFieldSchema();
- fields[1].setName("date").setType(DataType.CHARARRAY);
- fields[2] = new ResourceSchema.ResourceFieldSchema();
- fields[2].setName("mime").setType(DataType.CHARARRAY);
- fields[3] = new ResourceSchema.ResourceFieldSchema();
- fields[3].setName("content").setType(DataType.BYTEARRAY);
-
- schema.setFields(fields);
-
- return schema;
- }
-
- @Override
- public ResourceStatistics getStatistics(String location, Job job) throws IOException {
- return null;
- }
-
- @Override
- public void setPartitionFilter(Expression partitionFilter) throws IOException {
- }
-}
diff --git a/src/main/java/org/warcbase/pig/piggybank/DetectLanguage.java b/src/main/java/org/warcbase/pig/piggybank/DetectLanguage.java
deleted file mode 100644
index ccddd54..0000000
--- a/src/main/java/org/warcbase/pig/piggybank/DetectLanguage.java
+++ /dev/null
@@ -1,18 +0,0 @@
-package org.warcbase.pig.piggybank;
-
-import org.apache.pig.EvalFunc;
-import org.apache.pig.data.Tuple;
-import org.apache.tika.language.LanguageIdentifier;
-
-import java.io.IOException;
-
-public class DetectLanguage extends EvalFunc {
- @Override
- public String exec(Tuple input) throws IOException {
- if (input == null || input.size() == 0 || input.get(0) == null) {
- return null;
- }
- String text = (String) input.get(0);
- return new LanguageIdentifier(text).getLanguage();
- }
-}
\ No newline at end of file
diff --git a/src/main/java/org/warcbase/pig/piggybank/DetectMimeTypeMagic.java b/src/main/java/org/warcbase/pig/piggybank/DetectMimeTypeMagic.java
deleted file mode 100644
index 6e8aeee..0000000
--- a/src/main/java/org/warcbase/pig/piggybank/DetectMimeTypeMagic.java
+++ /dev/null
@@ -1,38 +0,0 @@
-package org.warcbase.pig.piggybank;
-
-import org.apache.pig.EvalFunc;
-import org.apache.pig.data.Tuple;
-
-import java.io.ByteArrayInputStream;
-import java.io.IOException;
-import java.io.InputStream;
-
-public class DetectMimeTypeMagic extends EvalFunc {
-
-
- @Override
- public String exec(Tuple input) throws IOException {
- String mimeType = null;
-
- if (input == null || input.size() == 0 || input.get(0) == null) {
- return "N/A";
- }
- String magicFile = (String) input.get(0);
- String content = (String) input.get(1);
-
- InputStream is = new ByteArrayInputStream(content.getBytes());
- if (content.isEmpty()) return "EMPTY";
-
- // I'm commenting this out because the jar isn't actually published anywhere...
- // @lintool 2014/08/12
-
- //org.opf_labs.LibmagicJnaWrapper jnaWrapper = new LibmagicJnaWrapper();
- //jnaWrapper.load(magicFile);
-
- //jnaWrapper.load("/usr/local/Cellar/libmagic/5.15/share/misc/magic.mgc"); // Mac OS X with Homebrew
- //jnaWrapper.load("/usr/share/file/magic.mgc"); // CentOS
-
- //mimeType = jnaWrapper.getMimeType(is);
- return mimeType;
- }
-}
diff --git a/src/main/java/org/warcbase/pig/piggybank/DetectMimeTypeTika.java b/src/main/java/org/warcbase/pig/piggybank/DetectMimeTypeTika.java
deleted file mode 100644
index 538002c..0000000
--- a/src/main/java/org/warcbase/pig/piggybank/DetectMimeTypeTika.java
+++ /dev/null
@@ -1,33 +0,0 @@
-package org.warcbase.pig.piggybank;
-
-import org.apache.pig.EvalFunc;
-import org.apache.pig.data.DataByteArray;
-import org.apache.pig.data.Tuple;
-import org.apache.tika.Tika;
-import org.apache.tika.detect.DefaultDetector;
-import org.apache.tika.parser.AutoDetectParser;
-
-import java.io.ByteArrayInputStream;
-import java.io.IOException;
-import java.io.InputStream;
-
-public class DetectMimeTypeTika extends EvalFunc {
-
- @Override
- public String exec(Tuple input) throws IOException {
- String mimeType;
-
- if (input == null || input.size() == 0 || input.get(0) == null) {
- return "N/A";
- }
- DataByteArray content = (DataByteArray) input.get(0);
-
- InputStream is = new ByteArrayInputStream(content.get());
-
- DefaultDetector detector = new DefaultDetector();
- AutoDetectParser parser = new AutoDetectParser(detector);
- mimeType = new Tika(detector, parser).detect(is);
-
- return mimeType;
- }
-}
diff --git a/src/main/java/org/warcbase/pig/piggybank/ExtractBoilerpipeText.java b/src/main/java/org/warcbase/pig/piggybank/ExtractBoilerpipeText.java
deleted file mode 100644
index f3c9b75..0000000
--- a/src/main/java/org/warcbase/pig/piggybank/ExtractBoilerpipeText.java
+++ /dev/null
@@ -1,32 +0,0 @@
-package org.warcbase.pig.piggybank;
-
-import java.io.IOException;
-import org.apache.commons.lang.StringUtils;
-
-import org.apache.pig.EvalFunc;
-import org.apache.pig.data.Tuple;
-import de.l3s.boilerpipe.extractors.DefaultExtractor;
-// Could also use tika.parser.html.BoilerpipeContentHandler, which uses older version of boilerpipe
-
-/**
- * UDF for extracting raw text content from an HTML page, minus "boilerplate"
- * content (using boilerpipe).
- */
-public class ExtractBoilerpipeText extends EvalFunc {
- public String exec(Tuple input) throws IOException {
- if (input == null || input.size() == 0 || input.get(0) == null) {
- return null;
- }
-
- try {
- // Other available extractors: https://boilerpipe.googlecode.com/svn/trunk/boilerpipe-core/javadoc/1.0/de/l3s/boilerpipe/extractors/package-summary.html
- String text = DefaultExtractor.INSTANCE.getText((String) input.get(0)).replaceAll("[\\r\\n]+", " ").trim();
- if (text.isEmpty())
- return null;
- else
- return text;
- } catch (Exception e) {
- throw new IOException("Caught exception processing input row ", e);
- }
- }
-}
diff --git a/src/main/java/org/warcbase/pig/piggybank/ExtractLinks.java b/src/main/java/org/warcbase/pig/piggybank/ExtractLinks.java
deleted file mode 100644
index 4ebb9e1..0000000
--- a/src/main/java/org/warcbase/pig/piggybank/ExtractLinks.java
+++ /dev/null
@@ -1,58 +0,0 @@
-package org.warcbase.pig.piggybank;
-
-import com.google.common.collect.Lists;
-import org.apache.pig.EvalFunc;
-import org.apache.pig.data.BagFactory;
-import org.apache.pig.data.DataBag;
-import org.apache.pig.data.Tuple;
-import org.apache.pig.data.TupleFactory;
-import org.jsoup.Jsoup;
-import org.jsoup.nodes.Document;
-import org.jsoup.nodes.Element;
-import org.jsoup.select.Elements;
-
-import java.io.IOException;
-import java.util.List;
-
-/**
- * UDF for extracting links from a webpage given the HTML content (using Jsoup). Returns a bag of
- * tuples, where each tuple consists of the URL and the anchor text.
- */
-public class ExtractLinks extends EvalFunc {
- private static final TupleFactory TUPLE_FACTORY = TupleFactory.getInstance();
- private static final BagFactory BAG_FACTORY = BagFactory.getInstance();
-
- public DataBag exec(Tuple input) throws IOException {
- if (input == null || input.size() == 0 || input.get(0) == null) {
- return null;
- }
-
- try {
- String html = (String) input.get(0);
- String base = input.size() > 1 ? (String) input.get(1) : null;
-
- DataBag output = BAG_FACTORY.newDefaultBag();
- Document doc = Jsoup.parse(html);
- Elements links = doc.select("a[href]");
-
- for (Element link : links) {
- if (base != null) {
- link.setBaseUri(base);
- }
- String target = link.attr("abs:href");
- if (target.length() == 0) {
- continue;
- }
-
- // Create each tuple (URL, anchor text)
- List linkTuple = Lists.newArrayList();
- linkTuple.add(target);
- linkTuple.add(link.text());
- output.add(TUPLE_FACTORY.newTupleNoCopy(linkTuple));
- }
- return output;
- } catch (Exception e) {
- throw new IOException("Caught exception processing input row ", e);
- }
- }
-}
\ No newline at end of file
diff --git a/src/main/java/org/warcbase/pig/piggybank/ExtractRawText.java b/src/main/java/org/warcbase/pig/piggybank/ExtractRawText.java
deleted file mode 100644
index 5ef49d3..0000000
--- a/src/main/java/org/warcbase/pig/piggybank/ExtractRawText.java
+++ /dev/null
@@ -1,25 +0,0 @@
-package org.warcbase.pig.piggybank;
-
-import java.io.IOException;
-
-import org.apache.pig.EvalFunc;
-import org.apache.pig.data.Tuple;
-import org.jsoup.Jsoup;
-
-/**
- * UDF for extracting raw text content from an HTML page (using Jsoup).
- */
-public class ExtractRawText extends EvalFunc {
- public String exec(Tuple input) throws IOException {
- if (input == null || input.size() == 0 || input.get(0) == null) {
- return null;
- }
-
- try {
- // Use Jsoup for cleanup.
- return Jsoup.parse((String) input.get(0)).text().replaceAll("[\\r\\n]+", " ");
- } catch (Exception e) {
- throw new IOException("Caught exception processing input row ", e);
- }
- }
-}
\ No newline at end of file
diff --git a/src/main/java/org/warcbase/pig/piggybank/ExtractTextFromPDFs.java b/src/main/java/org/warcbase/pig/piggybank/ExtractTextFromPDFs.java
deleted file mode 100644
index 7f6231a..0000000
--- a/src/main/java/org/warcbase/pig/piggybank/ExtractTextFromPDFs.java
+++ /dev/null
@@ -1,46 +0,0 @@
-package org.warcbase.pig.piggybank;
-
-import java.io.ByteArrayInputStream;
-import java.io.IOException;
-import java.io.InputStream;
-
-import org.apache.pig.EvalFunc;
-import org.apache.pig.data.DataByteArray;
-import org.apache.pig.data.Tuple;
-import org.apache.tika.metadata.Metadata;
-import org.apache.tika.parser.ParseContext;
-import org.apache.tika.parser.Parser;
-import org.apache.tika.parser.pdf.PDFParser;
-import org.apache.tika.sax.BodyContentHandler;
-import org.xml.sax.ContentHandler;
-
-public class ExtractTextFromPDFs extends EvalFunc {
- private Parser pdfParser = new PDFParser();
-
- @Override
- public String exec(Tuple input) throws IOException {
- try {
- if (input == null || input.size() == 0 || input.get(0) == null) {
- return "N/A";
- }
-
- DataByteArray dba = (DataByteArray) input.get(0);
- InputStream is = new ByteArrayInputStream(dba.get());
-
- ContentHandler contenthandler = new BodyContentHandler(Integer.MAX_VALUE);
- Metadata metadata = new Metadata();
-
- pdfParser.parse(is, contenthandler, metadata, new ParseContext());
-
- if (is != null) {
- is.close();
- }
-
- return contenthandler.toString();
- } catch (Throwable t) {
- // Basically, catch everything...
- t.printStackTrace();
- return null;
- }
- }
-}
\ No newline at end of file
diff --git a/src/main/java/org/warcbase/pig/piggybank/ExtractTopLevelDomain.java b/src/main/java/org/warcbase/pig/piggybank/ExtractTopLevelDomain.java
deleted file mode 100644
index 23a1dce..0000000
--- a/src/main/java/org/warcbase/pig/piggybank/ExtractTopLevelDomain.java
+++ /dev/null
@@ -1,38 +0,0 @@
-package org.warcbase.pig.piggybank;
-
-import java.io.IOException;
-import java.net.URL;
-
-import org.apache.pig.EvalFunc;
-import org.apache.pig.data.Tuple;
-
-/**
- * UDF for extracting the top-level domain from an URL. Extracts the hostname from the first
- * argument; if it's null
, extracts the hostname from the second argument. The second
- * argument is typically a source page, e.g., if the first URL is a relative URL, take the host from
- * the source page.
- */
-public class ExtractTopLevelDomain extends EvalFunc {
- public String exec(Tuple input) throws IOException {
- if (input == null || input.size() == 0 || input.get(0) == null) {
- return null;
- }
-
- String host = null;
- try {
- host = (new URL((String) input.get(0))).getHost();
- } catch (Exception e) {
- // It's okay, just fall through here.
- }
-
- if (host != null || (host == null && input.size() == 0)) {
- return host;
- }
-
- try {
- return (new URL((String) input.get(1))).getHost();
- } catch (Exception e) {
- return null;
- }
- }
-}
\ No newline at end of file
diff --git a/src/main/java/org/warcbase/pig/piggybank/NER3ClassUDF.java b/src/main/java/org/warcbase/pig/piggybank/NER3ClassUDF.java
deleted file mode 100644
index 08a4a38..0000000
--- a/src/main/java/org/warcbase/pig/piggybank/NER3ClassUDF.java
+++ /dev/null
@@ -1,132 +0,0 @@
-/*
- * Copyright 2014 Internet Archive
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you
- * may not use this file except in compliance with the License. You
- * may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
- * implied. See the License for the specific language governing
- * permissions and limitations under the License.
- */
-
-package org.warcbase.pig.piggybank;
-
-import java.io.IOException;
-import org.apache.pig.EvalFunc;
-import org.apache.pig.data.Tuple;
-import java.util.regex.*;
-import java.io.*;
-import java.net.*;
-import org.apache.pig.PigException;
-import org.apache.pig.backend.executionengine.ExecException;
-import org.apache.pig.data.TupleFactory;
-import org.apache.pig.data.DataBag;
-import org.apache.pig.data.DataType;
-import org.apache.pig.data.Tuple;
-import org.apache.pig.builtin.MonitoredUDF;
-import java.util.ArrayList;
-import java.util.List;
-import java.util.Iterator;
-import java.util.Map;
-import java.util.EnumMap;
-import edu.stanford.nlp.ie.AbstractSequenceClassifier;
-import edu.stanford.nlp.ie.crf.*;
-import edu.stanford.nlp.io.IOUtils;
-import edu.stanford.nlp.ling.CoreLabel;
-import edu.stanford.nlp.ling.CoreAnnotations;
-import java.io.IOException;
-import java.io.ObjectInputStream;
-import java.io.ObjectOutputStream;
-import java.lang.Integer;
-import java.util.concurrent.TimeUnit;
-
-/**
- * UDF which reads in a text string, and returns entities identified by the configured Stanford NER classifier
- * @author vinay
- * @author jrwiebe
- */
-
-//@MonitoredUDF(timeUnit = TimeUnit.MILLISECONDS, duration = 120000, stringDefault = "{PERSON=[], ORGANIZATION=[], LOCATION=[]}")
-public class NER3ClassUDF extends EvalFunc {
-
- String serializedClassifier;
- AbstractSequenceClassifier classifier = null;
-
- public NER3ClassUDF(String file) {
- serializedClassifier = file;
- }
-
- public enum NERClassType { PERSON, ORGANIZATION, LOCATION, O }
-
- public String exec(Tuple input) throws IOException {
-
- String emptyString = "{PERSON=[], ORGANIZATION=[], LOCATION=[]}";
- Map> entitiesByType = new EnumMap>(NERClassType.class);
- for (NERClassType t : NERClassType.values()) {
- if(t != NERClassType.O)
- entitiesByType.put(t, new ArrayList());
- }
-
- NERClassType prevEntityType = NERClassType.O;
- String entityBuffer = "";
-
- if(input == null || input.size() == 0) {
- return emptyString;
- }
-
- try {
- String textString = (String)input.get(0);
- if(textString == null) {
- return emptyString;
- }
-
- if(classifier == null) {
- //initialize
- classifier = CRFClassifier.getClassifier(serializedClassifier);
- }
-
- List> out = classifier.classify(textString);
- for (List sentence : out) {
- for (CoreLabel word : sentence) {
- String wordText = word.word();
- String classText = word.get(CoreAnnotations.AnswerAnnotation.class);
- NERClassType currEntityType = NERClassType.valueOf(classText);
- if (prevEntityType != currEntityType) {
- if(prevEntityType != NERClassType.O && !entityBuffer.equals("")) {
- //time to commit
- entitiesByType.get(prevEntityType).add(entityBuffer);
- entityBuffer = "";
- }
- }
- prevEntityType = currEntityType;
- if(currEntityType != NERClassType.O) {
- if(entityBuffer.equals(""))
- entityBuffer = wordText;
- else
- entityBuffer+= " " + wordText;
- }
- }
- //end of sentence
- //apply commit and reset
- if(prevEntityType != NERClassType.O && !entityBuffer.equals("")) {
- entitiesByType.get(prevEntityType).add(entityBuffer);
- entityBuffer = "";
- }
- //reset
- prevEntityType = NERClassType.O;
- entityBuffer = "";
- }
- return entitiesByType.toString();
-
- } catch(Exception e) {
- if(classifier == null)
- throw new IOException("Unable to load classifier ", e);
- return emptyString;
- }
- }
-}
diff --git a/src/main/scala/org/warcbase/spark/matchbox/ExtractTextFromPDFs.scala b/src/main/scala/org/warcbase/spark/matchbox/ExtractTextFromPDFs.scala
index badd233..d6072b3 100644
--- a/src/main/scala/org/warcbase/spark/matchbox/ExtractTextFromPDFs.scala
+++ b/src/main/scala/org/warcbase/spark/matchbox/ExtractTextFromPDFs.scala
@@ -1,32 +1,34 @@
package org.warcbase.spark.matchbox
import java.io.ByteArrayInputStream
-import org.apache.pig.data.DataByteArray
+//import org.apache.pig.data.DataByteArray
import org.apache.tika.metadata.Metadata
import org.apache.tika.parser.ParseContext
import org.apache.tika.parser.pdf.PDFParser
import org.apache.tika.sax.BodyContentHandler;
object ExtractTextFromPDFs {
val pdfParser = new PDFParser()
+/*
def apply(dba: DataByteArray): String = {
if (dba.get.isEmpty) "N/A"
else {
try {
val is = new ByteArrayInputStream(dba.get)
val contenthandler = new BodyContentHandler(Integer.MAX_VALUE)
val metadata = new Metadata()
pdfParser.parse(is, contenthandler, metadata, new ParseContext())
is.close()
contenthandler.toString
}
catch {
case t: Throwable =>
t.printStackTrace()
""
}
}
}
+*/
}
\ No newline at end of file
diff --git a/src/test/java/org/warcbase/pig/PigArcLoaderTest.java b/src/test/java/org/warcbase/pig/PigArcLoaderTest.java
deleted file mode 100644
index ec1daf3..0000000
--- a/src/test/java/org/warcbase/pig/PigArcLoaderTest.java
+++ /dev/null
@@ -1,173 +0,0 @@
-package org.warcbase.pig;
-
-import static org.junit.Assert.assertEquals;
-import static org.junit.Assert.assertFalse;
-
-import java.io.File;
-import java.util.Iterator;
-
-import org.apache.commons.io.FileUtils;
-import org.apache.commons.logging.Log;
-import org.apache.commons.logging.LogFactory;
-import org.apache.pig.data.Tuple;
-import org.apache.pig.pigunit.PigTest;
-import org.junit.After;
-import org.junit.Before;
-import org.junit.Test;
-
-import com.google.common.io.Files;
-import com.google.common.io.Resources;
-
-public class PigArcLoaderTest {
- private static final Log LOG = LogFactory.getLog(PigArcLoaderTest.class);
- private File tempDir;
-
- @Test
- public void testArcLoaderCount() throws Exception {
- String arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath();
-
- String pigFile = Resources.getResource("scripts/TestArcLoaderCount.pig").getPath();
- String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows
-
- PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile,
- "experimentfolder=" + location });
-
- Iterator parses = test.getAlias("b");
-
- Tuple tuple = parses.next();
- assertEquals(300L, tuple.get(0));
-
- // There should only be one record.
- assertFalse(parses.hasNext());
- }
-
- @Test
- public void testArcCountLinks() throws Exception {
- String arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath();
-
- String pigFile = Resources.getResource("scripts/TestArcCountLinks.pig").getPath();
- String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows
-
- PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile,
- "experimentfolder=" + location });
-
- Iterator parses = test.getAlias("a");
-
- int cnt = 0;
- while (parses.hasNext()) {
- LOG.info("link and anchor text: " + parses.next());
- cnt++;
- }
- assertEquals(664, cnt);
- }
-
- @Test
- public void testDetectLanguage() throws Exception {
- String arcTestDataFile;
- arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath();
-
- String pigFile = Resources.getResource("scripts/TestDetectLanguage.pig").getPath();
- String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows
-
- PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile, "experimentfolder=" + location });
-
- Iterator parses = test.getAlias("d");
-
- while (parses.hasNext()) {
- Tuple tuple = parses.next();
- String lang = (String) tuple.get(0);
- switch (lang) {
- case "en": assertEquals(57L, (long) tuple.get(1)); break;
- case "et": assertEquals( 6L, (long) tuple.get(1)); break;
- case "it": assertEquals( 1L, (long) tuple.get(1)); break;
- case "lt": assertEquals(66L, (long) tuple.get(1)); break;
- case "no": assertEquals( 6L, (long) tuple.get(1)); break;
- case "ro": assertEquals( 4L, (long) tuple.get(1)); break;
- }
- System.out.println("language test: " + tuple.getAll());
- }
-
- }
-
- /*
- * The two tests of MIME type detection is dependent on the version of the corresponding Tika and magiclib libraries
- */
-
- //@Test
- // Commenting out this test case for now since it requires a 3rd party lib to be installed.
- public void testDetectMimeTypeMagic() throws Exception {
- String arcTestDataFile;
- arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath();
-
- String pigFile = Resources.getResource("scripts/TestDetectMimeTypeMagic.pig").getPath();
- String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows ?
-
- PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile,
- "experimentfolder=" + location });
-
- Iterator ts = test.getAlias("magicMimeBinned");
- while (ts.hasNext()) {
- Tuple t = ts.next(); // t = (mime type, count)
- String mime = (String) t.get(0);
- System.out.println(mime + ": " + t.get(1));
- if (mime != null) {
- switch (mime) {
- case "EMPTY": assertEquals( 7L, (long) t.get(1)); break;
- case "text/html": assertEquals(139L, (long) t.get(1)); break;
- case "text/plain": assertEquals( 80L, (long) t.get(1)); break;
- case "image/gif": assertEquals( 29L, (long) t.get(1)); break;
- case "application/xml": assertEquals( 11L, (long) t.get(1)); break;
- case "application/rss+xml": assertEquals( 2L, (long) t.get(1)); break;
- case "application/xhtml+xml": assertEquals( 1L, (long) t.get(1)); break;
- case "application/octet-stream": assertEquals( 26L, (long) t.get(1)); break;
- case "application/x-shockwave-flash": assertEquals( 8L, (long) t.get(1)); break;
- }
- }
- }
- }
-
- @Test
- public void testDetectMimeTypeTika() throws Exception {
- String arcTestDataFile;
- arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath();
-
- String pigFile = Resources.getResource("scripts/TestDetectMimeTypeTika.pig").getPath();
- String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows ?
-
- PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile, "experimentfolder=" + location});
-
- Iterator ts = test.getAlias("tikaMimeBinned");
- while (ts.hasNext()) {
- Tuple t = ts.next();
-
- String mime = (String) t.get(0);
- switch (mime) {
- case "image/gif": assertEquals( 29L, (long) t.get(1)); break;
- case "image/png": assertEquals( 8L, (long) t.get(1)); break;
- case "image/jpeg": assertEquals( 18L, (long) t.get(1)); break;
- case "text/html": assertEquals(132L, (long) t.get(1)); break;
- case "text/plain": assertEquals( 86L, (long) t.get(1)); break;
- case "application/xml": assertEquals( 1L, (long) t.get(1)); break;
- case "application/rss+xml": assertEquals( 9L, (long) t.get(1)); break;
- case "applicaiton/xhtml+xml": assertEquals( 1L, (long) t.get(1)); break;
- case "application/octet-stream": assertEquals( 7L, (long) t.get(1)); break;
- case "application/x-shockwave-flash": assertEquals( 8L, (long) t.get(1)); break;
- }
- System.out.println(t.get(0) + ": " + t.get(1));
- }
- }
-
- @Before
- public void setUp() throws Exception {
- // create a random file location
- tempDir = Files.createTempDir();
- LOG.info("Output can be found in " + tempDir.getPath());
- }
-
- @After
- public void tearDown() throws Exception {
- // cleanup
- FileUtils.deleteDirectory(tempDir);
- LOG.info("Removing tmp files in " + tempDir.getPath());
- }
-}
diff --git a/src/test/java/org/warcbase/pig/PigWarcLoaderTest.java b/src/test/java/org/warcbase/pig/PigWarcLoaderTest.java
deleted file mode 100644
index c77a480..0000000
--- a/src/test/java/org/warcbase/pig/PigWarcLoaderTest.java
+++ /dev/null
@@ -1,57 +0,0 @@
-package org.warcbase.pig;
-
-import static org.junit.Assert.assertEquals;
-import static org.junit.Assert.assertFalse;
-
-import java.io.File;
-import java.util.Iterator;
-
-import org.apache.commons.io.FileUtils;
-import org.apache.commons.logging.Log;
-import org.apache.commons.logging.LogFactory;
-import org.apache.pig.data.Tuple;
-import org.apache.pig.pigunit.PigTest;
-import org.junit.After;
-import org.junit.Before;
-import org.junit.Test;
-
-import com.google.common.io.Files;
-import com.google.common.io.Resources;
-
-public class PigWarcLoaderTest {
- private static final Log LOG = LogFactory.getLog(PigWarcLoaderTest.class);
- private File tempDir;
-
- @Test
- public void testWarcLoaderCount() throws Exception {
- String arcTestDataFile = Resources.getResource("warc/example.warc.gz").getPath();
-
- String pigFile = Resources.getResource("scripts/TestWarcLoaderCount.pig").getPath();
- String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows
-
- PigTest test = new PigTest(pigFile, new String[] { "testWarcFolder=" + arcTestDataFile,
- "experimentfolder=" + location });
-
- Iterator parses = test.getAlias("b");
-
- Tuple tuple = parses.next();
- assertEquals(299L, tuple.get(0));
-
- // There should only be one record.
- assertFalse(parses.hasNext());
- }
-
- @Before
- public void setUp() throws Exception {
- // create a random file location
- tempDir = Files.createTempDir();
- LOG.info("Output can be found in " + tempDir.getPath());
- }
-
- @After
- public void tearDown() throws Exception {
- // cleanup
- FileUtils.deleteDirectory(tempDir);
- LOG.info("Removing tmp files in " + tempDir.getPath());
- }
-}
diff --git a/src/test/java/org/warcbase/pig/piggybank/ExtractLinksTest.java b/src/test/java/org/warcbase/pig/piggybank/ExtractLinksTest.java
deleted file mode 100644
index e235df1..0000000
--- a/src/test/java/org/warcbase/pig/piggybank/ExtractLinksTest.java
+++ /dev/null
@@ -1,59 +0,0 @@
-package org.warcbase.pig.piggybank;
-
-import static org.junit.Assert.assertEquals;
-
-import java.io.IOException;
-import java.util.Arrays;
-import java.util.Iterator;
-
-import org.apache.pig.data.DataBag;
-import org.apache.pig.data.Tuple;
-import org.apache.pig.data.TupleFactory;
-import org.junit.Test;
-
-public class ExtractLinksTest {
- private TupleFactory tupleFactory = TupleFactory.getInstance();
-
- @Test
- public void test1() throws IOException {
- ExtractLinks udf = new ExtractLinks();
-
- String fragment = "Here is a search engine .\n" +
- "Here is Twitter .\n";
-
- DataBag bag = udf.exec(tupleFactory.newTuple(fragment));
- assertEquals(2, bag.size());
-
- Tuple tuple = null;
- Iterator iter = bag.iterator();
- tuple = iter.next();
- assertEquals("http://www.google.com", (String) tuple.get(0));
- assertEquals("a search engine", (String) tuple.get(1));
-
- tuple = iter.next();
- assertEquals("http://www.twitter.com/", (String) tuple.get(0));
- assertEquals("Twitter", (String) tuple.get(1));
- }
-
- @Test
- public void test2() throws IOException {
- ExtractLinks udf = new ExtractLinks();
-
- String fragment = "Here is a search engine .\n" +
- "Here is a relative URL .\n";
-
- DataBag bag = udf.exec(tupleFactory.newTuple(Arrays.asList(fragment, "http://www.foobar.org/index.html")));
- assertEquals(2, bag.size());
-
- Tuple tuple = null;
- Iterator iter = bag.iterator();
- tuple = iter.next();
- assertEquals("http://www.google.com", (String) tuple.get(0));
- assertEquals("a search engine", (String) tuple.get(1));
-
- tuple = iter.next();
- assertEquals("http://www.foobar.org/page.html", (String) tuple.get(0));
- assertEquals("a relative URL", (String) tuple.get(1));
- }
-
-}
diff --git a/src/test/java/org/warcbase/pig/piggybank/ExtractTopLevelDomainTest.java b/src/test/java/org/warcbase/pig/piggybank/ExtractTopLevelDomainTest.java
deleted file mode 100644
index b2ce5a8..0000000
--- a/src/test/java/org/warcbase/pig/piggybank/ExtractTopLevelDomainTest.java
+++ /dev/null
@@ -1,44 +0,0 @@
-package org.warcbase.pig.piggybank;
-
-import static org.junit.Assert.assertEquals;
-
-import java.io.IOException;
-import java.util.Arrays;
-
-import org.apache.pig.data.TupleFactory;
-import org.junit.Test;
-
-public class ExtractTopLevelDomainTest {
- private TupleFactory tupleFactory = TupleFactory.getInstance();
-
- private static final String[][] CASES1 = {
- {"http://www.umiacs.umd.edu/~jimmylin/", "www.umiacs.umd.edu"},
- {"https://github.com/lintool", "github.com"},
- {"http://ianmilligan.ca/2015/05/04/iipc-2015-slides-for-warcs-wats-and-wgets-presentation/", "ianmilligan.ca"},
- {"index.html", null},
- };
-
- private static final String[][] CASES2 = {
- {"index.html","http://www.umiacs.umd.edu/~jimmylin/", "www.umiacs.umd.edu"},
- {"index.html","lintool/", null},
- };
-
- @Test
- public void test1() throws IOException {
- ExtractTopLevelDomain udf = new ExtractTopLevelDomain();
-
- for (int i = 0; i < CASES1.length; i++) {
- assertEquals(CASES1[i][1], udf.exec(tupleFactory.newTuple(CASES1[i][0])));
- }
- }
-
- @Test
- public void test2() throws IOException {
- ExtractTopLevelDomain udf = new ExtractTopLevelDomain();
-
- for (int i = 0; i < CASES2.length; i++) {
- assertEquals(CASES2[i][2],
- udf.exec(tupleFactory.newTuple(Arrays.asList(CASES2[i][0], CASES2[i][1]))));
- }
- }
-}
diff --git a/src/test/resources/scripts/TestArcCountLinks.pig b/src/test/resources/scripts/TestArcCountLinks.pig
deleted file mode 100644
index 905f3d8..0000000
--- a/src/test/resources/scripts/TestArcCountLinks.pig
+++ /dev/null
@@ -1,8 +0,0 @@
--- Counts up number of links
-
-DEFINE ArcLoader org.warcbase.pig.ArcLoader();
-
-raw = load '$testArcFolder' using ArcLoader();
-a = foreach raw generate FLATTEN(org.warcbase.pig.piggybank.ExtractLinks((chararray) content));
-
-store a into '$experimentfolder/a';
diff --git a/src/test/resources/scripts/TestArcLoader.pig b/src/test/resources/scripts/TestArcLoader.pig
deleted file mode 100644
index f98a285..0000000
--- a/src/test/resources/scripts/TestArcLoader.pig
+++ /dev/null
@@ -1,14 +0,0 @@
--- Simple word count example to tally up dates when pages are crawled
-
-DEFINE ArcLoader org.warcbase.pig.ArcLoader();
-
-raw = load '$testArcFolder' using ArcLoader();
--- schema is (url:chararray, date:chararray, mime:chararray, content:bytearray);
-
-store raw into '$experimentfolder/raw' using PigStorage();
-
-a = foreach raw generate SUBSTRING(date, 0, 8) as date;
-b = group a by date;
-c = foreach b generate group, COUNT(a);
-
-store c into '$experimentfolder/c' using PigStorage();
\ No newline at end of file
diff --git a/src/test/resources/scripts/TestArcLoaderCount.pig b/src/test/resources/scripts/TestArcLoaderCount.pig
deleted file mode 100644
index c8a8819..0000000
--- a/src/test/resources/scripts/TestArcLoaderCount.pig
+++ /dev/null
@@ -1,9 +0,0 @@
--- Counts up number of total records
-
-DEFINE ArcLoader org.warcbase.pig.ArcLoader();
-
-raw = load '$testArcFolder' using ArcLoader();
-a = group raw all;
-b = foreach a generate COUNT(raw);
-
-store b into '$experimentfolder/counts' using PigStorage();
diff --git a/src/test/resources/scripts/TestDetectLanguage.pig b/src/test/resources/scripts/TestDetectLanguage.pig
deleted file mode 100644
index 9b7acd0..0000000
--- a/src/test/resources/scripts/TestDetectLanguage.pig
+++ /dev/null
@@ -1,16 +0,0 @@
--- Simple language detection example
-
-DEFINE ArcLoader org.warcbase.pig.ArcLoader();
-DEFINE ExtractRawText org.warcbase.pig.piggybank.ExtractRawText();
-DEFINE DetectLanguage org.warcbase.pig.piggybank.DetectLanguage();
-
-raw = load '$testArcFolder' using ArcLoader();
--- schema is (url:chararray, date:chararray, mime:chararray, content:bytearray);
-
-a = filter raw by mime == 'text/html';
-b = foreach a generate url, mime,
- DetectLanguage(ExtractRawText((chararray) content)) as lang;
-c = group b by lang;
-d = foreach c generate group, COUNT(b);
-
-dump d;
diff --git a/src/test/resources/scripts/TestDetectMimeTypeMagic.pig b/src/test/resources/scripts/TestDetectMimeTypeMagic.pig
deleted file mode 100644
index 95bb0af..0000000
--- a/src/test/resources/scripts/TestDetectMimeTypeMagic.pig
+++ /dev/null
@@ -1,37 +0,0 @@
-
--- Combined mime type check and language detection on an arc file
---register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
-
-define ArcLoader org.warcbase.pig.ArcLoader();
-define DetectMimeTypeMagic org.warcbase.pig.piggybank.DetectMimeTypeMagic();
-
--- Load arc file properties: url, date, mime, content
-raw = load '$testArcFolder' using org.warcbase.pig.ArcLoader() as (url: chararray, date:chararray, mime:chararray, content:chararray);
-
--- Detect the mime type of the content using magic lib
--- On CentOS the magic file is located at /usr/share/file/magic.mgc
--- On MacOS X using Homebrew the magic file is located at /usr/local/Cellar/libmagic/5.15/share/misc/magic.mgc
-a = foreach raw generate url,mime, DetectMimeTypeMagic('/usr/local/Cellar/libmagic/5.15/share/misc/magic.mgc', content) as magicMime;
-
-
--- magic lib includes "; " in which we are not interested
-b = foreach a {
- magicMimeSplit = STRSPLIT(magicMime, ';');
- GENERATE url, mime, magicMimeSplit.$0 as magicMime;
-}
-
--- httpMimes = foreach b generate mime;
--- httpMimeGroups = group httpMimes by mime;
--- httpMimeBinned = foreach httpMimeGroups generate group, COUNT(httpMimes);
-
-magicMimes = foreach b generate magicMime;
-magicMimeGroups = group magicMimes by magicMime;
-magicMimeBinned = foreach magicMimeGroups generate group, COUNT(magicMimes);
-
---dump httpMimeBinned;
---dump tikaMimeBinned;
---dump magicMimeBinned;
-
--- store httpMimeBinned into '$experimentfolder/httpMimeBinned';
-store magicMimesBinned into '$experimentfolder/magicMimeBinned';
-
diff --git a/src/test/resources/scripts/TestDetectMimeTypeTika.pig b/src/test/resources/scripts/TestDetectMimeTypeTika.pig
deleted file mode 100644
index 364eeee..0000000
--- a/src/test/resources/scripts/TestDetectMimeTypeTika.pig
+++ /dev/null
@@ -1,19 +0,0 @@
--- Combined mime type check and language detection on an arc file
-
-define ArcLoader org.warcbase.pig.ArcLoader();
-define DetectMimeTypeTika org.warcbase.pig.piggybank.DetectMimeTypeTika();
-
-raw = load '$testArcFolder' using ArcLoader();
--- schema is (url:chararray, date:chararray, mime:chararray, content:bytearray);
-
--- Detect the mime type of the content using and Tika
-a = foreach raw generate url,mime, DetectMimeTypeTika(content) as tikaMime;
-
-tikaMimes = foreach a generate tikaMime;
-tikaMimeGroups = group tikaMimes by tikaMime;
-tikaMimeBinned = foreach tikaMimeGroups generate group, COUNT(tikaMimes);
-
-dump tikaMimeBinned;
-
-store tikaMimeBinned into '$experimentfolder/tikaMimeBinned';
-
diff --git a/src/test/resources/scripts/TestWarcLoaderCount.pig b/src/test/resources/scripts/TestWarcLoaderCount.pig
deleted file mode 100644
index d070b1a..0000000
--- a/src/test/resources/scripts/TestWarcLoaderCount.pig
+++ /dev/null
@@ -1,9 +0,0 @@
--- Counts up number of total records
-
-DEFINE WarcLoader org.warcbase.pig.WarcLoader();
-
-raw = load '$testWarcFolder' using WarcLoader();
-a = group raw all;
-b = foreach a generate COUNT(raw);
-
-store b into '$experimentfolder/counts' using PigStorage();