diff --git a/README.md b/README.md index dfe7307..5a3bfd2 100644 --- a/README.md +++ b/README.md @@ -1,120 +1,119 @@ Warcbase ======== Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark. There are two main ways of using Warcbase: -+ The first and most common is to analyze web archives using Spark (the preferred approach) or Pig (which is in the process of being deprecated). ++ The first and most common is to analyze web archives using [Spark](http://spark.apache.org/). + The second is to take advantage of HBase to provide random access as well as analytics capabilities. Random access allows Warcbase to provide temporal browsing of archived content (i.e., "wayback" functionality). You can use Warcbase without HBase, and since HBase requires more extensive setup, it is recommended that if you're just starting out, play with the Spark analytics and don't worry about HBase. Warcbase is built against CDH 5.4.1: + Hadoop version: 2.6.0-cdh5.4.1 + HBase version: 1.0.0-cdh5.4.1 -+ Pig version: 0.12.0-cdh5.4.1 + Spark version: 1.3.0-cdh5.4.1 The Hadoop ecosystem is evolving rapidly, so there may be incompatibilities with other versions. Detailed documentation is available in [this repository's wiki](https://github.com/lintool/warcbase/wiki). Getting Started --------------- Clone the repo: ``` $ git clone http://github.com/lintool/warcbase.git ``` You can then build Warcbase: ``` $ mvn clean package appassembler:assemble ``` For the impatient, to skip tests: ``` $ mvn clean package appassembler:assemble -DskipTests ``` To create Eclipse project files: ``` $ mvn eclipse:clean $ mvn eclipse:eclipse ``` You can then import the project into Eclipse. Spark Quickstart ---------------- For the impatient, let's do a simple analysis with Spark. Within the repo there's already a sample ARC file stored at `src/test/resources/arc/example.arc.gz`. If you need to install Spark, [we have a walkthrough here for installation on OS X](https://github.com/lintool/warcbase/wiki/Installing-and-Running-Spark-under-OS-X). This page also has instructions on how to get Spark Notebook, an interactive web-based editor, running. Once you've got Spark installed, you can go ahead and fire up the Spark shell: ``` $ spark-shell --jars target/warcbase-0.1.0-SNAPSHOT-fatjar.jar ``` Here's a simple script that extracts and counts the top-level domains (i.e., number of pages for each top-level domain) in the sample ARC data: ``` import org.warcbase.spark.matchbox._ import org.warcbase.spark.rdd.RecordRDD._ val r = RecordLoader.loadArc("src/test/resources/arc/example.arc.gz", sc) .keepValidPages() .map(r => ExtractTopLevelDomain(r.getUrl)) .countItems() .take(10) ``` **Tip:** By default, commands in the Spark shell must be one line. To run multi-line commands, type `:paste` in Spark shell: you can then copy-paste the script above directly into Spark shell. Use Ctrl-D to finish the command. What to learn more? Check out [analyzing web archives with Spark](https://github.com/lintool/warcbase/wiki/Analyzing-Web-Archives-with-Spark). What About Pig? --------------- Warcbase was originally conceived with Pig for analytics, but we have transitioned over to Spark as the language of choice for scholarly interactions with web archive data. Spark has several advantages, including a cleaner interface, easier to write user-defined functions (UDFs), as well as integration with different "notebook" frontends. Visualizations -------------- The result of analyses of using Warcbase can serve as input to visualizations that help scholars interactively explore the data. Examples include: + [Basic crawl statistics](http://lintool.github.io/warcbase/vis/crawl-sites/index.html) from the Canadian Political Parties and Political Interest Groups collection. + [Interactive graph visualization](https://github.com/lintool/warcbase/wiki/Gephi:-Converting-Site-Link-Structure-into-Dynamic-Visualization) using Gephi. + [Shine interface](http://webarchives.ca/) for faceted full-text search. Next Steps ---------- + [Ingesting content into HBase](https://github.com/lintool/warcbase/wiki/Ingesting-Content-into-HBase): loading ARC and WARC data into HBase + [Warcbase/Wayback integration](https://github.com/lintool/warcbase/wiki/Warcbase-Wayback-Integration): guide to provide temporal browsing capabilities + [Warcbase Java tools](https://github.com/lintool/warcbase/wiki/Warcbase-Java-Tools): building the URL mapping, extracting the webgraph License ------- Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0). Acknowledgments --------------- This work is supported in part by the National Science Foundation and by the Mellon Foundation (via Columbia University). Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors. diff --git a/pom.xml b/pom.xml index 49cda9d..47fb08a 100644 --- a/pom.xml +++ b/pom.xml @@ -1,546 +1,517 @@ 4.0.0 org.warcbase warcbase jar 0.1.0-SNAPSHOT Warcbase An open-source platform for managing web archives built on Hadoop and HBase http://warcbase.org/ The Apache Software License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0.txt repo scm:git:git@github.com:lintool/warcbase.git scm:git:git@github.com:lintool/warcbase.git git@github.com:lintool/warcbase.git lintool Jimmy Lin jimmylin@umd.edu milad621 Milad Gholami mgholami@cs.umd.edu jeffyRao Jinfeng Rao jinfeng@cs.umd.edu org.sonatype.oss oss-parent 7 UTF-8 UTF-8 8.1.12.v20130726 2.6.0-cdh5.4.1 1.0.0-cdh5.4.1 3.4.5-cdh5.4.1 - 0.12.0-cdh5.4.1 1.3.0-cdh5.4.1 2.10.4 maven-clean-plugin 2.6.1 src/main/solr/lib false org.apache.maven.plugins maven-compiler-plugin 3.2 1.7 1.7 org.apache.maven.plugins maven-shade-plugin 2.3 package shade META-INF/services/org.apache.lucene.codecs.Codec *:* META-INF/*.SF META-INF/*.DSA META-INF/*.RSA true fatjar org.apache.hadoop:* org.apache.maven.plugins maven-dependency-plugin 2.4 copy package copy-dependencies src/main/solr/lib org.codehaus.mojo appassembler-maven-plugin 1.9 -Xms512M -Xmx24576M org.warcbase.WarcbaseAdmin WarcbaseAdmin org.warcbase.data.UrlMappingBuilder UrlMappingBuilder org.warcbase.data.UrlMapping UrlMapping org.warcbase.data.ExtractLinks ExtractLinks org.warcbase.data.ExtractSiteLinks ExtractSiteLinks org.warcbase.ingest.IngestFiles IngestFiles org.warcbase.ingest.SearchForUrl SearchForUrl org.warcbase.browser.WarcBrowser WarcBrowser org.warcbase.analysis.DetectDuplicates DetectDuplicates org.warcbase.browser.SeleniumBrowser SeleniumBrowser org.scala-tools maven-scala-plugin 2.15.2 process-resources add-source compile scala-test-compile process-test-resources testCompile ${scala.version} true -target:jvm-1.7 -g:vars -deprecation -dependencyfile ${project.build.directory}/.scala_dependencies maven http://repo.maven.apache.org/maven2/ cloudera https://repository.cloudera.com/artifactory/cloudera-repos/ internetarchive Internet Archive Maven Repository http://builds.archive.org:8080/maven2 junit junit 4.12 test org.scalatest scalatest_2.10 2.2.4 test commons-codec commons-codec 1.8 commons-io commons-io 2.4 org.jsoup jsoup 1.7.3 com.google.guava guava 14.0.1 tl.lin lintools-datatypes 1.0.0 org.apache.hbase hbase-client ${hbase.version} org.apache.hadoophadoop-core org.apache.hbase hbase-server ${hbase.version} org.apache.hadoophadoop-core org.mortbay.jettyservlet-api-2.5 javax.servletservlet-api asmasm org.apache.hadoop hadoop-client ${hadoop.version} javax.servletservlet-api org.apache.zookeeper zookeeper ${zookeeper.version} - - org.apache.pig - pig - ${pig.version} - - org.mortbay.jettyservlet-api-2.5 - javax.servletservlet-api - - - - org.apache.pig - pigunit - ${pig.version} - - commons-langcommons-lang - commons-loggingcommons-logging - - - org.netpreserve.openwayback openwayback-core 2.0.0.BETA.2 org.apache.hadoophadoop-core ch.qos.logbacklogback-classic org.netpreserve.openwaybackopenwayback-cdx-server org.netpreserve.openwaybackopenwayback-access-control-core it.unimi.dsidsiutils fastutilfastutil org.netpreserve.commons webarchive-commons 1.1.4 org.apache.hadoophadoop-core commons-langcommons-lang fastutilfastutil it.unimi.dsi dsiutils 2.2.0 ch.qos.logbacklogback-classic commons-langcommons-lang it.unimi.dsi fastutil 6.5.15 commons-langcommons-lang org.eclipse.jetty jetty-server ${jettyVersion} org.eclipse.jetty jetty-webapp ${jettyVersion} true org.slf4j slf4j-log4j12 1.6.4 org.apache.commons commons-lang3 3.0 commons-cli commons-cli 1.2 net.sf.opencsv opencsv 2.3 org.apache.tika tika-core 1.9 org.apache.tika tika-parsers 1.9 org.antlr antlr 3.5.2 - - - org.seleniumhq.selenium selenium-java 2.42.2 org.seleniumhq.seleniumselenium-htmlunit-driver org.seleniumhq.seleniumselenium-ie-driver org.webbitserverwebbit org.scala-lang scala-library 2.10.4 org.apache.spark spark-core_2.10 ${spark.version} com.typesafeconfig org.xerial.snappysnappy-java com.fasterxml.jackson.core jackson-core 2.6.3 com.typesafe config 1.2.1 org.xerial.snappy snappy-java 1.0.5 edu.stanford.nlp stanford-corenlp 3.4.1 com.syncthemall boilerpipe 1.2.2 xerces xercesImpl 2.11.0 org.apache.lucene lucene-core 4.7.2 org.apache.solr solr-core 4.7.2 slf4j-apiorg.slf4j org.apache.hadoophadoop-annotations org.apache.hadoophadoop-common org.apache.hadoophadoop-hdfs com.typesafeconfig uk.bl.wa.discovery warc-hadoop-indexer 2.2.0-BETA-5 asmasm com.typesafeconfig diff --git a/src/main/java/org/warcbase/pig/ArcLoader.java b/src/main/java/org/warcbase/pig/ArcLoader.java deleted file mode 100644 index 61a55e4..0000000 --- a/src/main/java/org/warcbase/pig/ArcLoader.java +++ /dev/null @@ -1,121 +0,0 @@ -package org.warcbase.pig; - -import java.io.IOException; -import java.util.List; - -import org.apache.hadoop.io.LongWritable; -import org.apache.hadoop.mapreduce.Job; -import org.apache.hadoop.mapreduce.RecordReader; -import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; -import org.apache.log4j.Logger; -import org.apache.pig.Expression; -import org.apache.pig.FileInputLoadFunc; -import org.apache.pig.LoadMetadata; -import org.apache.pig.PigException; -import org.apache.pig.ResourceSchema; -import org.apache.pig.ResourceStatistics; -import org.apache.pig.backend.executionengine.ExecException; -import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit; -import org.apache.pig.data.DataByteArray; -import org.apache.pig.data.DataType; -import org.apache.pig.data.Tuple; -import org.apache.pig.data.TupleFactory; -import org.archive.io.arc.ARCRecordMetaData; -import org.warcbase.data.ArcRecordUtils; -import org.warcbase.io.ArcRecordWritable; -import org.warcbase.mapreduce.WacArcInputFormat; - -import com.google.common.collect.Lists; - -public class ArcLoader extends FileInputLoadFunc implements LoadMetadata { - private static final Logger LOG = Logger.getLogger(ArcLoader.class); - - private static final TupleFactory TUPLE_FACTORY = TupleFactory.getInstance(); - - private RecordReader in; - - public ArcLoader() { - } - - @Override - public WacArcInputFormat getInputFormat() throws IOException { - return new WacArcInputFormat(); - } - - @Override - public Tuple getNext() throws IOException { - try { - if ( !in.nextKeyValue() ) { - return null; - } - - ArcRecordWritable r = in.getCurrentValue(); - ARCRecordMetaData meta = r.getRecord().getMetaData(); - - List protoTuple = Lists.newArrayList(); - protoTuple.add(meta.getUrl()); - protoTuple.add(meta.getDate()); // These are the standard 14-digit dates. - protoTuple.add(meta.getMimetype()); - - try { - protoTuple.add(new DataByteArray(ArcRecordUtils.getBodyContent(r.getRecord()))); - } catch (OutOfMemoryError e) { - // When we get a corrupt record, this will happen... - // Try to recover and move on... - LOG.error("Encountered OutOfMemoryError ingesting " + meta.getUrl()); - LOG.error("Attempting to continue..."); - } - - return TUPLE_FACTORY.newTupleNoCopy(protoTuple); - } catch (InterruptedException e) { - int errCode = 6018; - String errMsg = "Error while reading input"; - throw new ExecException(errMsg, errCode, PigException.REMOTE_ENVIRONMENT, e); - } - } - - @SuppressWarnings({ "unchecked", "rawtypes" }) - @Override - public void prepareToRead(RecordReader reader, PigSplit split) { - in = reader; - } - - @Override - public void setLocation(String location, Job job) throws IOException { - FileInputFormat.setInputPaths(job, location); - } - - @Override - public String[] getPartitionKeys(String location, Job job) throws IOException { - return null; - } - - @Override - public ResourceSchema getSchema(String location, Job job) throws IOException { - // Schema is (url:chararray, date:chararray, mime:chararray, content:bytearray) - ResourceSchema schema = new ResourceSchema(); - - ResourceSchema.ResourceFieldSchema[] fields = new ResourceSchema.ResourceFieldSchema[4]; - fields[0] = new ResourceSchema.ResourceFieldSchema(); - fields[0].setName("url").setType(DataType.CHARARRAY); - fields[1] = new ResourceSchema.ResourceFieldSchema(); - fields[1].setName("date").setType(DataType.CHARARRAY); - fields[2] = new ResourceSchema.ResourceFieldSchema(); - fields[2].setName("mime").setType(DataType.CHARARRAY); - fields[3] = new ResourceSchema.ResourceFieldSchema(); - fields[3].setName("content").setType(DataType.BYTEARRAY); - - schema.setFields(fields); - - return schema; - } - - @Override - public ResourceStatistics getStatistics(String location, Job job) throws IOException { - return null; - } - - @Override - public void setPartitionFilter(Expression partitionFilter) throws IOException { - } -} diff --git a/src/main/java/org/warcbase/pig/WarcLoader.java b/src/main/java/org/warcbase/pig/WarcLoader.java deleted file mode 100644 index 298a75e..0000000 --- a/src/main/java/org/warcbase/pig/WarcLoader.java +++ /dev/null @@ -1,149 +0,0 @@ -package org.warcbase.pig; - -import com.google.common.collect.Lists; -import org.apache.hadoop.io.LongWritable; -import org.apache.hadoop.mapreduce.Job; -import org.apache.hadoop.mapreduce.RecordReader; -import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; -import org.apache.log4j.Logger; -import org.apache.pig.*; -import org.apache.pig.backend.executionengine.ExecException; -import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit; -import org.apache.pig.data.DataByteArray; -import org.apache.pig.data.DataType; -import org.apache.pig.data.Tuple; -import org.apache.pig.data.TupleFactory; -import org.archive.io.ArchiveRecordHeader; -import org.archive.io.warc.WARCRecord; -import org.archive.util.ArchiveUtils; -import org.warcbase.data.WarcRecordUtils; -import org.warcbase.io.WarcRecordWritable; -import org.warcbase.mapreduce.WacWarcInputFormat; - -import java.io.IOException; -import java.text.DateFormat; -import java.text.ParseException; -import java.text.SimpleDateFormat; -import java.util.Date; -import java.util.List; - -public class WarcLoader extends FileInputLoadFunc implements LoadMetadata { - private static final Logger LOG = Logger.getLogger(WarcLoader.class); - - private static final TupleFactory TUPLE_FACTORY = TupleFactory.getInstance(); - private static final DateFormat ISO8601 = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssX"); - - private RecordReader in; - - public WarcLoader() { - } - - @Override - public WacWarcInputFormat getInputFormat() throws IOException { - return new WacWarcInputFormat(); - } - - @Override - public Tuple getNext() throws IOException { - try { - WARCRecord record; - ArchiveRecordHeader header; - - // We're going to continue reading WARC records from the underlying input format - // until we reach a "response" record. - while (true) { - if (!in.nextKeyValue()) { - return null; - } - - record = in.getCurrentValue().getRecord(); - header = record.getHeader(); - - if (header.getHeaderValue("WARC-Type").equals("response")) { - break; - } - } - - String url = header.getUrl(); - byte[] content = null; - String type = null; - - try { - content = WarcRecordUtils.getContent(record); - type = WarcRecordUtils.getWarcResponseMimeType(content); - } catch (OutOfMemoryError e) { - // When we get a corrupt record, this will happen... - // Try to recover and move on... - LOG.error("Encountered OutOfMemoryError ingesting " + url); - LOG.error("Attempting to continue..."); - } - - Date d = null; - String date = null; - try { - d = ISO8601.parse(header.getDate()); - date = ArchiveUtils.get14DigitDate(d); - } catch (ParseException e) { - LOG.error("Encountered ParseException ingesting " + url); - } - - List protoTuple = Lists.newArrayList(); - protoTuple.add(url); - protoTuple.add(date); - protoTuple.add(type); - protoTuple.add(new DataByteArray(content)); - - return TUPLE_FACTORY.newTupleNoCopy(protoTuple); - } catch (InterruptedException e) { - int errCode = 6018; - String errMsg = "Error while reading input"; - throw new ExecException(errMsg, errCode, PigException.REMOTE_ENVIRONMENT, e); - } - } - - @SuppressWarnings({ "unchecked", "rawtypes" }) - @Override - public void prepareToRead(RecordReader reader, PigSplit split) { - in = reader; - } - - @Override - public void setLocation(String location, Job job) throws IOException { - FileInputFormat.setInputPaths(job, location); - } - - - @Override - public String[] getPartitionKeys(String location, Job job) throws IOException { - return null; - } - - @Override - public ResourceSchema getSchema(String location, Job job) throws IOException { - // Schema is (url:chararray, date:chararray, mime:chararray, content:bytearray) - ResourceSchema schema = new ResourceSchema(); - - ResourceSchema.ResourceFieldSchema[] fields = new ResourceSchema.ResourceFieldSchema[4]; - fields[0] = new ResourceSchema.ResourceFieldSchema(); - fields[0].setName("url").setType(DataType.CHARARRAY); - fields[1] = new ResourceSchema.ResourceFieldSchema(); - fields[1].setName("date").setType(DataType.CHARARRAY); - fields[2] = new ResourceSchema.ResourceFieldSchema(); - fields[2].setName("mime").setType(DataType.CHARARRAY); - fields[3] = new ResourceSchema.ResourceFieldSchema(); - fields[3].setName("content").setType(DataType.BYTEARRAY); - - schema.setFields(fields); - - return schema; - } - - @Override - public ResourceStatistics getStatistics(String location, Job job) throws IOException { - return null; - } - - @Override - public void setPartitionFilter(Expression partitionFilter) throws IOException { - } -} diff --git a/src/main/java/org/warcbase/pig/piggybank/DetectLanguage.java b/src/main/java/org/warcbase/pig/piggybank/DetectLanguage.java deleted file mode 100644 index ccddd54..0000000 --- a/src/main/java/org/warcbase/pig/piggybank/DetectLanguage.java +++ /dev/null @@ -1,18 +0,0 @@ -package org.warcbase.pig.piggybank; - -import org.apache.pig.EvalFunc; -import org.apache.pig.data.Tuple; -import org.apache.tika.language.LanguageIdentifier; - -import java.io.IOException; - -public class DetectLanguage extends EvalFunc { - @Override - public String exec(Tuple input) throws IOException { - if (input == null || input.size() == 0 || input.get(0) == null) { - return null; - } - String text = (String) input.get(0); - return new LanguageIdentifier(text).getLanguage(); - } -} \ No newline at end of file diff --git a/src/main/java/org/warcbase/pig/piggybank/DetectMimeTypeMagic.java b/src/main/java/org/warcbase/pig/piggybank/DetectMimeTypeMagic.java deleted file mode 100644 index 6e8aeee..0000000 --- a/src/main/java/org/warcbase/pig/piggybank/DetectMimeTypeMagic.java +++ /dev/null @@ -1,38 +0,0 @@ -package org.warcbase.pig.piggybank; - -import org.apache.pig.EvalFunc; -import org.apache.pig.data.Tuple; - -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.io.InputStream; - -public class DetectMimeTypeMagic extends EvalFunc { - - - @Override - public String exec(Tuple input) throws IOException { - String mimeType = null; - - if (input == null || input.size() == 0 || input.get(0) == null) { - return "N/A"; - } - String magicFile = (String) input.get(0); - String content = (String) input.get(1); - - InputStream is = new ByteArrayInputStream(content.getBytes()); - if (content.isEmpty()) return "EMPTY"; - - // I'm commenting this out because the jar isn't actually published anywhere... - // @lintool 2014/08/12 - - //org.opf_labs.LibmagicJnaWrapper jnaWrapper = new LibmagicJnaWrapper(); - //jnaWrapper.load(magicFile); - - //jnaWrapper.load("/usr/local/Cellar/libmagic/5.15/share/misc/magic.mgc"); // Mac OS X with Homebrew - //jnaWrapper.load("/usr/share/file/magic.mgc"); // CentOS - - //mimeType = jnaWrapper.getMimeType(is); - return mimeType; - } -} diff --git a/src/main/java/org/warcbase/pig/piggybank/DetectMimeTypeTika.java b/src/main/java/org/warcbase/pig/piggybank/DetectMimeTypeTika.java deleted file mode 100644 index 538002c..0000000 --- a/src/main/java/org/warcbase/pig/piggybank/DetectMimeTypeTika.java +++ /dev/null @@ -1,33 +0,0 @@ -package org.warcbase.pig.piggybank; - -import org.apache.pig.EvalFunc; -import org.apache.pig.data.DataByteArray; -import org.apache.pig.data.Tuple; -import org.apache.tika.Tika; -import org.apache.tika.detect.DefaultDetector; -import org.apache.tika.parser.AutoDetectParser; - -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.io.InputStream; - -public class DetectMimeTypeTika extends EvalFunc { - - @Override - public String exec(Tuple input) throws IOException { - String mimeType; - - if (input == null || input.size() == 0 || input.get(0) == null) { - return "N/A"; - } - DataByteArray content = (DataByteArray) input.get(0); - - InputStream is = new ByteArrayInputStream(content.get()); - - DefaultDetector detector = new DefaultDetector(); - AutoDetectParser parser = new AutoDetectParser(detector); - mimeType = new Tika(detector, parser).detect(is); - - return mimeType; - } -} diff --git a/src/main/java/org/warcbase/pig/piggybank/ExtractBoilerpipeText.java b/src/main/java/org/warcbase/pig/piggybank/ExtractBoilerpipeText.java deleted file mode 100644 index f3c9b75..0000000 --- a/src/main/java/org/warcbase/pig/piggybank/ExtractBoilerpipeText.java +++ /dev/null @@ -1,32 +0,0 @@ -package org.warcbase.pig.piggybank; - -import java.io.IOException; -import org.apache.commons.lang.StringUtils; - -import org.apache.pig.EvalFunc; -import org.apache.pig.data.Tuple; -import de.l3s.boilerpipe.extractors.DefaultExtractor; -// Could also use tika.parser.html.BoilerpipeContentHandler, which uses older version of boilerpipe - -/** - * UDF for extracting raw text content from an HTML page, minus "boilerplate" - * content (using boilerpipe). - */ -public class ExtractBoilerpipeText extends EvalFunc { - public String exec(Tuple input) throws IOException { - if (input == null || input.size() == 0 || input.get(0) == null) { - return null; - } - - try { - // Other available extractors: https://boilerpipe.googlecode.com/svn/trunk/boilerpipe-core/javadoc/1.0/de/l3s/boilerpipe/extractors/package-summary.html - String text = DefaultExtractor.INSTANCE.getText((String) input.get(0)).replaceAll("[\\r\\n]+", " ").trim(); - if (text.isEmpty()) - return null; - else - return text; - } catch (Exception e) { - throw new IOException("Caught exception processing input row ", e); - } - } -} diff --git a/src/main/java/org/warcbase/pig/piggybank/ExtractLinks.java b/src/main/java/org/warcbase/pig/piggybank/ExtractLinks.java deleted file mode 100644 index 4ebb9e1..0000000 --- a/src/main/java/org/warcbase/pig/piggybank/ExtractLinks.java +++ /dev/null @@ -1,58 +0,0 @@ -package org.warcbase.pig.piggybank; - -import com.google.common.collect.Lists; -import org.apache.pig.EvalFunc; -import org.apache.pig.data.BagFactory; -import org.apache.pig.data.DataBag; -import org.apache.pig.data.Tuple; -import org.apache.pig.data.TupleFactory; -import org.jsoup.Jsoup; -import org.jsoup.nodes.Document; -import org.jsoup.nodes.Element; -import org.jsoup.select.Elements; - -import java.io.IOException; -import java.util.List; - -/** - * UDF for extracting links from a webpage given the HTML content (using Jsoup). Returns a bag of - * tuples, where each tuple consists of the URL and the anchor text. - */ -public class ExtractLinks extends EvalFunc { - private static final TupleFactory TUPLE_FACTORY = TupleFactory.getInstance(); - private static final BagFactory BAG_FACTORY = BagFactory.getInstance(); - - public DataBag exec(Tuple input) throws IOException { - if (input == null || input.size() == 0 || input.get(0) == null) { - return null; - } - - try { - String html = (String) input.get(0); - String base = input.size() > 1 ? (String) input.get(1) : null; - - DataBag output = BAG_FACTORY.newDefaultBag(); - Document doc = Jsoup.parse(html); - Elements links = doc.select("a[href]"); - - for (Element link : links) { - if (base != null) { - link.setBaseUri(base); - } - String target = link.attr("abs:href"); - if (target.length() == 0) { - continue; - } - - // Create each tuple (URL, anchor text) - List linkTuple = Lists.newArrayList(); - linkTuple.add(target); - linkTuple.add(link.text()); - output.add(TUPLE_FACTORY.newTupleNoCopy(linkTuple)); - } - return output; - } catch (Exception e) { - throw new IOException("Caught exception processing input row ", e); - } - } -} \ No newline at end of file diff --git a/src/main/java/org/warcbase/pig/piggybank/ExtractRawText.java b/src/main/java/org/warcbase/pig/piggybank/ExtractRawText.java deleted file mode 100644 index 5ef49d3..0000000 --- a/src/main/java/org/warcbase/pig/piggybank/ExtractRawText.java +++ /dev/null @@ -1,25 +0,0 @@ -package org.warcbase.pig.piggybank; - -import java.io.IOException; - -import org.apache.pig.EvalFunc; -import org.apache.pig.data.Tuple; -import org.jsoup.Jsoup; - -/** - * UDF for extracting raw text content from an HTML page (using Jsoup). - */ -public class ExtractRawText extends EvalFunc { - public String exec(Tuple input) throws IOException { - if (input == null || input.size() == 0 || input.get(0) == null) { - return null; - } - - try { - // Use Jsoup for cleanup. - return Jsoup.parse((String) input.get(0)).text().replaceAll("[\\r\\n]+", " "); - } catch (Exception e) { - throw new IOException("Caught exception processing input row ", e); - } - } -} \ No newline at end of file diff --git a/src/main/java/org/warcbase/pig/piggybank/ExtractTextFromPDFs.java b/src/main/java/org/warcbase/pig/piggybank/ExtractTextFromPDFs.java deleted file mode 100644 index 7f6231a..0000000 --- a/src/main/java/org/warcbase/pig/piggybank/ExtractTextFromPDFs.java +++ /dev/null @@ -1,46 +0,0 @@ -package org.warcbase.pig.piggybank; - -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.io.InputStream; - -import org.apache.pig.EvalFunc; -import org.apache.pig.data.DataByteArray; -import org.apache.pig.data.Tuple; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.parser.pdf.PDFParser; -import org.apache.tika.sax.BodyContentHandler; -import org.xml.sax.ContentHandler; - -public class ExtractTextFromPDFs extends EvalFunc { - private Parser pdfParser = new PDFParser(); - - @Override - public String exec(Tuple input) throws IOException { - try { - if (input == null || input.size() == 0 || input.get(0) == null) { - return "N/A"; - } - - DataByteArray dba = (DataByteArray) input.get(0); - InputStream is = new ByteArrayInputStream(dba.get()); - - ContentHandler contenthandler = new BodyContentHandler(Integer.MAX_VALUE); - Metadata metadata = new Metadata(); - - pdfParser.parse(is, contenthandler, metadata, new ParseContext()); - - if (is != null) { - is.close(); - } - - return contenthandler.toString(); - } catch (Throwable t) { - // Basically, catch everything... - t.printStackTrace(); - return null; - } - } -} \ No newline at end of file diff --git a/src/main/java/org/warcbase/pig/piggybank/ExtractTopLevelDomain.java b/src/main/java/org/warcbase/pig/piggybank/ExtractTopLevelDomain.java deleted file mode 100644 index 23a1dce..0000000 --- a/src/main/java/org/warcbase/pig/piggybank/ExtractTopLevelDomain.java +++ /dev/null @@ -1,38 +0,0 @@ -package org.warcbase.pig.piggybank; - -import java.io.IOException; -import java.net.URL; - -import org.apache.pig.EvalFunc; -import org.apache.pig.data.Tuple; - -/** - * UDF for extracting the top-level domain from an URL. Extracts the hostname from the first - * argument; if it's null, extracts the hostname from the second argument. The second - * argument is typically a source page, e.g., if the first URL is a relative URL, take the host from - * the source page. - */ -public class ExtractTopLevelDomain extends EvalFunc { - public String exec(Tuple input) throws IOException { - if (input == null || input.size() == 0 || input.get(0) == null) { - return null; - } - - String host = null; - try { - host = (new URL((String) input.get(0))).getHost(); - } catch (Exception e) { - // It's okay, just fall through here. - } - - if (host != null || (host == null && input.size() == 0)) { - return host; - } - - try { - return (new URL((String) input.get(1))).getHost(); - } catch (Exception e) { - return null; - } - } -} \ No newline at end of file diff --git a/src/main/java/org/warcbase/pig/piggybank/NER3ClassUDF.java b/src/main/java/org/warcbase/pig/piggybank/NER3ClassUDF.java deleted file mode 100644 index 08a4a38..0000000 --- a/src/main/java/org/warcbase/pig/piggybank/NER3ClassUDF.java +++ /dev/null @@ -1,132 +0,0 @@ -/* - * Copyright 2014 Internet Archive - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. - */ - -package org.warcbase.pig.piggybank; - -import java.io.IOException; -import org.apache.pig.EvalFunc; -import org.apache.pig.data.Tuple; -import java.util.regex.*; -import java.io.*; -import java.net.*; -import org.apache.pig.PigException; -import org.apache.pig.backend.executionengine.ExecException; -import org.apache.pig.data.TupleFactory; -import org.apache.pig.data.DataBag; -import org.apache.pig.data.DataType; -import org.apache.pig.data.Tuple; -import org.apache.pig.builtin.MonitoredUDF; -import java.util.ArrayList; -import java.util.List; -import java.util.Iterator; -import java.util.Map; -import java.util.EnumMap; -import edu.stanford.nlp.ie.AbstractSequenceClassifier; -import edu.stanford.nlp.ie.crf.*; -import edu.stanford.nlp.io.IOUtils; -import edu.stanford.nlp.ling.CoreLabel; -import edu.stanford.nlp.ling.CoreAnnotations; -import java.io.IOException; -import java.io.ObjectInputStream; -import java.io.ObjectOutputStream; -import java.lang.Integer; -import java.util.concurrent.TimeUnit; - -/** - * UDF which reads in a text string, and returns entities identified by the configured Stanford NER classifier - * @author vinay - * @author jrwiebe - */ - -//@MonitoredUDF(timeUnit = TimeUnit.MILLISECONDS, duration = 120000, stringDefault = "{PERSON=[], ORGANIZATION=[], LOCATION=[]}") -public class NER3ClassUDF extends EvalFunc { - - String serializedClassifier; - AbstractSequenceClassifier classifier = null; - - public NER3ClassUDF(String file) { - serializedClassifier = file; - } - - public enum NERClassType { PERSON, ORGANIZATION, LOCATION, O } - - public String exec(Tuple input) throws IOException { - - String emptyString = "{PERSON=[], ORGANIZATION=[], LOCATION=[]}"; - Map> entitiesByType = new EnumMap>(NERClassType.class); - for (NERClassType t : NERClassType.values()) { - if(t != NERClassType.O) - entitiesByType.put(t, new ArrayList()); - } - - NERClassType prevEntityType = NERClassType.O; - String entityBuffer = ""; - - if(input == null || input.size() == 0) { - return emptyString; - } - - try { - String textString = (String)input.get(0); - if(textString == null) { - return emptyString; - } - - if(classifier == null) { - //initialize - classifier = CRFClassifier.getClassifier(serializedClassifier); - } - - List> out = classifier.classify(textString); - for (List sentence : out) { - for (CoreLabel word : sentence) { - String wordText = word.word(); - String classText = word.get(CoreAnnotations.AnswerAnnotation.class); - NERClassType currEntityType = NERClassType.valueOf(classText); - if (prevEntityType != currEntityType) { - if(prevEntityType != NERClassType.O && !entityBuffer.equals("")) { - //time to commit - entitiesByType.get(prevEntityType).add(entityBuffer); - entityBuffer = ""; - } - } - prevEntityType = currEntityType; - if(currEntityType != NERClassType.O) { - if(entityBuffer.equals("")) - entityBuffer = wordText; - else - entityBuffer+= " " + wordText; - } - } - //end of sentence - //apply commit and reset - if(prevEntityType != NERClassType.O && !entityBuffer.equals("")) { - entitiesByType.get(prevEntityType).add(entityBuffer); - entityBuffer = ""; - } - //reset - prevEntityType = NERClassType.O; - entityBuffer = ""; - } - return entitiesByType.toString(); - - } catch(Exception e) { - if(classifier == null) - throw new IOException("Unable to load classifier ", e); - return emptyString; - } - } -} diff --git a/src/main/scala/org/warcbase/spark/matchbox/ExtractTextFromPDFs.scala b/src/main/scala/org/warcbase/spark/matchbox/ExtractTextFromPDFs.scala index badd233..d6072b3 100644 --- a/src/main/scala/org/warcbase/spark/matchbox/ExtractTextFromPDFs.scala +++ b/src/main/scala/org/warcbase/spark/matchbox/ExtractTextFromPDFs.scala @@ -1,32 +1,34 @@ package org.warcbase.spark.matchbox import java.io.ByteArrayInputStream -import org.apache.pig.data.DataByteArray +//import org.apache.pig.data.DataByteArray import org.apache.tika.metadata.Metadata import org.apache.tika.parser.ParseContext import org.apache.tika.parser.pdf.PDFParser import org.apache.tika.sax.BodyContentHandler; object ExtractTextFromPDFs { val pdfParser = new PDFParser() +/* def apply(dba: DataByteArray): String = { if (dba.get.isEmpty) "N/A" else { try { val is = new ByteArrayInputStream(dba.get) val contenthandler = new BodyContentHandler(Integer.MAX_VALUE) val metadata = new Metadata() pdfParser.parse(is, contenthandler, metadata, new ParseContext()) is.close() contenthandler.toString } catch { case t: Throwable => t.printStackTrace() "" } } } +*/ } \ No newline at end of file diff --git a/src/test/java/org/warcbase/pig/PigArcLoaderTest.java b/src/test/java/org/warcbase/pig/PigArcLoaderTest.java deleted file mode 100644 index ec1daf3..0000000 --- a/src/test/java/org/warcbase/pig/PigArcLoaderTest.java +++ /dev/null @@ -1,173 +0,0 @@ -package org.warcbase.pig; - -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertFalse; - -import java.io.File; -import java.util.Iterator; - -import org.apache.commons.io.FileUtils; -import org.apache.commons.logging.Log; -import org.apache.commons.logging.LogFactory; -import org.apache.pig.data.Tuple; -import org.apache.pig.pigunit.PigTest; -import org.junit.After; -import org.junit.Before; -import org.junit.Test; - -import com.google.common.io.Files; -import com.google.common.io.Resources; - -public class PigArcLoaderTest { - private static final Log LOG = LogFactory.getLog(PigArcLoaderTest.class); - private File tempDir; - - @Test - public void testArcLoaderCount() throws Exception { - String arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath(); - - String pigFile = Resources.getResource("scripts/TestArcLoaderCount.pig").getPath(); - String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows - - PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile, - "experimentfolder=" + location }); - - Iterator parses = test.getAlias("b"); - - Tuple tuple = parses.next(); - assertEquals(300L, tuple.get(0)); - - // There should only be one record. - assertFalse(parses.hasNext()); - } - - @Test - public void testArcCountLinks() throws Exception { - String arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath(); - - String pigFile = Resources.getResource("scripts/TestArcCountLinks.pig").getPath(); - String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows - - PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile, - "experimentfolder=" + location }); - - Iterator parses = test.getAlias("a"); - - int cnt = 0; - while (parses.hasNext()) { - LOG.info("link and anchor text: " + parses.next()); - cnt++; - } - assertEquals(664, cnt); - } - - @Test - public void testDetectLanguage() throws Exception { - String arcTestDataFile; - arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath(); - - String pigFile = Resources.getResource("scripts/TestDetectLanguage.pig").getPath(); - String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows - - PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile, "experimentfolder=" + location }); - - Iterator parses = test.getAlias("d"); - - while (parses.hasNext()) { - Tuple tuple = parses.next(); - String lang = (String) tuple.get(0); - switch (lang) { - case "en": assertEquals(57L, (long) tuple.get(1)); break; - case "et": assertEquals( 6L, (long) tuple.get(1)); break; - case "it": assertEquals( 1L, (long) tuple.get(1)); break; - case "lt": assertEquals(66L, (long) tuple.get(1)); break; - case "no": assertEquals( 6L, (long) tuple.get(1)); break; - case "ro": assertEquals( 4L, (long) tuple.get(1)); break; - } - System.out.println("language test: " + tuple.getAll()); - } - - } - - /* - * The two tests of MIME type detection is dependent on the version of the corresponding Tika and magiclib libraries - */ - - //@Test - // Commenting out this test case for now since it requires a 3rd party lib to be installed. - public void testDetectMimeTypeMagic() throws Exception { - String arcTestDataFile; - arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath(); - - String pigFile = Resources.getResource("scripts/TestDetectMimeTypeMagic.pig").getPath(); - String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows ? - - PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile, - "experimentfolder=" + location }); - - Iterator ts = test.getAlias("magicMimeBinned"); - while (ts.hasNext()) { - Tuple t = ts.next(); // t = (mime type, count) - String mime = (String) t.get(0); - System.out.println(mime + ": " + t.get(1)); - if (mime != null) { - switch (mime) { - case "EMPTY": assertEquals( 7L, (long) t.get(1)); break; - case "text/html": assertEquals(139L, (long) t.get(1)); break; - case "text/plain": assertEquals( 80L, (long) t.get(1)); break; - case "image/gif": assertEquals( 29L, (long) t.get(1)); break; - case "application/xml": assertEquals( 11L, (long) t.get(1)); break; - case "application/rss+xml": assertEquals( 2L, (long) t.get(1)); break; - case "application/xhtml+xml": assertEquals( 1L, (long) t.get(1)); break; - case "application/octet-stream": assertEquals( 26L, (long) t.get(1)); break; - case "application/x-shockwave-flash": assertEquals( 8L, (long) t.get(1)); break; - } - } - } - } - - @Test - public void testDetectMimeTypeTika() throws Exception { - String arcTestDataFile; - arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath(); - - String pigFile = Resources.getResource("scripts/TestDetectMimeTypeTika.pig").getPath(); - String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows ? - - PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile, "experimentfolder=" + location}); - - Iterator ts = test.getAlias("tikaMimeBinned"); - while (ts.hasNext()) { - Tuple t = ts.next(); - - String mime = (String) t.get(0); - switch (mime) { - case "image/gif": assertEquals( 29L, (long) t.get(1)); break; - case "image/png": assertEquals( 8L, (long) t.get(1)); break; - case "image/jpeg": assertEquals( 18L, (long) t.get(1)); break; - case "text/html": assertEquals(132L, (long) t.get(1)); break; - case "text/plain": assertEquals( 86L, (long) t.get(1)); break; - case "application/xml": assertEquals( 1L, (long) t.get(1)); break; - case "application/rss+xml": assertEquals( 9L, (long) t.get(1)); break; - case "applicaiton/xhtml+xml": assertEquals( 1L, (long) t.get(1)); break; - case "application/octet-stream": assertEquals( 7L, (long) t.get(1)); break; - case "application/x-shockwave-flash": assertEquals( 8L, (long) t.get(1)); break; - } - System.out.println(t.get(0) + ": " + t.get(1)); - } - } - - @Before - public void setUp() throws Exception { - // create a random file location - tempDir = Files.createTempDir(); - LOG.info("Output can be found in " + tempDir.getPath()); - } - - @After - public void tearDown() throws Exception { - // cleanup - FileUtils.deleteDirectory(tempDir); - LOG.info("Removing tmp files in " + tempDir.getPath()); - } -} diff --git a/src/test/java/org/warcbase/pig/PigWarcLoaderTest.java b/src/test/java/org/warcbase/pig/PigWarcLoaderTest.java deleted file mode 100644 index c77a480..0000000 --- a/src/test/java/org/warcbase/pig/PigWarcLoaderTest.java +++ /dev/null @@ -1,57 +0,0 @@ -package org.warcbase.pig; - -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertFalse; - -import java.io.File; -import java.util.Iterator; - -import org.apache.commons.io.FileUtils; -import org.apache.commons.logging.Log; -import org.apache.commons.logging.LogFactory; -import org.apache.pig.data.Tuple; -import org.apache.pig.pigunit.PigTest; -import org.junit.After; -import org.junit.Before; -import org.junit.Test; - -import com.google.common.io.Files; -import com.google.common.io.Resources; - -public class PigWarcLoaderTest { - private static final Log LOG = LogFactory.getLog(PigWarcLoaderTest.class); - private File tempDir; - - @Test - public void testWarcLoaderCount() throws Exception { - String arcTestDataFile = Resources.getResource("warc/example.warc.gz").getPath(); - - String pigFile = Resources.getResource("scripts/TestWarcLoaderCount.pig").getPath(); - String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows - - PigTest test = new PigTest(pigFile, new String[] { "testWarcFolder=" + arcTestDataFile, - "experimentfolder=" + location }); - - Iterator parses = test.getAlias("b"); - - Tuple tuple = parses.next(); - assertEquals(299L, tuple.get(0)); - - // There should only be one record. - assertFalse(parses.hasNext()); - } - - @Before - public void setUp() throws Exception { - // create a random file location - tempDir = Files.createTempDir(); - LOG.info("Output can be found in " + tempDir.getPath()); - } - - @After - public void tearDown() throws Exception { - // cleanup - FileUtils.deleteDirectory(tempDir); - LOG.info("Removing tmp files in " + tempDir.getPath()); - } -} diff --git a/src/test/java/org/warcbase/pig/piggybank/ExtractLinksTest.java b/src/test/java/org/warcbase/pig/piggybank/ExtractLinksTest.java deleted file mode 100644 index e235df1..0000000 --- a/src/test/java/org/warcbase/pig/piggybank/ExtractLinksTest.java +++ /dev/null @@ -1,59 +0,0 @@ -package org.warcbase.pig.piggybank; - -import static org.junit.Assert.assertEquals; - -import java.io.IOException; -import java.util.Arrays; -import java.util.Iterator; - -import org.apache.pig.data.DataBag; -import org.apache.pig.data.Tuple; -import org.apache.pig.data.TupleFactory; -import org.junit.Test; - -public class ExtractLinksTest { - private TupleFactory tupleFactory = TupleFactory.getInstance(); - - @Test - public void test1() throws IOException { - ExtractLinks udf = new ExtractLinks(); - - String fragment = "Here is a search engine.\n" + - "Here is Twitter.\n"; - - DataBag bag = udf.exec(tupleFactory.newTuple(fragment)); - assertEquals(2, bag.size()); - - Tuple tuple = null; - Iterator iter = bag.iterator(); - tuple = iter.next(); - assertEquals("http://www.google.com", (String) tuple.get(0)); - assertEquals("a search engine", (String) tuple.get(1)); - - tuple = iter.next(); - assertEquals("http://www.twitter.com/", (String) tuple.get(0)); - assertEquals("Twitter", (String) tuple.get(1)); - } - - @Test - public void test2() throws IOException { - ExtractLinks udf = new ExtractLinks(); - - String fragment = "Here is a search engine.\n" + - "Here is a relative URL.\n"; - - DataBag bag = udf.exec(tupleFactory.newTuple(Arrays.asList(fragment, "http://www.foobar.org/index.html"))); - assertEquals(2, bag.size()); - - Tuple tuple = null; - Iterator iter = bag.iterator(); - tuple = iter.next(); - assertEquals("http://www.google.com", (String) tuple.get(0)); - assertEquals("a search engine", (String) tuple.get(1)); - - tuple = iter.next(); - assertEquals("http://www.foobar.org/page.html", (String) tuple.get(0)); - assertEquals("a relative URL", (String) tuple.get(1)); - } - -} diff --git a/src/test/java/org/warcbase/pig/piggybank/ExtractTopLevelDomainTest.java b/src/test/java/org/warcbase/pig/piggybank/ExtractTopLevelDomainTest.java deleted file mode 100644 index b2ce5a8..0000000 --- a/src/test/java/org/warcbase/pig/piggybank/ExtractTopLevelDomainTest.java +++ /dev/null @@ -1,44 +0,0 @@ -package org.warcbase.pig.piggybank; - -import static org.junit.Assert.assertEquals; - -import java.io.IOException; -import java.util.Arrays; - -import org.apache.pig.data.TupleFactory; -import org.junit.Test; - -public class ExtractTopLevelDomainTest { - private TupleFactory tupleFactory = TupleFactory.getInstance(); - - private static final String[][] CASES1 = { - {"http://www.umiacs.umd.edu/~jimmylin/", "www.umiacs.umd.edu"}, - {"https://github.com/lintool", "github.com"}, - {"http://ianmilligan.ca/2015/05/04/iipc-2015-slides-for-warcs-wats-and-wgets-presentation/", "ianmilligan.ca"}, - {"index.html", null}, - }; - - private static final String[][] CASES2 = { - {"index.html","http://www.umiacs.umd.edu/~jimmylin/", "www.umiacs.umd.edu"}, - {"index.html","lintool/", null}, - }; - - @Test - public void test1() throws IOException { - ExtractTopLevelDomain udf = new ExtractTopLevelDomain(); - - for (int i = 0; i < CASES1.length; i++) { - assertEquals(CASES1[i][1], udf.exec(tupleFactory.newTuple(CASES1[i][0]))); - } - } - - @Test - public void test2() throws IOException { - ExtractTopLevelDomain udf = new ExtractTopLevelDomain(); - - for (int i = 0; i < CASES2.length; i++) { - assertEquals(CASES2[i][2], - udf.exec(tupleFactory.newTuple(Arrays.asList(CASES2[i][0], CASES2[i][1])))); - } - } -} diff --git a/src/test/resources/scripts/TestArcCountLinks.pig b/src/test/resources/scripts/TestArcCountLinks.pig deleted file mode 100644 index 905f3d8..0000000 --- a/src/test/resources/scripts/TestArcCountLinks.pig +++ /dev/null @@ -1,8 +0,0 @@ --- Counts up number of links - -DEFINE ArcLoader org.warcbase.pig.ArcLoader(); - -raw = load '$testArcFolder' using ArcLoader(); -a = foreach raw generate FLATTEN(org.warcbase.pig.piggybank.ExtractLinks((chararray) content)); - -store a into '$experimentfolder/a'; diff --git a/src/test/resources/scripts/TestArcLoader.pig b/src/test/resources/scripts/TestArcLoader.pig deleted file mode 100644 index f98a285..0000000 --- a/src/test/resources/scripts/TestArcLoader.pig +++ /dev/null @@ -1,14 +0,0 @@ --- Simple word count example to tally up dates when pages are crawled - -DEFINE ArcLoader org.warcbase.pig.ArcLoader(); - -raw = load '$testArcFolder' using ArcLoader(); --- schema is (url:chararray, date:chararray, mime:chararray, content:bytearray); - -store raw into '$experimentfolder/raw' using PigStorage(); - -a = foreach raw generate SUBSTRING(date, 0, 8) as date; -b = group a by date; -c = foreach b generate group, COUNT(a); - -store c into '$experimentfolder/c' using PigStorage(); \ No newline at end of file diff --git a/src/test/resources/scripts/TestArcLoaderCount.pig b/src/test/resources/scripts/TestArcLoaderCount.pig deleted file mode 100644 index c8a8819..0000000 --- a/src/test/resources/scripts/TestArcLoaderCount.pig +++ /dev/null @@ -1,9 +0,0 @@ --- Counts up number of total records - -DEFINE ArcLoader org.warcbase.pig.ArcLoader(); - -raw = load '$testArcFolder' using ArcLoader(); -a = group raw all; -b = foreach a generate COUNT(raw); - -store b into '$experimentfolder/counts' using PigStorage(); diff --git a/src/test/resources/scripts/TestDetectLanguage.pig b/src/test/resources/scripts/TestDetectLanguage.pig deleted file mode 100644 index 9b7acd0..0000000 --- a/src/test/resources/scripts/TestDetectLanguage.pig +++ /dev/null @@ -1,16 +0,0 @@ --- Simple language detection example - -DEFINE ArcLoader org.warcbase.pig.ArcLoader(); -DEFINE ExtractRawText org.warcbase.pig.piggybank.ExtractRawText(); -DEFINE DetectLanguage org.warcbase.pig.piggybank.DetectLanguage(); - -raw = load '$testArcFolder' using ArcLoader(); --- schema is (url:chararray, date:chararray, mime:chararray, content:bytearray); - -a = filter raw by mime == 'text/html'; -b = foreach a generate url, mime, - DetectLanguage(ExtractRawText((chararray) content)) as lang; -c = group b by lang; -d = foreach c generate group, COUNT(b); - -dump d; diff --git a/src/test/resources/scripts/TestDetectMimeTypeMagic.pig b/src/test/resources/scripts/TestDetectMimeTypeMagic.pig deleted file mode 100644 index 95bb0af..0000000 --- a/src/test/resources/scripts/TestDetectMimeTypeMagic.pig +++ /dev/null @@ -1,37 +0,0 @@ - --- Combined mime type check and language detection on an arc file ---register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar'; - -define ArcLoader org.warcbase.pig.ArcLoader(); -define DetectMimeTypeMagic org.warcbase.pig.piggybank.DetectMimeTypeMagic(); - --- Load arc file properties: url, date, mime, content -raw = load '$testArcFolder' using org.warcbase.pig.ArcLoader() as (url: chararray, date:chararray, mime:chararray, content:chararray); - --- Detect the mime type of the content using magic lib --- On CentOS the magic file is located at /usr/share/file/magic.mgc --- On MacOS X using Homebrew the magic file is located at /usr/local/Cellar/libmagic/5.15/share/misc/magic.mgc -a = foreach raw generate url,mime, DetectMimeTypeMagic('/usr/local/Cellar/libmagic/5.15/share/misc/magic.mgc', content) as magicMime; - - --- magic lib includes "; " in which we are not interested -b = foreach a { - magicMimeSplit = STRSPLIT(magicMime, ';'); - GENERATE url, mime, magicMimeSplit.$0 as magicMime; -} - --- httpMimes = foreach b generate mime; --- httpMimeGroups = group httpMimes by mime; --- httpMimeBinned = foreach httpMimeGroups generate group, COUNT(httpMimes); - -magicMimes = foreach b generate magicMime; -magicMimeGroups = group magicMimes by magicMime; -magicMimeBinned = foreach magicMimeGroups generate group, COUNT(magicMimes); - ---dump httpMimeBinned; ---dump tikaMimeBinned; ---dump magicMimeBinned; - --- store httpMimeBinned into '$experimentfolder/httpMimeBinned'; -store magicMimesBinned into '$experimentfolder/magicMimeBinned'; - diff --git a/src/test/resources/scripts/TestDetectMimeTypeTika.pig b/src/test/resources/scripts/TestDetectMimeTypeTika.pig deleted file mode 100644 index 364eeee..0000000 --- a/src/test/resources/scripts/TestDetectMimeTypeTika.pig +++ /dev/null @@ -1,19 +0,0 @@ --- Combined mime type check and language detection on an arc file - -define ArcLoader org.warcbase.pig.ArcLoader(); -define DetectMimeTypeTika org.warcbase.pig.piggybank.DetectMimeTypeTika(); - -raw = load '$testArcFolder' using ArcLoader(); --- schema is (url:chararray, date:chararray, mime:chararray, content:bytearray); - --- Detect the mime type of the content using and Tika -a = foreach raw generate url,mime, DetectMimeTypeTika(content) as tikaMime; - -tikaMimes = foreach a generate tikaMime; -tikaMimeGroups = group tikaMimes by tikaMime; -tikaMimeBinned = foreach tikaMimeGroups generate group, COUNT(tikaMimes); - -dump tikaMimeBinned; - -store tikaMimeBinned into '$experimentfolder/tikaMimeBinned'; - diff --git a/src/test/resources/scripts/TestWarcLoaderCount.pig b/src/test/resources/scripts/TestWarcLoaderCount.pig deleted file mode 100644 index d070b1a..0000000 --- a/src/test/resources/scripts/TestWarcLoaderCount.pig +++ /dev/null @@ -1,9 +0,0 @@ --- Counts up number of total records - -DEFINE WarcLoader org.warcbase.pig.WarcLoader(); - -raw = load '$testWarcFolder' using WarcLoader(); -a = group raw all; -b = foreach a generate COUNT(raw); - -store b into '$experimentfolder/counts' using PigStorage();