Homec4science

Added WARC support to analysis:

Authored by Jeremy Wiebe <jwiebe@gmail.com> on Dec 10 2014, 15:54.

Description

Added WARC support to analysis:

WARC counterparts created for Count*.java
WARC support added to ExtractUniqueUrls (can handle mix of WARC and ARC files)
FindUrls -> FindArcUrls and FindWarcUrls

To use MultipleInputs (i.e. to handle WARC and ARC in single class) would have
required restructuring of Count* code. FindUrls does not implement MultipleInputs
either, because this makes obtaining input filename impossible without hacky code
(see JIRA ticket MAPREDUCE-1743 and
http://stackoverflow.com/questions/11130145/hadoop-multipleinputs-fails-with-classcastexception).

Details

Committed
Jeremy Wiebe <jwiebe@gmail.com>Dec 10 2014, 15:54
Pushed
dportabellaOct 19 2016, 16:29
Parents
R1473:65837e09aa3a: Added WARC support to UrlMappingMapReduceBuilder. It can now accept a path…
Branches
Unknown
Tags
Unknown

Event Timeline

Jeremy Wiebe <jwiebe@gmail.com> committed R1473:ec4285807f39: Added WARC support to analysis: (authored by Jeremy Wiebe <jwiebe@gmail.com>).Dec 10 2014, 15:54