History Graph
History Graph
Commit | Author | Details | Committed | |||
---|---|---|---|---|---|---|
f972206db516 | lintool | Created warcbase-core module. | Jun 16 2016 | |||
eb4c0e313a67 | ianmilligan1 | fixed ArcTest | Apr 21 2016 | |||
fc0d11495cf1 | ianmilligan1 | renaming ExtractTopLevelDomain to ExtractDomain | Mar 29 2016 | |||
40025b169e93 | lintool | Add UDF for extracting stuff from tweets, Issue #210 | Mar 27 2016 | |||
b80c970f2e50 | lintool | Added test cases. | Mar 27 2016 | |||
2da8f0abb983 | lintool | Added ExtractImageLinks test cases. | Mar 27 2016 | |||
f69c7900f12e | Jeremy Wiebe | Added tests for generic ARC/WARC classes | Feb 18 2016 | |||
8a4c55019413 | Jeremy Wiebe | Added discardUrlPatterns | Feb 16 2016 | |||
232397a26bf1 | Jeremy Wiebe | keepUrlPatterns test | Feb 16 2016 | |||
a7b0e0b07682 | Jeremy Wiebe | Merge branch 'format' for ExtractDate (#154) and TupleFormatter | Feb 13 2016 | |||
4b8f0fb7b482 | Jeremy Wiebe | Use shapeless to flatten tuples of any arity | Feb 13 2016 | |||
1b6d83c4a7fe | Alice-Z | Add RemovePrefixWWW method | Dec 25 2015 | |||
4042cac0ad8b | Alice-Z | add method to filter date by component | Dec 25 2015 | |||
d1c3f5d49783 | Alice-Z | Cleanup and add comments | Dec 25 2015 | |||
1ec6e05d7540 | Alice-Z | Add Date utils to clarify date extraction | Dec 25 2015 | |||
c773fd242ff5 | Alice-Z | Add Formatter to un-nest tuples and print in tab-delimited format | Dec 25 2015 | |||
2adce498927d | Alice-Z | Refactor Record API (#189) | Dec 10 2015 | |||
90616dd7b3bd | Alice-Z | Fix WARecord getContentString | Dec 3 2015 | |||
83d4d7232405 | Alice-Z | Revision API to be more descriptive | Nov 25 2015 | |||
5e316b1bb129 | lintool | Killed RecordUtils, ExtractLinksAndText, JwatArcLoaderTest | Nov 25 2015 | |||
2e88c1b19afb | lintool | Slapped Apache License boilerplate -- now we're a *real* open-source project :) | Nov 25 2015 | |||
f5d8edf50506 | lintool | Killed all the Pig stuff. | Nov 24 2015 | |||
2dde7b822a5c | Alice-Z | Tiny fix: remove unneeded part of test | Nov 23 2015 | |||
4a3d31b8babf | Alice-Z | Port named entities extractor Pig script over to Spark as per issue #158 | Nov 23 2015 | |||
f9422c23efae | Alice-Z | ExtractEntities takes a classifier file path | Nov 23 2015 | |||
99583f793bc5 | Alice-Z | Revert to object version of NER3Classifier | Nov 23 2015 | |||
533b4152d534 | Alice-Z | Turn test off; classifier is too large to be included | Nov 21 2015 | |||
582b21adefe6 | Alice-Z | Pass classifier class to ExtractEntities UDF | Nov 21 2015 | |||
86733707e5a1 | Alice-Z | add test for ner3classifier | Nov 21 2015 | |||
14e521794754 | Alice-Z | Clean up, fix tests changed by new keepValidPages | Nov 21 2015 | |||
758288fb2bd9 | Alice-Z | Fix warcloader bug (issue #166) | Nov 19 2015 | |||
cc274ed73b60 | Alice-Z | Use Jackson JSON serializers to write to String | Nov 12 2015 | |||
e9a3965e2389 | Alice-Z | Clean up string formatting | Nov 12 2015 | |||
9f7fa9f26b54 | Alice-Z | Working extract entities with correct output string | Nov 12 2015 | |||
c553ef0d853f | Alice-Z | Refactor ExtractLinks to be called with the src url | Nov 11 2015 | |||
de66d390f4ae | Alice-Z | tiny formatting change | Nov 11 2015 | |||
2c2607867ef1 | Alice-Z | Add keepValidPages transformation and layer for counting in Spark | Nov 11 2015 | |||
e5668e3e3251 | Alice-Z | Rename tests for JUnitTestRunner | Nov 11 2015 | |||
9140519a9d50 | Alice-Z | Fix imports and enable tests on JUnitRunner | Nov 9 2015 | |||
3eb11b04e499 | Alice-Z | Commit clean-up | Nov 8 2015 | |||
8057c46945d0 | Alice-Z | Add Spark support | Nov 3 2015 | |||
151242d2d988 | lintool | Upgrade CDH; fixed broken tests due classpath conflict issue and Tika upgrade. | Jul 13 2015 | |||
9b8e85db50e3 | lintool | removed JWAT testcase. | Jun 7 2015 | |||
195f4af7474e | lintool | revamped tests. | Jun 7 2015 | |||
84ebe4f68bbd | lintool | Added test case for PigWarcLoader. | Jun 6 2015 | |||
13619567667f | lintool | Changed over to WacWarcInputFormatTest as the underlying InputFormat. | Jun 6 2015 | |||
a3a2cd7853c8 | lintool | tries to fix issues with relative links | May 13 2015 | |||
0a781f364480 | lintool | Added test case for ExtractLinks UDF. | May 6 2015 | |||
5b10497c2d53 | lintool | Updated UDF to handle relative paths (with source page). Added test case. | May 6 2015 | |||
0145d74ed35e | lintool | Refactoring to create method that extracts MIME from WARC response records. | Oct 22 2014 | |||
d491020c8287 | lintool | Merge branch 'pig' into warc | Oct 19 2014 | |||
410cfd81a069 | lintool | WARC-related Hadoop bindings. | Oct 19 2014 | |||
f4a249469cbc | lintool | Minor refactoring. | Oct 19 2014 | |||
0c52caed50cf | lintool | Pig ArcLoader exports its own ResourceSchema. | Oct 19 2014 | |||
3af097ea655d | lintool | Fixed test cases. | Oct 19 2014 | |||
4798a4314b6e | lintool | Figured out how to extract MIME type and date from WARC. | Aug 30 2014 | |||
f3516c7fd7f0 | lintool | Added test cases to try loading WARC records from a stream; back-ported same… | Aug 29 2014 | |||
28c5c007f4fd | lintool | Added simple test case. | Aug 28 2014 | |||
8e67d49d44b3 | lintool | WARC sample from https://archive.org/details/ExampleArcAndWarcFiles | Aug 28 2014 | |||
ccee8fd2204a | lintool | Implemented ArcRecordWritable. | Aug 23 2014 | |||
846ebb216100 | lintool | Minor refactoring. | Aug 23 2014 | |||
47f3c46c099d | lintool | Added Hadoop bindings for webarchive-commons ARC readers, demo, test cases. | Aug 22 2014 | |||
b07376aba3fe | lintool | Added/refactored test cases for JWAT. | Aug 22 2014 | |||
bf606c07d453 | lintool | Refactoring UrlUtil and related classes. | Aug 14 2014 | |||
93f6d42f4ff4 | lintool | Fixed compile and broken test issues. | Aug 12 2014 | |||
accd1978862d | lintool | Merge branch 'master' of github.com:cneud/warcbase into cneud-integration | Aug 12 2014 | |||
f3015cd7ba4b | lintool | issue #50 | Jun 18 2014 | |||
02c26d6d8ea4 | lintool | Started working on issue #46 cleanup of org.warcbase.data.Util | Jun 17 2014 | |||
a6be4375e0c7 | lintool | ExtractLinks using HBase appears to be working. | Jun 13 2014 | |||
60b827d1c174 | lintool | Refactored getIdRange method signature, add more test cases to UriMapping. | Jun 5 2014 | |||
7fc1b92d60a7 | Jeffyrao | fix issue 40 that UriMapping prefix search should return empty result when no… | Jun 5 2014 | |||
a8953616248c | lintool | Refactoring, added test case (currently broken). | Jun 4 2014 | |||
758948b463b9 | lintool | Light refactoring, fixed a few errors. | Mar 31 2014 | |||
71c68ecae0fa | Jeffyrao | add prefix search feature to UriMapping class; add test class for UriMapping… | Mar 31 2014 | |||
db4ebe9f4826 | pmd | Added a null pointer check and a more Pig friendly return value from the UDFs | Dec 19 2013 | |||
ba201c27e210 | pmd | Refactored the configuration of the magic lib into the Pig script. | Dec 19 2013 | |||
71b90c81859e | pmd | Improved the DetectMimeTypeTika Pig script. | Dec 19 2013 | |||
bedac9288080 | pmd | Corrected an error in the DetectMimeTypeMagic Pig script and the corresponding… | Dec 19 2013 | |||
5d312a3f80ec | pmd | Removed warnings | Dec 19 2013 | |||
b3fdd488fdcb | pmd | Refactored the DetectMimeType into two seperate methods: one for each detection… | Dec 19 2013 | |||
842012e0de5f | pmd | Corrected a comment | Dec 10 2013 | |||
e71f20141a36 | pmd | Changed unit test to match the change in ArcLoader that removed the filter for… | Dec 10 2013 | |||
10b02ec31780 | pmd | Changed unit test to match the change in ArcLoader that removed the filter for… | Dec 10 2013 | |||
b8f63da4d685 | pmd | Use the provided ARC file for the unit test. | Dec 9 2013 | |||
25ca4df65a50 | pmd | Improving the unit test of the DetectMimeType by using the two identification… | Dec 9 2013 | |||
49f50dd95873 | pmd | Add the magic lib UDF to the Pig script | Dec 9 2013 | |||
1a209cf5b04a | pmd | First version of a magic lib UDF | Dec 4 2013 | |||
c0e996ec2e61 | pmd | Added unit test for the language detection UDF. | Dec 3 2013 | |||
a4282d81d6e9 | lintool | Cleaned up Pig test cases, added JWAT test case. | Dec 2 2013 | |||
17bc9616a180 | graemon | added a pig unit test | Dec 2 2013 | |||
8885f5db4ded | graemon | added a pig unit test | Dec 2 2013 |
c4science · Help