History Graph
History Graph
Commit | Author | Details | Committed | |||
---|---|---|---|---|---|---|
d1c3f5d49783 | Alice-Z | Cleanup and add comments | Dec 25 2015 | |||
1ec6e05d7540 | Alice-Z | Add Date utils to clarify date extraction | Dec 25 2015 | |||
c773fd242ff5 | Alice-Z | Add Formatter to un-nest tuples and print in tab-delimited format | Dec 25 2015 | |||
0d78b9bb2b5b | Alice-Z | Use SerializableWritable wrapper for serialization | Dec 10 2015 | |||
042253b579cc | Alice-Z | Use SerializableWritable wrapper for serialization | Dec 10 2015 | |||
2adce498927d | Alice-Z | Refactor Record API (#189) | Dec 10 2015 | |||
3e9dceac5766 | Jeremy Wiebe | Fixed newBufferedWriter() call for JRE < 1.8. | Dec 9 2015 | |||
b9f4b0ffa199 | Jeremy Wiebe | Write directly to local file instead of HDFS | Dec 9 2015 | |||
76d82afa6c35 | Jeremy Wiebe | Added GDF export function to matchbox | Dec 8 2015 | |||
90616dd7b3bd | Alice-Z | Fix WARecord getContentString | Dec 3 2015 | |||
43b9b1f84ef0 | Alice-Z | Add example scripts for crawl statistics [Issue #182] | Dec 1 2015 | |||
ba91869cd370 | Alice-Z | Add example scripts for crawl statistics and social media links | Dec 1 2015 | |||
c44118522844 | Jeremy Wiebe | Added matchbox function to NER-classify and generate JSON for visualizer, per… | Nov 26 2015 | |||
32f3e4ba7d6c | Jeremy Wiebe | Modified structure of JSON output | Nov 26 2015 | |||
8e8f072aca5f | Alice-Z | Revision API to be more descriptive as in issue #179 | Nov 26 2015 | |||
58491a2b8a57 | Alice-Z | Add Apache license header | Nov 26 2015 | |||
183b31d74831 | Alice-Z | Merge branch 'refactor-api' of github.com:lintool/warcbase into refactor-api | Nov 25 2015 | |||
83d4d7232405 | Alice-Z | Revision API to be more descriptive | Nov 25 2015 | |||
15ad4bb50c83 | Alice-Z | Revision API to be more descriptive | Nov 25 2015 | |||
d0facdcd382f | Jeremy Wiebe | Make final output a single file containing single JSON array. | Nov 25 2015 | |||
5e316b1bb129 | lintool | Killed RecordUtils, ExtractLinksAndText, JwatArcLoaderTest | Nov 25 2015 | |||
0b0839bdbaa9 | lintool | Fixed CR/LF issues (i.e., DOS formatting). | Nov 25 2015 | |||
2e88c1b19afb | lintool | Slapped Apache License boilerplate -- now we're a *real* open-source project :) | Nov 25 2015 | |||
dd643f9fae5c | Jeremy Wiebe | Added jackson-databind to pom.xml | Nov 25 2015 | |||
3e634fa43920 | Jeremy Wiebe | Relocated file, converted to class. | Nov 24 2015 | |||
f5d8edf50506 | lintool | Killed all the Pig stuff. | Nov 24 2015 | |||
201cfba2d832 | Jeremy Wiebe | Fixed emptyString; deleted dupe line | Nov 23 2015 | |||
70c55839d0c3 | Jeremy Wiebe | Moved initialization of NER3Classifier into map closure; switched from map() to… | Nov 23 2015 | |||
4a3d31b8babf | Alice-Z | Port named entities extractor Pig script over to Spark as per issue #158 | Nov 23 2015 | |||
f9422c23efae | Alice-Z | ExtractEntities takes a classifier file path | Nov 23 2015 | |||
99583f793bc5 | Alice-Z | Revert to object version of NER3Classifier | Nov 23 2015 | |||
533b4152d534 | Alice-Z | Turn test off; classifier is too large to be included | Nov 21 2015 | |||
582b21adefe6 | Alice-Z | Pass classifier class to ExtractEntities UDF | Nov 21 2015 | |||
86733707e5a1 | Alice-Z | add test for ner3classifier | Nov 21 2015 | |||
14e521794754 | Alice-Z | Clean up, fix tests changed by new keepValidPages | Nov 21 2015 | |||
2fd98dbfdb67 | Alice-Z | Make keepValidPages smarter as in issue #163 | Nov 19 2015 | |||
758288fb2bd9 | Alice-Z | Fix warcloader bug (issue #166) | Nov 19 2015 | |||
cc274ed73b60 | Alice-Z | Use Jackson JSON serializers to write to String | Nov 12 2015 | |||
e9a3965e2389 | Alice-Z | Clean up string formatting | Nov 12 2015 | |||
d56d62cf1415 | Alice-Z | Merge branch 'ner3classifier' of github.com:lintool/warcbase into ner3classifier | Nov 12 2015 | |||
9f7fa9f26b54 | Alice-Z | Working extract entities with correct output string | Nov 12 2015 | |||
c9ccf559fafc | Alice-Z | WIP: port NER3 Classifier over to Java with example usage | Nov 12 2015 | |||
31269caa44b6 | Alice-Z | WIP: port NER3 Classifier over to Java with example usage | Nov 12 2015 | |||
c553ef0d853f | Alice-Z | Refactor ExtractLinks to be called with the src url | Nov 11 2015 | |||
229d274d6c13 | Alice-Z | Remove extract method as per referenced in issue #146 | Nov 11 2015 | |||
a599db1f45b9 | Alice-Z | add documentation | Nov 11 2015 | |||
1ee0455efdf6 | Alice-Z | remove extract methods | Nov 11 2015 | |||
2c2607867ef1 | Alice-Z | Add keepValidPages transformation and layer for counting in Spark | Nov 11 2015 | |||
e1be481cd782 | Alice-Z | Clean up extracting code, use pattern matching | Nov 10 2015 | |||
3957b2cce7fb | Alice-Z | Clean up enum to function mapping | Nov 9 2015 | |||
9140519a9d50 | Alice-Z | Fix imports and enable tests on JUnitRunner | Nov 9 2015 | |||
3eb11b04e499 | Alice-Z | Commit clean-up | Nov 8 2015 | |||
8057c46945d0 | Alice-Z | Add Spark support | Nov 3 2015 | |||
35e39da6e72d | Alice-Z | add extractCrawldateDomainUrlBody | Oct 23 2015 | |||
e5821091ecca | Alice-Z | update ArcRecords interface | Oct 21 2015 | |||
cfe76f600c7d | Alice-Z | extend RDD | Oct 15 2015 | |||
e77201892789 | Alice-Z | first commit | Oct 15 2015 | |||
258a159c796e | lintool | Fixed issue #137 Integrate with warc-hadoop-indexer for Shine | Jul 22 2015 | |||
17393c207934 | lintool | Complete revamped cmdline parameter handling. | Jul 21 2015 | |||
0be900c474ef | lintool | Simplified specification of configs. | Jul 21 2015 | |||
243004be2ded | lintool | Config for WARCIndexer. | Jul 21 2015 | |||
b520faa88283 | lintool | Refactoring to simplify code. | Jul 21 2015 | |||
7936dd172aae | lintool | Fixed: https://github.com/ukwa/webarchive-discovery/issues/64 | Jul 14 2015 | |||
38ac5fb5cba5 | lintool | Rename. | Jul 13 2015 | |||
a3eb65567484 | lintool | More code simplification and refactoring. | Jul 13 2015 | |||
46d7cef43ee3 | lintool | Refactoring to simplify code. | Jul 13 2015 | |||
befb3f9b39fb | lintool | Reformat code. | Jul 13 2015 | |||
04f416e56781 | lintool | Getting rid of external dependencies by checking in Solr configs. | Jul 13 2015 | |||
335d751b425c | lintool | Starting to incorporate Hadoop WARC indexing code. | Jul 13 2015 | |||
11dfe152d4cc | Jeremy Wiebe | Added ExtractBoilerpipeText UDF | Jun 30 2015 | |||
c9f27489b093 | lintool | Merge branch 'master' into pyspark | Jun 14 2015 | |||
51abf7e86696 | ianmilligan1 | testing fork | Jun 9 2015 | |||
7cb0c2e8d2cc | Jeremy Wiebe | added Python scripts dir | Jun 9 2015 | |||
d9b553c76ced | lintool | Removed JWAT; removed unneeded files in org.warcbase.analysis; switched… | Jun 7 2015 | |||
caf87fbecf45 | lintool | Added schema metadata in loader. | Jun 7 2015 | |||
1f9d853f38a2 | lintool | better trapping of errors. | Jun 7 2015 | |||
13619567667f | lintool | Changed over to WacWarcInputFormatTest as the underlying InputFormat. | Jun 6 2015 | |||
7c1ebde17fb4 | lintool | Better integration of ARC readers in pyspark. | Jun 4 2015 | |||
b220b0cc175a | lintool | Converter to extract all metadata. | May 27 2015 | |||
649f1674352b | lintool | Initial experiments with Python converters to use Hadoop InputFormat from… | May 27 2015 | |||
1c013a7b064f | Jeremy Wiebe | Added Stanford NER UDF | May 26 2015 | |||
eec01a41b18e | lintool | Cleanup. | May 24 2015 | |||
1c9b25027e73 | lintool | Wraps a try block around everything to catch all errors. | May 24 2015 | |||
9acc8ca74c0a | lintool | Merge branch 'master' into extract-pdf-udf | May 24 2015 | |||
a3a2cd7853c8 | lintool | tries to fix issues with relative links | May 13 2015 | |||
5b10497c2d53 | lintool | Updated UDF to handle relative paths (with source page). Added test case. | May 6 2015 | |||
fb95f3ae562e | lintool | UDF for extracting top-level domain from URL. | May 6 2015 | |||
f263769c86b7 | lintool | Added option to change MAX_CONTENT_SIZE in IngestFiles, Issues #112 | Dec 25 2014 | |||
52b9696fba87 | lintool | Fixed typo. | Dec 24 2014 | |||
5d471280e7db | lintool | Add ingestion option to select either Snappy or GZ compression, Issue #110 | Dec 24 2014 | |||
fb3edb4e40c7 | lintool | Renaming. | Dec 23 2014 | |||
f6caf0f6f226 | lintool | Fixed OOM issues. | Dec 23 2014 | |||
b1c6995aa034 | lintool | Fixed warnings. | Dec 23 2014 | |||
59949b3d7af4 | lintool | Fixed OOM errors. | Dec 23 2014 | |||
e6b21cc557d6 | lintool | Fixed Issue #109: OOM when running UrlMappingBuilder | Dec 23 2014 | |||
6acf52f320cf | lintool | Merge branch 'hbase-api-refactoring' into dev | Dec 22 2014 | |||
7f6d571424a9 | lintool | Changed compression back from GZ to Snappy. | Dec 21 2014 | |||
12a00be93209 | lintool | Fixed minor code formatting issues. | Dec 21 2014 | |||
5c2540099370 | lintool | Reformatting. | Dec 18 2014 | |||
dbaa917eb8b2 | Jeremy Wiebe | Updated regex for "Content-Type" (RFC2616 sec. 4.2 says HTTP header field names… | Dec 11 2014 |
c4science · Help