History Graph
History Graph
Commit | Author | Details | Committed | |||
---|---|---|---|---|---|---|
ba201c27e210 | pmd | Refactored the configuration of the magic lib into the Pig script. | Dec 19 2013 | |||
5d312a3f80ec | pmd | Removed warnings | Dec 19 2013 | |||
b3fdd488fdcb | pmd | Refactored the DetectMimeType into two seperate methods: one for each detection… | Dec 19 2013 | |||
a80e87ca6f74 | milad621 | new hbase structure for ingest files | Dec 11 2013 | |||
d6c0cec7efb4 | pmd | Added a TODO comment | Dec 11 2013 | |||
259f0057f400 | pmd | Add identification engine as a parameter to the DetectMimeType UDF | Dec 9 2013 | |||
b69563d53c83 | pmd | Enable the ArcLoader to load all types of files | Dec 9 2013 | |||
2b49ec425597 | Jeffyrao | check text/html type and modify Jsoup.parse charset as ISO-8859-1 | Dec 8 2013 | |||
2e7e0cb81775 | Jeffyrao | modify UriMappingBuilder to read all files under given directory | Dec 7 2013 | |||
c4248d61b8b7 | lintool | Simple MapReduce program to count number of unique URLs. | Dec 7 2013 | |||
8e1cadb7e70e | Jeffyrao | Extract links and Lucene FST for URLs. | Dec 6 2013 | |||
bbee2e38f787 | milad621 | removed dead code | Dec 5 2013 | |||
1a209cf5b04a | pmd | First version of a magic lib UDF | Dec 4 2013 | |||
ccb9ea44f90a | cneud | use tika for mime type detection | Dec 3 2013 | |||
fbf902fcd066 | cneud | use tika for language detection | Dec 2 2013 | |||
eb848477370a | lintool | Added WarcLoader | Dec 2 2013 | |||
1c57edb9e225 | lintool | Added ExtractRawText UDF, tweaked ExtractLinks. | Dec 2 2013 | |||
1fd881ddf836 | lintool | Loader now materializes actual text, added ExtractLinks UDF. | Dec 2 2013 | |||
02317746e1b2 | lintool | Added simple Pig Loader for Arc files, returns (url, time, mime) currently. | Dec 2 2013 | |||
19735bbbbd86 | milad621 | Started a new branch to extract text from warcbase data and organize the data… | Dec 1 2013 | |||
944b138ee0d2 | milad621 | started a new branch to extract text from warcbase data and organize the data… | Dec 1 2013 | |||
7ee8dbb88c3c | lintool | Improved error checking for dates. | Nov 25 2013 | |||
7e76e19f2547 | milad621 | PrintAllUris add to appassembler which will output a urls.html file with all… | Nov 23 2013 | |||
22dd0757c01e | lintool | Tweaked browser; added MR programs for simple content analysis. | Nov 23 2013 | |||
b5b4e211ea05 | lintool | Hadoop InputFormats for ARC and WARC files + simple demos. | Nov 22 2013 | |||
bf51f6b75bd8 | milad621 | Code refactoring after pair coding. | Nov 18 2013 | |||
43d12364a70f | milad621 | Fixed ava.lang.NegativeArraySizeException at org.apache.commons.io.output. | Nov 8 2013 | |||
c3e37d50e856 | milad621 | Created a new runnable to find a uri inside warc/arc files. | Nov 7 2013 | |||
33966c708354 | milad621 | Uses HTablePool instead of creating a new connection each time. | Nov 7 2013 | |||
64ef175db2fd | milad621 | Some code cleanup in servlet. DetectDuplicates fixed with new hbase table… | Nov 7 2013 | |||
bf1b3616e4c0 | milad621 | updated servlet. | Nov 7 2013 | |||
76949eacabd5 | milad621 | Added a seperate class to manage HBase connection and addRecord | Nov 6 2013 | |||
01dff2a1f2a5 | milad621 | One runnable to process both arc and warc files in a folder. | Nov 5 2013 | |||
1354c68684d9 | milad621 | fixed some issues with the servlet. content and types might not follow eachother | Oct 30 2013 | |||
e9e3d1c9d940 | milad621 | IngestWarcFiles fixed. Now it uses jwat-warc to ingest warcfiles to hbase. | Oct 30 2013 | |||
70c00addd99e | milad621 | Updated WarcBrowser to work with the new structure of hbase table (supports… | Oct 24 2013 | |||
e8f7af4176f8 | milad621 | IngestArc updated. | Oct 23 2013 | |||
735b77efcad4 | milad621 | IngestWarcFiles updated. Uses jwat and stores content type in htable | Oct 23 2013 | |||
b1bee7dc3a8e | milad621 | Arc Processing tools added. Not working with hbase yet. | Oct 17 2013 | |||
803daecd43ef | milad621 | No need for name arg | Oct 9 2013 | |||
b321544c8a1f | milad621 | url style fixed. Can capture table names from url and home page changed to http… | Oct 9 2013 | |||
4b7ec6fab139 | milad621 | URL style changed | Oct 2 2013 | |||
3770fab2cafe | milad621 | close button fixed | Sep 27 2013 | |||
127a533cf731 | milad621 | fixed a baseurl problem | Sep 24 2013 | |||
c27fb0fb72d9 | milad621 | A bug with swf files fixed | Sep 20 2013 | |||
dc20e4fc805f | milad621 | header (navigation bar?) added | Sep 17 2013 | |||
24cf7be01bea | milad621 | Some url bugs with senate data fixed. A few code refactoring done | Aug 17 2013 | |||
392b5d139973 | milad621 | Some urls fixed. House is OK now | Aug 16 2013 | |||
e7a88b5211bb | milad621 | urls | Aug 15 2013 | |||
55b82f3dcb61 | milad621 | urls | Aug 15 2013 | |||
142aa0c8a34f | milad621 | urls | Aug 15 2013 | |||
1331620ce670 | milad621 | urls | Aug 15 2013 | |||
ecb3d1dd7e6d | milad621 | urls | Aug 15 2013 | |||
bb3985aec41a | milad621 | urls | Aug 15 2013 | |||
9e729d1fc54f | milad621 | urls | Aug 15 2013 | |||
4f7ad0cbd161 | milad621 | urls | Aug 15 2013 | |||
45b62820e8ca | milad621 | urls | Aug 15 2013 | |||
d0d442b95fad | milad621 | urls | Aug 15 2013 | |||
b08e35a91dc0 | milad621 | urls | Aug 15 2013 | |||
24d59470fd5e | milad621 | urls | Aug 15 2013 | |||
8c2223aca070 | milad621 | urls | Aug 15 2013 | |||
211d4536f4c2 | milad621 | urls | Aug 15 2013 | |||
b293db10c799 | milad621 | urls for house fixed | Aug 14 2013 | |||
3a2aec9c5698 | lintool | Cleaned up analysis code. Removed dead code. Added README. | Aug 13 2013 | |||
4fffd3905e9e | lintool | Refactoring of browser; each archive is now stored in its own separate table… | Aug 12 2013 | |||
0f580edb6186 | lintool | Interim check-in, refactoring of ingestion program. | Aug 12 2013 | |||
17e566d63370 | milad621 | duplicate count | Aug 8 2013 | |||
52dec81548a6 | milad621 | duplicate count | Aug 7 2013 | |||
e5f69f95e7ea | milad621 | duplicate count | Aug 7 2013 | |||
ae13087a3ffc | milad621 | duplicate count | Aug 7 2013 | |||
efb7cebae1b6 | milad621 | duplicate count | Aug 7 2013 | |||
992379b830c3 | milad621 | duplicate count | Aug 7 2013 | |||
020c97770fe2 | milad621 | duplicate count | Aug 7 2013 | |||
42dbb7faf5bb | milad621 | duplicate count | Aug 7 2013 | |||
f33edee2d50f | milad621 | duplicate count | Aug 7 2013 | |||
94d4c056456e | milad621 | duplicate count | Aug 7 2013 | |||
44061ad84f9f | milad621 | duplicate count | Aug 7 2013 | |||
06cefdf9191f | milad621 | duplicate count | Aug 7 2013 | |||
a61852dfcea3 | milad621 | duplicate count | Aug 7 2013 | |||
c9ae5b9ff356 | milad621 | duplicate count | Aug 7 2013 | |||
c24dba8cb858 | milad621 | duplicate count | Aug 7 2013 | |||
5bb4160f39a6 | milad621 | duplicate count | Aug 7 2013 | |||
cd1c983fd0c9 | milad621 | duplicate count | Aug 7 2013 | |||
176c9170ee3f | milad621 | duplicate count | Aug 7 2013 | |||
ca2f7c1f0995 | milad621 | duplicate count | Aug 7 2013 | |||
5c4aad52d667 | milad621 | duplicate count | Aug 7 2013 | |||
b679deabebf7 | milad621 | duplicate count | Aug 7 2013 | |||
5adb7197f665 | milad621 | duplicate count | Aug 7 2013 | |||
36da127d3c9f | milad621 | duplicate count | Aug 7 2013 | |||
c8906c4bb619 | milad621 | duplicate count | Aug 7 2013 | |||
fcdd21ba1de4 | milad621 | duplicate count | Aug 7 2013 | |||
9f74d062a1f0 | milad621 | duplicate count | Aug 7 2013 | |||
f67d4cd2b850 | milad621 | Duplicates | Aug 6 2013 | |||
e9e904fbd89f | milad621 | Dashboard | Jul 30 2013 | |||
117d39b7fee0 | milad621 | Dashboard | Jul 30 2013 | |||
33f60647635e | milad621 | Dashboard | Jul 29 2013 | |||
4ccea04e83ab | milad621 | Dashboard | Jul 29 2013 | |||
5981632ecfbb | milad621 | Dashboard | Jul 29 2013 | |||
fcbee4abf787 | milad621 | Dashboard | Jul 29 2013 | |||
499af6b86454 | milad621 | Dashboard | Jul 29 2013 |
c4science · Help