BibClassify: UI improvements and refactoring
- Refactoring changes:
- instead of trying to load components, we leave the job to python - just make sure correct python path is set
- write into the etc directory if running in a standalone mode
- making import relative
- improving bibclassify_config.py to find config dir and test writability
- also refactoring, moving the acronym extraction call into the core function
- fixing inconsistency between arguments, _output_marc was expecting dictionary of kws, _output_html and _output_text list of kws
- marc is able to work with different types of keywords (stages possible); needs some workflow
- brought more api calls inside bibclassify engine
- Output changes:
- author keywords were not formatted properly for the output, now fixed
- display the count for the found core keywords - helpful for human decisions
- fieldcodes printed now during ouput
- added bibclassify signature to the output
- simplified and refactored the output for html, txt, marc
- core keywords are now printed no matter if they are in the limit range
- outputting core keywords, if they are part of the composites
- web interface now distinguishes between different types of keywords
- DESY keywords (or other fiels) are now displayed alongside with the other keywords
- moved the local css from bibclassify_config to webstyle/css/invenio.css
- (closes #149)
- Cache reloading and invalidation fixes:
- fixed the bug when cache was being always rebuilt
- incremented version number
- invalidate generated docs (var/tmp/bibclassify/bibclassify_*.xml) by lazy-deleting the files if already saved in the database
- checked reuse of the same cache between threads - using thread.Lock()
- (closes #49)
- Making bibclassify more secure and other small changes:
- kw generations now goes through bibsched
- generation might be associated with certain user roles
- escape added to kw args received from web
- added docstrings to tests
- relative path resolution (automatic) to microtests
- limiting number of keywords output
- use_task_low_level_submission to upload keywords
- removed mysqldb.escape_string
- ui messages improvements
- Workflow improvements and various prettifications:
- before extraction is scheduled, we now check much more conditions, before allowing the run, appropriate messages are generated
- improved css, moved css to invenio
- make config use 6531_ syntax for main and other marc fields
- when exporting, use data from the database rather than from the generated files (if they still exist)
- when no weights are available, make tagcloud use minimal size, rather than maximum size fonts
- fixed bug for searching inside DB for taxonomy name
- improved local_config options, to override settings locally
- Bugs fixed:
- Fixing bug where KeywordToken.output() spits out label instead of prefLabel; it was a problem in instance.spires initialization
- option extract-acronyms was not honoured
- fixed bug when only single keywords were considered as composite parts of the composite kw, but in fact, composite kw can be made of other composite kw (now, if bibc reports that there are missing kw, it means the concepts are not defined neither as single, nor as composite kw -- ie. error in taxonomy
- fixed bibclassify_engine.py when generating marcxml
- prettified interface
- moving css to <style>
- prevent whitespace breaks for kwsvim
- fixed bug: exchnanged skw with ckw
- added hash method to the KeywordToken which is important to ensure kw objects are identified by their concept string
- text_extractor was never using pdftotext in one call
- bibclassify_ontology_reader.py
- single extracted keywords were truncated before being sent to composite kw extraction this resulted in less comp kws found
- also added more checks for cache existence,writability, readability etc
- fix of a invisible slow-down by rdflib graph, when the store object is used and evaluated as boolean (but in fact it is not a fast operation)
- bibclassify_tests.py
- a lot of new tests checking for cache (re)build-up
- Various:
- make bibclassify_acronym_analyzer.py to accept lower case letters as acronyms; this is a bit controversial change, because now the acronym is anything as a letter inside brackets, so (dS), (DS), (Ds) which is preceded by the acronym expansion which follows "d-s", ie. dynamic syntax (ds)
- updated bibclassify_tests.py to use pdfs from the demo collection
- inherit fieldcode information from components
- added info about no of matches (for inherited core keywords)
- fixed a bug when composite keywords were not found because single kws were filtered/truncated before passing them to the comp-kw search
- moved the regex compilation (20% speed increase on bulk processing)
- fixed two old regex patterns that introduces a lot of noise into the normalized text
- (re)added --only-core-tags option
- changed config of the patterns to match multi-kws that span several lines