diff --git a/TODO b/TODO index 22b9e3772..cccae1993 100644 --- a/TODO +++ b/TODO @@ -1,606 +1,702 @@ ;;; -*- mode: outline; coding: utf-8; outline-regexp: "[*\f]+"; -*- ;;; ;;; CDSware TODO and WISH list ;;; ========================== ;;; $Id$ ;;; ;;; ;;; This TODO and WISH list of the CDSware project is formatted in a ;;; way suitable for editing with Emacs outline mode; e.g. use C-c C-t ;;; to see only headings and hide the text, C-c C-a to make everything ;;; visible again, C-c C-d to hide one item, C-c C-s to show one item, ;;; etc. See Emacs help for more. * BibConvert ** BibConvert-20031217: MEDLINE, BibTeX examples Progress: BibTeX example done, MEDLINE pending * BibEdit ** BibEdit-20041103: text or GUI tool * BibFormat ** BibFormat-20041018: forall() On Mon, 18 Oct 2004, pelzer@hbz-nrw.de wrote: > why is it impossible to nest forall constructions in your format > definitions like: > > forall ($6531.a) > { > forall($710.a) > { > .... > } > .... > } Because of BibFormat's internal implementation of the forall() function. The BibFormat Admin guide mentions this limitation. In the future the BibFormat module will probably get rewritten in Python some day, and we'll try to lever this limitation. (It's nothing imminent though.) ** BibFormat-20041019: language-dependent behaviour > - le premier concerne la fonction link() dans la definition des > formats (BibFormat Admin). Le parametre de langue "ln" n'est pas > transmis dans l'URL. Ce qui signifie que lorsque tu cliques par > exemple sur un auteur pour faire une recherche par auteur, l'URL ne > contient pas le parametre "ln". Oui, BibFormat a le probleme de multilinguisme, car il a ete developpe avant... il faudra que l'on y ajoute la langue comme parametre dans plusieurs endroits. ** BibFormat-20041103: foresee electronic issue publication Introduce a possibility to easily publish issues of electronic journals, like Weekly Bulletin. Access control needed. ** BibFormat-20041104: hb_fly for CDSware distro ** BibFormat-20041105: make hd and other detailed formats on the fly for CDSware too Just as we do it for CERN. ** BibFormat-20041106: forsee script for downloading cfg, editing in Emacs, uploading cfg To ease the editing of BibFormat formats and friends, a little script is needed that would download cfg into a file that would be Emacs-editable and that one could upload back into DB. On Wed, 24 Mar 2004, Tibor Simko wrote: > When you play with BibFormat formats outside of BibFormat Web Admin, > there is an annoyance with the ``serialized'' column of the > ``flxFORMATS'' table. For example, to fill the demo site > automatically upon installation I have to resort to fancy SQL > statements like the one appended below[1]. > > Would it be easily possible if: > > 1. BibFormat core ``FormatRetriever.inc.php'' would not require > ``serialized'' but would check for its existence, and if this > column is NULL, then it would recreate it upon first load. IOW, > ``getSerializedFormat()'' should not fail but should call > something like ``setSerializedFormat($plaintextformat)''. (You > probably have such a function somewhere in the Web Admin; I have > not looked.) > > 2. We could even think of writing a more complex external CLI tool > that would provide bibformatcfgdump/bibformatcfgimport > functionality. People could then edit formats or link rules or > extraction rules or whatever outside of the BibFormat Web Admin, > in a text editor. It would enable for example an easy way to do > global changes like replacing ``100.a'' with ``110.a'' etc. > > What do you think? Would you have time for option 1, either to advise > me or to do it yourself? ** BibFormat-20041214: bibreformat format type lookup is wrong When I launch bibreformat -oHD, and in the bibfmt table there was format already for HB, then the record is not processed. BibReformat should look for HD, not HB, when it is deciding whether to process records or not. +** BibFormat-20050527: better display of photo galleries + +> When you look at photos in CDS it would be useful to be able to +> navigate simply (forward and back) when you are looking at the full +> size (or even half-size) versions. Take the case of the relay race; +> there are 16 photos for this 'event' +> (http://cdsweb.cern.ch/search.py?of=hd&f=970__a&p=000035128MMD +> _06.jpg> ). It would be nice just to browse the 144 dpi versions +> one-by-one rather than having to go back every time to the page of +> thumbnails. + +Yes, it would indeed be interesting. For the time being we only point +to JPEGs sitting on another server, so we cannot easily add +next/previous links there. But for the future we can rework the +formats in order to allow it... I've put it into our TODO list. + * BibIndex ** BibIndex-20031217: make our own ACC indexes Study phrase index generation from XML MARC. Related to bibXXx table abolition. Progress: table structure prepared for v0.3.0 ** BibIndex-20041112: index conference title with individual contribution records When LKR is defined for an article, lookup the conference title and index it together with the article metadata, for the title and the global indexes. (Beware of phrase indexes, modification times of the conference record WRT contribution records, etc.) ** BibIndex-20041113: --reindex > I suspect my fix of course. Is there a safe way to start the indexes > afresh ? dropping the content of all the idx tables ? Yes, exactly. Your fix wasn't sufficient, it would have been necessary to do similar things in more places... at the expense of speed of the searching and indexing engine. Which is why it's better to reindex all records from scratch. You can use the following technique: $ echo "TRUNCATE idxWORD01F;" | /path/to/cdsware/bin/dbexec $ echo "TRUNCATE idxWORD01R;" | /path/to/cdsware/bin/dbexec $ echo "TRUNCATE idxWORD02F;" | /path/to/cdsware/bin/dbexec $ echo "TRUNCATE idxWORD02R;" | /path/to/cdsware/bin/dbexec [...] $ echo "TRUNCATE idxWORD10F;" | /path/to/cdsware/bin/dbexec $ echo "TRUNCATE idxWORD10R;" | /path/to/cdsware/bin/dbexec $ echo "UPDATE idxINDEX SET last_updated='0000-00-00 00:00:00';" | /path/to/cdsware/bin/dbexec $ /path/to/cdsware/bin/bibindex We should invent a nice option to ``bibindex'' that would take care of all these steps. ** BibIndex-20041114: index into tmp table When reindexing everything from scratch, don't wipe out existing index but rather create new index into a temporary table and then copy it over the current index at the end of the process . Useful for end users to keep seeing existing index while the new is beeing built from scratch. * BibRank ** BibRank20041103: rebalancing should read old records When bibrank -R option is used, the attention should not be paid to the dates of last modification of neither records not rnkMETHODDATA and friends; rather the ranking indexation should go into empty tables. ** BibRank-20050504: similar records in the demo installation give traceback gives: [' File "/log/cdsware-DEMOPLUS/lib/python/cdsware/bibrank_record_sorter.py", line 230, in rank_records\n result = find_similar(rank_method_code, pattern[0][6:], hitset, rank_limit_relevance, verbose)\n', ' File "/log/cdsware-DEMOPLUS/lib/python/cdsware/bibrank_record_sorter.py", line 381, in find_similar\n if len(tf_values) <= methods[rank_method_code]["max_nr_words_lower"] or (len(term_recs)>= methods[rank_method_code]["min_nr_words_docs"] and (((float(len(term_recs)) / float(methods[rank_method_code]["col_size"])) <= methods[rank_method_code]["max_word_occurence"]) and ((float(len(term_recs)) / float(methods[rank_method_code]["col_size"]))>= methods[rank_method_code]["min_word_occurence"]))): #too complicated...something must be done\n'] * BibSched ** BibSched-20031217: enable multiple hosts ** BibSched-20040302: task sleeping sometimes cannot be done directly Sometimes when you try to make BibSched task sleeping, you cannot do it because MySQL is active for the task elsewhere: _mysql_exceptions.ProgrammingError: (2014, "Commands out of sync;You can't run this command now") ** BibSched-20040303: task numbering > 2) can you prefix the task number with 0, e.g. _task_0088.err We cannot foretell how many zeros to put there... depends on the total number of jobs. (Maybe we can put some sufficiently large number of leading zeros.) ** BibSched-20040304: start/stop > Another patch is for modules/bibsched/bin/bibsched.wml: "ps -C %s o > '%%p%%a'" does not exist on FreeBSD, and I have replaced it by "ps > -o pid,command | grep %s" We'll rewrite the offending part. We've actually been thinking about replacing the bibsched daemon behaviour via a more traditional ``apachectl start/stop'' kind of approach. ** BibSched-20040305: ERROR task queue policy > Note that the BibSched daemon automatic mode stops as soon as some > of the tasks ends with an error. It it therefore a good idea to > inspect BibSched queue from time to time. This can be done by > running the BibSched command-line admin interface > > Wouldn't it be better that it continued in auto mode and issued a > warning (by e-mail) to the admin? that way it would not be necessary > to check on it from time to time. Indeed. The original workflow assumed a chain of actions that we wanted to stop and manually fix as soon as a problem appeared. I think we can safely change this policy now. ** BibSched-20041216: memory efficiency of the daemon for big schTASK tables When there is a lot of DONE tasks in the schTASK table, ``bibsched -d'' eats up a lot of memory during the execution. It shouldn't. * BibUpload ** BibUpload-20040116: must use run_sql() to avoid connection dropping problems ** BibUpload-20041103: table bibxEP was wanted at some point During recent demo, at some point in time when a new submission type was made the system wanted to look for bibxEP table. Check tag creation rules. ** BibUpload-20050512: XML MARC not stored when bibupload -c is used XML MARC not stored in bibfmt when bibupload -c is used. * Miscellaneous ** Miscellaneous-20040315: Personalization part is not I18N-ized yet The Personalization part is not I18N-ized yet and there are not very many personalization options. We plan to expand it at some point in the future, like a possibility to select default language, default number of hits per page, default sorting, etc. ** Miscellaneous-20041129: introduce several record modification times (bib/ref/pdf) Need to distinguish between several modification times: metadata modif, reference modif, fulltext modif. Touches BibUpload/BibReference/BibIndex and friends. ** Miscellaneous-20041130: investigate usage of SQLRelay ** Miscellaneous-20041201: INSTALL file Add comments on PHP and Python linking to the same MySQL library. (e.g. people should not use PHP internal MySQL library). ** Miscellaneous-20041213: version numbering CDSware/0.3.3.20040929 bibindex/1.12 Each module should have its own version numbering, to easily track what changed. For example: $ /soft/cdsware-PCDH23/bin/bibindex -V CDSware/0.3.3 bibindex/1.2 $ /soft/cdsware-PCDH23/bin/bibformat -V CDSware/0.3.3 bibformat/2.3 After a new release: $ /soft/cdsware-PCDH23/bin/bibindex -V CDSware/0.3.4 bibindex/1.8 $ /soft/cdsware-PCDH23/bin/bibformat -V CDSware/0.3.4 bibformat/2.3 indicating that bibformat didn't change while bibindex did a lot in between the two releases. ** Miscellaneous-20041217: INSTALL file upgrade instructions Updated Wed May 12 12:00:38 2004 - update sql targets, plus release announcements ** Miscellaneous-20041218: test suite Make version testing there too. ** Miscellaneous-20041219: backup script shutdown DB, put warning message, hotbackup tables, start DB, remove warning message ** Miscellaneous-20041220: MySQLdb 1.0.0 API has changed for BLOBs/arrays. Should adapt our interfaces. ** Miscellaneous-20041221: INSTALL file and root/non-root > (o) You forgot to mention that many steps are to be done as > root. Some are obvious, others are not. But for clean install file > it should be stated, especially when you have before (using the > "sudo" shortcut) - make install (obviously) - make XXX-demo-XXX > (most likely the /var/www/ directly you propose is not writable for > all users!) Actually, most of the steps shouldn't be done as root. The /var/www was listed only for illustration purposes. I agree that we may describe the installation process in a more howto manner, e.g. how to create first a special ``cdsware'' user that will be used later to run the system, or how to share permissions with the Apache user via groups, etc. The INSTALL file is silent on this now, each site has their own preferences. I've added a task to our TODO file to improve this. ** Miscellaneous-20050413: prettify config.py The config variables in config.py are dirty, as they are badly named and use different naming styles. Harmonize them to a schema like cfg_general_for, cfg_webaccess_bar, etc. * OAI ** OAI-20040119: RTdata dir should be created during `make install` ? Mon Jan 19 10:41:18 2004 ** OAI-20040928: provenance information On Tue, 28 Sep 2004, pelzer@hbz-nrw.de wrote: > have a question about your oai_repository.py: > > do you plan for OAI the output of provenance information in the > "about" part of a record? e.g. > http://www.openarchives.org/OAI/2.0/guidelines-provenance.htm ** OAI-20041004: periodical harvesting On Mon, 4 Oct 2004, pelzer@hbz-nrw.de wrote: > reply of martin. my new question: are you ready with OAI data > harvestor? do you have any experiences with periodical harvesting? > what do you do with doublets? For the time being we only provide command-line `bibharvest' tool without any periodical harvesting admin facility. We haven't had time to develop BibHarvest Admin yet. * WebAccess ** WebAccess-20041103: restriction by IP ** WebAccess-20050222: passage de CFG_ACCESS_CONTROL_LEVEL_ACCOUNTS inactive les comptes Le passage de CFG_ACCESS_CONTROL_LEVEL_ACCOUNTS de 0 a 3 inactive les comptes. Comment fixer le probleme: $ echo 'update user set note=1 where id=1' | //bin/dbexec La raison en est que vous avez utilise un access level different de 3 avant, et lorsque vous avez passe a 3 le systeme a considere tous les comptes existants comme inactifs, y compris celui de superuser. Nous allons fixer cela. * WebAlert ** WebAlert-20041124: nice and simple interface to set up an alert The form we used to have to easily create alerts has gone. The only way now is via search history. Put it back. See also . ** WebAlert-20041103: manage alerts for a mailing list Wed Nov 3 11:46:24 2004 Instead of having to create an account with the email address of mailing list (in order to be able to send alerts to mailing list), introduce a possibility to define alert mailing list management by an individual user. +** WebAlert-20050812: Similar records should not be in the email body +Fri Aug 12 15:46:59 2005 + +When sending email with records found, the body contains also link to +find Similar records. It is not necessary there and should be hidden. + * WebBasket ** WebBasket-20041103: output formats in XML etc Properly support many output formats for a basket, quite like the search engine does. * WebSearch ** WebSearch-20031217: cross-searching of various CDSware installations ** WebSearch-20040121: when searching for ``title: goo'' the space should be ignored ** WebSearch-20040622: of=nn for the search engine On Tue, 22 Jun 2004, pelzer@hbz-nrw.de wrote: >> If this is not feasible (e.g. huuuge result sets), then we may >> invent another output behaviour, e.g. ``of=nn'' that would return >> you only the number of hits, that is ``12'' for the example above. > > another output behaviour would be the best solution. don't you think it's difficult for a user to search with xml output. the user don't see in xml - "mode", how many records are found. i think, it's helpful to write the hit number at the beginning of the first xml data record. what do you think about this? ** WebSearch-20041015: multiple search logs On Fri, 15 Oct 2004, Frederic Gobry wrote: > For the file log, wouldn't it be useful to keep a track of queries > with no matches? (to discover systematic errors or problems) The > current implementation seems to discard these queries. Yes, it would. A log of slow queries as well. I'll add them as new log files before the next release. ** WebSearch-20041103: indicate when cfg_max_recID is going to be exhausted ** WebSearch-20041213: treat stemming properly ** WebSearch-20041214: treat stopword search properly ** WebSearch-20041215: introduce possibility to search for basketid:333 ** WebSearch-20041216: introduce recommended terms lookup ** WebSearch-20041217: introduce `advertized records' like Google ** WebSearch-20041218: introduce NEAR operator For example, we could approximate NEAR by doing a regexp search for two words less than 20 characters apart. ** WebSearch-20041219: put search cache back ** WebSearch-20041220: add RSS output for recent additions to the collections. ** WebSearch-20041221: collection cache > URI: > http://cdsweb.cern.ch/search.py?sc=1&ln=en&p=slow+ejection&f=title&action=Search+&cc=Articles+%26+Preprints&c=Published+Articles&c=Preprints&c=Theses&c=Reports&c=CERN+Internal+Notes > Time: 24/Jun/2004:18:16:53 +0200 Browser: Mozilla/4.0 (compatible; > MSIE 6.0; Windows NT 5.0) Client: 137.138.169.154 The problem is connected to temporary cache. If it happens again, please just try to reload the page after a while. We'll fix the problem to prevent it from happening. ** WebSearch-20041222: safer wildcard treatment > okay, just began to wonder when CERN* never returned an answer :) Yup. I wanted to plug-in a generic timeouter to the whole search engine to make sure that queries finish within 10 seconds or so. But this is not done yet. At the moment, the wildcards are simply refused for words with less that three letters, and accepted for longer words. But this does not work well for words like `CERN'. While waiting for that generic timeouter, I should rather check how many indexed terms are returned by a wildcard word, and refused to take wildcard into account in case of e.g. more than 20 terms or so... > Looks like you are sending me 'terror*', and not each word that > includes in 'terror*' Currently `cern*' could lead to hundreds of thousands of words, so it's hard to . I'll rewrite the wildcard handling part in order to retain cases with <200 words, say, and then I'll pass you the full list. ** WebSearch-20050110: clearer output messages In phrases like ``Search term dark mass inside title index did not match any record'' we should better distinguish for the end user word-searches from phrase-searches. ** WebSearch-20050111: investigate doing partial phrase match when exact failed investigate doing partial phrase match when exact phrase match failed, exactly as we were doing some years ago. ** WebSearch-20050207: better no match messages for phrase queries Whenever a search using '...' gives zero, your diagnostic messages all have % in place of the '. This is extremely confusing. Surely it would be a simple matter to display the diagnostic in the form the query was typed in? ** WebSearch-20050208: wau=ellis -> author:ellis Yes, the equal sign has a special meaning in order to translate our old syntax such as ``wau=ellis'' into the new one ``author:ellis''. As you noticed, currently the search engine only blindly replaces ``='' with ``:'' everywhere, which is no good. I'll make it look properly for ``wau='', ``wti='', etc places only. ** WebSearch-20050415: search engine sort options should be collection-dependent > I just wondered: the admin UI websearch treats the sort options as > collection-dependent. However, in the search_engine, it seems that > they are not. > > Do I need to take another coffee break? (say yes, please :-)) No, you don't. On the search interface pages (and hence in the WebSearch Admin interface too) the search fields, sort options, output formats, etc are rightly collection-dependent. On the search results pages the current collection choice (cc, the one from which the search was initiated) should be respected too and the search engine should display only those search fields, sort options, output formats, etc that are associated with it. This is indeed not implemented yet. Put into TODO as WebSearch-20050415. +** WebSearch-20050617: sorting limit behaviour + +> b) Currently an attempt to sort a set of more than 1500 records +> gives a warning message and reverts to the default order "latest +> first". I think it would be much better to cut off the set at 1500 +> records by system number (take highest 1500 system numbers) and to +> sort these 1500 records as requested. This could be much more useful +> than just refusing the sort. + +Yes, it could be interesting to offer this option too. Thanks for the +suggestion, I've added it to our TODO list. + +** WebSearch-20050624: search in nonexisting access index + +On Fri, 17 Jun 2005, David Dallman wrote: +> What is being searched for in the "not773" case? + +You can append ``verbose=1'' argument to the URL to get some internal +information. For example: + +1. + + gives [['|', 'a2004', '916__y', 'a'], ['-', 'a-->zz', '773__p', 'a']] + +and + +2. + + gives [['|', 'a2004', '916__y', 'a'], ['+', 'a-->zz', 'not773__p', 'a']] + +The tuples should be interpreted as follows: + + [boolean_operator, pattern, tag, search_type] + +where search_type a=access file, w=word file, etc. + +This means that the search is done in ``not773__p'' field that +obviously doesn't exist. So the search engine ignores it and searches +by default in the ``author'' field. (I cannot recall why the +``author'' index was chosen for this. We have better choices, like +return an error for example.) + +Note also that a->zz should be written with one dash only. So you +were searching for authors with names between a- to zz. Note: + + a-->zz ... 646,515 hits + a->zz .... 646,530 hits + 0->zzzz .. 646,548 hits + +I think I'll just return an empty result in case the `chosen' tag +(not773__p) does not exist. It's good that you catched this! + * WebSession ** WebSession-20041102: detect cookies and inform user if not available upon login We should detect whether cookies are disabled, and print a message on the login page. Otherwise user types good access credentials but stays guest, not knowing what went wrong. See e.g.: On Tue, 2 Nov 2004, RAMSTEIN Beatrice wrote: > I was working with Konqueror (KDE navigator). I tried now with > netscape, which is in fact my usual navigator and it works. It still > doesn't work with Konqueror, but it doesn't matter, since I usually > use Netscape anyway. Good. Note that Konqueror works perfectly fine for me. I guess that you have probably configured it not to accept cookies, which is why our session ID is rejected on your end so that user authentication cannot work. Please try to enable cookies for our domain and Konqueror should start to work just fine. * WebSubmit ** WebSubmit-20041103: MBI login link uses bad `referer' argument ** WebSubmit-20041104: call elements `title' not `TI' ** WebSubmit-20041105: traceback when `brique' element was edited When `brique' element was edited during a recent demo, a Python traceback was obtained. ** WebSubmit-20041106: admin interface should use [?] links WebSubmit Admin should use [?] links to point to help pages, e.g. to the list of available functions etc. ** WebSubmit-20041107: Slovak language abbreviation is not `slo', that is Slovene ** WebSubmit-20041108: MBI for hep-th/00000 looked for hep-th_00000 ** WebSubmit-20041109: publiline report number link gave traceback publiline report number link gave Python traceback during recent demo. ** WebSubmit-20041110: `your approvals' link not needed everywhere On the pesonal account page, Your Approvals link should not be displayed if I'm not referee for some document. Verify! ** WebSubmit-20041111: integrate submit new record / submit new file Do not separate out submitting new record bibliographic information and new fulltext file, but rather merge the latter into the former as page N. See also ``End submission'' and ``Finish submission'' problems. MBI and SRV button names should be user-friendly. ** WebSubmit-20041112: after adding PCV action, php mysql bad result After PCV action was added, php mysql bad result was obtained for some page. ** WebSubmit-20041113: create icons for submitted videos ** WebSubmit-20041114: delete unwanted fields > Une petite question e propos de l'interface de soumission: Lorsqu'on > efface le contenu d'un champ (de type text input par exemple) qui a > ete prealablement saisi lors de la soumission d'un document, les > donnees de ce champ ne sont pas effacees. Elles sont toujours > visibles sur le serveur. Est-ce un bug? Merci de ta reponse. Oui, il paraet que c'est un probleme. On ne peut pas effacer un champ en faisant e correct e sur un champ vide (disons sans sous-champs), car BibUpload ignorera les champs vides. On va regarder cela et envisager un protocol souhaitable e ce propos. Sinon ce que tu peux faire en attendant c'est de faire le e replace e complet de la notice entiere, comme suit: # telecharger la notice recID=123: $ wget -O z_z.xml 'http://pcdh23.cern.ch/search.py?recid=123&of=xm' # editer la notice et enlever le champ en trop: $ vi z_z.xml # soumettre la notice en mode replace: $ bibupload -r z_z.xml ce qui fera l'affaire. +** WebSubmit-20050628: sending emails not UTF-8 clean + +> it tries to use outgoing address "Kovárna VIVA Zlín, spol. s. +> r. o. " with all unicode +> characters. This string is passed to MTA, which rejects it, because +> headers must be in 7-bit ascii. + +Indeed, we should use quoted printables or something. Added to our +TODO list. + +** WebSubmit-20050713: bad To: header in approval emails + + From: Atlantis Institute of Fictive Science Submission Engine + Subject: Request for approval of TEST-PREPRINT-2005-001 + To: + Date: Wed, 13 Jul 2005 10:52:12 +0200 (CEST) + +Note the empty To: header field. It looks like the approval emails do +not properly check stuff. Can you please have a look at it when you +have time? + * End of file ;;; End of file.