[Soks] Searching attached files

[Soks] Searching attached files

Copied from Email on Fri, 4 Mar 2005 08:30:09 +0000 from Tom Counsell (tamc2@cam.ac.uk)

On 4 Mar 2005, at 05:57, Bil Kleb wrote:
> So I now have pdftotext, catdoc, and xls2csv text conversion
> scripts available for PDF, Word, and Excel files.  How would
> I hook the existing search capability up to them to handle
> attached files?  (I'm stumped.)

Ok, as I mentioned in my first email on the topic, I reckon this might 
be a couple of days work, and getting a separate search engine like 
webglimpse going might be quicker, and in the long run, more powerful.

But anyway, to have a go at a brute force search I would:
1) I would write something in the 
lib/soks-servlet.rb/WikiServlet#doUpload that, when a page is uploaded, 
runs it through the appropriate script to convert it into text and 
writes that to disk either in attachments or in content.

2) Modify lib/soks-view.rb/View#find(pagename)

Replacing this ...
text_results = @wiki.select { |name,page| page.content=~ search_term }

with something like

text_results = @wiki.select { |name, page|
    case page
    when UploadPage
        if File.exists? path_to_text_version( page )
            IO.readlines( path_to_text_version( page ) ).join =~ search_term
        end
    else
        page.content=~ search_term
    end
end

3) If that worked, it will probably be pretty slow, so I would think 
about caching the text versions of the pages (which might take up quite 
a lot of memory I guess).

Again, this code is from memory.  I'm sorry I can't work on this today 
and therefore provide you with running code, I will take a look over 
the weekend, so if you want to send anything you have hacked together 
to me at the end of the day, I'll do what I can to improve it for 
monday (although this is perhaps to late for your current deadline?  In 
which case sorry... )

Tom

_______________________________________________
Soks-discuss mailing list
Soks-discuss@rubyforge.org
http://rubyforge.org/mailman/listinfo/soks-discuss

Edit this page or watch for changes using RSS.