[Soks] Searching attached files
Copied from Email on Fri, 4 Mar 2005 08:30:09 +0000 from Tom Counsell (tamc2@cam.ac.uk)
On 4 Mar 2005, at 05:57, Bil Kleb wrote:
> So I now have pdftotext, catdoc, and xls2csv text conversion
> scripts available for PDF, Word, and Excel files. How would
> I hook the existing search capability up to them to handle
> attached files? (I'm stumped.)
Ok, as I mentioned in my first email on the topic, I reckon this might
be a couple of days work, and getting a separate search engine like
webglimpse going might be quicker, and in the long run, more powerful.
But anyway, to have a go at a brute force search I would:
1) I would write something in the
lib/soks-servlet.rb/WikiServlet#doUpload that, when a page is uploaded,
runs it through the appropriate script to convert it into text and
writes that to disk either in attachments or in content.
2) Modify lib/soks-view.rb/View#find(pagename)
Replacing this ...
text_results = @wiki.select { |name,page| page.content=~ search_term }
with something like
text_results = @wiki.select { |name, page|
case page
when UploadPage
if File.exists? path_to_text_version( page )
IO.readlines( path_to_text_version( page ) ).join =~ search_term
end
else
page.content=~ search_term
end
end
3) If that worked, it will probably be pretty slow, so I would think
about caching the text versions of the pages (which might take up quite
a lot of memory I guess).
Again, this code is from memory. I'm sorry I can't work on this today
and therefore provide you with running code, I will take a look over
the weekend, so if you want to send anything you have hacked together
to me at the end of the day, I'll do what I can to improve it for
monday (although this is perhaps to late for your current deadline? In
which case sorry... )
Tom
_______________________________________________
Soks-discuss mailing list
Soks-discuss@rubyforge.org
http://rubyforge.org/mailman/listinfo/soks-discuss
Edit this page or
watch for changes using RSS.