Categories
Java Web Development

POI / TextMining.org Error when extracting text from a Word File

My client was experiencing difficulties when trying to index Word files into Lucene.
I am using the text extraction library from TextMining.org but the issue occurs also when using Apache POI (which TextMining.org is related to).

The exception being thrown is:

Exception while extracting Word file: Invalid header signature

After opening one of the questionable files I found out that they were actually RTF files saved as Word doc files. Only after saving the file under a different name (using Save As…) and explicitly specifying the file to be a Word Document did was the file properly saved and summarily had its text extracted succesfully.

Also, make sure that Word is not using the Fast Save option as it will also cause issues when extracting text.

Share
Categories
Computing Java Web Development

Setting up jEdit for remote development using SSH and sFTP

I am taking a class about XML and XSLTat the Harvard Extenion.
We are supposed to essentially remotely log into the system through SSH and use emacs or any other editor of our liking for the development of our projects and solutions.

I am totally down with emacs but the fact that it does not have XSlide or any other context-sensitive helper functionality enabled on the server is disappointing. So I tried NetBeans, Eclipse and finally jEdit as alternatives with the following requirements:

1. Must be able to save and read file remotely using sFTP (FTP is not enabled on the server).
2. Must be useful with XML.
3. If possible, also allow to log in to the system using an internal SSH client.
4. Anything I use must be free and legal.

NetBeans is supposedly very good with XSLT. I did not get that far because it does not have internal sFTP or SSH.

Eclipse has a really cool sFTP plugin that allows you to synchronize to a remote folder using the ‘Team’ functionality. But its XSLT support is way limited and the free version of XML Buddy does not support it. It also can only SSH when it involves CVS. I could not find anything to decouple the two… probably need to write one myself.

jEdit, my favorite left-field option and definitely not as chi-chi as the other two, had all three.
It has so-so XSLT support.
It has internal sFTP support that virtually mounts the remote file system into jEdit’s own file manager. Very cool.
Finally, it has an available (surprisingly not through its plugin manager) SSH client that you need to download and install into the $home/.jedit/jars folder. If you do follow this advice, note that when you connect, the window that should pop up to request your user name and password does not really pop up but pops under so just notice any new windows appearing.

jEdit wins.

Share
Categories
Java Web Development

Deleting a document with Lucene

Lucene keeps on blowing my mind, but find how to do rudimentary things with it is not too simple.
Suppose you want the index to no longer show a doucment that you deleted. As far as I understand – after some research pain – this involves six steps: [of course, there is definitely more than one way to do this, and I am by no means a Lucene expert]

1. Find the document’s id. That is the id Lucene, not you, gave the document.
2. Get an Directory object for the index directory.
3. Get an IndexReaderfor that directory
4. Unlock that directory
5. Delete the document
6. Close the IndexReader object

Each step is almost its own procedure.
1. Find the document’s id
This is the more elaborate step. You need to search your index for the doucment you wish to delete. To do so, I ran a query against the index.
(This sample query will show you the names and indexs of documents that match on a field called “contents”):

Directory fsDir = FSDirectory.getDirectory(indexDir, false);
IndexSearcher is = new IndexSearcher(fsDir);
Query query = QueryParser.parse(search_term, "contents", new StandardAnalyzer());
Hits hits = is.search(query);
System.out.println("Found " + hits.length() + " document(s) that matched query '" + q + "':");
for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); System.out.println(doc.get("filename") + " score: " + hits.score(i) + " id: " + hits.id(i)); }

Finding the id, as you see, involves the Hits
object, which holds the precious id(int hit_position) method that returns you the id.

Now that you have the id, you can proceed and start the real deletion process:

2. Get an Directory object
Similar to what we did above, you get a Directory object from the FSDirectory. That is easy enough.

3. Get an IndexReader object
The IndexReader is an abstract class, so in order to get the concrete implementation for it, you instantiate it using a call like:
IndexReader ir = IndexReader.open(fsDir);
where fsDir is the Directory object we created in step 2.

4. Unlock the Direcotry
Lucene uses file locks to secure the index and the updates happening to it. To delete a document, you have to first unlock the directory, and the IndexReader object will be happy to do that for you:
ir.unlock(fsDir);

5. Delete the document
Finally, we ask the IndexReader to delete the document using the id we found in step 1 - which we intuitively put in a variable called docId:
ir.delete(docId);

6. Close the IndexReader object
Nothing will happen unless you close the IndexReader object - the document will not be deleted. Easy enough, close it then:
ir.close();

Voila.

Share
Share