OpenOffice Parser: Extracting text from OpenOffice documents

With OpenDocument formats getting widespread acceptance, a lack of simple text extractor from OpenOffice documents is the main motivation for me in developing this one.  The below code will extract text from Open Office documents (like odt, odp, etc). I have used the JDOM XML APIs for easier processing of OpenOffice XMLs. Hope this will make life a bit easier.

/\*
 \* OpenOfficeParser.java
 \*
 \* Created on September 12, 2007, 4:24 PM
 \*
 \* To change this template, choose Tools | Template Manager
 \* and open the template in the editor.
 \*/

/\*\*
 \*
 \* @author prasanna
 \*/

import java.io.InputStream;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.Text;
import org.jdom.input.SAXBuilder;
import java.util.zip.ZipFile;
import java.util.zip.ZipEntry;
import java.util.Enumeration;
import java.util.Iterator;
import java.util.List;

public class OpenOfficeParser {
   
    StringBuffer TextBuffer;
   
    /\*\* Creates a new instance of OpenOfficeParser \*/
   
    public OpenOfficeParser() {}
   
    //Process text elements recursively
    public void processElement(Object o) {
        
        if (o instanceof Element) {
           
            Element e = (Element) o;
            String elementName = e.getQualifiedName();
           
            if (elementName.startsWith("text")) {
               
                if (elementName.equals("text:tab")) // add tab for text:tab
                    TextBuffer.append("\\t");
                else if (elementName.equals("text:s"))  // add space for text:s
                    TextBuffer.append(" ");
                else {
                    List children = e.getContent();
                    Iterator iterator = children.iterator();
                   
                    while (iterator.hasNext()) {
                       
                        Object child = iterator.next();
                        //If Child is a Text Node, then append the text
                        if (child instanceof Text) { 
                            Text t = (Text) child;
                            TextBuffer.append(t.getValue());
                        }
                        else
                        processElement(child); // Recursively process the child element                   
                    }                   
                }
                if (elementName.equals("text:p"))
                    TextBuffer.append("\\n");                   
            }
            else {
                List non_text_list = e.getContent();
                Iterator it = non_text_list.iterator();
                while (it.hasNext()) {
                    Object non_text_child = it.next();
                    processElement(non_text_child);                   
                }
            }               
        }
    }
   
    public String getText(String fileName) throws Exception {
        TextBuffer = new StringBuffer();
       
        //Unzip the openOffice Document
        ZipFile zipFile = new ZipFile(fileName);
        Enumeration entries = zipFile.entries();
        ZipEntry entry;
       
        while(entries.hasMoreElements()) {
            entry = (ZipEntry) entries.nextElement();
                                  
            if (entry.getName().equals("content.xml")) {
               
                TextBuffer = new StringBuffer();               
                SAXBuilder sax = new SAXBuilder();
                Document doc = sax.build(zipFile.getInputStream(entry));
                Element rootElement = doc.getRootElement();
                processElement(rootElement);
                break;
            }
        }                 
        System.out.println("The text extracted from the OpenOffice document = " + TextBuffer.toString());
        return TextBuffer.toString();       
    }     
   
   
    public static void main(String args[]) throws Exception
    {
        new OpenOfficeParser().getText("OpenDocumentFile.odt");
    }
}
Comments:

Thank you. Exactly what I need.

Posted by jim on March 13, 2008 at 08:04 PM IST #

No problem, glad to know that the code is useful to someone.

Posted by Prasanna S on March 15, 2008 at 06:01 AM IST #

Hello,
the parser don't work im my case.
This is the report:

C:\\Program Files\\Java\\jdk>javac openofficeparser.java
openofficeparser.java:26: cannot find symbol
symbol : variable textBuffer
location: class OpenOfficeParser
textBuffer.append("\\t");
\^
openofficeparser.java:28: cannot find symbol
symbol : variable textBuffer
location: class OpenOfficeParser
textBuffer.append(" ");
\^
openofficeparser.java:39: cannot find symbol
symbol : variable textBuffer
location: class OpenOfficeParser
textBuffer.append(t.getValue());
\^
openofficeparser.java:46: cannot find symbol
symbol : variable textBuffer
location: class OpenOfficeParser
textBuffer.append("\\n");
\^
openofficeparser.java:60: cannot find symbol
symbol : variable textBuffer
location: class OpenOfficeParser
textBuffer = new StringBuffer();
\^
openofficeparser.java:72: cannot find symbol
symbol : variable textBuffer
location: class OpenOfficeParser
textBuffer = new StringBuffer();
\^
openofficeparser.java:80: cannot find symbol
symbol : variable textBuffer
location: class OpenOfficeParser
System.out.println("The text extracted from the OpenOffice document = "
+ textBuffer.toString());

\^
openofficeparser.java:81: cannot find symbol
symbol : variable textBuffer
location: class OpenOfficeParser
return textBuffer.toString();
\^
8 errors

The classpath for jdom is installed.
I hope someone can help me! :-)

Posted by ffg on August 05, 2008 at 01:07 PM IST #

It's solved!
I forgot this one:

StringBuffer TextBuffer;

I don't know how this could happen... :-)

Posted by ffd on August 05, 2008 at 02:01 PM IST #

Hey, great.
Perfectly what i needed.
Makes my life easier and safes much of my sparse lifetime. \^\^

Thanks a lot.

Posted by Jonas on March 27, 2009 at 07:13 PM IST #

Hi Jonas,

Thanks for saying so.

Posted by Prasanna on March 30, 2009 at 05:24 PM IST #

This helped me so much with my work!!

I converted all of my MS Office documents to ODT by using their Converter Wizard and then I accessed them with your tool.

Thank you so much!!

Posted by Katsi on May 19, 2009 at 10:34 AM IST #

Baran!

Posted by Gora on February 04, 2010 at 09:52 PM IST #

Awesome!

Posted by tisho on November 05, 2010 at 10:38 AM IST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

prasanna

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today