A few weeks ago I got a project wherein I had to get various statistical data about a document. One of the important parameters was to get a precise count of grammatical errors in the document. I tried to go the open source way and tested out LanguageTool, PorseLink among others. None of these were up to the mark of the good old MS-Word. So I had to now work on a way to problematically access Word, load a document in it and get the statistics.
Python being versatile and great at memory management, we would be using it to control COM object and do what we need to do.
This is the Part 1 of the two series post.
Part 1 : Introduction to
COMobject, initialization, examples etc
Part 2 : Setting up a server environment. Making use of threads and opening using thread for each process. This is how to communicate with the
COMobject in a multi-thread environment as it cannot be passed to a thread directly.
The Microsoft Component Object Model (COM) is a platform-independent, distributed, object-oriented system for creating binary software components that can interact. COM is the foundation technology for Microsoft's OLE (compound documents), ActiveX (Internet-enabled components), as well as others. read more...
In a nutshell it can be used to control various aspects of Windows. We would be focusing on using the
COMobject to control MS-Office applications like Word and Excel. Although it is not limited to just MS-Office applications, it can be used to communicate with other softwares like IE (if you're still into it...)
You can do things like :
The possibilities are limitless. I guess you already have a use case in mind and landing to this page to know the process, if not it's always good to know if there is possibility to get a certain functionality done.
We will need the following tools to communicate with MS-Word :
PyWin32 is a great library that gives us the same set of methods/properties that are exposed by COM which are natively in Visual Basic for Applications (VBA).
NOTE: I'm using Python3 in the code samples in this guide.
To get hold of a COM object
import win32com.client # for MS-Word word = win32.gencache.EnsureDispatch('Word.Application') # OR word = win32com.client.DispatchEx('Word.Application') # for MS-Excel excel = win32.gencache.EnsureDispatch('Excel.Application') # OR excel = win32com.client.DispatchEx('Excel.Application')
There are some key differences to be noted here
EnsureDispatch : In dead simple terms, it makes sure that a object which references the specified application is returned. If the application is already open, it will return the same instance, if not it will start a instance and return it.
It is useful in cases where you would want to make use of single application that holds certain meta-data, over the life of the process.
DispatchEx : For each call, you would get a new object returned. This would mean that if you call it 7 times, you would see 7 instances of MS-Word in task manager.
It makes sense to get independent instances when you would want to handle each process independently and close the application when one process completes.
Understand that if you run a web-server that takes in data, puts it in an Excel, runs a macro and return a result. You would want to have a dedicated instance of Excel for each process so that you can close it as soon as the response is ready. If you used
Dispatch, that would mean that at the end of each process, you would have to check if any other process is in progress, if not shut down the application.
It's requires a bit extra memory but helps to keep memory leak in check as using same application instance to load up multiple files that do not require each others data, means that with each load, some meta-data is loaded which is not cleared when closing the process and it builds up over time. We will cover this aspect in detail in the Part-2 of the post.
Coming back to the point, let's go ahead.
Now that we have the
COMobject we can use it a open a Word Document.
doc = word.Documents.Open( os.path.join( os.getcwd(), 'path/to/files/', filename), ReadOnly=True)
NOTE : As you might have had guessed, you can use the same
wordobject to open multiple
docobjects. This is what I was referring to using same application to open multiple documents in the last paragraph
At this point, we can refer to Microsoft's documentation to get gist of various properties & functions available.
Make the application window invisible :
word.Visible = False
To add a new Document
doc = word.Documents.Add()
To get all the grammatical or spelling errors
grammaticalErrors = doc.GrammaticalErrors spellingErrors = doc.doc.SpellingErrors #or to simple get the count grammaticalErrorsCount = doc.GrammaticalErrors.Count spellingErrorsCount = doc.SpellingErrors.Count
If you have a look at the Word BuiltInProperty, you'll see you can do the following
wordCount = int(doc.BuiltInDocumentProperties(15)) characterCount = int(doc.BuiltInDocumentProperties(16)) paragraphCount = int(doc.BuiltInDocumentProperties(24))
or do something much more complex as
characterPerWord = doc.ReadabilityStatistics("Characters per Word").Value
or something as simple as
shapeCount = doc.Shapes.Count tableCount = doc.Tables.Count sentenceCount = doc.Sentences.Count tableOfContentCount = doc.TablesOfContents.Count #loop through each paragraphs for paras in doc.Paragraphs: # paras will now be each individual paragraph
As you can see the possibilities are endless. MS-Office tools are pretty powerful in what they do. We did not even dig into the excel territory. Majority of the questions on stackoverflow are based on excel as automating tasks there makes can be of great use.
In the next post we will focus on how to wrap all this up into dedicated thread for each process and word with COMobject in threads.