Handling with .doc extension with Python

A guide to performing operations on .doc files using Python

Arthur Fortes
3 min readJan 30, 2020

In this week I started one project, in which I need read a Word doc (.doc) and extract relevant information from it. However, I never work with .doc files and then I started to research it. I found lots of information on reading .docx (e.g. textract)but much less on .doc.

In this context, I’m writing this article to share a native solution for anyone working with Anaconda, without needing extra installations (if you use pure Python, just install a library). The magic library is called win32com, which provides access to many of the Windows APIs from Python., such as .ppt.

1. Installing via PIP

  • If you are using Anaconda, please skip this step.
pip install pywin32

For more information about installation, you could access win32com Github.

2. Import and read file

Here’s a script to save Word documents in and below a given directory to text.

import fnmatch, os, pythoncom, sys, win32com.client

wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")

try:
for path, dirs, files in os.walk(sys.argv[1]):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.doc')]:
print "processing %s" % doc
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('doc') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt, FileFormat=win32com.client.constants.wdFormatTextLineBreaks)
wordapp.ActiveWindow.Close()
finally:
wordapp.Quit()

To get other elements of the document, you ca use the follow command:

element = doc.Content. => text

3. Create and write text to a Word Document

When using win32com, bear in mind that you are talking to the Word object model. You don’t need to know a lot of VBA or other languages to apply the samples to using Python; you just need to figure out which parts of the object model are being used. Here, we have a simple example of how create a Word Document with content.

import win32com.client# Create new Word Object
wordapp = win32com.client.Dispatch("Word.Application")
# Word Application should`t be visible
wordapp.Visible = 0
# Create new Document Object
worddoc = wordapp.Documents.Add()
# Make some Setup to the Document:
worddoc.PageSetup.Orientation = 1
worddoc.PageSetup.LeftMargin = 20
worddoc.PageSetup.TopMargin = 20
worddoc.PageSetup.BottomMargin = 20
worddoc.PageSetup.RightMargin = 20
worddoc.Content.Font.Size = 11
worddoc.Content.Paragraphs.TabStops.Add (100)
worddoc.Content.Text = "Hello, I am a text!"
worddoc.Content.MoveEnd
# Close the Word Document (a save-Dialog pops up)
worddoc.Close()
# Close the Word Application
wordapp.Quit()

4. Extra

Here you will find a simple example how to create a small Table in Word and to fill it with date.

from win32com import client
import powerfactory as pf
app = pf.GetApplication()
lines = app.GetCalcRelevantObjects('*.ElmLne')
ldf = app.GetFromStudyCase('ComLdf')
ldf.Execute()
nr_lines = len(lines)
wordapp = client.Dispatch("Word.Application")
wordapp.Visible = True
worddoc = wordapp.Documents.Add()

rang = doc.Range(Start=0,End=0)
worddoc.Tables.Add(rang, NumRows=2, NumColumns=1)
index = 2 + nr_lines
worddoc.Tables(1).Rows(2).Cells(1).Split(1,3)
width = doc.Tables(1).Rows(1).Cells(1).Width
worddoc.Tables(1).Rows(1).Cells(1).Range.Bold=True
worddoc.Tables(1).Rows(1).Cells(1).Range.Font.Size=15
worddoc.Tables(1).Rows(1).Cells(1).Range.Text='Report of LoadFlow Calculations from PowerFactory'
worddoc.Tables(1).Cell(2,1).Range.Text='Name of the line'
worddoc.Tables(1).Cell(2,2).Range.Text='Loading'
worddoc.Tables(1).Cell(2,3).Range.Text='Comment'
app.PrintPlain(doc.Tables(1).Rows(2).Cells)
for i,line in enumerate(lines):
worddoc.Tables(1).Rows.Add()
worddoc.Tables(1).Cell(i+3,1).Range.Text=line.loc_name
worddoc.Tables(1).Cell(i+3,2).Range.Text=str(
round(line.GetAttribute('c:loading'),2))+ ' %'
if(line.GetAttribute('c:loading')>60):
worddoc.Tables(1).Cell(i+3,3).Range.Font.Color=225
worddoc.Tables(1).Cell(i+3,3).Range.Text='LoadingOver 60%'

To copy the contents of a Word document and paste it into an Outlook application, just follow the code below.

import win32com.client

word = win32com.client.Dispatch("Word.Application")
doc = word.Documents.Open(word_path)
doc.Content.Copy()
doc.Close()

outlook = win32com.client.Dispatch("Outlook.Application")
# Create a new MailItem object
msg = outlook.CreateItem(0)
msg.GetInspector.WordEditor.Range(Start=0, End=0).Paste()

msg.Display(False)

Final Remark

File handling in Python is pretty easy because most of the basic operations just take a single line of code to do the job, as we have seen in this article. However, when you need to handling with a proprietary software some difficulties begin to appear. In this article, I introduced the win32com lib, which is capable of manipulating .doc extension, the mainly Microsoft Word format. I hope this content be useful for you.

--

--

Arthur Fortes

Data scientist and Python developer with experience in research and industrial projects. Innovation Enthusiast.