Handling with .doc extension with Python

A guide to performing operations on .doc files using Python

Arthur Fortes
3 min readJan 30, 2020

In this week I started one project, in which I need read a Word doc (.doc) and extract relevant information from it. However, I never work with .doc files and then I started to research it. I found lots of information on reading .docx (e.g. textract)but much less on .doc.

In this context, I’m writing this article to share a native solution for anyone working with Anaconda, without needing extra installations (if you use pure Python, just install a library). The magic library is called win32com, which provides access to many of the Windows APIs from Python., such as .ppt.

1. Installing via PIP

  • If you are using Anaconda, please skip this step.
pip install pywin32

For more information about installation, you could access win32com Github.

2. Import and read file

Here’s a script to save Word documents in and below a given directory to text.

import fnmatch, os, pythoncom, sys, win32com.client

wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")

for path, dirs, files in os.walk(sys.argv[1]):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.doc')]:
print "processing %s" % doc
docastxt = doc.rstrip('doc') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt, FileFormat=win32com.client.constants.wdFormatTextLineBreaks)

To get other elements of the document, you ca use the follow command:

element = doc.Content. => text

3. Create and write text to a Word Document

When using win32com, bear in mind that you are talking to the Word object model. You don’t need to know a lot of VBA or other languages to apply the samples to using Python; you just need to figure out which parts of the object model are being used. Here, we have a simple example of how create a Word Document with content.

import win32com.client# Create new Word Object
wordapp = win32com.client.Dispatch("Word.Application")
# Word Application should`t be visible
wordapp.Visible = 0
# Create new Document Object
worddoc = wordapp.Documents.Add()
# Make some Setup to the Document:
worddoc.PageSetup.Orientation = 1
worddoc.PageSetup.LeftMargin = 20
worddoc.PageSetup.TopMargin = 20
worddoc.PageSetup.BottomMargin = 20
worddoc.PageSetup.RightMargin = 20
worddoc.Content.Font.Size = 11
worddoc.Content.Paragraphs.TabStops.Add (100)
worddoc.Content.Text = "Hello, I am a text!"
# Close the Word Document (a save-Dialog pops up)
# Close the Word Application

4. Extra

Here you will find a simple example how to create a small Table in Word and to fill it with date.

from win32com import client
import powerfactory as pf
app = pf.GetApplication()
lines = app.GetCalcRelevantObjects('*.ElmLne')
ldf = app.GetFromStudyCase('ComLdf')
nr_lines = len(lines)
wordapp = client.Dispatch("Word.Application")
wordapp.Visible = True
worddoc = wordapp.Documents.Add()

rang = doc.Range(Start=0,End=0)
worddoc.Tables.Add(rang, NumRows=2, NumColumns=1)
index = 2 + nr_lines
width = doc.Tables(1).Rows(1).Cells(1).Width
worddoc.Tables(1).Rows(1).Cells(1).Range.Text='Report of LoadFlow Calculations from PowerFactory'
worddoc.Tables(1).Cell(2,1).Range.Text='Name of the line'
for i,line in enumerate(lines):
round(line.GetAttribute('c:loading'),2))+ ' %'
worddoc.Tables(1).Cell(i+3,3).Range.Text='LoadingOver 60%'

To copy the contents of a Word document and paste it into an Outlook application, just follow the code below.

import win32com.client

word = win32com.client.Dispatch("Word.Application")
doc = word.Documents.Open(word_path)

outlook = win32com.client.Dispatch("Outlook.Application")
# Create a new MailItem object
msg = outlook.CreateItem(0)
msg.GetInspector.WordEditor.Range(Start=0, End=0).Paste()


Final Remark

File handling in Python is pretty easy because most of the basic operations just take a single line of code to do the job, as we have seen in this article. However, when you need to handling with a proprietary software some difficulties begin to appear. In this article, I introduced the win32com lib, which is capable of manipulating .doc extension, the mainly Microsoft Word format. I hope this content be useful for you.



Arthur Fortes
Arthur Fortes

Written by Arthur Fortes

Data scientist and Python developer with experience in research and industrial projects. Innovation Enthusiast.

Responses (2)