Handling with .doc extension with Python
A guide to performing operations on .doc files using Python
In this week I started one project, in which I need read a Word doc (.doc) and extract relevant information from it. However, I never work with .doc files and then I started to research it. I found lots of information on reading .docx (e.g. textract)but much less on .doc.
In this context, I’m writing this article to share a native solution for anyone working with Anaconda, without needing extra installations (if you use pure Python, just install a library). The magic library is called win32com, which provides access to many of the Windows APIs from Python., such as .ppt.
1. Installing via PIP
- If you are using Anaconda, please skip this step.
pip install pywin32
For more information about installation, you could access win32com Github.
2. Import and read file
Here’s a script to save Word documents in and below a given directory to text.
import fnmatch, os, pythoncom, sys, win32com.client
wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")
try:
for path, dirs, files in os.walk(sys.argv[1]):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.doc')]:
print "processing %s" % doc
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('doc') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt, FileFormat=win32com.client.constants.wdFormatTextLineBreaks)
wordapp.ActiveWindow.Close()finally:
wordapp.Quit()
To get other elements of the document, you ca use the follow command:
element = doc.Content. => text
3. Create and write text to a Word Document
When using win32com, bear in mind that you are talking to the Word object model. You don’t need to know a lot of VBA or other languages to apply the samples to using Python; you just need to figure out which parts of the object model are being used. Here, we have a simple example of how create a Word Document with content.
import win32com.client# Create new Word Object
wordapp = win32com.client.Dispatch("Word.Application") # Word Application should`t be visible
wordapp.Visible = 0# Create new Document Object
worddoc = wordapp.Documents.Add()# Make some Setup to the Document:
worddoc.PageSetup.Orientation = 1
worddoc.PageSetup.LeftMargin = 20
worddoc.PageSetup.TopMargin = 20
worddoc.PageSetup.BottomMargin = 20
worddoc.PageSetup.RightMargin = 20
worddoc.Content.Font.Size = 11
worddoc.Content.Paragraphs.TabStops.Add (100)
worddoc.Content.Text = "Hello, I am a text!"
worddoc.Content.MoveEnd# Close the Word Document (a save-Dialog pops up)
worddoc.Close()# Close the Word Application
wordapp.Quit()
4. Extra
Here you will find a simple example how to create a small Table in Word and to fill it with date.
from win32com import client
import powerfactory as pfapp = pf.GetApplication()
lines = app.GetCalcRelevantObjects('*.ElmLne')
ldf = app.GetFromStudyCase('ComLdf')ldf.Execute()
nr_lines = len(lines)wordapp = client.Dispatch("Word.Application")
wordapp.Visible = True
worddoc = wordapp.Documents.Add()
rang = doc.Range(Start=0,End=0)
worddoc.Tables.Add(rang, NumRows=2, NumColumns=1)
index = 2 + nr_linesworddoc.Tables(1).Rows(2).Cells(1).Split(1,3)
width = doc.Tables(1).Rows(1).Cells(1).Width
worddoc.Tables(1).Rows(1).Cells(1).Range.Bold=True
worddoc.Tables(1).Rows(1).Cells(1).Range.Font.Size=15
worddoc.Tables(1).Rows(1).Cells(1).Range.Text='Report of LoadFlow Calculations from PowerFactory'
worddoc.Tables(1).Cell(2,1).Range.Text='Name of the line'
worddoc.Tables(1).Cell(2,2).Range.Text='Loading'
worddoc.Tables(1).Cell(2,3).Range.Text='Comment'
app.PrintPlain(doc.Tables(1).Rows(2).Cells)for i,line in enumerate(lines):
worddoc.Tables(1).Rows.Add()
worddoc.Tables(1).Cell(i+3,1).Range.Text=line.loc_name
worddoc.Tables(1).Cell(i+3,2).Range.Text=str(
round(line.GetAttribute('c:loading'),2))+ ' %'if(line.GetAttribute('c:loading')>60):
worddoc.Tables(1).Cell(i+3,3).Range.Font.Color=225
worddoc.Tables(1).Cell(i+3,3).Range.Text='LoadingOver 60%'
To copy the contents of a Word document and paste it into an Outlook application, just follow the code below.
import win32com.client
word = win32com.client.Dispatch("Word.Application")
doc = word.Documents.Open(word_path)
doc.Content.Copy()
doc.Close()
outlook = win32com.client.Dispatch("Outlook.Application")
# Create a new MailItem object
msg = outlook.CreateItem(0)
msg.GetInspector.WordEditor.Range(Start=0, End=0).Paste()
msg.Display(False)
Final Remark
File handling in Python is pretty easy because most of the basic operations just take a single line of code to do the job, as we have seen in this article. However, when you need to handling with a proprietary software some difficulties begin to appear. In this article, I introduced the win32com lib, which is capable of manipulating .doc extension, the mainly Microsoft Word format. I hope this content be useful for you.