维护地球和平之 - 利用 Python 从PDF中提取目录和表格

这两天帮楠妹做了一点事情,涉及从PDF中提取表格,使用了两个开源工具 PDFMinerPyPDF2,记录一下提取过程。

具体是需要从一个网站上下载 28000+ 个PDF,每个PDF大概150页,每个PDF只提取一个表格的信息。

下载PDF这个事情用 爬虫 就可以搞定了,但是提取PDF就比较麻烦了。并没有直接的工具可以将PDF中的表格导出,所以只能用最笨的方法,提取文字再恢复成表格。

根据自带outline提取页码

提取文字需要定位文字,不然150多页的PDF遍历起来也是很费劲。但是 PDFMiner 支持提取提纲,但提纲对应的destination没什么卵用。

提纲

Some PDF documents use page numbers as destinations, while others use page numbers and the physical location within the page. Since PDF does not have a logical structure, and it does not provide a way to refer to any in-page object from the outside, there’s no way to tell exactly which part of text these destinations are referring to.

于是搜索了一下其他方法,找到了StackOverflow上一个 回答,这里我直接把代码贴上来了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#从outline获取页码 返回列表
class PdfOutline(PdfFileReader):

def getDestinationPageNumbers(self):
def _setup_outline_page_ids(outline, _result=None):
if _result is None:
_result = {}
for obj in outline:
if isinstance(obj, Destination):
_result[(id(obj), obj.title)] = obj.page.idnum
elif isinstance(obj, list):
_setup_outline_page_ids(obj, _result)
return _result

def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None):
if _result is None:
_result = {}
if pages is None:
_num_pages = []
pages = self.trailer["/Root"].getObject()["/Pages"].getObject()
t = pages["/Type"]
if t == "/Pages":
for page in pages["/Kids"]:
_result[page.idnum] = len(_num_pages)
_setup_page_id_to_num(page.getObject(), _result, _num_pages)
elif t == "/Page":
_num_pages.append(1)
return _result
try:
outline_page_ids = _setup_outline_page_ids(self.getOutlines())
page_id_to_page_numbers = _setup_page_id_to_num()
except:
return None

result = {}
for (_, title), page_idnum in outline_page_ids.iteritems():
if isinstance(title, TextStringObject):
#title = title.encode('utf-8')
pass
result[title] = page_id_to_page_numbers.get(page_idnum, '???')
return result

这段代码会返回一个页码列表,是深度优先的顺序,然后根据想要的位置找到列表的页码索引便可以得到页码了。

提取表格

根据正文提取内容并不是直接抓就可以了,PDF的生成渲染机制(暂且这么称呼)很奇怪,表格行列出现的顺序甚至会出现完全随机的情况,所以基本用规律或者正则表达式页很难提取。

PDF is evil. Although it is called a PDF “document”, it’s nothing like Word or HTML document. PDF is more like a graphic representation. PDF contents are just a bunch of instructions that tell how to place the stuff at each exact position on a display or paper. In most cases, it has no logical structure such as sentences or paragraphs and it cannot adapt itself when the paper size changes. PDFMiner attempts to reconstruct some of those structures by guessing from its positioning, but there’s nothing guaranteed to work. Ugly, I know. Again, PDF is evil.

坐标

还好天无绝人之路,这篇 博客 启发了我。 其实原理就是每个pdf里的一段文字都是一个对象,对象通过坐标标示位置,然后附带内容。每一个PDF页都是一个平面,其中左下角是0,x轴向右延伸,y轴向上延伸。先通过某些文字的内容确定表格的坐标,这里要根据自己的内容选取,例如我的表格是这样的:

表格

我选取了8个x值,4个y值。然后将内容按照y值降序排列,根据x的值每一列地选取。

1
2
3
4
5
6
7
8
9
10
11
#x0          x1       x2          x3        x4       x5   x6         x7
#-------------------------------------------------------------------------y0
#|独立董事姓名|本报告期应 |现场出席次数|以通讯方式|委托出席|缺席|是否连续两次|
#| |董事会次数 | | | | |未亲自参加 |
#-------------------------------------------------------------------------y1
#明
#-------------------------------------------------------------------------
#楠
#-------------------------------------------------------------------------y2
#独立董事列席股东大会
#-------------------------------------------------------------------------y3

方法很蠢,仅供参考:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
def pdfHandler(filetitle):
out = False #表示是否能够处理
pageIndex = 0 #页码
print "parsing "+filetitle+" extracting page number"
#提取页码
fp = open(filetitle, 'rb')
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
password = ''
try:
document = PDFDocument(parser, password)
except:
fp.close()
#print 'ee'
log.write('open pdf document error ')
return False
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
fp.close()
#print 'ee'
log.write('pdf document not extractable error ')
return False

pdf = PdfOutline(fp)
pageslist = []
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)

for p,t in sorted([(v,k) for k,v in pdf.getDestinationPageNumbers().iteritems()]):
pageslist.append(p)


#get pageIndex

...

print "parsing "+filetitle+" extracting table"

start = False
end = False

firstpage = 0
lastpage = 0
namepage = 0 #独立董事姓名 表头所在的page
pages = 0
#print 'pageIndex ',pageIndex
tableObjs = []
for page in PDFPage.create_pages(document):
pages+=1
if end:
break
if pages<pageIndex:
continue
interpreter.process_page(page)
# layout 即为每一页

layout = device.get_result()

#注意还要踢出表格分页的情况
for obj in layout:
if isinstance(obj,LTTextBox) or isinstance(obj,LTTextLine):
if start :

if obj.get_text().find(u"独立董事姓名") !=-1:
namepage = pages
elif obj.get_text().find(u"独立董事列席股东大会次数") !=-1:
end = True
lastpage = pages
tableObjs.append(obj)
break
else:
#fpo.write(obj.get_text())
tableObjs.append(obj)
out = True
else:
if obj.get_text().find(u"独立董事出席董事会及股东大会的情况") !=-1:
start = True #开始写下一行
firstpage = pages

print 'firstpage,lastpage,namepage',firstpage,lastpage,namepage
if namepage==0 or lastpage==0 or namepage!=lastpage:
#log.write('table across two page, abort ')
print 'table across two page, abort'
return False

x0,x1,x2,x3,x4,x5,x6,x7 = 0,0,0,0,0,0,0,0
y0,y1,y2,y3 = 0,0,0,0
for i in tableObjs:
#print i,i.get_text()
if i.get_text().find(u"本报告期应")!=-1:
y0 = i.y1
x1 = i.x0
elif i.get_text().find(u"董事会次数")!=-1:
y1 = i.y0
elif i.get_text().find(u"独立董事列席股东大会")!=-1:
y2 = i.y1
x0 = i.x0
y3 = i.y0
elif i.get_text().find(u"现场出席次数")!=-1:
x2 = i.x0
elif i.get_text().find(u"以通讯方式")!=-1:
x3 = i.x0
elif i.get_text().find(u"委托出席")!=-1:
x4 = i.x0
elif i.get_text().find(u"缺席次数")!=-1:
x5 = i.x0
elif i.get_text().find(u"是否连续")!=-1:
x6 = i.x0
x7 = i.x1
else:
continue

if x0==0 or x1==0 or x2==0 or x3==0 or x4==0 or x5==0 or x6==0 or x7==0:
#log.write('extract x error ')
print 'extract x error '
return False
if y0==0 or y1==0 or y2==0 or y3==0:
#log.write('extract y error ')
print 'extract y error '
return False

#以y坐标排列
tableObjs.sort(key=lambda x:x.y1,reverse=True)
#print tableObjs

dudong = []
conj = False
for item in tableObjs:
if item.y0 >= y1:
continue
#独立董事姓名
elif item.x1 < x1 :
dudong.append(item.get_text().replace('\n',''))
#本报告期应参加董事会次数
elif item.x0 > x1 and item.x1 < x2:
dudong.append(item.get_text().replace('\n',''))
#现场出席次数
elif item.x0 > x2 and item.x1 < x3:
dudong.append(item.get_text().replace('\n',''))
#以通讯方式参加次数
elif item.x0 > x3 and item.x1 < x4:
dudong.append(item.get_text().replace('\n',''))
#委托出席次数
elif item.x0 > x4 and item.x1 < x5:
dudong.append(item.get_text().replace('\n',''))
elif item.y1 == y2:
break
else:
......

for (ii,item) in enumerate(dudong):
result.write(item+',')
if ii!=0 and (ii+1)%7==0 :
result.write(secCode+','+secName+','+announcementTitle+'\n')

fp.close()
result.flush()
#fpo.close()
return out

反正最终我维护了地球和平…