Contributors mailing list archives

Browse archives


Re: Module to read and extract information from PDF's

MoaHub, Graeme Gellatly
- 06/09/2023 23:25:12
invoice2data is becoming a bit more unstable we are finding with new maintainers. For years it was fairly static and unchanging and fairly dedicated to Odoo, now it is more generalised.  Also for this purpose it would need a bit of customization and it really only suits when you know the document beforehand. We still use it, but wouldn't for a requirement like this.

For our recent requirements to integrate with DMS and also enterprise Documents module to auto receive records and attach to correct record in this area, we have gone with what is listed below with a simple custom frontend model to define patterns. This was for a backscanning project of some 1m pages, multipage detection, multi doctype kind of thing. Basically, scan 150 pages on a scanner, it comes in, gets parsed and page breaks made and separate files with a copy of extracted text, then auto attached to correct record.

pdftotext works as advertised. tesseract has some dependencies and quirks, which is fine, just needs some error and ambiguous bit handling. To do really well, you would also want opencv etc to do things like contrast and deskew images from scanned files, but we found actually for the overhead, for the documents we were doing it didn't really add any value. We offered to clean up and put this work to OCA but were refused on basis that noone does OCR anymore.

Alternatively, you can just push to something like GVision for images. That was our first implementation, it is maybe 1/3 of the code, but harder to test in isolated dev and the results, and while much more comprehensive, for our use case weren't really value for money.

import pdftotext
import pytesseract
from pdf2image import convert_from_bytes

On Thu, Sep 7, 2023 at 8:51 AM Enric Tobella Alomar <> wrote:
You can try with invoice2data extractor.

It can extract data from PDF (not only invoice info)

El mié, 6 sept 2023 a las 22:42, Samuel Macias Oropeza (<>) escribió:
Hello everyone. 

We have a client using Odoo 16 that needs to extract information from a PDF file and update a res.partner record with this info. The PDF contains data like name, address, ZIP Code, VAT number, etc. Does anyone know of any module/python library that could help us with this?

Thank you!




P.O. BOX 940, HIGLEY, AZ 85236

Post to:

Enric Tobella Alomar
CEO & Founder

Post to: