Closed
Description
related #3281
Create a read_pdf method in IO tools for reading tables from PDF documents. Many data sets are released in PDF form.
For example:
- Walmart's Historical Unit Count and Square Footage: http://az204679.vo.msecnd.net/media/documents/unit-counts-q1-fy14_130131488115936836.pdf
- Las Vegas Visitor Statistics: http://www.lvcva.com/includes/content/images/media/docs/ES-YTD20128.pdf
- Coffee export data: http://www.ico.org/historical/2010-19/PDF/EXPCALY.pdf
There are a number of standalone tools, projects for this:
- http://ieg.ifs.tuwien.ac.at/projects/pdf2table/ (an academic project /paper; Java)
- https://pypi.python.org/pypi/pdfquery (Python lib)
- https://pypi.python.org/pypi/pdftable (Python lib)
- http://www.unixuser.org/~euske/python/pdfminer/ (Python lib)
- https://github.com/ashima/pdf-table-extract (Python lib)
There are also a number of site / projects to convert PDF to HTML:
- https://github.com/coolwanglu/pdf2htmlEX/wiki (open source)