How To Get Text From Document

How to Extract Text From PDF Files in All Formats.

Sort documents by type and extract text from PDF files

Practice yous need to extract text from different files such as pdfs and Word files?

This quick tutorial shows how sort files by type, and and then extract text from PDF files. I downloaded two fake resumes in pdf format from Overleaf to demonstrate how this code works. I am not going to cover how to extract text from Give-and-take documents. Y'all can download docxpy Python bundle and employ it to excerpt text from Give-and-take files. Feel free to contact me at anna@sakura-ai.com if y'all have any questions or demand assist parsing documents.

The main challenge in extracting text from PDF files is that they have different formats:

PDF files are either 8-bit binary files or 7-bit ASCII text files (using ASCII-85 encoding).
Every line in a PDF tin can incorporate upward to 255 characters.
Every line ends with a carriage return, a line feed, or a carriage return followed by a line feed (depending upon the application or platform used to create the PDF file).
PDF is case sensitive.
The file format is completely independent of the platform that information technology is viewed or created on. Files can be moved back and along betwixt Macs, Windows arrangement, Linux systems,… When FTP-ing a PDF file, it does make sense to shrink it, to avert data corruption by some outdated web system that the file needs to become through.
Scanned PDFs are stored equally images

You tin learn more near PDF files hither: https://www.prepressure.com/pdf/basics/fileformat

My solution to this trouble is to catechumen all PDF files into 1 format — images using pdf2image Python package and and so use the optical character recognition (OCR) Python bundle to excerpt text from images.

Outset, import all packages. You need pdf2image to convert pdfs to ppm image files. We will practice some path manipulation to join and rename text files, so nosotros import os and sys packages. The side by side part is calling a library PIL and importing Image with pytesseract . You can see total pytesseract import and usage instructions hither: https://pypi.org/projection/pytesseract/

Initialize the path to your documents and the counter to use later on in pdf extract role to count your documents in the binder.

Later on all text is extracted from the epitome files, we want these image files erased to avoid binder with documents to be overflooded and text file mix-up. In addition, in the Apple macOS operating arrangement, . DS_Store is a file that stores custom attributes of its containing folder, such as the position of icons or the choice of a background image. Information technology is created when PDF files are converted to ppm paradigm files. The proper name is an abbreviation of Desktop Services Shop, reflecting its purpose. Since we are going to sort files past extension, these files can forestall our code from running, so nosotros are just going to erase them as well. Below is delete_ppms function that cleans upwards all unnecessary files from the document binder — it uses bone Python bundle, which provides a portable way of using operating system dependent functionality. You can see more than documentation on os package here: https://docs.python.org/3/library/os.html

Now we need to sort files by blazon. We will utilise the file extension to determine its type. Since I have Word files and PDF files in my folder, I will only initialize two lists for each extension type where I volition store the names of the files. In this for-loop, I fist join the path and file names together to ensure their accessibility. So I split each file name into its name and extension, which enables me to append the file names into ii different lists based on their extension type.

At present we can finally excerpt text from our documents. Below is pdf_extract part. Start, it is press the name of each file from which the text is being extracted. Depending on the size of the document, text extraction can take some time. This print statement volition help yous run across which file is being extracted at the moment.

Since this part is going to be used in a for-loop for each file, it is important to use delete_ppms function each time earlier extraction to make clean upwardly image files from each certificate page to forestall text from ii different documents to be written into the aforementioned text file.

Then all files are converted to images, sorted and the images are renamed. The images are named in the following format: image1–2.ppm. The outset number is the document number and the second number is the page number. The index [i] which we initialized before exterior of this part, keeps track of each document in the folder. The index [j] initialized inside of the function keeps rails of each page in the document. The files volition be sorted to go on order in which the image files are renamed. This will assist to have each page number written into the text file in the same guild as in the original certificate.

Next, a text file is created for each image. I chose to proper noun text files upshot with a number extension for each document index. This naming procedure would help me to chop-chop check if all files were extracted and to combine all pages from the aforementioned certificate into the same text file. You can play with os package to rename text files to your liking.

Then all ppm epitome files are sorted again. The lambda part is created to sort the files based on their names and page numbers without using the extension.

And finally, the text is written from images into text files created before.

At last, we tin run our pdf_extract role on all pdf files appended earlier.

At present if you go to your folder, you should run across two text files named result0.txt and result1.txt for each resume.

Thank you lot for reading my tutorial! Please go out comments below with suggestions on how to edit and format this tutorial.