Streamlining Page Extraction from PDFs: A Python Script

Handling multiple PDF files is a common task in academia. Whether it’s extracting specific pages for citation, bundling first pages of research papers for applications, or organizing research materials, a flexible tool can save valuable time. This blog post introduces a Python script that simplifies this task, providing flexibility to extract individual pages, comma-separated lists of pages, or a range of pages, with an option to bundle them into a single document.

How to Use the Script

Step 1: Installation

Ensure that Python is installed, along with the PyPDF2 library. You can install PyPDF2 using the following command:

pip install PyPDF2

Step 2: Download and Run the Script

Place the script in the directory containing the PDF files you want to work with and run it.

Step 3: Select Pages

You will be prompted to enter the page numbers or ranges to extract. You can enter individual numbers, comma-separated lists, or ranges using a hyphen (e.g., “1,3,5-7”).

Step 4: Choose Bundling Option

You can choose to bundle the extracted pages into one file or save them separately.

Step 5: Press Enter to Exit

press Enter to exit the script.

Code Snippet:

The complete code is available below:

import sys
import os
import glob
from typing import List

try:
    from PyPDF2 import PdfWriter, PdfReader
except ImportError:
    print("PyPDF2 is not installed. Please install it by running the following command:")
    print("pip install PyPDF2")
    sys.exit()

def get_pdf_files() -> List[str]:
    return sorted(glob.glob('*.pdf'), key=lambda x: os.path.getmtime(x), reverse=True)

def parse_page_numbers(input_str: str) -> List[int]:
    pages = []
    for part in input_str.split(','):
        if '-' in part:
            start, end = map(int, part.split('-'))
            pages.extend(range(start, end + 1))
        else:
            pages.append(int(part))
    return pages

def extract_pages(pdf_files: List[str], pages_to_extract: List[int], bundle=True) -> None:
    if bundle:
        pdf_writer = PdfWriter()
        for pdf_file in pdf_files:
            try:
                if pdf_file != 'bundled_pages.pdf': # Avoid reading the output file itself
                    with open(pdf_file, 'rb') as file:
                        pdf_reader = PdfReader(file)
                        for page_number in pages_to_extract:
                            # Subtracting 1 to convert from human-readable page numbers to zero-based indexing
                            pdf_writer.add_page(pdf_reader.pages[page_number - 1])
            except Exception as e:
                print(f"An error occurred with file {pdf_file}: {e}")

        with open('bundled_pages.pdf', 'wb') as file:
            pdf_writer.write(file)
    else:
        for pdf_file in pdf_files:
            try:
                pdf_reader = PdfReader(open(pdf_file, 'rb'))
                for page_number in pages_to_extract:
                    pdf_writer = PdfWriter()
                    pdf_writer.add_page(pdf_reader.pages[page_number - 1])
                    with open(f'{pdf_file[:-4]}_page_{page_number}.pdf', 'wb') as file:
                        pdf_writer.write(file)
            except Exception as e:
                print(f"An error occurred with file {pdf_file}: {e}")

if __name__ == "__main__":
    pdf_files = get_pdf_files()
    page_numbers_input = input("Enter the page numbers or ranges to extract (e.g., 1,3,5-7): ")
    pages_to_extract = parse_page_numbers(page_numbers_input)
    bundle_option = input("Would you like to bundle the extracted pages into one file? (y/n): ").lower() == 'y'
    extract_pages(pdf_files, pages_to_extract, bundle=bundle_option)
    print("Pages extracted successfully.")
    input("Press Enter to exit...")

Download Link:

pdf_page_extractor Download

For researchers and academics this script provides a convenient way to quickly gather specific pages of numerous documents. Whether it’s for a job application, scholarship submission, or research overview, this automation saves time and reduces manual effort.

Streamlining Page Extraction from PDFs: A Python Script

Comments

Leave a Reply Cancel reply