Handling multiple PDF files is a common task in academia. Whether it’s extracting specific pages for citation, bundling first pages of research papers for applications, or organizing research materials, a flexible tool can save valuable time. This blog post introduces a Python script that simplifies this task, providing flexibility to extract individual pages, comma-separated lists of pages, or a range of pages, with an option to bundle them into a single document.
How to Use the Script
Step 1: Installation
Ensure that Python is installed, along with the PyPDF2 library. You can install PyPDF2 using the following command:
pip install PyPDF2
Step 2: Download and Run the Script
Place the script in the directory containing the PDF files you want to work with and run it.
Step 3: Select Pages
You will be prompted to enter the page numbers or ranges to extract. You can enter individual numbers, comma-separated lists, or ranges using a hyphen (e.g., “1,3,5-7”).
Step 4: Choose Bundling Option
You can choose to bundle the extracted pages into one file or save them separately.
Step 5: Press Enter to Exit
press Enter to exit the script.
Code Snippet:
The complete code is available below:
import sys
import os
import glob
from typing import List
try:
from PyPDF2 import PdfWriter, PdfReader
except ImportError:
print("PyPDF2 is not installed. Please install it by running the following command:")
print("pip install PyPDF2")
sys.exit()
def get_pdf_files() -> List[str]:
return sorted(glob.glob('*.pdf'), key=lambda x: os.path.getmtime(x), reverse=True)
def parse_page_numbers(input_str: str) -> List[int]:
pages = []
for part in input_str.split(','):
if '-' in part:
start, end = map(int, part.split('-'))
pages.extend(range(start, end + 1))
else:
pages.append(int(part))
return pages
def extract_pages(pdf_files: List[str], pages_to_extract: List[int], bundle=True) -> None:
if bundle:
pdf_writer = PdfWriter()
for pdf_file in pdf_files:
try:
if pdf_file != 'bundled_pages.pdf': # Avoid reading the output file itself
with open(pdf_file, 'rb') as file:
pdf_reader = PdfReader(file)
for page_number in pages_to_extract:
# Subtracting 1 to convert from human-readable page numbers to zero-based indexing
pdf_writer.add_page(pdf_reader.pages[page_number - 1])
except Exception as e:
print(f"An error occurred with file {pdf_file}: {e}")
with open('bundled_pages.pdf', 'wb') as file:
pdf_writer.write(file)
else:
for pdf_file in pdf_files:
try:
pdf_reader = PdfReader(open(pdf_file, 'rb'))
for page_number in pages_to_extract:
pdf_writer = PdfWriter()
pdf_writer.add_page(pdf_reader.pages[page_number - 1])
with open(f'{pdf_file[:-4]}_page_{page_number}.pdf', 'wb') as file:
pdf_writer.write(file)
except Exception as e:
print(f"An error occurred with file {pdf_file}: {e}")
if __name__ == "__main__":
pdf_files = get_pdf_files()
page_numbers_input = input("Enter the page numbers or ranges to extract (e.g., 1,3,5-7): ")
pages_to_extract = parse_page_numbers(page_numbers_input)
bundle_option = input("Would you like to bundle the extracted pages into one file? (y/n): ").lower() == 'y'
extract_pages(pdf_files, pages_to_extract, bundle=bundle_option)
print("Pages extracted successfully.")
input("Press Enter to exit...")
Download Link:
For researchers and academics this script provides a convenient way to quickly gather specific pages of numerous documents. Whether it’s for a job application, scholarship submission, or research overview, this automation saves time and reduces manual effort.
Leave a Reply