IdentifiantMot de passe
Loading...
Mot de passe oubli� ?Je m'inscris ! (gratuit)
Navigation

Inscrivez-vous gratuitement
pour pouvoir participer, suivre les r�ponses en temps r�el, voter pour les messages, poser vos propres questions et recevoir la newsletter

Contribuez Python Discussion :

Utilisation de PyMUPDF pour manipuler des fichiers PDF


Sujet :

Contribuez Python

  1. #1
    Membre Expert
    Homme Profil pro
    D�veloppeur informatique
    Inscrit en
    F�vrier 2003
    Messages
    1 605
    D�tails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France

    Informations professionnelles :
    Activit� : D�veloppeur informatique
    Secteur : Industrie

    Informations forums :
    Inscription : F�vrier 2003
    Messages : 1 605
    Par d�faut Utilisation de PyMUPDF pour manipuler des fichiers PDF
    Bonjour � toutes et � tous.

    Je vous pr�sente le programme pdf_analyzer qui r�sulte d'une demande professionnelle et r�pondant � une expression de besoins qu'il couvre � 100 %.

    D�velopp� en Python 3.10.10 sous Windows, il est exploitable sous d'autres plateformes. S'appuyant sur la lib PyMUPDF/, il permet � l'utilisateur :

    • de rechercher des mots ou des phrases dans une page d'un fichier PDF,
    • de d�terminer un rapport entre le texte et les images pr�sentes sur la page,
    • de rechercher la pr�sence de calques ou de marquages.


    Tous les crit�res ci-dessus r�pondent � l'expression des besoins et pourront para�tre bizarres, voire saugrenus pour quiconque lira le code source.

    Les diff�rents types de recherches peuvent porter sur les n premi�res pages et/ou les n derni�res pages, ou bien la totalit� du PDF.

    Toutes les m�thodes, variables sont typ�es. L� aussi, une demande de l'expression de besoins.

    Voici les diff�rents codes sources :

    pdf_analyzer.py :

    Code : S�lectionner tout - Visualiser dans une fen�tre � part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    from dataclasses import dataclass
    from pathlib import Path
    from types import TracebackType
    from typing_extensions import Self
    import fitz
    from .exception import UnreadablePDF
     
     
    @dataclass
    class PDFAnalyzer:
        """PDF analyzer class.
     
        Attributes:
            filename: Path of the PDF file to analyze.
        """
     
        filename: Path
        document: fitz.Document | None = None
        error_msg: str | None = None
     
        def _load_document(self) -> None:
            """Method for instantiate fitz.Document."""
            try:
                self.document = fitz.Document(self.filename)
            except (fitz.fitz.FileDataError, fitz.fitz.FileNotFoundError) as err:
                self.error_msg = str(err)
     
        def _pages_to_scan(self, first_pages: int = 0, last_pages: int = 0) -> list[int]:
            """Determines the list of pages to analyze according to the requested page
            ranges.
     
            Args:
                first_pages: The number of first pages to be analyzed.
                last_pages: The number of last pages to be analyzed.
     
            Returns:
                List of pages to be analyzed.
            """
            if self.document is None:
                return []
            total_pages: int = self.document.page_count
            if not first_pages and not last_pages:
                return list(range(total_pages))
            first_pages = first_pages if first_pages < total_pages else total_pages
            last_pages = last_pages if last_pages < total_pages else total_pages
            if first_pages and not last_pages:
                return list(range(first_pages))
            if last_pages and not first_pages:
                return list(range(total_pages))[-last_pages:total_pages]
            pages: list[int] = list(range(first_pages))
            pages.extend(list(range(total_pages)[-last_pages:total_pages]))
            return sorted(list(set(pages)))
     
        def readability_rate(self, first_pages: int = 0, last_pages: int = 0) -> float:
            """Calculates the readability rate of the pages to analyze.
     
            Args:
                first_pages: The number of first pages to be analyzed.
                last_pages: The number of last pages to be analyzed.
     
            Returns:
                The readability rate with 2 decimals.
            """
            if self.document is None:
                return 0.0
            rates: list[float] = []
            for page in self._pages_to_scan(first_pages=first_pages, last_pages=last_pages):
                text: int = 0
                content: fitz.Page = self.document.load_page(page)
                if content.get_xobjects():
                    return 0.0
                if content.get_text().strip():
                    text = 1
                image: int = len(content.get_images())
                try:
                    rates.append(text / (text + image))
                except ZeroDivisionError:
                    rates.append(0.0)
            return round(sum(rates) / len(rates), 2)
     
        def layer(self, first_pages: int = 0, last_pages: int = 0) -> bool:
            """Search for a layer in the pages to analyze.
     
            Args:
                first_pages: The number of first pages to be analyzed.
                last_pages: The number of last pages to be analyzed.
     
            Returns:
                True if a layer is found.
                False if no layer found.
            """
            if self.document is None:
                return False
            for page in self._pages_to_scan(first_pages=first_pages, last_pages=last_pages):
                content: fitz.Page = self.document.load_page(page)
                if content.get_xobjects():
                    return True
            return False
     
        @property
        def corrupted_file(self) -> bool:
            """Determines whether the file is corrupted or not.
     
            Returns:
                True if corrupted file.
                False otherwise.
            """
            return bool(self.error_msg)
     
        def terms_found(
            self,
            first_pages: int = 0,
            last_pages: int = 0,
            words: list[str] | None = None,
            case_sensitive: bool = False,
        ) -> bool:
            """Look for one of the terms on each page.
     
            Args:
                first_pages: The number of first pages to be analyzed.
                last_pages: The number of last pages to be analyzed.
                words: List of words to find.
                case_sensitive: If the search for words must be in sensitive case or not.
     
            Raises:
                UnreadablePDF if the file is unknown or corrupted.
     
            Returns:
                True if a term is found.
                False otherwise.
            """
            if self.document is None:
                raise UnreadablePDF(self.error_msg)
            if words is None:
                return False
            for page in self._pages_to_scan(first_pages=first_pages, last_pages=last_pages):
                content: fitz.Page = self.document.load_page(page)
                for sentence in words:
                    if case_sensitive:
                        if sentence in content.get_text():
                            return True
                    else:
                        if content.search_for(sentence):
                            return True
            return False
     
        def is_readable(
            self, first_pages: int = 0, last_pages: int = 0, acceptance: float = 0.0
        ) -> bool:
            """Determines if a PDF has sufficient readability to be processed.
     
            Args:
                first_pages: The number of first pages to be analyzed.
                last_pages: The number of last pages to be analyzed.
                acceptance: Minimum readability rate required to consider the PDF readable.
     
            Returns:
                True if the PDF is readable enough.
                False otherwise.
            """
            return (
                self.readability_rate(first_pages=first_pages, last_pages=last_pages)
                >= acceptance
            )
     
        def __enter__(self) -> Self:
            """__enter__ method for use with 'with' instantiation of the class.
     
            Returns:
                The instance.
            """
            self._load_document()
            return self
     
        def __exit__(
            self,
            exc_type: type[Exception] | None,
            exc_val: Exception | None,
            exc_tb: TracebackType | None,
        ) -> None:
            """__exit__ method for use with 'with' at the end of the instance."""
            if self.document is not None:
                self.document.close()
    exceptions.py :

    Code : S�lectionner tout - Visualiser dans une fen�tre � part
    1
    2
    class UnreadablePDF(Exception):
        """Exception for damaged, unreadable PDF files or other problems."""
    Pour les tests (qui s'appuient sur de vrais fichiers PDF non pr�sents ici), test_pdf_analyzer.py :

    Code : S�lectionner tout - Visualiser dans une fen�tre � part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    from pathlib import Path
    import pytest
    import fitz
    from pdf_analyzer import PDFAnalyzer, UnreadablePDF
     
     
    class TestSearch:
        def test_instance_with_minimal_argv(sef) -> None:
            with PDFAnalyzer(filename=Path("tests/samples/match.pdf")) as search:
                assert search.filename == Path("tests/samples/match.pdf")
                assert type(search.document) == fitz.Document
                assert search.error_msg is None
     
        @pytest.mark.parametrize(
            "filename,expected_type,expected_error",
            [
                (
                    Path("tests/samples/damaged_pdf.pdf"),
                    None,
                    "cannot open broken document",
                ),
                (
                    Path("unknown_file.pdf.pdf"),
                    None,
                    "no such file: 'unknown_file.pdf.pdf'",
                ),
            ],
        )
        def test_instance_with_bad_pdf_file(
            self, filename: Path, expected_type: None, expected_error: str
        ) -> None:
            with PDFAnalyzer(filename=filename) as search:
                assert search.document == expected_type
                assert search.error_msg == expected_error
     
        @pytest.mark.parametrize(
            "filename,first_pages,last_pages,expected",
            [
                (Path("tests/samples/match.pdf"), 1, 0, [0]),
                (Path("tests/samples/match.pdf"), 10, 0, [0]),
                (Path("tests/samples/match.pdf"), 10, 10, [0]),
                (Path("tests/samples/match.pdf"), 0, 5, [0]),
                (Path("tests/samples/watermark.pdf"), 0, 5, [144, 145, 146, 147, 148]),
                (Path("tests/samples/watermark.pdf"), 3, 3, [0, 1, 2, 146, 147, 148]),
                (Path("tests/samples/watermark.pdf"), 0, 0, list(range(149))),
                (Path("tests/samples/damaged_pdf.pdf"), 5, 10, []),
            ],
        )
        def test_pages_to_scan(
            self, filename: Path, first_pages: int, last_pages: int, expected: list[int]
        ) -> None:
            with PDFAnalyzer(filename=filename) as doc:
                assert (
                    doc._pages_to_scan(first_pages=first_pages, last_pages=last_pages)
                    == expected
                )
     
        @pytest.mark.parametrize(
            "filename,first_pages,expected",
            [
                (Path("tests/samples/full_pdfi.pdf"), 0, 0.0),
                (Path("tests/samples/match.pdf"), 0, 1.0),
                (Path("tests/samples/watermark.pdf"), 40, 0.0),
                (Path("tests/samples/blank.pdf"), 0, 0.0),
                (Path("unknown_file.pdf"), 0, 0.0),
            ],
        )
        def test_readability_rate(
            self, filename: Path, first_pages: int, expected: float
        ) -> None:
            with PDFAnalyzer(filename=filename) as doc:
                assert doc.readability_rate(first_pages=first_pages) == expected
     
        @pytest.mark.parametrize(
            "filename,expected",
            [
                (Path("tests/samples/match.pdf"), False),
                (Path("tests/samples/watermark.pdf"), True),
                (Path("tests/samples/damaged_pdf.pdf"), False),
            ],
        )
        def test_layer(self, filename: Path, expected: bool) -> None:
            with PDFAnalyzer(filename=filename) as doc:
                assert doc.layer() is expected
     
        @pytest.mark.parametrize(
            "filename,expected",
            [
                (Path("tests/samples/match.pdf"), False),
                (Path("tests/samples/watermark.pdf"), False),
                (Path("tests/samples/damaged_pdf.pdf"), True),
            ],
        )
        def test_corrupted_file(self, filename: Path, expected: bool) -> None:
            with PDFAnalyzer(filename=filename) as doc:
                assert doc.corrupted_file is expected
     
        def test_terms_found_raises_an_exception(self) -> None:
            with PDFAnalyzer(filename=Path("tests/samples/damaged_pdf.pdf")) as doc:
                with pytest.raises(UnreadablePDF):
                    doc.terms_found()
     
        @pytest.mark.parametrize(
            "filename,first_pages,words,case_sensitive,expected",
            [
                (Path("tests/samples/match.pdf"), 1, None, True, False),
                (Path("tests/samples/match.pdf"), 1, None, False, False),
                (Path("tests/samples/match.pdf"), 1, ["Annexe Nationale"], True, True),
                (Path("tests/samples/match.pdf"), 1, ["ANNEXE NATIONALE"], False, True),
                (Path("tests/samples/match.pdf"), 1, ["ANNEXE NATIONALE"], True, False),
                (Path("tests/samples/not_match.pdf"), 1, ["Annexe Nationale"], True, False),
                (
                    Path("tests/samples/not_match.pdf"),
                    1,
                    ["Annexe Nationale"],
                    False,
                    False,
                ),
                (Path("tests/samples/watermark.pdf"), 1, ["Eurocode 6"], True, True),
                (Path("tests/samples/watermark.pdf"), 1, ["EUROCODE 6"], True, False),
                (Path("tests/samples/watermark.pdf"), 1, ["EUROCODE 6"], False, True),
                (
                    Path("tests/samples/watermark.pdf"),
                    1,
                    [
                        "La présente Norme européenne a été adoptée par le CEN le 3 janvier "
                        + "2022.\n\nLes membres du CEN sont tenus de se soumettre au "
                        + "Règlement Intérieur du CEN/CENELEC"
                    ],
                    True,
                    False,
                ),
                (
                    Path("tests/samples/watermark.pdf"),
                    1,
                    [
                        "la présente norme européenne a été adoptée par le cen le 3 janvier "
                        + "2022.\n\nLes membres du cen sont tenus de se soumettre au "
                        + "Règlement Intérieur du cen/cenelec"
                    ],
                    False,
                    True,
                ),
            ],
        )
        def test_terms_found(
            self,
            filename: Path,
            first_pages: int,
            words: list[str],
            case_sensitive: bool,
            expected: bool,
        ) -> None:
            with PDFAnalyzer(filename=filename) as doc:
                assert (
                    doc.terms_found(
                        first_pages=first_pages,
                        words=words,
                        case_sensitive=case_sensitive,
                    )
                    is expected
                )
     
        @pytest.mark.parametrize(
            "filename,acceptance,expected",
            [
                (Path("tests/samples/match.pdf"), 1.0, True),
                (Path("tests/samples/blank.pdf"), 0.5, False),
                (Path("tests/samples/damaged_pdf.pdf"), 0.5, False),
                (Path("tests/samples/full_pdfi.pdf"), 1, False),
                (Path("tests/samples/not_match.pdf"), 1.0, True),
                (Path("tests/samples/watermark.pdf"), 0.1, False),
            ],
        )
        def test_is_readable(
            self, filename: Path, acceptance: float, expected: bool
        ) -> None:
            with PDFAnalyzer(filename=filename) as doc:
                assert doc.is_readable(acceptance=acceptance) is expected
    Les tests couvrent 100 % du code (l� aussi... Expression des besoins).

    Je poste ce code ici car je me suis �norm�ment �clat� � le r�aliser et je me dis que, peut �tre, �a pourra permettre � certain(e)s de d�couvrir la lib PyMUPDF et ses classes Document et Page, pytest et la puissante fixture Parametrize.

    Si vous trouvez �a bidon, pas de soucis. Je supprimerai ce topic.

  2. #2
    Membre Expert
    Homme Profil pro
    D�veloppeur informatique
    Inscrit en
    F�vrier 2003
    Messages
    1 605
    D�tails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France

    Informations professionnelles :
    Activit� : D�veloppeur informatique
    Secteur : Industrie

    Informations forums :
    Inscription : F�vrier 2003
    Messages : 1 605
    Par d�faut
    Un petit exemple d'utilisation :

    Code : S�lectionner tout - Visualiser dans une fen�tre � part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    from pathlib import Path
    from pdf_analyzer import PDFAnalyzer, UnreadablePDF
     
    def main() -> None:
        first_pages_to_scan: int = 3
        last_pages_to_scan: int = 1
        words: list[str] = ["foo", "bar", "hello", "world"]
        sensitive_case: bool = False
        for pdf_file in Path("path/of/directory/contains/PDF files").glob("*.pdf"):
            # ALWAYS use with statement !
            with PDFAnalyzer(filename=pdf_file) as job:
                print(job.readability_rate(first_pages=first_pages_to_scan, last_pages=last_pages_to_scan))
                    # displays the readability rate of the requested pages
                    ...
                if job.layer(first_pages=first_pages_to_scan, last_pages=last_pages_to_scan):
                    # at least one page contains a layer or a marking, do something
                    ...
                if job.corrupted_file:
                    # the PDF is damaged or not found, do something
                    ...
                if job.is_readable(first_pages=first_pages_to_scan, last_pages=last_pages_to_scan, acceptance=0.5):
                    # The readability of the analyzed pages does not cover the 50% rate requested, do something
                    ...
                try:
                    if job.terms_found(first_pages=first_pages_to_scan, last_pages=last_pages_to_scan, words=words, case_sensitive=sensitive_case):
                        # at list one sentence found, do something
                        ...
                except UnreadablePDF as err:
                    # oops, file corrupted or does not exist, do raise or something else
                    ...
     
     
    if __name__ == '__main__':
        main()

Discussions similaires

  1. R�ponses: 20
    Dernier message: 27/05/2025, 09h34
  2. [WD-365] Utiliser la fonction DataFields("") pour cr�er des fichiers PDF
    Par PENSEUR33 dans le forum VBA Word
    R�ponses: 9
    Dernier message: 30/09/2020, 16h08
  3. Cherche MODULE pour concat�ner des fichiers PDF
    Par DevPerl dans le forum Modules
    R�ponses: 1
    Dernier message: 21/10/2007, 17h11
  4. Utiliser mon tableau pour copier des fichiers
    Par Paloma dans le forum VB 6 et ant�rieur
    R�ponses: 2
    Dernier message: 31/10/2006, 18h38
  5. [FPDF] cr�ation d'un interface pour acc�s � des fichiers PDF
    Par StyleXP dans le forum Biblioth�ques et frameworks
    R�ponses: 1
    Dernier message: 19/12/2005, 10h18

Partager

Partager
  • Envoyer la discussion sur Viadeo
  • Envoyer la discussion sur Twitter
  • Envoyer la discussion sur Google
  • Envoyer la discussion sur Facebook
  • Envoyer la discussion sur Digg
  • Envoyer la discussion sur Delicious
  • Envoyer la discussion sur MySpace
  • Envoyer la discussion sur Yahoo