Peng Et Al. Advanced Unstructured Data Processing For ESG Reports
Peng Et Al. Advanced Unstructured Data Processing For ESG Reports
Jiahui Peng1, Jing Gao1, Xin Tong 1,*, Jing Guo2, Hang Yang2, Jianchuan Qi2, Ruiqiao Li2,
Nan Li 2,✝, Ming Xu2
1 Introduction
In the field of data management, the dichotomy between "Structured Data" and "Unstructured Data"
is critical. Structured data, typified by its presence in databases and spreadsheets, is readily processed
via ETL (Extract, Transform, Load) procedures. Unstructured data, however, which includes
documents, PDF files, emails, and social media posts, presents complexities due to its non-
standardized formats, making context comprehension and key information extraction challenging.
By 2025, it's estimated that unstructured data will constitute about 80% of the 179.6ZB of global
data generation(IDC, 2023). This data is not only massive in volume but also varied in origin, posing
significant challenges in extracting valuable insights due to its rapid generation and complex
structure.
1
Particularly within this domain, ESG (Environmental, Social, and Governance) reports,
predominantly in PDF format, are a notable challenge. These reports, with their blend of text, tables,
and charts, offer insights into an enterprise's sustainability practices but are difficult to process due
to their unstructured nature. In industrial ecology and ESG report analysis, these unstructured data
are invaluable for understanding a company's environmental, social, and governance impacts. Hence,
effectively processing these reports for in-depth research is imperative.
Our research explores a methodology for processing unstructured ESG report texts, making them
more amenable to analysis by advanced Natural Language Processing (NLP) technologies like GPT.
From intricate table data extraction to intelligent text chunking, our goal is to refine large language
model responses, thereby gaining deeper insights into corporate performance in environmental,
social, and governance realms. We adopt an innovative document processing strategy for Retrieval-
Augmented Generation (RAG) applications(Lewis et al., 2020), using a content-aware chunking
method called Unstructured. This approach allows for coherent and logical paragraph segmentation,
enhancing the precision of vector databases for similarity search and large language model prompts.
Our novel text processing method, based on the unstructured open-source toolkit, includes PDF
content partitioning, multimodal elements integration, content segmentation, and textual content
extraction. This paper details these methods and their application in processing ESG reports, aiming
to improve processing results and provide tools for advanced analysis and decision support in the
field of industrial ecology and sustainable business, thus offering new perspectives and tools for
managing complex unstructured data in corporate settings.
Structured data, characterized by its organized and definable structure, is inherently more suitable
for direct use in computer applications, as it adheres to a specific format that computer programs
can easily process(Sharma and Bala, 2014). In contrast, unstructured data usually consists of
information that either lacks a specific data model or has a model not readily accessible for computer
programs, making it more challenging to process and analyze(Dhuria et al., 2016).
Natural Language Processing (NLP), a field at the intersection of Artificial Intelligence (AI) and
linguistics, has significantly evolved since the 1960s(Lewis et al., 2020). It primarily focuses on
generating and understanding natural languages, aiming to enhance human-computer
interaction(Gharehchopogh and Khalifelu, 2011). In the realm of ESG report, a predominant portion
of the data is unstructured. This scenario poses a significant challenge and concurrently, an area of
intense research interest: the effective application of NLP techniques for the analysis and
interpretation of unstructured data in ESG reports.
In recent advancements, the integration of sophisticated NLP models like BERT and GPT has
catalyzed a surge in the application of diverse NLP methodologies for analyzing sustainability
reports(Qiu and Jin, 2024). Amini et al. (2018) pioneered the use of Leximancer, an advanced content
analysis tool, to delineate and quantify the conceptual and thematic frameworks within sustainability
reports. Kang and Kim (2022) introduce an innovative methodology that addresses the limitations
identified in prior works. By employing sentence similarity techniques coupled with sentiment
2
analysis, their approach provides a nuanced understanding of thematic practices and trends in
sustainability reporting, elucidating the disparity in the prevalence of positive versus negative
disclosures among different corporations.
Presently, academic research in applying NLP to ESG report analysis predominantly emphasizes the
development and enhancement of analytical models and methodologies. The inherent composition
of ESG reports, predominantly featuring unstructured data, necessitates advanced preprocessing
capabilities to mitigate potential detriments to the accuracy of subsequent analyses. This emphasis,
however, often results in the underrepresentation of the inherent complexities associated with the
unstructured nature of these reports. Kang and Kim (2022)selectively employed the "sent_tokenize"
function from the Natural Language Toolkit (NLTK) for the dual purpose of segmenting sentences
and eliminating characters not recognized by standard keyboards, thereby streamlining the textual
data for subsequent analyses. However, this method can lead to instances where text is improperly
segmented, resulting in a loss of accuracy in the conveyed information. Smeuninx et al. (2020)
adopted ABBYY FineReader, an advanced OCR tool, to transmute report PDFs into analyzable,
structured plaintext. However, this intricate process entailed manual identification and exclusion of
numerical and heterogeneously composed tables, thus filtering and forwarding only relevant textual
narratives to the NLP system. Reznic and Omrani (began to pay attention to the unstructured data in
ESG reports, but only did basic text cleaning in the pre-processing stage.
In summation, existing methods display a marked lack of efficiency in maintaining textual sequence
and coherence, coupled with a notable absence of automated, advanced techniques for extracting
data from tables. Within this framework, our research has introduced a groundbreaking
methodological advancement by employing the Unstructured Core Library. This innovative
approach enables the effective and accurate segmentation of unstructured data within ESG reports,
subsequently reorganizing it into an organized structure. This advancement effectively bridges the
gaps in current methodologies, providing a refined solution that markedly improves the accuracy
and efficiency of data processing in ESG reports.
3 Core Functions
The framework diagram in Fig.1 shows a process for organizing information from an ESG report
pdf into a structured format. Initially, the pdf is split into text, images, tables, headers and footers.
The headers and footers are filtered, the text is cleaned up, images are textualized, and tables are
converted to HTML. These elements are then grouped together into sections, each with a title and
body, creating an organized list of the report's contents in a clear, structured form.
3
Fig. 1 Unstructured processing framework
3.1 Partitioning
During the Partitioning process, the ESG report pdf is divided into multiple elements including texts,
images, tables, headers and footers with the "hi_res" strategy(Unstructured, 2023), utilizing
detectron2, a deep learning-based object detection system. This strategy is particularly advantageous
for its ability to leverage the layout information to gain additional insights about the document
elements, not just the text but also the formatting and structure like headings, paragraphs, and tables.
A notable limitation of the "hi_res" strategy is its difficulty in correctly ordering elements in multi-
column documents, potentially affecting the proper sequencing of content. Overall, the "hi_res"
strategy offers a comprehensive approach to pdf processing, particularly effective for complex
layouts but dependent on the availability of specific technologies like detectron2.
The post-partitioning elements require subsequent refinement, purification, and structuring. This
process involves the elimination of ancillary components such as "Headers" and "Footers", while
retaining standardized "Text", "Image", and "Table" data.
Initially, the procedure selectively excludes "Headers" and "Footers" from the elements, preserving
solely three elements: "Text", "Image", and "Table". For the "Text" elements, meticulous cleansing
is imperative prior to their integration into the NLP model, to mitigate potential detrimental impacts
on model efficiency caused by superfluous content. Addressing this challenge, the "Unstructured
Documentation" cleaning functions are employed(Unstructured, 2023). These utilities proficiently
consolidate paragraphs demarcated by newlines, excise bullets and dashes at text commencement,
and eradicate excessive whitespace, thereby enhancing the clarity and integrity of the textual data.
For "Table" elements, the raw text of the table will be stored in the text attribute for the element, and
4
the HTML representation of the table will be available in the element's metadata under
"element.metadata.text_as_html", so it is output in that form to preserve the formatting of the
table(Unstructured, 2023).
3.3 Chunking
The "Unstructured Core Library" chunking functions are crucial for document processing,
particularly in applications like RAG(Unstructured, 2023). A key function, "chunk_by_title",
segments documents into subsections by detecting titles, each title commencing a new section.
Notably, specific elements such as tables and non-text components (e.g., page breaks, images)
inherently constitute separate sections.
Activating the "multipage_sections" parameter facilitates the creation of sections spanning multiple
pages. This feature is essential for preserving thematic or contextual continuity across page breaks,
thus ensuring the segmented document mirrors the structure of the original text.
The "new_after_n_chars" parameter remains unspecified (set to None), indicating reliance on the
function's inherent setting for initiating new sections. Meanwhile, the "max_characters" parameter,
fixed at 4096, implies an adaptation to include larger sections within each chunk, catering to the
document’s specific needs.
This process entails the systematic handling of segmented document chunks to classify and structure
text and table elements into an organized data format. The procedure iterates over each chunk,
extracting text from instances of "Composite Element" and accumulating it in a list. For elements
categorized as "Table", their HTML representations are either integrated into this list or merged with
the preceding text to maintain a uniform compilation of related content. This operation yields a series
of text strings, each potentially encompassing a title and its corresponding content body.
Subsequently, these text strings are dissected into title-body duos using a specific delimiter. Each
duo is cataloged in a dictionary, with distinct keys assigned for the title and body. These dictionaries
are then assembled into a list, representing the methodically organized content. The final phase
involves converting this structured content into a pandas Data Frame, subsequently exported to an
Excel file. This conversion is pivotal for enabling further data analysis and manipulation, effectively
transforming the original unstructured document segments into an efficiently organized, tabular
representation.
5
4 Experiments and Discussion
Three Fortune 500 companies, each representing a distinct industry classification - Walmart, Apple,
and Toyota - were selected for analysis ( Table.1 ) . Their latest ESG reports underwent a
comprehensive unstructured split processing. This involved a detailed examination of the reports'
text, images, and tables. The primary aim of this empirical study was to assess the effectiveness of
unstructured processing techniques in analyzing ESG reports from diverse industry sectors. The
findings are detailed below:
In demonstrating the efficacy of our proposed method, we selected three distinct text blocks from
diverse industries and formatting styles: Walmart FY2023 HIGHLIGHTS, Apple's Ambitious goals,
and Toyota ESD Project(Table.2). The processed results are visibly discernible through "Title" and
"Cleaned Text". This methodology adeptly identifies "Title" across varied ESG report formats from
these distinct sectors and executes thorough text cleaning and organization. Notably, in the Apple
report, key themes such as "Climate Change", "Resources", and "Smarter Chemistry" were precisely
identified, exemplifying the method's robustness in handling heterogeneous data structures and
themes.
The image processing results (Table.3), as illustrated, indicate that for the majority of images in ESG
reports, crucial information is predominantly stored in textual format within the images, enabling
direct text extraction. Consequently, the current functionality of our method is tailored to process
images that do not contain embedded text.
The processing of tables represents the most challenging aspect of our work, a facet previously
overlooked in prior research (Table.4). The post-processing results of tables demonstrate that they
have been organized into concise and clear formats using the "text_as_html" method. Specifically,
for Walmart's Original Table, a less conspicuous table element was successfully identified and
processed as a table. Apple's Original Table, considerably more complex with 21 rows and 7 columns,
underwent meticulous segmentation and organization. However, due to the physical separation of
texts like "Water", "Waste", and "Product packaging footprint" from the core table area, these
elements were not captured. And the unstructured performance in the upper part of the table is not
satisfactory, but it's possible that large language models might still be able to understand it. Toyota's
Original Table, being more standardized, yielded a distinctly clear and well-processed result.
6
Table. 2 Walmart, Apple, Toyota text processing results
Over 90% of assessed audit reports were rated green or yellow and less than
2% percent of facilities assessed received a successive orange rating34
FY2023 99% of Walmart U.S. and 99% of Sam’s Club U.S. net sales of fresh produce
Walmart
HIGHLIGHTS and floral were from suppliers endorsing the produce Ethical Charter35
Achieve carbon neutrality for our entire carbon footprint by 2030, and reach
our emissions reduction target5
Climate Change Create all products with net zero carbon impact by 2030
Transition our entire product value chain, including manufacturing and product
use, to 100 percent clean electricity by 2030
Use only recycled and renewable materials in our products and packaging, and
enhance material recovery
Apple
Eliminate plastics in our packaging by 20256
Resources
Reduce water impacts in the manufacturing of our products, use of our
services, and operation of our facilities
Eliminate waste sent to landfill from our corporate facilities and our suppliers
7
Integrate smarter chemistry innovation into the way we design and build our
products
8
Table. 3 Walmart, Apple, Toyota image processing results
Company Original Image Image Text
The image depicts a person outdoors, smiling at the camera and holding an avocado. They appear
to be in a natural environment suggestive of an avocado farm, given the lush greenery and the
appearance of avocado trees in the background. The individual is dressed in casual attire, which
Walmart
is practical for agricultural work, and appears to be either harvesting or inspecting the fruit,
signifying involvement in farming or agricultural practices. This scene conveys a sense of
agricultural life, potentially emphasizing the human aspect of farming and the cultivation of
produce.
This image shows a person wearing blue gloves engaging in what appears to be a cleaning or
maintenance activity on a piece of electronic equipment, which seems to be a computer server or
data storage unit. The individual is using a yellow tool, possibly a cutter or prying device, and a
Apple white cloth, which could indicate that they are either cleaning sensitive components or preparing
to perform hardware maintenance tasks such as opening the device or ensuring a dust-free
environment. The compartmentalized design of the metal surface with circular perforations
suggests that this is part of a system designed for efficient airflow, which is essential for cooling
in high-performance computing equipment.
The image is of a modern SUV. It exhibits a coupe-like roofline and features sporty alloy
wheels, which are common design choices aimed at combining utility with a stylish aesthetic.
The paint appears to be a matte or satin finish, which is a trend that has become more popular in
Toyota recent years. The bodywork also shows pronounced creases and the door handles are flush with
the body, indicating a focus on aerodynamics. Large wheels and low-profile tires are indicative
of an emphasis on performance in addition to everyday utility. The lack of visible exhaust tips
could suggest that this vehicle is an electric or hybrid model, aligning with the increasing focus
on sustainability in the automotive industry.
9
Table. 4 Walmart, Apple, Toyota table processing results
Company Original Table Table presented in html
Walmart
Apple
Toyota
10
5 Conclusion
This study presents a groundbreaking approach in processing and integrating unstructured data
from ESG reports, utilizing advanced techniques to convert these complex documents into
structured, analyzable formats. Our methodology, underpinned by the Unstructured library,
demonstrates a significant advancement over existing methods, particularly in handling the
diverse elements of PDFs such as text, images, and tables.
We've shown through empirical analysis that our method can adeptly manage various data types
found in ESG reports, ensuring that crucial information is not only preserved but also made
accessible for in-depth analysis. The method excels in processing text by effectively identifying
and cleaning titles and body sections, as demonstrated in reports from Walmart, Apple, and
Toyota. Additionally, it addresses the challenge of image processing by focusing on images
devoid of embedded text, a common feature in ESG reports.
One of the most significant achievements of our approach is its capability to handle complex
table structures, a task that has been notably challenging in previous studies. Our method's
ability to reorganize tables into a more structured and coherent format enhances the readability
and analysis of these critical data elements.
It is imperative to acknowledge certain limitations inherent in the current research. Despite the
methodology's robust performance in structuring and analyzing unstructured data from ESG
reports, challenges persist due to the considerable variability in report formats and writing styles
across different industries and corporations. Diverse page layouts can influence the efficacy of
data segmentation, though, on the whole, the method succeeds in achieving clear and precise
data division. Moreover, a notable deficiency of our approach lies in its handling of data charts
embedded with text. The current methodology does not optimally process these data-rich visual
elements, which remains an area for future enhancement.
Acknowledgements
This research was supported by one of the projects of the Erasmus Initiative: Dynamics of Inclusive
Prosperity, a joint project funded by the Dutch Research Council (NWO) and the National Natural
Science Foundation of China (NSFC): “Towards Inclusive Circular Economy: Transnational Network
for Wise-waste Cities (IWWCs) ” (NSFC project number: 72061137071 ;NWO project number:
482.19.608), and the project of " ESG Text Analysis for Assessing Carbon Disclosure Quality Based on
Large Language Models" funded by the SMP (Social Media Processing)-ZHIPU AI Large Language
Model Interdisciplinary Fund.
11
References
[1] Amini, M.; Bienstock, C.C.; Narcum, J.A., 2018. Status of corporate sustainability: A content
analysis of Fortune 500 companies. Business Strategy and the Environment 27(8), 1450-1461.
[2] Dhuria, S.; Taneja, H.; Taneja, K., 2016. NLP and ontology based clustering—An integrated
approach for optimal information extraction from social web, 2016 3rd international conference
on computing for sustainable global development (indiacom). IEEE, pp. 1765-1770.
[3] Gharehchopogh, F.S.; Khalifelu, Z.A., 2011. Analysis and evaluation of unstructured data: text
mining versus natural language processing, 2011 5th International Conference on Application of
Information and Communication Technologies (AICT). pp. 1-4.
[4] IDC, 2023. Worldwide Global DataSphere and Global StorageSphere Structured and Unstructured
Data Forecast, 2023–2027.
https://www.idc.com/getdoc.jsp?containerId=US50397723&pageType=PRINTFRIENDLY.
[5] Kang, H.; Kim, J., 2022. Analyzing and Visualizing Text Information in Corporate Sustainability
Reports Using Natural Language Processing Methods. Applied Sciences 12(11), 5614.
[6] Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih,
W.-t.; Rocktäschel, T., 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.
Advances in Neural Information Processing Systems 33, 9459-9474.
[7] Qiu, Y.; Jin, Y., 2024. ChatGPT and finetuned BERT: A comparative study for developing
intelligent design support systems. Intelligent Systems with Applications 21, 200308.
[8] Reznic, M.; Omrani, R., A Guide to Converting Unstructured Data into Actionable ESG Scoring.
[9] Sharma, M.M.; Bala, A., 2014. An approach for frequent access pattern identification in web
usage mining, 2014 International Conference on Advances in Computing, Communications and
Informatics (ICACCI). IEEE, pp. 730-735.
[10] Smeuninx, N.; De Clerck, B.; Aerts, W., 2020. Measuring the Readability of Sustainability
Reports: A Corpus-Based Analysis Through Standard Formulae and NLP. International Journal of
Business Communication 57(1), 52-85.
12