0% found this document useful (0 votes)
55 views12 pages

LLMasMMKG LLM Assisted Synthetic Multi-Modal Knowledge Graph Creation For Smart City Cognitive Digital Twins

This paper introduces a novel approach leveraging Large Language Models (LLMs) to automate the creation of synthetic Multi-Modal Knowledge Graphs (MMKGs) for Smart City Cognitive Digital Twins (CDTs). The method addresses challenges in integrating diverse urban data sources, enhancing knowledge representation and decision-making capabilities. By utilizing LLMs for tasks such as entity recognition, relationship extraction, and synthetic data generation, the research aims to improve urban management and foster sustainable, citizen-centric environments.

Uploaded by

Sumaiya Akter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views12 pages

LLMasMMKG LLM Assisted Synthetic Multi-Modal Knowledge Graph Creation For Smart City Cognitive Digital Twins

This paper introduces a novel approach leveraging Large Language Models (LLMs) to automate the creation of synthetic Multi-Modal Knowledge Graphs (MMKGs) for Smart City Cognitive Digital Twins (CDTs). The method addresses challenges in integrating diverse urban data sources, enhancing knowledge representation and decision-making capabilities. By utilizing LLMs for tasks such as entity recognition, relationship extraction, and synthetic data generation, the research aims to improve urban management and foster sustainable, citizen-centric environments.

Uploaded by

Sumaiya Akter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

AAAI Fall Symposium Series (FSS-24)

LLMasMMKG: LLM Assisted Synthetic Multi-Modal Knowledge Graph Creation


For Smart City Cognitive Digital Twins
Sukanya Mandal1 , Noel E. O’Connor2
1
SFI Centre for Research Training in Machine Learning (ML-Labs)
2
Insight SFI Research Centre for Data Analytics
School of Electronic Engineering
Dublin City University
Dublin, Ireland
sukanya.mandal2@mail.dcu.ie
Abstract as a powerful tool for this purpose (Liang et al. 2024), ca-
pable of integrating heterogeneous data from diverse urban
The concept of a Smart City (SC) Cognitive Digital Twin
domains, ranging from sensor networks and traffic manage-
(CDT) presents significant potential for optimizing urban envi-
ronments through sophisticated simulations, predictions, and ment systems to healthcare records and social media feeds.
informed decision-making. Comprehensive Knowledge Rep- However, building such Knowledge Graphs (KGs) for SC
resentations (KRs) that effectively integrate the diverse data CDT presents significant hurdles. The inherent diversity of
streams generated by a city are crucial to realizing this poten- data sources leads (Lehtola et al. 2022) to heterogeneity in
tial. This paper addresses this by introducing a novel approach formats, semantics, and levels of detail, necessitating sophis-
that leverages Large Language Models (LLMs) to automate ticated alignment techniques to establish semantic correspon-
the construction of synthetic Multi-Modal (MM) Knowledge dences and resolve inconsistencies. Furthermore, extracting
Graphs (KGs) specifically designed for a SC CDT. Recogniz- meaningful relationships and implicit knowledge from un-
ing the challenges in fusing and aligning information from structured data, such as social media posts and news articles,
disparate sources, our method harnesses the power of LLMs
requires advanced natural language understanding capabili-
for natural language understanding, entity recognition, and
relationship extraction to seamlessly integrate data from sen- ties.
sor networks, social media feeds, official reports, and other
relevant sources. Furthermore, we explore the use of LLM- Large Language Models (LLMs), trained on massive text
driven synthetic data generation to address data sparsity issues, and code datasets, offer a transformative solution for over-
leading to more comprehensive and robust KGs. Initial outputs coming these challenges. LLMs excel at natural language
demonstrate the effectiveness of our approach in constructing processing tasks (Huang 2024), including entity recognition,
semantically rich and interconnected synthetic KGs, highlight- relationship extraction (Wang et al. 2023), and text summa-
ing the significant potential of LLMs for advancing SC CDT rization (Jin et al. 2024), demonstrating a remarkable ability
technology. to discern semantic nuances and contextual dependencies.
These capabilities make them ideally suited for automating
1 Introduction or semi-automating the complex process of MMKG construc-
tion for SC CDTs.
The rapid pace of urbanization in the 21st century demands
innovative approaches to manage the complexities of increas-
This paper introduces a novel approach that leverages
ingly interconnected urban environments (Nijkamp and Kour-
LLMs to create synthetic MMKGs specifically designed to
tit 2013). Smart Cities (SCs) (Khalimon, Vikhodtseva, and
enhance the representation and functionality of SC CDTs.
Obradović 2020), characterized by their integration of ad-
Real-world urban data often suffers from limitations such
vanced technologies and data-driven insights, offer a com-
as scarcity, privacy concerns due to sensitive information,
pelling solution to these challenges. Central to this vision
and potential biases reflecting inequalities in data collection.
are Cognitive Digital Twins (CDTs) (Abburu et al. 2020)
Synthetic MMKGs offer a powerful solution by enabling us
which are Artificial Intelligence (AI) enabled virtual repre-
to generate large volumes of representative data while ad-
sentations that model the dynamic interplay of physical and
dressing issues of data scarcity, mitigating privacy risks and
social systems within a city. These CDTs (Zheng, Lu, and
promoting fairness (Qian, Cebere, and van der Schaar 2023).
Kiritsis 2022) provide a powerful platform for understanding
We investigate using LLMs for crucial tasks such as entity
urban complexities, enabling simulations, predictions, and
recognition across diverse data sources, relationship extrac-
informed decision-making for optimized city operations.
tion from unstructured text, and the generation of synthetic
Constructing comprehensive Knowledge Representations
data to address sparsity issues inherent in real-world datasets.
(KRs) lies at the heart of effective CDT realizations for SCs.
The following sections delve into related work and motiva-
Multi-Modal Knowledge Graphs (MMKGs) have emerged
tion, detail our proposed methodology, implementation and
Copyright © 2024, Association for the Advancement of Artificial present sample outputs based on our proposed methodology.
Intelligence (www.aaai.org). All rights reserved. We conclude the paper with directions for future research.

210
2 Related Work novel approach that leverages LLMs for constructing syn-
Constructing comprehensive KGs for complex environments thetic MMKG specifically designed for SC CDT (Mandal
such as SCs presents a significant challenge (Anthopoulos 2024). This work explores the potential of LLMs in fusing
2015). While previous research has explored various ap- information from heterogeneous sources, aligning entities
proaches, ranging from manual curation to automated ex- across multiple modalities, and generating synthetic data to
traction, limitations persist. Existing work often focuses on augment KG coverage. This approach ultimately aims to
domain-specific KGs for areas like transportation (Chen et al. support the development of more robust and comprehensive
2022), (Zhang et al. 2023) or energy (Chun et al. 2020), Digital Twin (DT) applications. Furthermore, this work pro-
(Chun et al. 2018), employing domain-specific ontologies poses a reusable and scalable pipeline adaptable for future
and rule-based methods for data integration. However, these enhancements, as detailed in figure 1 and 2.
methods lack the flexibility and scalability required to han-
dle the diverse and voluminous multi-modal data found in a 3 Motivation
holistic SC CDT. The promise of SCs lies in their ability to leverage data for
Addressing this multi-modal challenge is crucial. Tra- improved urban planning, resource management, and citizen
ditional data fusion and alignment techniques, such as services (Gharaibeh et al. 2017). Yet, this data-driven vision
schema mapping (Dsouza, Tempelmeier, and Demidova is hindered by the sheer complexity of the information gener-
2021), (Oliveira, Sahay, and d’Aquin 2019) and ontology ated. SCs are characterised by a constant stream of readings
alignment (Xiang et al. 2021), (Silva, Faria, and Pesquita from traffic sensors, social media posts reflecting citizen sen-
2022), often rely on labor-intensive handcrafted rules or sta- timent, and reports from various city departments, to name
tistical measures that struggle with semantic heterogeneity but a few sources, each offering a fragmented view of the
and ambiguity across different data modalities. While ma- urban landscape. This inherent heterogeneity and volume of
chine learning approaches like embedding-based alignment data necessitate a KR framework that goes beyond the capa-
(Guo et al. 2022), (Fanourakis et al. 2023) show promise, bilities of traditional data management techniques, leading
they often necessitate large amounts of labelled data and may us to explore the potential of KGs for SC CDTs.
not generalize well to unseen data distributions.
The emergence of Large Language Models (LLMs) offers 3.1 Beyond Traditional Data - KGs for SCs
a potential solution. Recent research has demonstrated the ef-
fectiveness of LLMs in tasks such as entity linking (Shu et al. While traditional data formats like relational databases and
2024), relationship extraction (Tao, Wang, and Bai 2024), spreadsheets are ideal for storing structured information, they
(Meyer et al. 2023), and knowledge base completion (Xu struggle to adequately capture the complex relationships and
et al. 2024a), (Xu et al. 2024c), (Wei et al. 2024), particularly interdependencies that define SC ecosystems. Ontologies
in capturing complex semantic relationships and reasoning (Ontotext 2024) and KGs (Ehrlinger and Wöß 2016), on the
over unstructured text (De Bellis 2023), (Mandvikar 2023). other hand, offer a powerful alternative by providing a rich
However, the application of LLMs to construct MMKG for semantic representation of data, facilitating seamless data
SC CDT remains largely unexplored. integration, enabling contextual reasoning and inference, and
A notable exception is the work reported in (Tupayachi promoting explainability and interpretability. To illustrate
et al. 2024), which is very similar to our approach in using these advantages, let us consider how KGs could be used to
an LLM to automate the creation of scenario-based ontolo- represent and reason about data from traffic sensors.
gies for urban decision support systems. Their methodology Rich Semantic Representation (Ji et al. 2021): Consider
employs the ChatGPT API as a reasoning core, leveraging a table storing traffic sensor data, including Sensor ID, Loca-
methontology-based prompt tuning and transformers to gen- tion, and Speed readings. Storing this in a relational database
erate ontologies from existing research articles and technical captures what is measured but not the relationships between
manuals. However, our approach differs in its focus on con- things. While a relational database can efficiently store these
structing synthetic MMKGs specifically designed for a SC attributes, it falls short in representing the rich connections
CDT. We leverage LLMs not only for ontology generation that provide context and meaning. KGs excel in this area
but also for Multi-Modal (MM) data fusion, including the by representing information as interconnected triples,
integration of sensor data and the generation of synthetic data such as (Sensor 123, locatedAt, Intersection
to address sparsity issues, enabling a more comprehensive A) and (Intersection A, connectsTo, Street X).
and data-driven representation of the SC. This interconnectedness allows for more nuanced and com-
Current LLM-based approaches predominantly focus on plex queries, such as “Find all sensors within 2 miles of an
text-based Knowledge Extraction (KE), leaving a significant accident reported on social media”, a query that requires un-
gap in effectively integrating and aligning information from derstanding of spatial relationships and linking of data types.
diverse modalities like sensor data, images, and social media Furthermore, the inherent structure of KGs facilitates pattern
feeds – all of which are essential for comprehensive SC discovery, revealing indirect connections that might other-
CDT. Moreover, addressing data sparsity, a common issue wise go unnoticed, like how traffic congestion in one area
in real-world SC datasets, requires innovative solutions like might correlate with increased hospital visits in another.
synthetic data generation, an area not adequately addressed
in the context of LLM-driven KG construction. Interoperability and Data Integration (Grangel-
This research aims to bridge these gaps by introducing a González 2019), (Melluso, Grangel-González, and Fantoni

211
2022): The SC landscape is characterized by data hetero- these different modalities. Our research leverages the power
geneity. Integrating information from traffic sensors, social of LLMs, with their inherent ability to process and understand
media feeds, and weather reports into a unified system multi-modal data, to develop novel techniques for learning
presents significant challenges for traditional databases due joint representations that capture the intricate relationships
to differences in schemas, units, and levels of detail. KGs across these different data modalities.
address this challenge by using standardized data models
like Resource Description Framework (RDF) (Pan 2009) and Automated KE (Peng et al. 2023): The sheer volume and
shared ontologies (vocabularies of a domain) (Ontoforce heterogeneity of SC data make manual KG construction an
2024). This creates a common language for representing arduous and error-prone task. Our work explores the use of
data from diverse sources, e.g., allowing information from a LLMs to automate this process. By leveraging LLMs’ nat-
traffic sensor and a weather report such as temperature ural language understanding capabilities, we aim to extract
to be expressed using the same ontology term. This enables entities and relationships from textual sources like news arti-
direct comparison and analysis within a unified framework, cles and social media posts. Furthermore, we investigate how
thus significantly easing the integration of new data sources, LLMs can learn patterns from sensor data, enabling the auto-
as they can be readily mapped to the existing KG ontology, matic extraction of meaningful relationships from different
avoiding the need for a complete database overhaul. kinds of data points.

Contextual Reasoning and Inference (Liu et al. 2024): Synthetic Data Generation (Xu et al. 2024b): Real-world
Traditional databases are primarily designed for data storage data is often incomplete, with missing data points hindering
and retrieval. Deriving new knowledge from the data often the KG’s ability to provide a complete and accurate represen-
requires complex queries and external logic. KGs, in con- tation of the city. To address these challenges, our research
trast, leverage their structured representation and semantic explores the use of LLMs for synthetic data generation. By
richness to enable built-in reasoning capabilities. This can learning the underlying structure and patterns within the ex-
be achieved through rule-based inference, where predefined isting KG, LLMs can generate synthetic yet plausible data
rules, such as If (Traffic Light, isBroken, True) points to fill in gaps and enhance the KG’s predictive power.
→ (Traffic Flow, isDisrupted, True), capture log- This allows us to create more robust and comprehensive KGs
ical relationships between entities and their attributes, allow- that can better support the complex simulations and predic-
ing the KG to infer new knowledge directly from the data. tions required for effective SC management.
Additionally, KGs can leverage semantic similarity to infer Harnessing the power of LLMs to address these challenges,
that seemingly different phrases, like Road Closed and this work paves the way for more intelligent, efficient, and
Street Inaccessible, likely refer to the same event, insightful urban management, ultimately contributing to the
even if the wording varies across data sources. development of more sustainable and citizen-centric urban
Explainability and Interpretability (Schramm, Wehner, environments.
and Schmid 2023): Transparency and accountability are of
paramount importance for SCs i.e. the ability to understand 4 Proposed Methodology
the reasoning behind decisions is crucial. Machine learn- This section details the methodology employed to investigate
ing models, while powerful, often operate as “black boxes”, the potential of LLMs in constructing synthetic MMKGs for
making it difficult to understand why a specific prediction SC CDTs. Recognizing the interconnected nature of SC do-
was made. KGs, with their transparent and human-readable mains, this work aims to capture and represent their complex
structure, provide a clear and auditable path of how conclu- interplay, facilitating a holistic understanding of the urban
sions are reached. This inherent explainability fosters trust environment.
in the insights derived from the KG and facilitates informed
decision-making by urban planners and policymakers. 4.1 System Architecture – Core Components
3.2 Leveraging LLMs for Enhanced KG Our proposed system architecture comprises four key mod-
Construction in SC Applications ules, as illustrated in Figure 1, forming a workflow pipeline
for LLM-assisted automated MMKG construction. Starting
This research is motivated by the limitations of existing KG from the left, the core components of this architecture (in
construction methods in effectively addressing the complexi- modules) are as follows:
ties of SC data representations. Current approaches struggle
to capture the multifaceted nature of urban environments, Module 1 - Heterogeneous Data Sources and Data Pro-
hindering the development of truly comprehensive and in- cessing: This module is responsible for gathering data from
sightful CDTs. Specifically, we identify three key challenges diverse sources reflecting various aspects of a SC, and ap-
that traditional methods struggle to overcome: plying pre-processing techniques to ensure data quality and
consistency. Examples of these data sources includes (but not
Multimodal Data Fusion (Liang et al. 2024): SCs are in- limited to):
herently multi-modal, with data originating from text-based
sources like social media, time-series data from sensors, and • Smart Home Data: Sensor readings from smart homes, in-
even visual data from traffic cameras. Traditional KG con- cluding temperature, energy consumption, appliance usage,
struction methods often struggle to meaningfully combine and occupancy data.

212
Module 4 - KG Construction and Population: This mod-
ule structures the extracted knowledge into a formal KG in
RDF format. It involves representing knowledge as RDF
triples, designing a hierarchical domain-specific ontology,
incorporating original source data, and populating the KG
with extracted entities, relationships, and attributes. This will
further enable querying and reasoning over the integrated
knowledge.The key steps involved in this module are:
• KR: RDF triples (subject, predicate, object) are used to
express entities and relationships extracted by the LLM.
• Ontology Design: A hierarchical ontology is designed to
define classes and properties for each domain (e.g., smart
Figure 1: System Architecture homes, transportation), ensuring consistency and interop-
erability within the KG.
• Data Incorporation: Data from the original sources is
• Smart Healthcare Data: Electronic health records integrated into the ontology to create a KG, enriching its
(EHRs), anonymized patient data, hospital occupancy rates, content and providing context for the extracted knowledge.
and public health alerts.
• KG Population: Extracted entities, relationships, and at-
• Smart Transportation Data: Real-time traffic data from tributes are populated into the KG, adhering to the defined
sensors, GPS devices, and traffic cameras, as well as public ontology. This results in a structured and queryable knowl-
transportation schedules and incident reports. edge base that can be used for various downstream tasks.
• Smart Grid Data: Energy consumption patterns, grid
stability metrics, and renewable energy generation data. The output of each module feeds into the next, culminating
in an RDF file containing the ontology and the populated KG.
• Social Media Feeds: Publicly available posts and com-
ments from platforms like Twitter and Facebook, filtered 4.2 Benefits of the Modular System Architecture
for relevance to the target city.
• Official Reports and News Articles: Government reports, The modular design of the proposed system architecture of-
news articles, and online publications related to city events, fers several key advantages, making it particularly well-suited
infrastructure updates, and public announcements. for the dynamic and evolving nature of SC data:

Standard data pre-processing techniques are applied to en- • Reusability: Individual modules can be readily reused
sure data quality and consistency. Data cleaning handles miss- and adapted for various SC scenarios and tasks, promoting
ing values, removes duplicates, and corrects data errors. Data efficiency and consistency across applications.
Normalization standardizes units of measurement, converts • Robustness: Modularity enables easier debugging and er-
data types, and resolves naming convention inconsistencies. ror isolation. If one module encounters issues, it can be
Textual Data Preprocessing includes tokenization, stop word addressed independently without impacting the functional-
removal, stemming, and lemmatization of textual data. ity of other modules, enhancing system stability.
Module 2 - Multimodal Representation Learning • Scalability: Each module can be independently scaled to
(MMRL): MMRL involves creating representations that handle increasing data volume or complexity. This allows
capture the relationships and interactions between different the system to adapt gracefully to the evolving needs of a
types of data (Zhang et al. 2019). This module transforms growing SC.
the preprocessed data into a unified representation, captur- • Composability: New modules can be easily integrated, or
ing semantic relationships across different data modalities. existing ones modified, to incorporate new data sources
This prepares the data for effective KE by the LLM. Specific (including the integration of real-world data along with
techniques are elaborated in the “Implementation and Output” synthetic data), preprocessing steps, or KE techniques.
section, tailored to the datasets and modalities used in this This flexibility ensures the system remains adaptable and
work. can readily incorporate advancements in LLMs and data
availability.
Module 3 - LLM Guided KE: This module leverages
LLMs for automated KE from the MM data representation. • Maintainability: The modular structure promotes cleaner
LLMs perform entity recognition (identifying relevant enti- code organization and simplifies updates or modifications,
ties like people, locations, and events), relationship extraction making the system easier to maintain, update, and evolve
(inferring relationships like locatedIn between a sensor over time.
and a building based on their co-occurrence in the data), and • Microservice and Multi-Agent Compatibility: The mod-
semantic schema mapping (determining semantic similarity ular design is well-suited for a microservice-based archi-
between concepts across data sources, facilitating schema tecture, allowing distributed deployment, independent scal-
alignment). This ensures that data from different sources can ing and fault tolerance of each module. This modularity
be integrated into a unified KG. also supports the integration of intelligent multi-agents,

213
enabling more autonomous and adaptive system behav- (Achiam et al. 2023) language model (specifically, the gpt-4-
ior. These further enhances the system’s ability to handle turbo engine) via the OpenAI API. We focused on generating
increasing complexity and dynamism as the architecture text that reflects the language and information such as device
evolves with additional use cases. descriptions, social media posts, forum interactions, patient
descriptions, telehealth conversations, traffic reports, news
This modular approach collectively contributes to a more articles, and weather reports, typically found in the above
flexible, maintainable, and scalable system, capable of effec- mentioned four key SC domains.
tively handling the complexities of SC data management and For each domain, we developed a set of python func-
KE. tions, each designed to generate a specific type of text.
These functions utilized carefully crafted prompts to guide
5 Implementation and Output GPT-4 in producing text that aligned with the character-
This section describes a practical implementation of our pro- istics of each domain. We also adjusted parameters like
posed methodology. For the purpose of this implementation, max tokens and temperature to fine-tune the gener-
we create and utilize synthetic data to construct a MMKG rep- ated text. The max tokens parameter in GPT-4 controls
resenting the interconnected domains of smart homes (Karimi the maximum length of the generated text, measured in to-
et al. 2021), smart healthcare (Jokanović 2020), smart trans- kens (roughly corresponding to words or subwords). We
portation (Gaur et al. 2015), and smart grids (Bonetto and tailored the max tokens value for each text generation
Rossi 2017) within a SC context (Sánchez-Corcuera et al. function to ensure appropriate output lengths while maintain-
2019). We describe the generation of synthetic data for both ing coherence and relevance. The temperature parameter
text and sensor modalities, followed by the pre-processing in GPT-4 controls the randomness of the generated text. A
steps applied to prepare the data for KE. We then detail the higher temperature (closer to 1) results in more creative and
specific techniques used for MM representation learning, unpredictable output, while a lower temperature (closer to 0)
LLM-guided KE, and the final construction and population produces more deterministic and focused text.
of the KG. The codebase is publicly available1 . Smart Home: To generate realistic smart home
data, we utilized GPT-4 with specifically designed
prompts and parameters. For example, the function
generate device description(device type)
creates a short description of a smart home device (e.g.,
“thermostat”, “security camera”) using the prompt: “Describe
a smart home {device type} with innovative features”.
We also created functions to simulate social media interac-
tions. The generate social media post(device
type) function generates a social media post about a speci-
fied device type (e.g., “smart lighting system”). It randomly
selects a prompt from a predefined list, such as “Just got a
new smart {device type} and I’m loving it! #SmartHome”,
“My smart {device type} is a game-changer! So convenient.
#HomeAutomation”, “Having some trouble setting up my
Figure 2: System Architecture adopted for this implementa- smart {device type}. Any tips? #TechHelp”, ensuring diverse
tion post content.
Finally, the generate forum interaction(dev-
ice type) function simulates forum interactions by gen-
erating a question and answer related to a given device type
5.1 Module 1 – Synthetic Data Generation (e.g., “refrigerator”). The question prompt is “I’m having
This step represents “Module 1: Heterogeneous Data Sources trouble connecting my smart {device type} to my Wi-Fi. Any
and Data Processing” of the system architecture as described advice?”. GPT-4 then uses this question to generate a plausi-
in section 4.1 - see figure 1. Recognizing the limitations of ble answer.
real-world data in terms of sparsity and coverage, we generate These functions were applied to a predefined list of device
two modalities of synthetic data - text and sensors. This types, such as “thermostat”, “security camera”, “smart light-
synthetic data is designed to be representative of the above ing system”, and “refrigerator” resulting in a diverse dataset
mentioned target domains of a SC and augments real-world of smart home-related text.
datasets to enhance the coverage and representativeness of
the MMKG. Smart Healthcare: For the smart healthcare domain, we
created functions to generate various types of medical text.
Text Data Generation - To create a rich and diverse corpus The generate patient description(symptoms,
of synthetic text data for our KG, we leveraged the capabil- medical history) function generates a brief medical
ities of the GPT-4 [Generative Pre-trained Transformer 4] description of a patient based on provided symptoms (e.g.,
“fatigue”, “shortness of breath”) and medical history (e.g.,
1
See: https://github.com/sukanyamandal/LLMasMMKG Code “hypertension”, “diabetes”). The prompt used for this task

214
is “Generate a brief medical description of a patient pre- Finally, the generate weather report(locati-
senting with {symptoms}, including their medical history of on, parameters) function generates a weather report
{medical history}”. for a specified location (e.g., “New York City”) and includes
To simulate telehealth interactions, we developed the desired weather parameters (e.g., “temperature”, “wind
generate telehealth conversation(initial speed”, “solar radiation”). We used a structured prompt for-
query) function. It uses a multi-turn conversation approach, mat to ensure the generated report includes the specified
where the initial query from the patient (e.g., “Hello, I’m parameters in a clear and concise manner.
experiencing chest pain.”) is used to initiate the conversa-
Sensor Data Generation - To complement the textual data
tion. GPT-4 then generates responses based on the previous
and provide a multi-modal representation of the SC, we gen-
conversation turns and the patient’s new queries as context,
erated synthetic sensor data for each domain. We used Python
simulating a realistic chatbot interaction.
libraries like pandas, numpy, and random to simulate re-
Similar to the smart home domain, we also created alistic sensor readings, employing techniques such as sine
a generate health forum interaction(topic) waves, random noise, and linear interpolation.
function for the healthcare domain. This function simulates a
health forum interaction by generating a question and answer Smart Home: For the smart home domain, we simulated
based on a given health-related topic (e.g., “managing stress”, readings from temperature, humidity, energy consumption,
“healthy diet”). GPT-4 first generates a question related to and door/window status sensors. Temperature and humidity
the topic and then uses that question to generate a relevant data were generated using sine waves to mimic seasonal
answer. variations, with added random noise for realism.
Energy consumption data followed a similar approach,
Smart Transportation: In the smart transportation domain, with a sine wave simulating daily peaks and additional ran-
we focused on generating text related to traffic conditions domness to reflect household variations. Door/window status
and commutes. The generate traffic report(hi- data was generated by randomly assigning “closed” (False)
ghway, time) function creates a traffic report for a major or “open” (True) states, with a higher probability of “closed”
highway (e.g., “Highway 101”) at a specific time of day (e.g., to reflect typical usage patterns.
“rush hour”). The prompt we used for this task is “Generate
a traffic report for {highway} during {time}, including the Smart Healthcare: In the smart healthcare domain, we
cause of any congestion and expected delays”. focused on heart rate and sleep hours data. Heart rate data
To simulate social media discussions about commutes, was generated hourly, incorporating daily variations using
we developed the generate commute post(mode, a sine wave and adding random noise to simulate individ-
sentiment) function. This function generates a social ual differences. Sleep hours data was generated daily, with
media post about a commute experience, taking the mode of random values around a typical sleep duration of 7-8 hours.
transportation (e.g., “driving”, “public transit”) and sentiment Smart Transportation: For smart transportation, we simu-
(e.g., “positive”, “negative”) as input. The prompt we used for lated GPS location (latitude and longitude) and speed data.
this task is “Create a social media post about a {sentiment} GPS location data was generated using linear interpolation
experience during a {mode} commute”. between two predefined locations (Los Angeles and San
The generate public transport schedule(r- Francisco), with added random deviations to simulate realis-
oute, start, end, frequency) function gener- tic movement patterns. Speed data was generated randomly
ates a public transport schedule based on the provided within a reasonable range (40-120 km/h).
route information, start and end locations, and desired
frequency of departures. We used a structured prompt Smart Grid: The smart grid domain included simulated
format to ensure that the generated schedule includes all the data for energy consumption, solar generation, and wind
specified information in a clear and organized manner. generation. Energy consumption data was generated hourly,
incorporating daily peaks and seasonal variations using sine
Smart Grid: For the smart grid domain, we generated text waves and random noise. Solar generation data was gener-
related to energy infrastructure, news, and weather conditions. ated only during daylight hours, taking into account seasonal
The generate grid description(city) function variations in sunlight duration and intensity. Wind generation
describes the composition of a city’s power grid (e.g., data was simulated to be more variable and less predictable,
“Los Angeles”), including traditional and renewable energy using random values and a sine wave to represent general
sources. The prompt used for this task is “Describe the com- patterns.
position of {city}’s power grid, including traditional and
Data Volume and Time Range: We aimed to generate
renewable energy sources”.
approximately 10,000 data points per domain, covering a
To simulate energy-related news articles, we created the time range of one year (365 days). The frequency of data
generate energy news(event) function. This func- points varies depending on the sensor type (hourly, daily,
tion takes an energy-related event or development (e.g., “gov- every 10 minutes).
ernment investment in grid upgrades”, “new solar farm”) as
input and generates a news article about it using the prompt: Data Pre-processing: Before KE, all data generated un-
“Write a news article about {event} and its impact on the dergoes data cleaning, data normalisation and textual data
energy sector”. pre-processing as outlined previously.

215
5.2 Module 2 – MMRL
This module transforms the preprocessed data from diverse
sources and modalities into a unified representation that
captures semantic relationships across different data types,
preparing it for effective KE by the LLM.
In this implementation, we employ sentence embedding
models, specifically Sentence-BERT (Reimers and Gurevych
2019) [all-mpnet-base-v2 model (Transformers 2024)], to
generate dense vector representations for each text data point.
These embeddings capture the semantic meaning of the text,
allowing for comparisons and linking based on semantic
similarity. Sensor data is integrated directly into the KG, with
relationships to relevant text data points established through
the semantic similarity scores between their corresponding
text descriptions and the sentence embeddings of the text
data. This approach enables the fusion of textual and sensor
information within a unified semantic space.
Figure 5: Data Properties of the Ontology

Figure 6: Object Properties of the Ontology

an object (another entity or a value). For example, the fact


(“EcoTemp Thermostat”, “controls”, “Room Temperature”)
represents the knowledge that a specific thermostat device
controls the temperature of a room. The ability to extract and
Figure 3: Classes and Subclasses of the Ontology represent such knowledge in a structured format is crucial for
building a comprehensive and queryable KG that can support
reasoning and decision-making in a SC context.
Entity Recognition: This module employs a fine-tuned
BERT model (Devlin et al. 2018) [dbmdz/bert-large-cased-
finetuned-conll03-english (dbmdz 2024)] to identify relevant
entities within the text data. For example, given the input “My
EcoTemp thermostat keeps disconnecting from Wi-Fi” from
a smart home scenario, this module would output “EcoTemp
thermostat” and “Wi-Fi” as identified entities.
Relationship Extraction: This module utilizes GPT-4 to
infer relationships between entities identified in the previous
step. The model is prompted with the input text and the identi-
fied entities, and is tasked with extracting relationships in the
form of (subject, predicate, object), focusing on relationships
Figure 4: Annotation Properties of the Ontology relevant to SCs (e.g., “controls”, “measures”, “affects”, “lo-
cated at”, “experiences”). For instance, given the input “John,
a 65-year-old male, presented with complaints of fatigue and
5.3 Module 3 – LLM Guided KE shortness of breath. He has a history of hypertension” from a
smart healthcare scenario, this module would output relation-
This module leverages the capabilities of LLMs to extract
ships such as (John, experiences, fatigue), (John, experiences,
meaningful knowledge from the preprocessed synthetic data.
shortness of breath), and (John, has history of, hypertension).
In this context, ”knowledge” refers to structured information
about entities and their relationships within the SC domain. Semantic Schema Mapping: This task, as part of the LLM-
This knowledge is represented as facts, or triples, consisting guided KE process, focuses on establishing links between
of a subject (an entity), a predicate (the relationship), and text data points based on their semantic similarity. It leverages

216
(from smart thermostat data) and “Ambient Temperature”
(from energy consumption data) as input, the high similarity
score between their sentence embeddings would lead to the
establishment of a relatedTo relationship between these
two concepts in the KG.

5.4 Module 4 – KG Construction and Population


KR: The extracted knowledge is formally represented as a
KG using the RDF. RDF triples (subject, predicate, object)
are used to express entities and relationships within the graph.
In the codebase we have defined custom namespaces (e.g.,
ONTOLOGY, SMART CITY, SMART HOME etc,; see fig-
ure 9) to organize and distinguish concepts within the KG. A
snapshot of the final RDF representation of the KG is shown
in figure 7.
Ontology Design: A hierarchical ontology is designed to
Figure 7: A snapshot of the final RDF structure the KG, defining classes and properties for each
domain to ensure consistency and facilitate interoperability.
The ontology is created programmatically using the rdflib
library in Python. The code is responsible for defining classes
for each data type (e.g., “SmartHomeObservation”, “Device”,
“SensorReading”) and properties to represent relationships
and attributes (e.g., “relatedTo”, “hasSensorReading”). Data
properties for sensor readings are also added, specifying their
data types (e.g., “temperature” as XSD.float). The ontology
is saved in a RDF/XML (smart city ontology.rdf) format.
Figures 3, 4, 5, and 6 (as viewed on Webprotege (Tudorache,
Vendetti, and Noy 2008)) illustrate the structure of the ontol-
ogy.
Synthetic Data Incorporation: After pre-processing, mul-
timodal representation learning (Section 5.2), and LLM-
guided KE (Section 5.3), the synthetic data generated using
the techniques described in Section 5.1 is integrated into the
KG to enhance its completeness and address potential data
sparsity issues. The add sensor data to graph and
add text data to graph functions in the code handles
the integration of sensor and text data into the KG, aligning
Figure 8: A snapshot of the KG containing data it with the defined ontology. This function iterates through
the sensor data, creating observation URIs based on times-
tamps, and adds RDF triples to represent the observation type,
the sentence embeddings generated in section 5.2. For each timestamp, and sensor readings.
pair of text data points, their corresponding sentence embed-
KG Population: The extracted entities, relationships, and
dings are compared using cosine similarity. If the similarity
attributes are populated into the KG. Each data point is repre-
score exceeds a predefined threshold (0.6 in this implemen-
sented as a node (entity) or edge (relationship) in the graph,
tation), a relatedTo relationship is established between
adhering to the defined ontology. See figure 8 (as viewed on
them in the KG - while this value has not been rigorously opti-
Webprotege).
mized for our specific dataset, it represents a balance between
capturing meaningful semantic connections and avoiding an
excessive number of spurious links, based on general practice 5.5 Key Contribution in This Implementation
in semantic similarity tasks. Future work will focus on empir- This implementation showcases a practical approach for lever-
ically determining the optimal threshold through a systematic aging LLMs to construct MMKGs for representing SCs. It di-
evaluation of precision and recall across different threshold rectly addresses how LLMs can be utilized to build MMKGs
values, tailored to the characteristics of our smart city data that effectively capture the intricate interplay within a SC
(Resnik 1999). This approach allows the system to identify CDT framework. This work focuses specifically on the cre-
connections between text data points based on the underlying ation of synthetic MMKGs focusing on the SC domains of
concepts they express, even if they don’t share explicit key- smart homes, smart healthcare, smart transportation, and
words or phrases. For example, given “Room Temperature” smart grids. The key contributions are aligned as follows:

217
within the urban environment. This directly contributes to the
development of a CDT by providing a comprehensive and
interconnected KR that captures the multifaceted nature of a
SC.

6 Future Work
Figure 9: Namespaces for Ontology and Knowledge Graph
This research lays the groundwork for a future where LLMs
power the creation of rich, interconnected MMKGs for SC
Fusion of Heterogeneous Data Sources: The implemen- CDT development. While our work demonstrates the feasibil-
tation successfully integrates data from diverse sources, in- ity of this vision, realizing its full potential requires further ex-
cluding synthetically generated text, simulated sensor read- ploration of advanced techniques and addressing the practical
ings, and predefined ontologies (see Figure 9). This fusion challenges of real-world urban data. The following avenues
of heterogeneous data into a unified KG enables cross-modal for future work will build upon this foundation, exploring
analysis and reasoning, providing a more holistic view of the more advanced techniques and addressing the practical chal-
SC. lenges of applying LLMs to large-scale, heterogeneous urban
data.
Automated KE from MM Data: The implementation
demonstrates the use of LLMs for automatically extracting 6.1 Enhancing LLM-Based KE
entities and relationships from both textual and sensor data.
A fine-tuned BERT model is used for entity recognition, Entity Linking: Representing data points as nodes in the
GPT-4 for relationship extraction, and Sentence-BERT for se- graph enables the future application of LLMs, which could be
mantic linking based on similarity. This automated approach used to link entities across different data sources, identifying
significantly reduces the manual effort required for KG con- instances where the same entity is referred to using different
struction. names or descriptions. For instance, an LLM could be used
to recognize that “John” in the smart home data refers to the
Construction of MMKGs: The implementation demon- same entity as “Patient X” in the healthcare data.
strates the successful construction of a KG that integrates
both structured data (sensor readings) and unstructured data Semantic Similarity-Based Linking: While the imple-
(text descriptions, social media posts) within a unified graph mentation demonstrates a basic form of relationship extrac-
representation using the workflow pipeline mentioned in fig- tion, it could be further extended to extract more specific
ure 2 showcasing its MM capabilities. relationship types (e.g., “controls”, “measures”, “affects”) by
employing LLMs trained on domain-specific datasets with
Workflow Example with Synthetic Data: This implemen- annotated relationships.
tation provides a detailed workflow example using syntheti-
cally generated data to demonstrate the effectiveness of our Data Transformation and Integration: More advanced
LLM-driven MMKG construction approach. The process of LLM applications could be used to translate between modal-
generating synthetic text and sensor data, extracting knowl- ities, such as generating textual descriptions of sensor data
edge using LLMs, and populating the KG is thoroughly doc- patterns or creating synthetic sensor readings based on textual
umented, offering a practical illustration of how our method- descriptions.
ology can be applied. This example serves as a valuable re-
source for researchers and practitioners seeking to implement Schema Alignment and Mapping: In a more complex
similar approaches for building KGs for SC applications. scenario, LLMs could assist in aligning different schemas
by identifying semantically equivalent concepts across data
Reusable and Scalable Pipeline: The implementation in- sources. For example, an LLM could recognize that “Room
troduces a scalable and reusable pipeline for LLM-assisted Temperature” in the smart home data corresponds to “Ambi-
MMKG construction. This pipeline can be easily adapted ent Temperature” in the healthcare data.
for future enhancements, such as incorporating additional
data modalities (e.g., images, videos) or integrating more Comparative Analysis of LLMs: Future work will involve
advanced LLM architectures. The modular design of the a comparative analysis of different LLM architectures, in-
pipeline allows for flexibility and extensibility, making it a cluding GPT-4 and other advanced models, to assess their
valuable tool for ongoing research and development in the effectiveness in various KE tasks. This analysis will focus on
field of SC CDTs. evaluating the accuracy, efficiency, and robustness of differ-
ent LLMs in handling the specific challenges of SC data.
Foundation for a CDT: The implementation methodology
and further, the KG obtained from it, serves as a founda- Human-in-the-Loop Verification: To enhance the relia-
tional input data element for SC CDT framework outlined in bility and trustworthiness of the generated KG, we plan to
(Mandal 2024). The KG, by representing entities, relation- incorporate a human-in-the-loop approach for verifying LLM-
ships, and events within the SC, provides a structured and extracted knowledge. This will involve developing interactive
semantically rich knowledge base that can be used for sim- interfaces that allow domain experts to review, validate, and
ulating, predicting, and understanding complex interactions refine the extracted entities, relationships, and semantic links.

218
Parameter Optimization and Sensitivity Analysis: Fur- 6.4 Synthetic Data and Real World Data
ther research will focus on optimizing the parameters used Refining Synthetic Data Generation: While the current
in the LLM-based KE process, including the parameters for techniques generate synthetic data with plausible characteris-
GPT-4 and the similarity threshold for sentence embeddings. tics, further refinement is crucial. This involves developing
A sensitivity analysis will be conducted to assess the impact more sophisticated generative models that capture the sub-
of these parameters on the quality and completeness of the tle nuances, correlations, and potential anomalies present in
generated KG. real-world sensor data and rigorously evaluating the synthetic
Robustness to Noisy Data: Investigate how to make the data against corresponding real-world datasets to identify and
LLM-based KE process more robust to noise, inconsistencies, rectify any discrepancies or biases.
and errors often present in real-world SC data. Combining Real-World and Synthetic Datasets: Explor-
ing optimal strategies for combining real-world and synthetic
Scalability and Generalization to Diverse City Contexts:
datasets during the KG construction process is essential. This
Future work will focus on scaling our approach to handle the
involves developing techniques to effectively weight and in-
data volume and complexity of diverse city contexts. This
tegrate data from different sources while mitigating potential
includes exploring distributed processing, data partitioning,
biases introduced by synthetic data.
and hierarchical KG structures to manage larger datasets.
Additionally, we will investigate methods for adapting the
KG construction process to different city-specific ontologies,
7 Conclusion
data schemas, and semantics, potentially leveraging domain This research introduces a novel approach to constructing
adaptation techniques for LLMs. comprehensive and interconnected MMKGs for SC CDTs
development by harnessing the power of LLMs. Our method-
Explainability and Trustworthiness: Investigate methods ology effectively addresses the challenges of multi-modal
for making the LLM-based KE process more explainable and data fusion, automated KE, and data sparsity through syn-
transparent, building trust in the generated KG and enabling thetic data generation, paving the way for realizing the full
human validation of the extracted knowledge. potential of CDTs in optimizing urban environments. The
Evaluating KG Quality: Future work will prioritize a com- use of LLMs enables the seamless integration of heteroge-
prehensive evaluation of the generated MMKG’s quality. This neous data sources, including text, sensor readings, and social
will encompass quantitative assessments of its structural prop- media feeds, into a unified and semantically rich KR. This
erties, semantic coherence, and utility for downstream tasks enhances data privacy in production CDTs, by reducing re-
like question answering and inference, alongside qualitative liance on sensitive real-world information during the CDT
assessments through expert evaluation and case studies. Addi- development phase.
tionally, we will benchmark our LLM-driven pipeline against While this work presents a significant step forward, it
existing KG construction methods to demonstrate its effec- serves primarily as a proof-of-concept and a first step towards
tiveness and efficiency. a broader research scope. Future research will focus on rig-
orous quantitative evaluation of the MMKG’s performance,
6.2 Expanding Data Modalities and Applications developing compelling real-world case studies, and expand-
ing the KG to incorporate a wider range of data modalities.
Incorporating Additional Data Modalities: Expanding Additionally, we will explore more advanced LLM architec-
the KG to encompass a wider range of data sources beyond tures, fine-tuning strategies, and reasoning mechanisms while
text is crucial for capturing a holistic view of the SC. This refining synthetic data generation techniques. Critically, we
includes incorporating image data, video data, voice data, will address the challenges of scalability, generalizability,
biometric data, facial and gesture data, geospatial data, real- and the ethical implications of using LLMs and synthetic
time weather information, demographic data, and economic data, including potential biases and their impact on auto-
indicators. mated decision-making. Pursuing these directions, we aim to
Evaluating Performance in Diverse Applications: A cru- advance the state-of-the-art in SC CDTs, empowering data-
cial next step is to assess the impact of the constructed KG on driven decisions and contributing to the development of more
a broader range of SC applications. This includes evaluating sustainable, efficient, and citizen-centric urban environments.
its utility in urban planning scenarios, resource optimization
strategies, disaster response simulations, and other relevant Acknowledgments
use cases. This publication has emanated from research conducted with
the financial support of ML-Labs at Dublin City Univer-
6.3 Reasoning and Inference sity under grant number 18/CRT/6183 and the Insight SFI
Enhancing Reasoning and Inference: Investigating the Research Centre for Data Analytics under grant number
use of KG embedding techniques and reasoning mechanisms, SFI/12/RC/2289 P2.
such as graph neural networks (GNN) and rule-based in-
ference engines, can unlock the full potential of the gener- References
ated KG. This would enable more sophisticated querying, Abburu, S.; Berre, A. J.; Jacoby, M.; Roman, D.; Stojanovic,
facilitate complex what-if analyses, and support advanced L.; and Stojanovic, N. 2020. Cognitwin–hybrid and cogni-
analytics and data-driven decision-making within the DT. tive digital twins for the process industry. In 2020 IEEE

219
International Conference on Engineering, Technology and Huang, Y. 2024. Leveraging large language models for
Innovation (ICE/ITMC), 1–8. IEEE. enhanced nlp task performance through knowledge distil-
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, lation and optimized training strategies. arXiv preprint
I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, arXiv:2402.09282.
S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; and Philip, S. Y.
preprint arXiv:2303.08774. 2021. A survey on knowledge graphs: Representation, ac-
Anthopoulos, L. G. 2015. Understanding the smart city quisition, and applications. IEEE transactions on neural
domain: A literature review. Transforming city governments networks and learning systems, 33(2): 494–514.
for successful smart cities, 9–21. Jin, H.; Zhang, Y.; Meng, D.; Wang, J.; and Tan, J. 2024.
Bonetto, R.; and Rossi, M. 2017. Smart grid for the smart A comprehensive survey on process-oriented automatic text
city. Designing, Developing, and Facilitating Smart Cities: summarization with exploration of llm-based methods. arXiv
Urban Design to IoT Solutions, 241–263. preprint arXiv:2403.02901.
Jokanović, V. 2020. Smart healthcare in smart cities. In
Chen, T.; Zhang, Y.; Qian, X.; and Li, J. 2022. A knowledge
Towards Smart World, 45–72. Chapman and Hall/CRC.
graph-based method for epidemic contact tracing in public
transportation. Transportation Research Part C: Emerging Karimi, R.; Farahzadi, L.; Sepasgozar, S.; Sargolzaei, S.; Sep-
Technologies, 137: 103587. asgozar, S. M. E.; Zareian, M.; and Nasrolahi, A. 2021. Smart
built environment including smart home, smart building and
Chun, S.; Jin, X.; Seo, S.; Lee, K.-H.; Shin, Y.; and Lee, I. smart city: definitions and applied technologies. Advances
2018. Knowledge graph modeling for semantic integration of and Technologies in Building Construction and Structural
energy services. In 2018 IEEE international conference on Analysis, 179.
big data and smart computing (BigComp), 732–735. IEEE.
Khalimon, E. A.; Vikhodtseva, E. A.; and Obradović, V. 2020.
Chun, S.; Jung, J.; Jin, X.; Seo, S.; and Lee, K.-H. 2020. Smart cities today and tomorrow–world experience. In Insti-
Designing an integrated knowledge graph for smart energy tute of Scientific Communications Conference, 1340–1347.
services. The Journal of Supercomputing, 76: 8058–8085. Springer.
dbmdz. 2024. BERT Large Cased Fine-tuned on CoNLL03 Lehtola, V. V.; Koeva, M.; Elberink, S. O.; Raposo, P.; Virta-
English. Accessed: 2024-07-27. nen, J.-P.; Vahdatikhaki, F.; and Borsci, S. 2022. Digital twin
De Bellis, A. 2023. Structuring the unstructured: an LLM- of a city: Review of technology serving city needs. Interna-
guided transition. tional Journal of Applied Earth Observation and Geoinfor-
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. mation, 114: 102915.
Bert: Pre-training of deep bidirectional transformers for lan- Liang, W.; De Meo, P.; Tang, Y.; and Zhu, J. 2024. A Survey
guage understanding. arXiv preprint arXiv:1810.04805. of Multi-modal Knowledge Graphs: Technologies and Trends.
Dsouza, A.; Tempelmeier, N.; and Demidova, E. 2021. To- ACM Computing Surveys.
wards neural schema alignment for openstreetmap and knowl- Liu, X.; Mao, T.; Shi, Y.; and Ren, Y. 2024. Overview of
edge graphs. In International Semantic Web Conference, Knowledge Reasoning for Knowledge Graph. Neurocomput-
56–73. Springer. ing, 127571.
Ehrlinger, L.; and Wöß, W. 2016. Towards a Definition of Mandal, S. 2024. A Privacy Preserving Federated Learning
Knowledge Graphs. (PPFL) Based Cognitive Digital Twin (CDT) Framework
for Smart Cities. Proceedings of the AAAI Conference on
Fanourakis, N.; Efthymiou, V.; Kotzinos, D.; and Artificial Intelligence, 38(21): 23399–23400.
Christophides, V. 2023. Knowledge graph embed-
Mandvikar, S. 2023. Augmenting intelligent document pro-
ding methods for entity alignment: experimental review.
cessing (IDP) workflows with contemporary large language
Data Mining and Knowledge Discovery, 37(5): 2070–2137.
models (LLMs). International Journal of Computer Trends
Gaur, A.; Scotney, B.; Parr, G.; and McClean, S. 2015. Smart and Technology, 71(10): 80–91.
city architecture and its applications based on IoT. Procedia Melluso, N.; Grangel-González, I.; and Fantoni, G. 2022.
computer science, 52: 1089–1094. Enhancing industry 4.0 standards interoperability via knowl-
Gharaibeh, A.; Salahuddin, M. A.; Hussini, S. J.; Khreishah, edge graphs with natural language processing. Computers in
A.; Khalil, I.; Guizani, M.; and Al-Fuqaha, A. 2017. Smart Industry, 140: 103676.
cities: A survey on data management, security, and enabling Meyer, L.-P.; Stadler, C.; Frey, J.; Radtke, N.; Junghanns,
technologies. IEEE Communications Surveys & Tutorials, K.; Meissner, R.; Dziwis, G.; Bulert, K.; and Martin, M.
19(4): 2456–2501. 2023. Llm-assisted knowledge graph engineering: Exper-
Grangel-González, I. 2019. A knowledge graph based integra- iments with chatgpt. In Working conference on Artificial
tion approach for industry 4.0. Ph.D. thesis, Universitäts-und Intelligence Development for a Resilient and Sustainable
Landesbibliothek Bonn. Tomorrow, 103–115. Springer Fachmedien Wiesbaden Wies-
Guo, L.; Zhang, Q.; Sun, Z.; Chen, M.; Hu, W.; and Chen, baden.
H. 2022. Understanding and improving knowledge graph Nijkamp, P.; and Kourtit, K. 2013. The “new urban Europe”:
embedding for entity alignment. In International Conference Global challenges and local responses in the urban century.
on Machine Learning, 8145–8156. PMLR. European planning studies, 21(3): 291–315.

220
Oliveira, D.; Sahay, R.; and d’Aquin, M. 2019. Leveraging Wei, Y.; Huang, Q.; Kwok, J. T.; and Zhang, Y. 2024.
ontologies for knowledge graph schemas. KICGPT: Large Language Model with Knowledge in Con-
Ontoforce. 2024. The Significance of Ontology in Knowledge text for Knowledge Graph Completion. arXiv preprint
Graphs. Accessed: 2024-07-30. arXiv:2402.02389.
Ontotext. 2024. What are ontologies? https://www.ontotext. Xiang, Y.; Zhang, Z.; Chen, J.; Chen, X.; Lin, Z.; and
com/knowledgehub/fundamentals/what-are-ontologies/. Ac- Zheng, Y. 2021. OntoEA: Ontology-guided entity align-
cessed: 2024-05-13. ment via joint knowledge graph embedding. arXiv preprint
arXiv:2105.07688.
Pan, J. Z. 2009. Resource description framework. In Hand-
book on ontologies, 71–90. Springer. Xu, D.; Zhang, Z.; Lin, Z.; Wu, X.; Zhu, Z.; Xu, T.; Zhao, X.;
Zheng, Y.; and Chen, E. 2024a. Multi-perspective Improve-
Peng, C.; Xia, F.; Naseriparsa, M.; and Osborne, F. 2023. ment of Knowledge Graph Completion with Large Language
Knowledge graphs: Opportunities and challenges. Artificial Models. arXiv preprint arXiv:2403.01972.
Intelligence Review, 56(11): 13071–13102.
Xu, H.; Omitaomu, F.; Sabri, S.; Li, X.; and Song, Y. 2024b.
Qian, Z.; Cebere, B.-C.; and van der Schaar, M. 2023. Syn- Leveraging Generative AI for Smart City Digital Twins: A
thcity: facilitating innovative use cases of synthetic data in Survey on the Autonomous Generation of Data, Scenar-
different data modalities. arXiv preprint arXiv:2301.07573. ios, 3D City Models, and Urban Designs. arXiv preprint
Reimers, N.; and Gurevych, I. 2019. Sentence-bert: Sentence arXiv:2405.19464.
embeddings using siamese bert-networks. arXiv preprint Xu, Y.; He, S.; Chen, J.; Wang, Z.; Song, Y.; Tong, H.; Liu, K.;
arXiv:1908.10084. and Zhao, J. 2024c. Generate-on-Graph: Treat LLM as both
Resnik, P. 1999. Semantic similarity in a taxonomy: An Agent and KG in Incomplete Knowledge Graph Question
information-based measure and its application to problems Answering. arXiv preprint arXiv:2404.14741.
of ambiguity in natural language. Journal of artificial intelli- Zhang, Q.; Ma, Z.; Zhang, P.; and Jenelius, E. 2023. Mo-
gence research, 11: 95–130. bility knowledge graph: review and its application in public
Sánchez-Corcuera, R.; Nuñez-Marcos, A.; Sesma-Solance, transport. Transportation, 1–27.
J.; Bilbao-Jayo, A.; Mulero, R.; Zulaika, U.; Azkune, G.; Zhang, S.-F.; Zhai, J.-H.; Xie, B.-J.; Zhan, Y.; and Wang,
and Almeida, A. 2019. Smart cities survey: Technologies, X. 2019. Multimodal representation learning: Advances,
application domains and challenges for the cities of the future. trends and challenges. In 2019 International Conference on
International Journal of Distributed Sensor Networks, 15(6): Machine Learning and Cybernetics (ICMLC), 1–6. IEEE.
1550147719853984.
Zheng, X.; Lu, J.; and Kiritsis, D. 2022. The emergence of
Schramm, S.; Wehner, C.; and Schmid, U. 2023. Comprehen- cognitive digital twin: vision, challenges and opportunities.
sible artificial intelligence on knowledge graphs: A survey. International Journal of Production Research, 60(24): 7610–
Journal of Web Semantics, 79: 100806. 7632.
Shu, D.; Chen, T.; Jin, M.; Zhang, Y.; Du, M.; and Zhang, Y.
2024. Knowledge Graph Large Language Model (KG-LLM)
for Link Prediction. arXiv preprint arXiv:2403.07311.
Silva, M. C.; Faria, D.; and Pesquita, C. 2022. Matching mul-
tiple ontologies to build a knowledge graph for personalized
medicine. In European Semantic Web Conference, 461–477.
Springer.
Tao, Y.; Wang, Y.; and Bai, L. 2024. Graphical Reasoning:
LLM-based Semi-Open Relation Extraction. arXiv preprint
arXiv:2405.00216.
Transformers, S. 2024. Pretrained Models. Accessed: 2024-
07-20.
Tudorache, T.; Vendetti, J.; and Noy, N. F. 2008. Web-
Protege: A Lightweight OWL Ontology Editor for the Web.
In OWLED, volume 432, 2009.
Tupayachi, J.; Xu, H.; Omitaomu, O. A.; Camur, M. C.;
Sharmin, A.; and Li, X. 2024. Towards Next-Generation Ur-
ban Decision Support Systems through AI-Powered Genera-
tion of Scientific Ontology using Large Language Models–A
Case in Optimizing Intermodal Freight Transportation. arXiv
preprint arXiv:2405.19255.
Wang, S.; Sun, X.; Li, X.; Ouyang, R.; Wu, F.; Zhang, T.; Li,
J.; and Wang, G. 2023. Gpt-ner: Named entity recognition
via large language models. arXiv preprint arXiv:2304.10428.

221

You might also like