Broca's aphasia is a type of aphasia characterized by non-fluent, effortful and fragmented speech production with relatively good comprehension. Since traditional aphasia treatment methods are often time-consuming, labour-intensive, and do not reflect real-world conversations, applying natural language processing based approaches such as Large Language Models (LLMs) could potentially contribute to improving existing treatment approaches. To address this issue, we explore the use of sequence-to-sequence LLMs for completing fragmented Broca's aphasic sentences. We first generate synthetic Broca's aphasic data using a rule-based system designed to mirror the linguistic characteristics of Broca's aphasic speech. Using this synthetic data, we then fine-tune four pre-trained LLMs on the task of completing fragmented sentences. We evaluate our fine-tuned models on both synthetic and authentic Broca's aphasic data. We demonstrate LLMs' capability for reconstructing fragmented sentences, with the models showing improved performance with longer input utterances. Our result highlights the LLMs' potential in advancing communication aids for individuals with Broca's aphasia and possibly other clinical populations.
The jobscripts folder contains all the jobscripts (including two helpers) needed to replicate the experiments on the Hábrók server. Note that a virtual environment should be created beforehand.
For installing the dependencies, execute the following command:
pip install -r requirements.txt
The code targets Python 3.10 and 3.11.
Note that the data setup scripts require CHA files from AphasiaBank and SBCSAE. Therefore, first retrieve those files and store them accordingly -- see helper_preprocessing.sh.
We created a helper for the setup and pre-processing steps:
jobscripts/helper_preprocessing.sh
The helper first executes the data setup files, converting the raw CHA files into workable dataframes, and then runs the pre-processing files over these dataframes.
Similar to the data setup and pre-processing, we created a helper for generating synthetic sentences and assessing their quality automatically.
jobscripts/helper_data_quality.sh
The helper generates synthetic sentences using the SBCSAE corpus and reproduces the data evaluation as shown in Table 3 in the paper. See the corresponding bash scripts for more information such as the data paths.
Before we can fine-tune the sentence completions models, we need to create the data splits:
jobscripts/finetuning_splits.sh 31-10-2024
The splits can be found in the data folder.
Next up we fine-tune the sentence completion models, let them generate completions for the test set, and evaluate their performances using our fine-tune script:
jobscripts/fine_tune.sh SBCSAE
See python fine_tune_t5.py --help
for more information about its parameters, and please find the generated completions in the experiment folder for convenience.
To gain more insights into the ChrF and Cosine similarity scores for each model, run the following command:
jobscripts/analyse_comp.sh
The bash scripts provides descriptive statistics about the completions by each model, including standard error, effectively recreating Table 4 in the paper.
The generated completions for the authentic Broca's aphasic sentences can be reproduced using the authentic completion script:
jobscripts/auth_comp.sh
See the bash script and python authentic_completion.py --help
to reuse the code with different input sentences.
Please find the generated completions for the authentic input in the experiment folder as well.