Uni HAR
Uni HAR
objective is to transfer HAR models from the source users to data augmentations are applied during the supervised train-
target users with realistic adoption overhead. ing stage to increase data diversity for better generalization.
We find existing works poor in the HAR scenario envi- In practical applications, UniHAR is a configurable frame-
sioned in Figure 1. Conventional supervised learning models work that can adapt to two scenarios, i.e., data-decentralized
[15, 25, 27, 64] assume collected labeled dataset is general and and data-centralized scenarios. In the data-decentralized sce-
thus suffer from severe performance degradation in practice. nario where raw data transmission is not encouraged, as
Recent self-supervised learning works [11, 12, 40, 52, 62], in- illustrated in Figure 1, UniHAR integrates self-supervised
cluding those aiming at building foundation models for IMU and federated learning techniques to train a generalized fea-
sensing, e.g., TPN [40] and LIMU-BERT [62], however, may ture extraction model. UniHAR then constructs an activity
still overlook the data heterogeneity and overfit to specific recognition model using limited but augmented labeled data.
user domains. Some domain adaptation works [4, 17, 37, 65] The recognition model is distributed to all users for activity
consider certain aspects of diversity but require fully la- inference without additional training. In the data-centralized
beled data from source users, which still underperform when scenario, where raw data transmissions from target users are
source domain labels are limited. We notice that most ex- possible, UniHAR can further leverage adversarial training
isting efforts are focused on directly learning common fea- techniques for improved performance.
tures among raw data, with the implied assumption that data For experiment evaluation, different from previous works
across different domains already share similar distributions. [11, 12, 15, 27, 33, 40, 47, 52, 62–64], UniHAR is fully proto-
However, this assumption does not hold when the sensor typed on the mobile platform. The client is deployable on
data collected from different user groups are highly hetero- Android devices, which supports real-time model training
geneous. As a result, most existing approaches fail to achieve and inference with locally collected data. We conduct ex-
satisfactory performance in practical adoptions at scale. tensive experiments with four open datasets by transferring
This paper explores the data augmentation perspective to models across datasets, i.e., the activity recognition mod-
combat data heterogeneity by incorporating physical knowl- els are trained with activity labels from only one dataset
edge. Most existing IMU data augmentation approaches are and then applied to the other three datasets without activity
directly borrowed from other application domains (e.g., im- labels. To the best of our knowledge, such a level of hetero-
ages or text processing [45, 46, 60]) without considering geneity involved in the experiment settings has not been
and exploiting the physics of inertial sensing, which can investigated in existing studies. The results show UniHAR
lead to harmful results when improperly adopted. We thor- achieves an average HAR accuracy of 78.5% as compared
oughly study a variety of IMU data augmentation methods to <62% achieved by extending any existing solutions as al-
and classify them into three categories based on their rela- ternatives. When the raw data transmissions are allowed
tions with underlying physical processes: complete - which in the data-centralized scenario, UniHAR can achieve 82.5%
fully aligns with physics, approximate - which captures un- average accuracy as compared to <72% achieved with state-
derlying physics but with approximate formulations, and of-the-art solutions. The key contributions of this paper are
flaky - which is not supported by the physical process and summarized as follows:
may undermine data distribution. The data augmentation • We consider a practical and challenging HAR scenario,
with physical priors does not introduce extra labeling over- where models trained from a small group of source
head and would generalize data distributions. We refer to users are adopted across massive target users with
this technique as Physics-Informed Data Augmentation, as realistic adoption overhead.
opposed to the conventional data plane approaches that dis- • We present a thorough and comprehensive analysis of
regard underlying physical processes. IMU data augmentation methodology and character-
By applying the carefully designed data augmentation ize physics-informed data augmentation based on the
approaches, this paper presents UniHAR, a universal HAR underlying physics of IMU sensing.
framework that extracts generalizable activity-related repre- • We identify a novel approach that organically inte-
sentations from heterogeneous IMU data. UniHAR comprises grates different data augmentation methods into a
two stages as shown in Figure 1 - i) self-supervised learning self-supervised learning framework to address data
for feature extraction with massive unlabeled data from all heterogeneity.
users, and ii) supervised training for activity recognition • We fully prototype UniHAR on the standard mobile
with limited labeled data from the source users. Catering platform and evaluate its generalization with practi-
to the nature of different augmentation methods, UniHAR cal experiment settings across different datasets. The
only applies complete data augmentation during the feature source codes are publicly available 1 .
extraction stage to align data distributions from various user
groups. On the other hand, both complete and approximate 1 https://dapowan.github.io/wands_unihar/
Practically Adopting Human Activity Recognition ACM MobiCom ’23, October 2–6, 2023, Madrid, Spain
1
HDCNN
XHAR
0.8
Accuracy
0.6
0.4
0.2
3.2 Overview
Figure 4: UniHAR overview.
As depicted in Figure 4, UniHAR has two training stages:
■ Feature Extraction. All local unlabeled datasets are first
The virtual IMU technique [19, 20] aims at converting augmented to align the distributions of heterogeneous data
videos of human activity into virtual streams of IMU mea- from various clients. To construct a generalized feature ex-
surement to augment the training data, which follows le- tractor (i.e., the encoder), the cloud server collaborates with
gitimate physical processes. Virtual IMU, however, requires all mobile clients to exploit massive augmented unlabeled
additional sensing information, including activity videos and data. The encoder and decoder are trained on clients individu-
the on-body position of the device, to reconstruct the phys- ally, which learn the high-level features using self-supervised
ical states of devices and thus generate virtual IMU data. learning techniques. The cloud server combines local models
It cannot be generally applied when the additional camera and obtains a generalized model. In a nutshell, the whole
sensing modality is not available. process aims at solving the following problem:
including orientation 𝒒 and time interval ∆𝑡 remain the same. observations depend on the approximation of original obser-
The acceleration of transformed physical states is vations or known physical states.
Linear Upsampling. IMU data are discrete signals and
𝑆𝑎 ( 𝒒 ′, 𝒍 ′, 𝒈 ′ ) = 𝒒 ′∗ ⊗ (𝒍 ′ + 𝒈 ′) ⊗ 𝒒 ′ upsampling can enrich data samples. The linear upsampling
𝒍 +𝒈 𝒂 interpolates physical states by 𝒒 𝒕′˜ = 𝐺(𝒒) = 𝛼𝒒 𝒕 + (1 − 𝛼)𝒒 𝒕−1
= 𝒒∗ ⊗ ( )⊗𝒒= = 𝒂 ′. (10)
∥𝒈∥ ∥𝒈∥ and 𝒍 𝒕˜′ = 𝐺(𝒍) = 𝛼𝒍 𝒕 + (1 − 𝛼)𝒍 𝒕−1 , where 𝛼 is a value within
[0, 1] and 𝑡˜ = 𝛼𝑡 + (1 − 𝛼)(𝑡 − ∆𝑡) = 𝑡 − (1 − 𝛼)∆𝑡. The
The 𝐹 (𝒂) only involves the known physical state 𝒈, so accel- corresponding augmented observations are
eration normalization is a complete data augmentation.
Local rotation. The placement diversity causes signif- 𝑆𝑎 ( 𝒒 ′, 𝒍 ′, 𝒈 ′ ) = 𝒒 𝒕′˜ ∗ ⊗ (𝒍 𝒕˜′ + 𝒈) ⊗ 𝒒 𝒕′˜ , (13)
icant differences in triaxial distributions of IMU data. To 2
𝑆𝜔 ( 𝒒 ′, ∆𝑡 ′ ) = 𝒒∗ ⊗ (𝒒 𝒕˜ − 𝒒 𝒕˜ −1 ), (14)
simulate the IMU data collected from different device ori- ∆𝑡 𝒕˜ −1
entations, this augmentation applies an extra rotation to both involving unknown physical states 𝒒 and 𝒍. Linear up-
the device and augments orientation in the local frame by sampling is an approximate data augmentation and its aug-
𝒒 ′ = 𝐺(𝒒) = 𝒒 ⊗ ∆𝒒, where ∆𝒒 is generated rotation and mented observations can be approximated as
known. The observations of transformed physical states are
𝒂 ′ = 𝑆𝑎 ( 𝒒 ′, 𝒍 ′, 𝒈 ′ ) = 𝒒 𝒕′˜ ∗ ⊗ (𝛼𝒍 𝒕 + (1 − 𝛼)𝒍 𝒕−1 + 𝒈) ⊗ 𝒒 𝒕′˜
𝑆𝑎 ( 𝒒 ′, 𝒍 ′, 𝒈 ′ ) = (𝒒 ⊗ ∆𝒒)∗ ⊗ (𝒍 ′ + 𝒈 ′) ⊗ (𝒒 ⊗ ∆𝒒) = 𝛼𝒒 𝒕′˜ ∗ ⊗ (𝒍 𝒕 + 𝒈) ⊗ 𝒒 𝒕′˜ + (1 − 𝛼)𝒒 𝒕′˜ ∗ ⊗ (𝒍 𝒕−1 + 𝒈) ⊗ 𝒒 𝒕′˜
∗ ∗
= ∆𝒒 ⊗ 𝒒 ⊗ (𝒍 + 𝒈) ⊗ 𝒒 ⊗ ∆𝒒 (11) ≈ 𝛼 𝒒∗𝒕 ⊗(𝒍 𝒕 + 𝒈) ⊗ 𝒒 𝒕 + (1 − 𝛼) 𝒒∗𝒕−1 ⊗(𝒍 𝒕−1 + 𝒈) ⊗ 𝒒 𝒕−1
∗
= ∆𝒒 ⊗ 𝒂 ⊗ ∆𝒒, = 𝛼𝒂 𝒕 + (1 − 𝛼)𝒂 𝒕−1, (15)
2
𝑆𝜔 ( 𝒒 ′, ∆𝑡 ′ ) = (𝒒 ⊗ ∆𝒒)∗ ⊗ (𝒒 𝒕 ⊗ ∆𝒒 − 𝒒 𝒕−1 ⊗ ∆𝒒) 𝝎 ′ = 𝑆𝜔 ( 𝒒 ′, ∆𝑡 ) =
2 ∗
𝒒 ⊗ (𝛼(𝒒 𝒕 − 𝒒 𝒕−1 ) −
∆𝑡 ∆𝑡 𝒕˜ −1
2 (1 − 𝛼)(𝒒 𝒕−1 − 𝒒 𝒕−2 ))
= ∆𝒒∗ ⊗ 𝒒∗ ⊗ (𝒒 𝒕 − 𝒒 𝒕−1 ) ⊗ ∆𝒒 (12)
∆𝑡 2𝛼 ∗ 2(1 − 𝛼) ∗
∗
= ∆𝒒 ⊗ 𝝎 ⊗ ∆𝒒. ≈ 𝒒 𝒕−1 ⊗(𝒒 𝒕 − 𝒒 𝒕−1 ) + 𝒒 𝒕−2 ⊗(𝒒 𝒕−1 − 𝒒 𝒕−2 )
∆𝑡 ∆𝑡
The 𝐹 (·) can be designed with 𝒂 ′ = 𝐹 (𝒂) = ∆𝒒∗ ⊗ 𝒂 ⊗ ∆𝒒 = 𝛼𝝎𝒕 + (1 − 𝛼)𝝎𝒕−1, (16)
and 𝝎 ′ = 𝐹 (𝝎) = ∆𝒒∗ ⊗ 𝝎 ⊗ ∆𝒒. The augmented observations where 𝒒 𝒕′˜ approximately equals to 𝒒 𝒕−1 or 𝒒 𝒕 if ∆𝑡 is small.
can be derived from original observations and known ∆𝒒, Linear upsampling enlarge the size of data but introduces
so local rotation is a complete data augmentation. The local approximation errors.
rotation significantly diversifies the triaxial distributions of Time wrapping. A same type of activities may vary in
the original readings and maintains other human motion duration across users due to distinct behavioral patterns. To
information, e.g., the magnitude and the fluctuation pattern. mitigate the temporal divergence, time wrapping accelerates
Dense sampling. Existing studies [5, 17, 40, 50, 62, 64, 65] or decelerates changes of physical states in the temporal
simply divide IMU readings using low overlapping rates (e.g., dimension, e.g., 𝒒 𝒕′ = 𝐺(𝒒) = 𝒒 𝒌∗𝒕 , where 𝑘 is a scaling factor
zero or 50% overlapping) and as a result underutilize the usually chosen within [0.8, 1.2]. The augmented observations
data. To fully use the collected IMU data, higher overlapping are accordingly stretched in the temporal dimension, e.g.,
rates with dense sampling may be adopted. The rationale 𝒂 𝒕′ = 𝐹 (𝒂) = 𝒂 𝒌∗𝒕 . To facilitate such a transformation, time
is that most daily activities are periodic, which means any wrapping adopts linear upsampling to obtain continuous
time can be viewed as the start of the motion. Dense sam- observations. Therefore, time wrapping is also approximate
pling shifts observations along the temporal dimension by data augmentation, which enhances temporal diversity and
𝒂 𝒕′ = 𝐹 (𝒂) = 𝒂 𝒕+𝒏 , 𝝎𝒕′ = 𝐹 (𝝎) = 𝝎𝒕+𝒏 , where 𝑛 is a random also with approximation errors.
value. The augmented observations are partitioned with a
fixed window and then put into HAR models for training, 4.2.3 Flaky data augmentation. We find many IMU data
which can enlarge the number of training samples with ex- augmentation methods, although widely adopted in existing
isting sensor readings. Its physical embedding is shifting the works [4, 36, 40, 52, 55, 57], do not have their physical em-
physical states by 𝑛 accordingly, e.g., 𝒍 𝒕′ = 𝐺(𝒍) = 𝒍 𝒕+𝒏 . The beddings. For example, some data augmentations randomly
𝐹 (·) does not require unknown physical states and dense negate observations [36, 40, 52] or reverse the observations
sampling is also a complete data augmentation. along the temporal dimension [36, 40, 52]. The permutation
[36, 40, 52, 55] slices the observations sequence within a
4.2.2 Approximate data augmentation. This type of data aug- temporal window and randomly swaps sliced segments to
mentation has its physical embedding, but the augmented generate a new sequence. The shuffling [36, 40, 52] randomly
Practically Adopting Human Activity Recognition ACM MobiCom ’23, October 2–6, 2023, Madrid, Spain
5 UNIHAR ADOPTION
Putting UniHAR to practical adoption, we consider two ap-
plication scenarios and make further optimizations for im-
Figure 6: Adoption of data augmentations.
proved performance.
rearranges the channels of sensor observations to change the 5.1 Data-decentralized Scenario
triaxial distribution. These methods may not be associated In the data-decentralized scenario, the raw data transmis-
with the underlying physical principles. sion from the target users to the cloud server is supposed
The jittering [36, 40, 52, 55] that adds additional random not allowed due to practical constraints, e.g., prohibitive
noise to the original observations is a special flaky data aug- processing and transmission overheads or privacy concerns.
mentation. It aims at augmenting the sensor model by intro-
5.1.1 Feature extraction. To extract effective features from
ducing the sensor noises. But the applied noise distributions
local unlabeled datasets, UniHAR adopts self-supervised
may not match the true distributions, which vary across
learning to train the encoder and decoder. There are sev-
different devices and are hard to determine [50].
eral state-of-the-art self-supervised representation models
Flaky data augmentations only operate with IMU observa-
[12, 40, 52, 62] for IMU data. However, many methods [12, 40,
tions and are not explainable with the underlying physical
52] are entangled with data augmentation and flaky data aug-
process. The adoption of them may lead to unbounded errors
mentation methods are integrated, which we believe may
in the generated data distributions.
harm the end performance. We identify LIMU-BERT [62]
as an effective foundation model for IMU-based sensing,
4.3 Data Augmentation Adoption and embed it into our design to build the representation
UniHAR incorporates physics-informed data augmentation model. LIMU-BERT is orthogonal to the data augmentation
methods differently during the two stages of the framework employed in UniHAR.
based on their respective characteristics and Figure 6 explains We employ acceleration normalization and local rotation
the rationale. for data augmentation, both being complete data augmenta-
During the feature extraction stage, although unlabeled tion. As shown in Figure 4, the encoder and decoder jointly
data are abundant, they are from different users, devices, predict the original values of the randomly masked IMU
and environments, which are subject to significant domain readings. By the reconstruction task, the encoder learns the
shift. The purpose of incorporating data augmentation is underlying relations among IMU data and extracts effec-
to generalize the data distributions and improve the inter- tive features. The Mean Square Error (MSE) loss is used to
domain data representativeness (as illustrated in the top of compute the differences between the original and predicted
Figure 6). On the other hand, it is challenging to control values, which is defined as follows:
the data quality when approximation errors are introduced
|𝑿 | 𝑗 ∈𝑀
[𝑖]
at scale. Therefore, UniHAR only employs complete data 1 X X [𝑖]
ℓ𝑟𝑒𝑐 (𝑤; 𝑿 ) = MSE (𝑿ˆ ·𝑗 − 𝑿 [𝑖]
·𝑗 ), (17)
augmentations to unlabeled data in this stage. |𝑿 | 𝑖=1 𝑗
During the activity recognition stage, labeled data from
the source domain are utilized but they are scarce. In addition where 𝑤 denotes the model weights of the encoder and de-
to aligning the data distributions across domains, the data coder, and 𝑀 [𝑖] represents the set of the position indices of
[𝑖]
augmentation is also expected to enrich the source domain masked readings for the 𝑖-th IMU sample 𝑿 [𝑖] . The 𝑿ˆ ∈
labels and improve the intra-domain data representativeness R𝐹 ×𝑚 denotes the predicted data as shown in Figure 4.
ACM MobiCom ’23, October 2–6, 2023, Madrid, Spain Huatao Xu, Pengfei Zhou, Rui Tan, Mo Li
50 Hz with a Samsung Galaxy S II carried on the waist. Table 2: Cross-dataset transfer setup.
■ MotionSense [31] dataset (abbreviated as Motion in our
paper) adopted an iPhone 6s to gather accelerometer and gy- Souce Domain Target Domain
Case
roscope time-series data. 24 participants from UK performed with labels without labels
6 activities (sitting, standing, walking, upstairs, downstairs, 1 HHAR UCI, Motion, Shoaib
2 UCI HHAR, Motion, Shoaib
and jogging) with the device stored in their front pockets.
3 Motion HHAR, UCI, Shoaib
All data were collected at a 50 Hz sampling rate.
4 Shoaib HHAR, UCI, Motion
■ Shoaib [44] et al. collected data of seven daily activities
(sitting, standing, walking, walking upstairs, walking down-
stairs, jogging, and biking) in the Netherlands. The 10 male is already selected as one of the baselines, and the perfor-
participants were equipped with five Samsung Galaxy SII mance of ASTTL as originally reported in [37] is poor.
(i9100) smartphones placed at five on-body positions (right
pocket, left pocket, belt, upper arm, and wrist). The IMU read- 7.1.3 Cross-dataset evaluation. To demonstrate the gener-
ings were collected at 50 Hz. alizability of UniHAR, we design four cross-dataset evalu-
To demonstrate the effectiveness of UniHAR across di- ation cases and Table 2 indicates each of the cases, e.g., in
verse datasets, we select four common activities (i.e., still, case 1, UniHAR transfers models from the HHAR dataset to
walk, walk upstairs, walk downstairs) contained in all four other three datasets without activity labels. The clients in
open datasets. For still activity, we merge several similar the source domain share a small portion of labeled data with
activities, for example, sit and stand in the HHAR dataset the cloud server, while the clients in the target domain only
into still. In addition to the diversity of {user, device, place- contribute local unlabeled data. Each mobile client has a lo-
ment, environment}, the merged dataset has general label cal dataset containing the IMU data collected from the same
definitions (i.e., still). The activity distributions of the four user and our setup has a total of 73 clients. Each local dataset
activities also differ in the four datasets. is partitioned into training (80%), validation (10%), and test
(10%) sets. The training sets of all clients participate in the
7.1.2 Baseline models. We consider both data-decentralized federated training process of the encoder and decoder. To
and data-centralized scenarios, and compare UniHAR with train the recognizer, we randomly select a small portion of la-
relevant state-of-the-art solutions in each scenario. In the beled samples (i.e., 1,000) from the training sets in the source
data-decentralized scenario, we extend three most relevant domain. The validation sets are utilized to select models (e.g.,
approaches as baselines: the encoder and classifier) and the trained recognizers are
■ DCNN [63] designs a CNN-based HAR model that out- evaluated on the test sets in the target domain.
performs many traditional methods. It assumes labeled data
are abundant and adequately representative. 7.1.4 Metrics. We compare the performance of HAR models
■ TPN [40] learns features from unlabeled data by recogniz- with average accuracy and F1-score of all users in the target
domain, which are defined as 𝑎 = 𝑎𝑖 , 𝑓 = 𝑓𝑖 , 𝑠.𝑡 . 𝑖 ∈ D𝑡 ,
P P
ing the applied data augmentations. It only requires limited
data for training but with an implicit assumption that they where 𝑎𝑖 and 𝑓𝑖 are the activity classification accuracy and
are not biased. F1-score of the 𝑖-th user, respectively.
■ LIMU-GRU [62] learns representations by a self-supervised
autoencoder LIMU-BERT. It uses limited labeled data and 7.2 Overall Performance
assumes unbiased data distributions. Table 3 compares the performance of UniHAR and other base-
In the data-centralized scenario, we compare UniHAR with line models in the data-decentralized and data-centralized
three existing unsupervised domain adaptation approaches. scenarios. The two numbers in each cell denote average ac-
■ HDCNN [17] handles domain shift by minimizing the curacy and F1-score, respectively. The row of UniHAR-A
Kullback-Leibler divergence between the fully-labeled source gives the performance of the UniHAR recognizer trained
domain features and unlabeled target domain features. with centralized target user data. In the data-decentralized
■ FM [4] minimizes the feature distance across domains by scenario, DCNN, TPN, and LIMU-GRU achieve poor per-
maximum mean discrepancy [54]. It requires full supervision formance mainly because they overlook the data diversity
with adequate labeled data from the source domain. among the source and target domain users. In contrast, Uni-
■ XHAR [65] is an adversarial domain adaptation model HAR achieves 78.5% average accuracy and 67.1% F1-score,
and needs to select the source domain before adapting models which outperforms the best of three baselines by at least 15%.
to the unlabeled target domain. For the data-centralized scenario, UniHAR-A also yields bet-
The SelfHAR [52] and ASTTL [37] are not compared be- ter accuracies and F1-scores when compared with HDCNN,
cause SelfHAR inherits the training scheme from TPN, which FM, and XHAR in most cases. Although XHAR delivers the
Practically Adopting Human Activity Recognition ACM MobiCom ’23, October 2–6, 2023, Madrid, Spain
Table 3: Performance comparison. (The two numbers in each cell are accuracy and F1-score.)
Accuracy
Accuracy
Accuracy
0.6 0.6 0.6 0.6
Figure 9: Accuracies with different numbers of labels. Figure 10: Effect of data augmentations.
best performance for case 4, it is shown to be sensitive to samples from the source domain. On the other hand, the
the source domain dataset and falls much lower in other models in the data-centralized scenario can achieve higher
cases. UniHAR-A, however, achieves accuracies consistently accuracies when more labeled data are employed. UniHAR-A
higher than 80.0% across all cases and in average outperforms consistently outperforms the other two models in the data-
the best of the three by 10% in terms of accuracy and 15% centralized scenario. The UniHAR(-A) is able to achieve aver-
in terms of F1-score. In summary, the results demonstrate age accuracies of 71.0% and 72.1% in the two scenarios, when
the outstanding performance of UniHAR(-A), credited to the only 50 labeled samples are used. The experiment results
effective data augmentations and feature extraction. also suggest a more robust performance of UniHAR(-A).
0.8 0.8
Accuracy
Accuracy
Accuracy
Accuracy
0.8 0.8
0.7 0.7
0.7 0.7
0.6 0.6
complete d.a. complete d.a. w.o. pretraining
0.5 complete + approximate d.a. 0.5 complete + approximate d.a. 0.6 w.o. pretraining 0.6 w/ self-supervised training
complete + approximate + flaky d.a. complete + approximate + flaky d.a. w/ UniHAR pretraining w/ UniHAR pretraining
1 2 3 4 Avg. 1 2 3 4 Avg. 1 2 3 4 Avg. 1 2 3 4 Avg.
Transfer Case Transfer Case Transfer Case Transfer Case
(a) Feature extraction. (b) Activity recognition. (a) Data-decentralized. (b) Data-centralized.
Figure 11: Impact of data augmentation combinations. Figure 12: Accuracies of pretraining approaches.
we then examine how effective is the proposed way of Table 4: Efficiency comparison.
integrating complete and approximate data augmentation
in UniHAR. We compare the three choices of i) using only Model Para. Size Train. Time Infer. Time
complete data augmentation, ii) using both complete and DCNN 17 K 76 KB 4.2 ms 0.8 ms
approximate data augmentation, and iii) using all including TPN 26 K 194 KB 7.6 ms 2.2 ms
flaky data augmentations, and during both the feature ex- LIMU-GRU 54 K 239 KB 8.1 ms 5.5 ms
traction and activity recognition stages. Figure 11 plots the HDCNN 28 K 118 KB 3.6 ms 0.8 ms
achieved accuracies in the data-decentralized scenario. Re- FM 50 K 203 KB 5.8 ms 2.9 ms
sults show that if approximate data augmentation is adopted XHAR 700 K 2607 KB 17.0 ms 14.2 ms
in the feature extraction stage, it may lead to an accuracy UniHAR 15 K 78 KB 8.9, 17.6 ms 3.6 ms
drop of 2.0% because of the approximation errors introduced
to massive unlabeled data. On the other hand, applying ap-
proximate data augmentation in the activity recognition sensitive to the number of samples [28], the recognizer with
stage can further enrich data diversity and thus increase the biased encoder degrades significantly in case 2, where
accuracy by 1.9%. Flaky data augmentation, however, only the source domain dataset UCI has the fewest samples. The
introduces a negative impact to almost all cases when applied experiment suggests that UniHAR(-A) can achieve more ro-
in either stage. The UniHAR performance drops by 5.7% on bust performance with federated training. The encoder and
average when data augmentation jittering and permutation decoder are first initialized with the source domain dataset in
are applied. Our results with the data-centralized scenario the feature extraction process, which gains a 1.5% improve-
show similar results (omitted due to page limits). ment in average accuracy. The possible reason is that mobile
We also investigate the detailed performance gains when clients can better adapt a good initialized model to their local
different data augmentation methods are adopted. The local datasets [7].
rotation introduces the largest gains, i.e., 19.4% and 15.3%
accuracy improvements in the two scenarios. The dense sam- 7.6 Model Size and Latency
pling and time wrapping together improve the average accu- Table 4 compares UniHAR with the baseline models in terms
racies by 2.6% and 2.3% in the two scenarios, respectively. of the number of parameters, model size, training time, and
inference time. The models are optimized by lite Pytorch
7.5 Effect of Feature Extraction Mobile [35]. The training time is the time the server takes
We evaluate the effectiveness of feature extraction by com- to train a mini-batch (64) of samples and the inference time
paring the performance of three training approaches, i.e., is the execution time for inferring one IMU sample on the
without any pretraining, with self-supervised learning (only Samsung Galaxy S8. The two training time of UniHAR corre-
available for the data-centralized scenario), and with both sponds to the models without and with the domain classifier.
self-supervised and federated learning (UniHAR pretrain- In summary, the model size of UniHAR is small and its train-
ing). Figure 12 plots the end performance of different per- ing and inference time is comparable with others.
taining approaches and the results show that the UniHAR
training approach consistently achieves better performance 8 RELATED WORK
in the data-decentralized scenario. For the data-centralized Wearable-based HAR systems [9, 10, 15, 17, 27, 37, 40, 64, 65]
scenario, although the self-supervised training slightly out- are ubiquitous and low-cost. Conventional HAR models [15,
performs the UniHAR training in case 4, it only achieves 43, 47, 51, 58, 63] adopt deep neural networks and achieve
an accuracy of 71.2% in case 2. It is because the encoder is high performance with the help of sufficient well-annotated
Practically Adopting Human Activity Recognition ACM MobiCom ’23, October 2–6, 2023, Madrid, Spain
datasets. However, IMU data heterogeneity prevents them existence of physical embedding is independent of represen-
from achieving promising performance in practice. tations. While quaternion is one of several possible ways to
Recent federated learning schemes [21–23, 34, 53] allow represent the device orientation, augmentation with physi-
for distributed training without accessing raw data but they cal embedding can be expressed with other representations.
require fully-labeled data at the target users, which cannot For example, we may also represent sensing models using
be directly applied to the considered scenario. rotation matrices: 𝒂 = 𝑹 −1 (𝒍 + 𝒈), 𝝎 = 𝑓𝑔−1 (𝑹𝑡−1−1 𝑹𝑡 )/∆𝑡, where
Self-supervised learning works [14, 36, 40, 52, 57, 62] have 𝑹 is the rotation matrix representing the device orientation
shown effectiveness in extracting useful features from unla- and 𝑓𝑔−1 converts the rotation matrix to angular changes [48].
beled data and thereby improving the performance of down- Taking local rotation as example, its physical embedding is
stream HAR models. For example, the encoder models from 𝑹 ′ = 𝑹∆𝑹, and augmented readings are derived:
TPN [40] and LIMU-BERT [62] may be viewed as the early
efforts in building "foundation" models to extract contextual 𝒂 ′ = (𝑹∆𝑹)−1 (𝒍 + 𝒈) = ∆𝑹 −1 𝑹 −1 (𝒍 + 𝒈) = ∆𝑹 −1 𝒂, (21)
′
features from unlabeled IMU data, with which task-specific 𝝎 = 𝑓𝑔−1 ((𝑹𝑡 −1 ∆𝑹)−1 𝑹𝑡 ∆𝑹)/∆𝑡 = 𝑓𝑔−1 (∆𝑹 −1 𝑹𝑡−1−1 𝑹𝑡 ∆𝑹)/∆𝑡
models can achieve superior performances with limited la- = 𝑓𝑔−1 (∆𝑹 −1 𝑓𝑔 (𝝎∆𝑡)∆𝑹)/∆𝑡
beled data. However, these models still require some labeled
data to train HAR classifiers, which can be overfitted to spe- = 𝑓𝑔−1 (∆𝑹 −1 𝑓𝑔 (𝝎)∆𝑹) = ∆𝑹 −1 𝝎. (22)
cific domains and fail to achieve high performance for target
These equations demonstrate the same results as those ob-
users without any labeled data.
tained with quaternions (Equation 11 and 12).
Unsupervised domain adaptation approaches have been
Frequency-Domain data augmentation. Some studies
introduced to HAR applications [17, 37, 65] and reduce the
[26, 36] propose data augmentations in the frequency domain.
distribution divergence between different domains. Specif-
To establish the physical embedding of these data augmen-
ically, HDCNN [17] learns transferable features by mini-
tations, we may accordingly perform Fourier Transform on
mizing Kullback-Leibler divergence between the source and
physical states like orientation. However, frequency domain
target domains. XHAR [65] extracts domain-independent fea-
operations can potentially violate the constraints of orien-
tures by adversarial training. Unfortunately, our experiments
tation representation after the inverse Fourier Transform.
demonstrate that purely learning-based domain adaptation
For example, a low-pass filter on orientation quaternions
approaches fail to handle highly heterogeneous IMU data
can lead to non-unit quaternions and a loss of their phys-
across domains and cannot achieve satisfactory performance
ical meaning. Further study is needed to understand their
in adapting models across different user groups.
relationships with physics-informed data augmentation.
Prior works [49, 55] devise a range of IMU data augmenta-
tion methods, e.g., random noising, to increase label size and
10 CONCLUSION
prevent overfitting to specific domains. And recent studies
[36, 40, 52, 57] have explored self-supervised learning with In this paper, we practically adopt HAR with realistic over-
data augmentation techniques for leveraging unlabeled data. head for mobile devices. The proposed UniHAR framework
However, many flaky data augmentations have been adopted effectively adopts physics-informed data augmentation on
in those studies [36, 40, 49, 52, 55, 57], which may generate massive unlabeled and limited labeled IMU data to overcome
readings that do not conform to the physical sensing princi- the data heterogeneity across various users. UniHAR is pro-
ples and undermine the data distributions. totyped in the mobile platform and tested introducing low
Different from existing studies, UniHAR aims at building overhead. Extensive evaluation with cross-dataset experi-
a general HAR framework, in which a representation model ments demonstrates its outstanding performance compared
is first built with massive unlabeled data, and supervised with state-of-the-art approaches.
training with limited labeled data is thereafter adopted to
adapt the model across user domains. UniHAR specifically ACKNOWLEDGMENTS
explores physics-informed data augmentation that aligns This research is supported by the National Research Foun-
with the underlying physical process and constructively em- dation, Singapore under its Industry Alignment Fund – Pre-
beds them into different learning stages to improve both positioning (IAF-PP) Funding Initiative, under its NRF In-
intra-domain and inter-domain data representativeness. vestigatorship (NRFI) NRF-NRFI08-2022-0010. Any opinions,
findings and conclusions or recommendations expressed in
this material are those of the authors and do not reflect the
9 DISCUSSION views of National Research Foundation, Singapore. This re-
Impact of orientation representation. The core idea of search is also supported by Singapore MOE Tier 1 (RG88/22).
the physics-informed data augmentation is general and the Mo Li is the corresponding author.
ACM MobiCom ’23, October 2–6, 2023, Madrid, Spain Huatao Xu, Pengfei Zhou, Rui Tan, Mo Li