Criterion-Related Validity of Consumer-Wearable Activity Trackers for Estimating Steps in Primary Schoolchildren under Controlled Conditions: Fit-Person Study

The purposes were to examine the criterion-related validity of the steps estimated by consumer-wearable activity trackers (wrist-worn activity trackers: Fitbit Ace 2, Garmin Vivofit Jr, and Xiomi Mi Band 5; smartphone applications: Pedometer, Pedometer Pacer Health, and Google Fit/Apple Health) and their comparability in primary schoolchildren under controlled conditions. An initial sample of 66 primary schoolchildren (final sample = 56; 46.4% females), aged 9-12 years old (mean = 10.4 ± 1.0 years), wore three wrist-worn activity trackers (Fitbit Ace 2, Garmin Vi-vofit Jr 2, and Xiaomi Mi Band 5) on their non-dominant wrist and had three applications in two smartphones (Pedometer, Pe-dometer Pacer Health, and Google Fit/Apple Health for Android/iOS installed in Samsung Galaxy S20+/iPhone 11 Pro Max) in simulated front trouser pockets. Primary schoolchildren’s steps estimated by the consumer-wearable activity trackers and the video-based counting independently by two researchers (gold standard) were recorded while they performed a 200-meter course in slow, normal and brisk pace walking, and running conditions. Results showed that the criterion-related validity of the step scores estimated by the three Samsung applications and the Gar-min Vivofit Jr 2 were good-excellent in the four walking/running conditions (e.g., MAPE = 0.6 - 2.3%; lower 95% CI of the ICC = 0.81 - 0.99), as well as being comparable. However, the Apple applications, Fitbit Ace 2, and Xiaomi Mi Band 5 showed poor criterion-related validity and comparability on some walking/running conditions (e.g., lower 95% CI of the ICC < 0.70). Although, as in real life primary schoolchildren also place their smartphones in other parts (e.g., schoolbags, hands or even somewhere away from the body), the criterion-related validity of the Garmin Vi-vofit Jr 2 potentially would be considerably higher than that of the Samsung applications. The findings of the present study high-light the potential of the Garmin Vivofit Jr 2 for monitoring primary schoolchildren’s steps under controlled conditions.


Introduction
Engaging in regular physical activity (PA), especially of moderate-to-vigorous intensity, is widely acknowledged as a significant indicator of health in primary schoolchildren (World Health Organization, 2020).Furthermore, scientific evidence has also shown that total PA is favourably linked to numerous health outcomes in primary schoolchildren (Poitras et al., 2016), with steps per day being a common and reliable measure of total PA (Althof et al., 2017;Craig et al., 2010).The World Health Organization (2020) recommends that primary schoolchildren should engage in at least an average of 60 minutes per day of moderate-tovigorous PA across the week.However, these PA guidelines are challenging to comprehend for both primary schoolchildren and their parents (Crossley et al., 2019).To address this issue, the moderate-to-vigorous PA-based recommendations have been translated into simpler step-perday guidelines for primary schoolchildren (Mayorga-Vega et al., 2021).In particular, existing evidence suggests that primary schoolchildren should achieve at least about 10,000 -12,000 steps per day (Benítez-Porres et al., 2016;Colley et al., 2012;Oliveira et al., 2017).
Consumer-wearable activity trackers have emerged as valuable tools for monitoring and promoting habitual PA among users (Casado-Robles et al., 2022).Such consumer-wearable activity trackers, including wrist-worn activity trackers, clip-on activity trackers and smartphone PA applications, are electronic devices worn on the body to monitor daily PA levels (Casado-Robles et al., 2022).The popularity of consumer-wearable activity trackers has surged in recent years, with global sales of wearable and smartphone devices exceeding 500 million and 13 billion worldwide, respectively (Laricchia, 2023a(Laricchia, , 2023b)).Given this widespread adoption and their characteristics, stakeholders, including researchers, paediatrics, physical education teachers and parents, are increasingly interested in utilizing consumer-wearable activity trackers to monitor and promote healthy habits of PA in primary schoolchildren (Casado-Robles et al., 2022;Mayorga-Vega et al., 2022).
Among the diverse consumer-wearable activity trackers available, smartphone PA applications and wristworn activity trackers have shown to be the most valued and used types of devices by primary schoolchildren (Mayorga-Vega et al., 2022;Viciana et al., 2022).Given that most primary schoolchildren now own smartphones that they carry with them throughout the day (Spanish National Institute of Statistics, 2023) and many PA applications are freely available (Viciana et al., 2022), smartphone PA applications hold a significant advantage as they do not require purchasing any specific device for monitoring and promoting PA.As regards the available purchase options, wrist-worn activity trackers stand out as having several advantages when they are compared with others like clip-on activity trackers, such as reporting real-time feedback that can be easily checked (Maher et al., 2017) or having greater

Research article
wear compliance (Fairclough et al., 2016).Moreover, recent scientific evidence supports wrist-worn activity trackers as the most effective for promoting primary schoolchildren's daily PA (Casado-Robles et al., 2022).For these reasons, smartphone PA applications and wrist-worn activity trackers have the potential to serve as feasible tools for objectively monitoring and promoting primary schoolchildren's daily PA (Casado-Robles et al., 2022;Gil-Espinosa et al., 2022;Giurgiu et al., 2022).
Steps per day represent the most common measure for monitoring PA and personalized goal-setting for promoting PA through consumer-wearable activity trackers (Casado-Robles et al., 2022;Maher et al., 2017).However, before utilizing a particular consumer-wearable activity tracker, it is crucial to assess its validity and ensure its appropriateness for the target population (Kottner et al., 2011;Mokkink et al., 2010).Criterion-related validity of step counts estimated by consumer-wearable activity trackers should be analyzed by examining the agreement between their scores and those from the "gold standard", which currently involves video-based counting conducted by at least two observers (Johnston et al., 2021).The best-practice protocol for the validation of steps estimated by consumerwearable activity trackers should be conducted under controlled, semi-free living, and free-living conditions (Johnston et al., 2021).The controlled testing condition, which involves participants wearing the activity trackers while completing walking/running tasks at controlled or self-selected speeds, represents the first stage in the multistage protocols for the best-practice validation of steps estimated by consumer-wearable activity trackers (Johnston et al., 2021).Furthermore, since different kinds of consumerwearable activity trackers could be used in the same context due to economic constrains (e.g., monitoring or promoting PA in the Physical Education setting or large-scale research studies) (Brodie et al., 2018;Creaser et al., 2022), the agreement between different devices (i.e., comparability) should be also studied (Viciana et al., 2022).
In spite of the increasing use of smartphone PA applications and wrist-worn activity trackers, there is a lack of substantial evidence regarding their criterion-related validity and comparability in primary schoolchildren.To date, and to our knowledge, only two prior studies have examined the criterion-related validity of steps estimated by wrist-worn activity trackers in primary schoolchildren under controlled conditions (Godino et al., 2020;Sun et al., 2022).These studies found that the wrist-worn activity trackers Fitbit Charge HR (Godino et al., 2020), Fitbit Ace, and Moki (Sun et al., 2022) had good to excellent criterionrelated validity for estimating steps.Moreover, as far as we know, no previous topic-related studies were carried out with smartphone PA applications among primary schoolchildren.Furthermore, to the best of our knowledge, there is a lack of prior studies examining the comparability of steps estimated by smartphone PA applications and wristworn activity trackers in this population.
Consequently, the main purpose of the present study was to examine the criterion-related validity of the steps estimated by the consumer-wearable activity trackers (wrist-worn activity trackers: Fitbit Ace 2, Garmin Vivofit Jr, and Xiomi Mi Band 5; smartphones applications: Pedometer, Pedometer Pacer Health, and Google Fit/Apple Health) in primary schoolchildren under controlled conditions.The secondary purpose of this study was to examine the comparability of the steps estimated by the above-mentioned consumer-wearable activity trackers in primary schoolchildren under controlled conditions.

Participants
The present study is reported according to the GRRAS guidelines (Kottner et al., 2011).The protocol of the present study conforms to the Declaration of Helsinki statements (64th WMA, Brazil, October 2013) and it was first approved by the Ethical Committee for Human Studies at the University of Granada.Three public primary schools located in urban areas of the province of Granada (Spain) chosen by convenience.According to the schools' reports, all the primary schoolchildren's families had a middle socioeconomic level.The principal and the PE teachers were first contacted.Then, they were informed about the project, and permission to conduct the study was requested.After the approvals of the schools was obtained, all the primary schoolchildren and their legal guardians were fully informed about the features of the project.Primary schoolchildren's verbal informed assents and their legal guardians' signed written informed consents were obtained before taking part in the study.
The present study followed a cross-sectional design.A total of 66 primary schoolchildren from 4th to 6th grade (i.e., 9 -12 years old) enrolled in the selected schools were invited to participate in the present study.The following inclusion criteria were considered: a) being enrolled in the 4th to 6th grade at the primary education level (i.e., target grades according to study aim); b) being free of any health disorder that would make them unable to engage in PA normally; c) providing the corresponding verbal informed assents of the primary schoolchildren, and d) presenting the corresponding signed written informed consents of the primary schoolchildren's legal guardians.The following exclusion criteria were considered: a) not having completed and valid data from the five wearable activity trackers, and/or b) not having completed and valid data from the video-based step count.
Anthropometric.Primary schoolchildren's body mass (kg) and height (cm) were first measured following the International Standards for Anthropometric Assessment (Stewart et al., 2011).Firstly, primary schoolchildren's body mass and height were measured in shorts, Tshirts, and barefoot.For the body mass measure, primary schoolchildren stood in the centre of the scale (Seca, Ltd., Hamburg, Germany; accuracy = 0.1 kg) without support and with the weight distributed evenly on both feet.For the body height assessment, primary schoolchildren stood with their feet together with the heels, buttocks and upper part of the back touching the stadiometer (Holtain Ltd., Crymmych, Pembs, United Kingdom; accuracy = 0.1 cm), and with the head placed in the Frankfort plane.Each measurement was performed twice and the mean was recorded (Stewart et al., 2011).Then, the body mass index was calculated as body mass divided by body height squared (kg/m2).Finally, primary schoolchildren's body weight status was categorized by gender-and age-adjusted body mass index thresholds as overweight/obesity or non-overweight/obesity (Cole et al., 2000).Body mass index and body weight status scores have shown high evidence supporting validity for body composition among primary schoolchildren (Cole et al., 2000).
Consumer-wearable activity trackers.Primary schoolchildren's steps were estimated by three wrist-worn activity trackers [Fitbit Ace 2 (Fitbit, San Francisco, SF, USA), Garmin Vivofit Jr 2 (Garmin, Kansas, KS, USA), and Xiaomi Mi Band 5 (Xiaomi, Pekin, China)] and three applications in two smartphones [Pedometer (ITO Technologies) and Pedometer Pacer Health for Android (Samsung Galaxy S20+) and iOS (iPhone 11 Pro Max); and Google Fit application for Android (Samsung Galaxy S20+), and the Apple Health application for iOS (iPhone 11 Pro Max)].Physical specifications of the chosen devices are as follows: Fitbit Ace 2: 2.27 x 1.00 x 0.30 cm, 20.0 g; Garmin Vivofit Jr 2: 1.1 x 1.1 x 0.9 cm, 17.5 g; Xiaomi Mi Band 5: 4.69 x 1.81 x 1.24 cm, 11.9 g; Samsung Galaxy S20+: 16.2 x 7.4 x 0.8 cm, 186 g, and iPhone 11 Pro Max: 15.8 x 7.8 x 0.8 cm, 226 g.The three chosen wrist-worn activity trackers are based in tri-axial built-in accelerometers, while the chosen smartphones have different sensors including accelerometers and gyroscopes.Each device and application has its own proprietary algorithm to estimate the step counts.
Concerning the particular chosen activity trackers, the criteria were as follows: a) the most worldwide used display-based activity wristbands brands (Henriksen et al., 2018; IDC's Worldwide Quarterly Wearable Device Tracker reports from 2017 to 2020); b) choosing models of the devices with affordable prices (based on launch prices in Spain; Fitbit Ace 2 ≈ 70€; Garmin Vivofit Jr 2 ≈ 70€; Xiaomi Mi Band 5 ≈ 35€); c) choosing the most advanced model (in that moment), and d) models designed specifically for children, when they were available (i.e., Garmin Vivofit Jr 2 and Fitbit Ace 2).For the smartphone applications, the criteria were to study: a) applications for Android and iOS, and b) choosing the most popular and used free downloadable applications available in the applications stores (due to the number of downloads and their user ratings) and the included applications of the corresponding smartphones (i.e., Samsung Google Fit for Android and Apple Health for iOS).As regards the specific smartphones used, the criteria were the most worldwide used brands (IDC's Worldwide Quarterly Wearable Device Tracker reports from 2017 to 2020) and choosing the most advanced model (in that moment) for Android and iOS.
Finally, as regards the number of wrist-worn activity trackers, it was considered that three wrist-worn activity trackers and two smartphones were a feasible number that did not interfere with the primary schoolchildren's movements while walking and running (i.e., natural arm and leg swing) and allowed for a correct measurement (i.e., wrist and legs adjustment).In this line, the total mass of the three wrist-worn activity trackers (37.5 g) and two smartphones (186 or 226 g in each thigh) was not high.According to the user manual of the wrist-worn activity trackers, one device of each model was adjusted snugly on the top of primary schoolchildren's non-dominant wrist, close to and above the wrist bone (they were 3.91 cm width).Regarding the smartphones, one device of each model was allocated in two bags (i.e., one in each bag), adjusted snugly with a belt, on the top and front part of the primary schoolchildren's thighs (one in each) as if they were placed in trouser pockets and did not interfere with the primary schoolchildren's movements during the trials.Activity trackers were adjusted so they could not move, but overtightening was avoided.
Video-based steps count.Primary schoolchildren's steps gold standard was determined by step counting the video recording in slow-motion (Johnston et al., 2021).Primary schoolchildren were asked to perform a 200-meter course in four different conditions.The 200-meter course was marked with cones and lines and performed inside the school on a non-slippery sport court with an oval shape and no tight turns.A digital video camera (Go Pro Hero 7, California, USA) with a tripod was situated in the middle of the sports court in order to easily record the primary schoolchildren's lower limbs during the entire course from the sagittal plane.For calculating the speed and step cadence of each condition, time was considered as from when the primary schoolchildren started walking/running until they crossed the finish line.The gold standard step count for each schoolchild in each condition was performed independently by two researchers through the slow-motion video recording projected on a 15.6" screen.When disagreement occurred (8.6%), these particular observations were evaluated again by the two researchers.Although most of the disagreements were simply due to an error in one of the two researchers, when disagreement still occurred, a third researcher evaluated it.

Procedure
Evaluations were carried out during the afternoon in participants' leisure time from Monday to Friday, and then data were downloaded and batteries charged during the morning.Due to the limitations of material and human resources, about two or three primary schoolchildren per hour were evaluated one by one during each evaluation session.Data collection was carried out by the same researchers, instruments and protocols.Firstly, primary schoolchildren's demographic characteristics and anthropometric measurements were recorded.Then, the five devices were adjusted on primary schoolchildren.In order to avoid the relative position of the activity trackers influencing the outcomes, they were adjusted in a random order varying across the primary schoolchildren (i.e., the position on the non-dominant wrist from hand to elbow for the wrist-worn activity trackers and the left/right thigh for the smartphones) (Hartung et al., 2020).
Finally, primary schoolchildren were instructed to walk/run the 200-meter course in the following four conditions, at a continuous speed, and with a natural arm and leg swing: 1) slow pace walking; 2) normal pace walking (self-pace walking); 3) brisk pace walking; and 4) running (jogging).Participants chose their walking/running speed based on the instructions provided for each condition (e.g., for the normal pace walking condition: "Perform the course at a speed that corresponds to walking naturally, at an everyday walking pace.For example, similar to the one you follow when going from home to school").Before starting, a demonstration in order to guide each participant was performed.When primary schoolchildren were at the starting line, the steps count from the activity trackers was recorded.Then, they were instructed to not move until they started walking/running.They also were asked to always start the course with the contralateral leg to the arm where the wrist-worn activity trackers were attached.Primary schoolchildren were requested to stop immediately after the finish line, and a cone was situated five meters beforehand to remind them.Then, the steps counted by the activity trackers were registered.

Statistical analysis
Descriptive statistics for all the variables of the included participants were calculated.Firstly, all the statistical tests assumptions were checked (e.g., histograms and Q-Q plots for normality) and met.Furthermore, univariate (i.e., z ± 3.0) and multivariate outliers (i.e., Mahalanobis distance) were removed.Afterward, for examining the main purpose of the present study (i.e., criterion-related validity), the agreement between the number of steps assessed by the consumer-wearable activity trackers and the video-based count (gold standard) were calculated as follows: a) Equivalence test with the 90% confidence interval (CI) method (Dixon et al., 2018); b) Limits of Agreement (LOA) with its 95% CI (Bland and Altman, 1986); c) Mean Absolute Error (MAE) (Willmott and Matsuura, 2005); d) Mean Absolute Percentage Error (MAPE) (Johnston et al., 2021), and e) Intraclass Correlation Coefficient (ICC), and its 95% CI, by a two-way random effects model with absolute agreement and single measurement [also known as ICC(2,1)] ( Koo and Li, 2016).Based on previous literature, agreement values were interpreted as follows: Equivalence test, when the mean reference standard score is within ± 15% of the mean consumer-wearable activity trackers score is considered acceptable (Dixon et al., 2018); MAPE, > 15.0% poor, 10.1 -15.0%acceptable, 5.1 -10.0%good, and 0.0 -5.0% excellent (Johnston et al., 2021); ICC, 0.00 -0.69 poor, 0.70 -0.79 acceptable, 0.80 -0.89 good, and 0.90 -1.00 excellent (Nunnally, 1978).Based on statistical inference, each ICC value was interpreted according to its 95% CI, that means, there was a 95% chance that the true ICC value landed on any point between the 95% CI range (Koo and Li, 2016).Finally, LOA plots, which are the individual participant differences between the two scores plotted against the respective individual means, were performed (Bland and Altman, 1995).Heteroscedasticity was also examined objectively by calculating the Pearsonʼs correlation coefficient (r) between the absolute differences and the individual means (Atkinson and Nevill, 1998).Based on Cohen's (Cohen, 1992) benchmarks, a correlation coefficient > 0.50 was considered as indicative of heteroscedasticity.Finally, as regards the secondary purpose of the present study (i.e., comparability), similarly the agreement between the number of steps estimated by pairs of consumer-wearable activity trackers was examined.All statistical analyses were performed using the SPSS version 25.0 for Windows (IBM® SPSS® Statistics), except for the equivalence test where the Jamovi version 2.3 (The Jamovi project, https://www.jamovi.org) was used.The statistical significance level was set at p < 0.05.

General characteristics
Figure 1 shows the flow diagram of the participants throughout the study.From the 66 primary schoolchildren that were invited to participate in the present study, 63 primary schoolchildren agreed and met the inclusion criteria.Since some primary schoolchildren met at least one exclusion criterion, the final sample consisted of 56 participants (i.e., non-compliance rate of 11.1%).Table 1 shows the general characteristics of the participants.

Criterion-related validity of the consumer-wearable activity trackers for estimating steps
Table 2 shows the criterion-related validity of the consumer-wearable activity trackers for estimating steps during controlled conditions.The results showed that the criterion-related validity of the step scores estimated by the activity trackers tended to be higher for slow pace walking, followed by running, normal pace walking and brisk pace walking.Particularly, the results showed that the criterionrelated validity of the step scores estimated by the three Samsung applications were excellent in all of the four walking/running conditions (e.g., scores inside the 90% CI of the equivalence test, MAPE ≤ 5%, and 95% CI of the ICC ≥ 0.90).Similarly, the criterion-related validity results of the steps estimated by the Garmin Vivofit Jr 2 was excellent, except for the 95% CI of the ICC value on the brisk pace walking condition, that was good.However, regarding the Apple applications, although most of the criterion-related validity results were excellent, the 95% CI of the ICC value on the brisk pace walking condition was poor (since in the iPhone 11 Pro Max the three applications reported exactly the same steps scores, note that results are reported as "Apple applica-tions").Moreover, although most of the criterion-related validity results of the steps estimated by the Xiaomi Mi Band 5 ranged from good to excellent, the 95% CI of the ICC values for the normal and brisk pace walking conditions were poor.Furthermore, the criterion-related validity results of the steps estimated by the Fitbit Ace 2 with the 95% CI of the ICC ranged from poor to acceptable (but scores inside the 90% CI of the equivalence test, and MAPE values ranged from good to excellent).

Comparability of the consumer-wearable activity trackers for estimating steps
Table 4 shows the comparability of the consumer-wearable activity trackers for estimating steps during controlled conditions.The results showed that the comparability of the step scores estimated by the activity trackers tended to be higher for slow pace walking, followed by running, normal pace walking and brisk pace walking.Particularly, the results showed that the comparability of the step scores estimated by all the activity trackers in the slow pace walking and running conditions was excellent (e.g., scores inside the 90% CI of the equivalence test, MAPE ≤ 5%, and 95% CI of the ICC ≥ 0.90), except for the 95% CI of the ICC value with all the comparisons with the Fitbit Ace 2 that was good for slow pace walking and poor for running.The results of the comparability of the step scores estimated by all the activity trackers in the normal pace walking condition was good-excellent, except for the 95% CI of the ICC value with all the comparisons with the Fitbit Ace 2/Xiaomi Mi Band 5 which was poor, as well as between the two wrist-worn activity trackers where it was acceptable.However, while in the brisk pace walking condition the 95% CI of the ICC value with all the comparisons with the three Samsung applications was excellent-good, as well as with the Garmin Vivofit Jr 2 was acceptable, the rest of comparisons were poor (though for all the comparisons the scores were inside the 90% CI of the equivalence test, and the MAPE values were excellent).Pearsonʼs correlation coefficients did not show heteroscedasticity on any walking/running condition (r = -0.50-0.26), except with the comparability between the Garmin Vivofit Jr 2 and Xiaomi Mi Band 5 on the brisk pace walking condition (r = -0.53),and the Fitbit Ace 2 and Samsung Pedometer/Samsung GoogleFit/Apple applications on the running condition [r = -0.52 -(-0.51);Table 3].

Criterion-related validity of the consumer-wearable activity trackers for estimating steps
The findings of the present study showed that the criterionrelated validity of the step scores estimated by the three Samsung applications were excellent in all of the four walking/running conditions.Similarly, the criterion-related validity results of the steps estimated by the Garmin Vivofit Jr 2 was excellent, except for the 95% CI of the ICC value on the brisk pace walking condition that was good.However, although for rest of consumer-wearable activity trackers scores were inside the 90% CI of the equivalence test and MAPE values ranged from good to excellent, in the present study some poor ICC outcomes were observed.For instance, while for the Apple applications most of the criterion-related validity results were also excellent, on the brisk pace walking condition it was poor.Similarly, while the criterion-related validity results of the steps estimated by the Xiaomi Mi Band 5 were good-excellent on slow pace walking and running, on the normal and brisk pace walking conditions the ICC outcomes were poor.Finally, the criterion-related validity results of the steps estimated by the Fitbit Ace 2 with the ICC ranged from poor to acceptable.
In spite of the increasing use of smartphone PA applications and wrist-worn activity trackers, today there is still a lack of substantial evidence regarding their criterionrelated validity in primary schoolchildren (Fuller et al., 2020;Gorzelitz et al., 2020;Johnston et al., 2021).Prior studies on the criterion-related validity of steps estimated by wrist-worn activity trackers in primary schoolchildren under controlled conditions showed similar outcomes to the present study.For instance, Godino et al. (2020) studied the criterion-related validity of the Fitbit Charge HR (non-dominant wrist) in primary schoolchildren (mean = 9.9, 9 -11 years) under controlled and semi free-living conditions (14 structured activities, including sitting, stationary cycling, treadmill walking/running, stair walking, outdoor walking and agility drills), for which they used a person-worn video camera (GoPro Hero) mounted on a harness as the gold standard (two observers).Similar to the results of the present study with the Fitbit Ace 2, Godino et al. (2020) observed that, on average, with the MAPE while the Fitbit Charge HR had a good criterion-related validity for estimating steps under the 14 activities (9.9%), the largest disagreement was found during fast walking/running [LOA = 20.5 (-19,6, 60.6)].Likewise, these authors also found that the Fitbit Charge HR underestimated step counts [mean of the 14 activities, LOA = 11.8 (8.1,15.6)].
In the same way, Sun et al. (2022) studied the criterion-related validity of the Fitbit Ace (left wrist) and Moki (right wrist) in primary and secondary schoolchildren (mean = 13.0,11 -13 years) under controlled conditions (3 walking activities), for which they used the smartphone camera (iPhone 8) mounted on a tripod as the gold standard (two observers).Similar to the findings of the present study with the Fitbit Ace 2, Sun et al. ( 2022) also observed that with the MAPE the Fitbit Ace had a good-excellent criterion-related validity for estimating steps (9.5, 3.1 and 5.3%), but it underestimated step counts [LOA = 30.0(-44.1, 104.1), 3.0 (-21.3,27.9), and 13.0 (-32.2, 57.3)].However, these authors found that the Moki had an excellent criterion-related validity for estimating steps (e.g., MAPE = 4.0/3.9/3.0%;systematic bias = 1.0/-4.0/-6.0).On the other hand, to our knowledge, there is no prior study examining the criterion-related validity of Garmin, Xiaomi or any other brand wrist-worn activity trackers for ing steps in primary schoolchildren under controlled conditions.Likewise, as far as we know, no previous topicrelated studies were carried out with smartphone PA applications in this population.
Although validity results depend on the population and testing conditions and, therefore, should not be generalized, due to the limited evidence on the criterion-related validity of wrist-worn activity trackers and smartphone PA applications for estimating steps in primary schoolchildren under controlled conditions, the findings of the present study have also been compared with available literature with young people (under 18 years) under controlled conditions and with primary schoolchildren under free-living conditions.As regards previous studies under controlled conditions, to our knowledge, only Viciana et al. (2022) examined the criterion-related validity of steps estimated by consumer-wearable activity trackers (wrist-worn activity trackers, in non-dominant wrist: Xiaomi Mi Band 5, Samsung Galaxy Watch Active 2, and Apple Watch Series 5; the same PA applications than in the present study) in secondary students (mean = 14.7, 12 -18 years) under controlled conditions (200-m course at slow, normal and brisk pace walking, and running), for which they used a digital video camera (Go Pro Hero 7) with a tripod situated in the middle of the sports court as the gold standard (two observers).Similar to the results of the present study, Viciana et al. (2022) observed that although for the examined consumer-wearable activity trackers, scores were inside the 90% CI of the equivalence test and the MAPE values were excellent, some ICC outcomes were poor-acceptable.Moreover, similar to the present study, the above-mentioned authors found that while the criterion-related validity results of the steps estimated by the Xiaomi Mi Band 5 were excellent on slow pace walking and running, on the normal and brisk pace walking conditions the ICC outcomes were acceptable and poor, respectively.Similarly, Viciana et al. (2022) also observed that for the three Samsung applications under the four walking/running conditions had good-excellent ICC outcomes (except for the Samsung Pedometer in normal pace walking condition which was poor).Furthermore, while for the Apple applications most of the criterion-related validity results were excellent, on the slow pace walking condition it was acceptable (in the present study on the brisk pace walking condition it was instead poor).Finally, these authors found that the Samsung Galaxy Watch Active 2 and Apple Watch Series 5 had a good-excellent criterion-related validity for estimating steps, except on the brisk pace walking condition which was poor.
Regarding previous studies examining the criterionrelated validity of wrist-worn activity trackers for estimating steps in primary schoolchildren under free-living conditions, to our knowledge, only four previous studies were carried out.Similar to the present study, Mayorga-Vega et al. ( 2023) examined the validity of the wrist-worn activity trackers Fitbit Ace 2, Garmin Vivofit Jr 2, and Xiaomi Mi Band 5 (non-dominant wrist) in primary schoolchildren (mean = 10.4,9-12 years), for which they used the Acti-Graph wGT3X-BT accelerometer as the reference standard (right hip).Similar to the results of the present study, the above-mentioned authors found that while the validity of the primary schoolchildren's daily steps estimated by the Garmin Vivofit Jr 2 and Xiaomi Mi Band 5 were good and acceptable (e.g., scores inside the 90% CI of the equivalence test, MAPE = 9.6/11.3%,and 95% CI of the ICC = 0.87/0.73),respectively, on the contrary, for the Fitbit Ace 2 it was poor (e.g., scores outside the 90% CI of the equivalence test, MAPE = 21.1%, and 95% CI of the ICC = 0.00).Similarly, while Schmidt et al. (2022) observed that the wrist-worn activity tracker Fitbit (Flex 2; non-dominant wrist) had a poor validity (ActiGraph GT9X accelerometer as the reference standard; right hip) for estimating daily steps (e.g., scores were outside the 90% CI of the equivalence test; MAPE = 45.1%) in primary schoolchildren (mean = 8.1, 6-11 years); Yang et al. (2019) found that the wrist-worn activity tracker Xiaomi (model not reported) had an acceptable validity (ActiGraph GT3X-BT accelerometer as the reference standard; right hip) for estimating daily steps (e.g., systematic bias = 633.5) in primary schoolchildren (mean = 13.0, 10 -17 years).Finally, Sirard et al. (2017) examined the validity of the Movband 2 (dominant wrist) for estimating daily steps in 6-to-12-year-old primary schoolchildren (mean = 8.6 years) using the Acti-Graph GT3X+ accelerometer as the reference standard (right hip).These authors found that the Movband 2 considerably overestimated the primary schoolchildrenʼs daily steps (i.e., 2,190.0steps).As regards the smartphone PA applications, however, to our knowledge, there is no previous study examining their validity for estimating daily steps in primary schoolchildren under free-living conditions.
The above-mentioned previous studies under freeliving conditions found that the validity of wrist-worn activity trackers for estimating steps tended to be lower than under controlled conditions.However, these apparent inconsistences between the findings of the present study (i.e., controlled conditions) and those in free-living conditions are plausible.While in the studies carried out in controlled conditions such as the present study, primary schoolchildren are constrained to predefined activities with stable gait patterns, previous studies under free-living conditions were carried out under a greater variability of motor patterns including a wide range of children's daily life behaviors.Consequently, it is to be expected that the mean error is lower in the first above mentioned case compared with the error in measurement in the second case (Johnston et al., 2021).In this line, systematic reviews have shown that consumer-wearable activity trackers tend to have a higher validity for estimating steps under controlled conditions than under free-living conditions (Fuller et al., 2020;Gorzelitz et al., 2020).Furthermore, criterion-related validity of step counts estimated by consumer-wearable activity trackers should be analyzed by examining the agreement between their scores and those from the "gold standard" (Johnston et al., 2021), that is, an error-free reference standard (Bossuyt et al., 2015).Video-based step counting with at least two observers is widely considered the gold standard (Johnston et al., 2021).However, all the abovementioned previous studies under free-living conditions were carried out with ActiGraph accelerometers as the reference standard, that is, a non-error free method (normally underestimating step counts, especially in slow pace walking) for assessing step counts among primary schoolchildren (Rosenkranz et al., 2010).
The findings of the present study indicate that the criterion-related validity of step scores estimated by activity trackers tended to be higher during slow pace walking and running conditions compared to normal and brisk pace walking in primary schoolchildren.Several factors could contribute to the observed differences in criterion-related validity across various walking and running conditions, such as algorithms or biomechanics of movement.These factors collectively underestimate the need for nuanced algorithm design and consideration of biomechanical variations across different walking and running conditions to enhance the overall validity of activity trackers in step counting in primary schoolchildren.
As it was mentioned before, the findings of the present study showed that the criterion-related validity of the step scores estimated by the three Samsung applications and the Garmin Vivofit Jr 2 were good-excellent on all four walking/running conditions.Given that most primary schoolchildren now own smartphones that they carry with them throughout the day (Spanish National Institute of Statistics, 2023) and the three studied Samsung applications are freely available, these PA applications would hold a significant advantage as they do not require to purchase any specific device for monitoring and promoting PA.In both the present study and the previous topic-related study with secondary students (Viciana et al., 2022) the criterionrelated validity of PA applications was examined with the smartphone placed in simulated front trouser pockets.However, in real life primary schoolchildren also placed their smartphones in many other places such as back trouser pockets, schoolbags, or in their hands.Likewise, in many moments smartphones are placed somewhere away from the body.Therefore, the criterion-related validity of the step scores estimated by the three studied Samsung applications in real life could be considerably lower.In contrast, wrist-worn activity trackers such as the Garmin Vivofit Jr 2 offer a significant advantage due to their high wear compliance (Fairclough et al., 2016).Moreover, these consumer-wearable activity trackers are always worn in the same location, which aligns with their placement on the non-dominant wrist as it was examined in the present study.Thus, in real life the validity of the Garmin Vivofit Jr 2 potentially would be considerably higher than of the Samsung applications.

Comparability of the consumer-wearable activity trackers for estimating steps
The findings of the present study showed that the comparability of the step scores estimated between the three Samsung applications were excellent on all four walking/running conditions (except between the Samsung Pedometer and GoogleFit on the brisk pace walking condition that with the ICC was good).Additionally, the comparability of the step scores estimated by the three Samsung applications and Garmin Vivofit Jr 2 was good-excellent, except on the brisk pace walking condition that with the ICC was acceptable.However, although for the rest of the consumerwearable activity trackers scores were inside the 90% CI of the equivalence test and MAPE values were excellent, in the present study some poor ICC outcomes were observed.Particularly, the comparability with the ICC was poor for all the comparisons with the Apple applications under the brisk pace walking condition; with the Xiaomi Mi Band 5 under the normal (except with the Xiaomi Mi Band 5 that was just acceptable) and brisk pace walking conditions; and with the Fitbit Ace 2 under the normal and brisk pace walking, and running conditions.
Although the use of different consumer-wearable activity trackers to monitor and promote primary schoolchildren's steps is commonly used in contexts with economic constraints such as in Physical Education where primary schoolchildren use their own devices (Viciana et al., 2022), to our knowledge, unfortunately, there are no previous studies about the comparability of wrist-worn activity trackers and smartphone PA applications in primary schoolchildren under controlled conditions.As far as we know, only Mayorga-Vega et al. (2023) examined the comparability of steps estimated by the wrist-worn activity trackers Fitbit Ace 2, Garmin Vivofit Jr 2, and Xiaomi Mi Band 5 (non-dominant wrist) in primary schoolchildren, but under free-living conditions.The above-mentioned authors found that while the comparability of the daily step scores estimated by the Garmin Vivofit Jr 2 and Xiaomi Mi Band 5 were acceptable-excellent (e.g., scores inside the 90% CI of the equivalence test, MAPE ≤ 5%, and 95% CI of the ICC ≥ 0.70), it was poor for those two with the Fitbit Ace 2 (e.g., 95% CI of the ICC < 0.70).Similarly, in the present study apart from when scores were inside the 90% CI of the equivalence test and MAPE values were excellent under all the conditions, the comparability with the ICC was excellent between the Garmin Vivofit Jr 2 and Xiaomi Mi Band 5 under the slow pace walking and running conditions, but it was poor under the normal and brisk pace walking conditions.Moreover, for all the conditions except the slow pace walking the comparability with the ICC was poor for those two with the Fitbit Ace 2 (exceptionally was just acceptable with the Xiaomi Mi Band 5 under the normal pace walking conditions).
Viciana et al. ( 2022) also examined the comparability of wrist-worn activity trackers (Samsung Galaxy Watch Active 2, Apple Watch Series 5, and Xiaomi Mi Band 5; non-dominant wrist), but in a sample of secondary students and under free-living conditions.Although the Xiaomi Mi Band 5 and Samsung Galaxy Watch Active 2 had a goodexcellent comparability [e.g., MAPE = 8.4; ICC = 0.98 (0.91 -0.99)], the comparability between those with the Apple Watch Series 5 was poor (e.g., MAPE = 19.4/23.3%).However, as far as we know, no previous studies about the comparability of step scores estimated by smartphone PA applications was carried out in this population.
Considering that the three Samsung applications and the Garmin Vivofit Jr 2 were comparable for estimating daily steps, apart from the price, technical characteristics, and options offered by the different devices, this could also be an important reason to select one or another for a particular aim (Viciana et al., 2022).For instance, battery duration, attractive screen, goal settings, reminders, or the data registered in the application, among others, could be essential to consider (Casado-Robles et al., 2022).However, as it was mentioned before, due to the fact that in real life the criterion-related validity of the Garmin Vivofit Jr 2 potentially would be considerably higher than that of the three studied Samsung applications, essentially those applications could be not comparable with the Garmin Vivofit Jr 2. On the other hand, as regards the wrist-worn activity trackers, in settings such as in Physical Education where the only economical possible way is that students use their own device (i.e., already purchased), for instance, the findings of the present study showed that the studied wrist-worn activity trackers could not be used interchangeably to monitor and promote daily steps among primary schoolchildren.

Strengths and limitations
An important strength of the present study was being, to our knowledge, the first one to examine the criterion-related validity of the steps estimated in primary schoolchildren under controlled conditions by recent models of wristworn activity trackers specially designed for children (i.e., Fitbit Ace 2 and Garmin Vivofit Jr 2).Moreover, as far as we know, it is also the first study to examine the criterionrelated validity of the steps estimated by PA smartphone applications in primary schoolchildren under controlled conditions.Furthermore, to the best of our knowledge, the present study is the first one to examine the comparability of step counts estimated by smartphone PA applications and wrist-worn activity trackers among primary schoolchildren.This is another important issue for feasibility reasons because in contexts such as Physical Education or large-scale research studies, it is common to use different consumer-wearable activity trackers for different participants (Brodie et al., 2018;Creaser et al., 2022).Finally, an important consideration during the validation of consumerwearable activity trackers is the validity of the criterion measure to which they are being compared (Johnston et al., 2021).If the criterion measure is not sufficiently valid, then criterion standard bias may be present (Umemneku Chikere et al., 2019), limiting the value that may be gained from such a study.Thus, due to the fact that the videobased step counting with at least two observers is considered the gold standard (i.e., an error-free reference standard) (Johnston et al., 2021), the criterion measure used in the present study represents another strength.Therefore, the present study allows for addressing important gaps in the scientific literature to date.
However, the present study has some limitations.Firstly, a non-probability and relatively small sample has been used, which limits the generalizability of the obtained outcomes to the particular studied setting (i.e., primary schoolchildren with similar characteristics).However, due to the human, time and material resource restrictions, a probability and larger sample could not be examined.Secondly, the best-practice protocol for the validation of steps estimated by consumer-wearable activity trackers should be conducted under controlled, semi-free living, and freeliving conditions (Johnston et al., 2021).Although the controlled testing condition represents the first stage in the multistage protocols for the best-practice validation of steps estimated by consumer-wearable activity trackers (Johnston et al., 2021), focusing only on controlled conditions may fail in the ecological validation of these devices under free-living conditions (Johnston et al., 2021).However, due to the human, time and material resource restrictions, the criterion-related validity under semi-free living, and especially free-living conditions could not be examined.Although previous studies have shown that it is feasible using a body worn camera as a gold standard under free-living conditions (Kelly et al., 2015), from the researcher's perspective, feasibility is compromised by the processing time that video recording requires (e.g., 600 -900 minutes of video to be analyzed by two independent observers for each participant and day) (Johnston et al., 2021).
Consequently, future studies should examine the criterion-related validity (i.e., with a gold standard) of steps scores in primary schoolchildren under semi-free living, and free-living conditions.Moreover, since other PA outputs such as heart rate, distance, or energy expenditure, are commonly used in the consumer-wearable activity trackers, future studies should also examine the criterion-related validity (i.e., with a gold standard) of those PA outputs in primary schoolchildren.

Conclusion
The PA applications Pedometer, Pedometer Pacer Health, and Google Fit for Android (using the Samsung Galaxy S20+ placed in the trousers pockets) and the wrist-worn activity tracker Garmin Vivofit Jr 2 (on the non-dominant wrist) showed good-excellent criterion-related validity for estimating steps in primary schoolchildren under controlled conditions.Moreover, those devices were comparable for estimating steps in primary schoolchildren under controlled conditions.However, the Apple applications and the wrist-worn activity trackers Fitbit Ace 2 and Xiaomi Mi Band 5 have not shown an acceptable criterion-related validity for estimating steps, as well as they not being comparable, on some walking/running conditions.However, due to the fact that in real life primary schoolchildren also place their smartphones in other places (e.g., schoolbags or hands), including somewhere away from the body, the criterion-related validity of the Samsung applications could be considerably lower.In contrast, because in real life, wrist-worn activity trackers are usually worn on the wrist the same as it was examined in the present study, in real life the validity of the Garmin Vivofit Jr 2 potentially would be considerably higher than that of the Samsung applications.The findings of the present study highlight the potential of the Garmin Vivofit Jr 2 for monitoring primary schoolchildren's steps under controlled conditions.Future studies should examine the criterion-related validity of steps scores estimated by PA applications and wrist-worn activity trackers in primary schoolchildren under free-living conditions.

Figure 1 .
Figure 1.Flow diagram of the participants throughout the study.

Figure 2 .
Figure 2. Limits of agreement plots of the consumer-wearable activity trackers for estimating steps during controlled conditions (slow pace walking condition).The middle-dashed line indicates the mean difference (systematic bias) between step scores assessed by the con- sumer-wearable activity trackers and the video-based count (gold standard) and the upper and lower dashed lines indicate the limits of agreement (95% confidence interval).

Figure 3 .
Figure 3. Limits of agreement plots of the consumer-wearable activity trackers for estimating steps during controlled conditions (normal pace walking condition).The middle-dashed line indicates the mean difference (systematic bias) between step scores assessed by the consumer-wearable activity trackers and the video-based count (gold standard) and the upper and lower dashed lines indicate the limits of agreement (95% confidence interval).

Figure 4 .
Figure 4. Limits of agreement plots of the consumer-wearable activity trackers for estimating steps during controlled conditions (brisk pace walking condition).The middle-dashed line indicates the mean difference (systematic bias) between step scores assessed by the consumer-wearable activity trackers and the video-based count (gold standard) and the upper and lower dashed lines indicate the limits of agreement (95% confidence interval).

Figure 5 .
Figure 5. Limits of agreement plots of the consumer-wearable activity trackers for estimating steps during controlled conditions (running condition).The middle-dashed line indicates the mean difference (systematic bias) between step scores assessed by the consumer-wearable activity trackers and the video-based count (gold standard) and the upper and lower dashed lines indicate the limits of agreement (95% confidence interval).

Table 1 . General characteristics of the participants Eligible sample
Data are reported as mean (standard deviation) a or percentage b .PA = Physical activity.

Table 3 . Pearsonʼs correlation coefficient (r) between the absolute differences and the individual means (n = 56).
a Apple applications is referred to the three applications activated in the iPhone smartphone (i.e., Pedometer, Pacer, and Apple Health) due to the fact that all of them reported exactly the same steps scores.* p < 0.05, ‡ p < 0.01, and † p < 0.001

Table 4 . Comparability of the consumer-wearable activity trackers for estimating steps during controlled conditions (n = 56)
LOA = Limits of Agreement; 90/95% CI = 90/95% Confident Interval; MAE = Mean Absolute Error; MAPE = Mean Absolute Percentage Error; ICC = Intraclass Correlation Coefficient.a Apple applications is referred to the three applications activated in the iPhone smartphone (i.e., Pedometer, Pacer, and Apple Health) due to the fact that all of them reported exactly the same steps scores.

Table 4
Apple applications is referred to the three applications activated in the iPhone smartphone (i.e., Pedometer, Pacer, and Apple Health) due to the fact that all of them reported exactly the same steps scores a