Reliability of questionnaire The International Fitness Scale: a systematic review and meta-analysis

ABSTRACT Objective To perform a systematic literature review and meta-analysis to investigate the reliability of The International Fitness Scale questionnaire for assessing overall physical fitness and related components. Methods PubMed®, BIREME, SciELO, EMBASE, SPORTDiscus, LILACS and Cochrane databases were searched using the following search terms: “The International Fitness Scale”, “International Fitness Scale” and “IFIS”. Article selection and data extraction were performed according to the following eligibility criteria: reliability and/or validity study of the measure tools of The International Fitness Scale; adoption of the The International Fitness Scale as a reference criterion (gold standard) and being an original article. Quality of the study was considered based on Assessment of Reliability Studies. Data analysis used Kappa coefficient of agreement, Cochran and the Higgins I2 test. Sensitivity analysis was conducted using the withdrawal model. Results A total of seven articles were included in the analysis. Test-retest reliability coefficients ranged from 0.40 to 0.99, with most studies achieving values ≥0.60, indicative of moderate to substantial reliability. Conclusion In spite of appropriate test-retest scores attributed to most reliability indicators, heterogeneity among the studies remained high. Therefore, further studies with low risk of bias are needed to support the reliability of the self-reported The International Fitness Scale.


❚ INTRODUCTION
Physical fitness is a predictor of health problems. Satisfactory fitness levels contribute to health problem prevention and functional capacity maintenance and improvement, and limit the development of chronic degenerative dysfunctions, leading to better quality of life. (1) Direct physical fitness measurement methods are considered gold standard. However, these methods have limitations, such as need for laboratories, high costs of equipment, need for a specialized team and difficult interpretation of findings. (2,3) Questionnaires are therefore an alternative for epidemiological studies, particularly in developing countries, (4) due to their user-friendly nature, low cost, reliability and reproducibility. (5) Multicenter research investigating adolescent lifestyle in Europa has led to the development of the International Fitness Scale (IFIS) self-reported questionnaire for assessing overall physical fitness and related components (cardiorespiratory fitness, muscle strength, speed/agility and flexibility). (2) This questionnaire was originally validated in the English language for adolescents aged 12 to 17 years, (2) then adapted and translated into nine languages (German, Austrian German, Greek, Flemish, French, Hungarian, Italian, Spanish and Swedish) (2) and validated for use in different populations (male and female children, youngsters and adults). (3,(6)(7)(8) Results derived from IFIS revealed associations with risk factors for cardiovascular diseases and metabolic syndrome. (3,6,8) The IFIS has been employed in several international research studies. Still, instruments with accurate psychometric properties, capable of reproducing a given outcome consistently within time and space, or across different observers (reliability), are required for studies aimed to estimate physical fitness, identify associated risk factors, analyze relations with different outcomes, and assess effectiveness of training programs. (9) Given the significance of physical fitness measurement using reliable, user-friendly instruments, and the growing interest in this field, this study set out to conduct a systematic review and meta-analysis of the available literature, in order to determine whether IFIS is a reliable tool for assessing overall physical fitness and related components.

❚ METHODS Protocol and registration
This systematic review was conducted in compliance with Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) recommendations. The review protocol was registered in the International Prospective Register of Systematic Reviews (PROSPERO), under no. CRD42018117472.

Search strategy
Literature search included articles published up to September 2019 and listed in the following data bases: MEDLINE via PubMed ® , BIREME, Scientific Electronic Library Online (SciELO), EMBASE, SPORTDiscus, LILACS and Cochrane Central, regardless of type of study, population, language, participant age and sex, and publication date. Studies were searched using the following descriptors: "Physical Fitness" and "Selfreport" (controlled) and "The International Fitness Scale"; "International Fitness Scale"; "IFIS" (noncontrolled). Terms were combined using the Boolean operator (OR). The [TIAB] field code was used to limit exhibition to articles containing selected terms in the title and abstract (Table 1).

Study selection
An assessment form developed based on inclusion and exclusion criteria and calibrated prior to screening was used for study selection. Inclusion criteria were as einstein (São Paulo). 2020;18:1-9 follows: studies addressing reliability and/or validity of the IFIS measurement instrument; original research articles involving human beings; publication in journals indexed in the selected databases. Review articles were excluded. The Mendeley Reference Manager Software (https://www.mendeley.com/) was used to ensure independent selection and assessment across reviewers.
Duplicate studies were excluded. Two blinded, independent reviewers selected studies in two steps: title and abstract screening and full text reading. In the first step, titles and abstracts were examined according to predefined eligibility criteria for identification of relevant studies. Studies selected by at least one reviewer were included in the subsequent step. These were then read in full and examined by reviewers based on eligibility criteria, using an evaluation form.
Articles selected for full text reading were submitted to cross-reference search for identification of relevant studies that might not have come up in electronic search.

Data extraction
Data extraction was performed according to the Cochrane Handbook for Systematic Reviews of Interventions. (10) Data extracted from studies satisfying eligibility criteria were entered into an electronic Excel spreadsheet (Microsoft Excel software; Microsoft Corporation, WA, USA). The following pieces of data were extracted: first author, title and year of publication; type of study; descriptive (overall sample size, sample size per sex, age group and country where the study was conducted, and sampling procedures) and reliability (Kappa values and 95%CI) data.
Two independent raters extracted descriptive and outcome data from selected articles. The GRADE System was used to examine overall quality of evidence. (11) Unresolved discrepancies between raters were examined by a third rater. Prior to data extraction, raters received training in calibration to ensure interrater consistency and data extraction spreadsheet refinement.

Methodological quality assessment: risk of bias
Methodological quality of selected studies was assessed using the Quality Appraisal of Reliability Studies (QAREL). This instrument includes 11 items in the following domains: items 1 and 2 -sampling bias, participants and rater representativeness; items 3 to 7 -blinding of raters; item 8 -variations in order of examination; item 9 -appropriate time intervals between repeated measures; item 10 -correct test application and interpretation; item 11 -appropriate statistical analysis. Items may be answered with "yes", "no", "unclear" or "not applicable" (items 3, 4, 5, 6 and 8); "yes" and "no" suggest good and poor study quality, respectively. (12) Inconsistencies in this study were discussed among authors and a final decision reached by consensus, according to Cochrane Handbook for Systematic Reviews recommendations. (10) In the absence of consensus, a third author was consulted, reasons for article exclusion examined, and a decision made.

Data analysis
Reliability was tested using the Kappa coefficient of agreement; sample size was used for grouped Kappa calculation. The random effects model was chosen over the fixed effects model due to varying levels of physical fitness among individuals, which may have reflected the impacts of physical activity during childhood and adolescence on adult life. (13) Kappa coefficients of agreement were interpreted as follows: none <0.00; slight, 0.00 to 0.20; fair, 0.21 to 0.40; moderate, 0.41 to 0.60; substantial, 0.61 to 0.80; almost perfect, 0.81 to 1.00. (14) Statistical heterogeneity was investigated using the Cochran Q test (level of significance, p<0.10). Statistical inconsistency was investigated using the Higgins I 2 test, (15) as follows: ≤40%, low heterogeneity; 30% to 60%, moderate heterogeneity; >50% to 90%, substantial heterogeneity; and >75% to 100%, considerable heterogeneity. (10) Whenever I 2 >50% and tau squared ( 2 ) >1, in the presence of statistical significance (p<0.10), heterogeneity was rated significant and reasons investigated. Statistical analyses were performed using software (R package meta; R 3.5.1).

Sensitivity analysis
Subgroup analysis was conducted to explain study heterogeneity. Effects were divided by study population and sampling bias, then meta-regression calculation performed. einstein (São Paulo). 2020;18:1-9

❚ RESULTS
A total of 1,999 articles were found in the selected databases. Of these, 871 (duplicates) were excluded. Title/abstract screening and full text reading included 1,128 and 23 articles respectively, with 99.2% agreement between raters. Seven of these articles satisfied eligibility criteria and were included in the quantitative narrative analysis of this meta-analysis ( Figure 1).
Studies in this sample reported test-retest reliability estimates based on Kappa agreement coefficients. Time intervals between examinations ranged from 1 to 2 weeks, with 2-week intervals used in most studies (2,6,8,16,17) and 1-week intervals limited to two studies. (3,7)

Risk of bias
Inter-rater agreement regarding risk of bias was 94.8% (4 inconsistencies across 77 items examined). Overall, study participants (2,3,6,7,8,16,17) were representative of those to whom the authors intended the results to be applied (QAREL item Q2) and intervals between repeated measurements of the target variable (QAREL item Q9) were reported. As regards primary sources of bias, blinding of raters to findings of other raters or to their own previous findings, to results of the reference standard accepted for the target variable, to clinical information, to additional cues and to order of examination was not reported in any of the studies. In two studies, (2,6) tests were conducted by raters who were representative of those to whom the authors intended the results to be applied. Finally, correct test application and appropriate interpretation, as well as appropriate statistical analysis, were performed in studies in this sample (Table 3).

Summary of reliability findings
According to Kappa coefficients, overall test-retest reliability ranged from 0.73 to 0.81 (substantial to almost perfect agreement). When all items assessed in selected studies were accounted for, reliability ranged from 0.40 to 0.99 (fair to almost perfect), with more than 50% (26 out of 40 items) achieving values ≥0.60 or moderate to substantial level of reliability -and 30% (12 out of 40 items) achieving almost perfect reliability as per Landis et al. (14) Kappa coefficients attributed to IFIS domains in selected studies were as follows: overall physical fitness -moderate, substantial and almost perfect agreement in two, four and two articles, respectively; cardiorespiratory fitness -moderate, substantial and almost perfect agreement in three articles, respectively; muscle strength -moderate, substantial, fair and almost perfect agreement in three, two, one and two articles, respectively; speed/agility -moderate, substantial and almost perfect agreement in four, one and three articles, respectively; flexibility -substantial, moderate and almost perfect agreement in three, three and two articles, respectively (Figure 2). 95%CI: 95% confidence interval.

Sensitivity analysis
Lower Kappa coefficients attributed to the adult population compared to other subgroups in all domains suggest moderate agreement in that population ( Table 4). Risk of sampling bias across studies may significantly affect agreement in overall fitness (p<0.001), cardiorespiratory fitness (p<0.001), muscle strength (p=0.022) and flexibility (p<0.001) IFIS domains (Table 5).  More strict studies regarding risk of bias assessment as per Q2 had lower Kappa coefficients compared to other subgroups. As regards heterogeneity, metaregression revealed that both subgroups (population and risk of bias as per Q2_QAREL) explained 85.99% of overall heterogeneity among studies (Tables 4  and 5). Summarized findings and GRADE quality classifications are presented in table 6.

❚ DISCUSSION
Global organizations, such as the World Health Organization (WHO) and the American College of Sports Medicine (ACSM) currently recommend regular practice of moderate to vigorous physical activity for 150 minutes per week for overall physical fitness improvement. (18,19) A retrospective cohort study following up on 122,007 patients revealed that cardiorespiratory fitness is inversely associated with long term mortality. (20) Combined with findings of that study, a meta-analysis involving 2,525,827 adults revealed progressive decline in health parameters and increased obesity and related comorbidity rates as cardiorespiratory fitness decreases. (19) Physical fitness is a health problem predictor and a modifiable indicator. It should therefore be assessed via gold-standard tests, such as cardiorespiratory fitness (ergospirometry), (21) muscle strength (isokinetic test), (22) speed/agility (20/40 m sprint test using photocell systems) (23) and flexibility (inclinometer, goniometer, Leighton flexometer, fleximeter and imaging methods, like radiography and photogrammetry). (24,25) However, application of aforementioned tests in scarce financial resource settings, or when specialized personnel is lacking, is not feasible and may preclude large scale studies. (26) Hence the interest in alternative, user-friendly, low-cost tool development by public health organizations and researchers working in developing countries. This is the first systematic review and meta-analysis investigating IFIS reliability -or consistency over timebased on test-retest, which is a significant aspect of any assessment tool. Low test-retest reliability tools are not able to detect true score changes over time. (9) Overall, findings of this study revealed that testretest reliability of IFIS domains determined using Kappa coefficients of agreement is valid for assessing overall physical fitness and related components (cardiorespiratory fitness, muscle strength, speed/agility einstein (São Paulo). 2020;18:1-9 and flexibility), given the low variability in reliability measures and moderate to substantial scores attributed to most domains.
In this study, steps were controlled via a systematic approach and strict protocol. Comprehensive search with no restrictions regarding study type, population, language, age, sex and date of publication was also conducted. Besides other advantages of questionnaires, IFIS has significant clinical applicability, once findings are associated with directly measured cardiorespiratory fitness and risk factors for cardiovascular disease, such as adiposity and metabolic syndrome, in different populations. (3,6,8) Physical fitness assessment is also a critical indicator for ideal, personalized prescription of physical exercise. (7) In spite of acceptable Kappa coefficient values, results of this meta-analysis involve potential risk of bias and overestimation. This heterogeneity was in part attributed to test-retest reliability dispersion across different populations. Some authors reported high testretest reliability among measures in children, whereas others reported medium and low values in adolescents and adults, respectively. Low methodological quality (QAREL items Q4-Q7) may also have compromised reliability, as selected studies in this sample failed to satisfy these criteria. (11) Also, the IFIS version used by De Moraes et al., (17) has not been validated for the Brazilian population.
High heterogeneity among items detected in sensitivity analysis indicates that health status, age group, blinding of raters, test-retest time intervals, questionnaire application instructions and understanding by volunteers (7,3) may impact study findings.
Therefore, interpretation and generalization of findings reported here must be done with caution, since this meta-analysis excluded grey literature and the few studies investigating IFIS reliability were of low methodological quality and involved high statistical heterogeneity according to grouped Kappa coefficients.
Finally, the fact that IFIS is available in nine languages must be emphasized. Should it be applied without previous adaptation and testing in samples with different characteristics from those accounted for in instrument construction and testing, cultural bias may occur. In order not to compromise findings of future Brazilian studies, application of the Portuguese version of IFIS and reference to Guidelines for Reporting Reliability and Agreement Studies (GRRAS) (27) and QAREL checklist (12) are recommended.

❚ CONCLUSION
Documentary corpus in this meta-analysis revealed high heterogeneity among studies, in spite of almost perfect agreement in 30% of items and appropriate item test-retest scores in most cases, which suggests moderate to substantial reliability according to Kappa coefficients.
Hence, further studies with low risk of bias and investigating instrument reliability and health status