Evaluating mobile medical applications

The past 5 years have seen a proliferation of both mobile health apps and proposed tools to rate such apps. While these digital health tools hold great potential, concerns around privacy, efficacy and credibility, coupled with a lack of strict oversight by governing bodies, have highlighted a need for frameworks that can help guide clinicians and consumers to make informed app choices. Although the USs’ Food and Drug Administration has recognised the issue and is piloting a precertification programme that would prioritise app safety at the developer level,1 this model is still in pilot stages and there has yet to be an international consensus around standards for health apps, resulting in a profusion of proposed frameworks across governments, academic institutions and commercial interests.

In 2018, our team drew on existing evaluation frameworks to identify salient categories from existing rating schemes and create a new framework.2 The American Psychiatric Association’s (APA) App Evaluation Model was developed by harmonising questions from 45 evaluation frameworks and selecting 38 total questions that mapped to five categories: background information, privacy and security, clinical foundation, ease of use and interoperability. This APA model subsequently has been used by many diverse stakeholders given its flexibility in guiding informed decision-making.3–7 However, the flexibility of the model also created a demand for a more applied approach that offered users more concrete information instead of placing the onus entirely on a clinician or provider.

Thus, since the framework’s development, the initial 38 questions have been operationalised into 105 new objective questions that invite a binary (yes/no) or numeric response by a rater.8 These questions align with the categories proposed by the APA model but are more extensive and objective, with, for example, ‘app engagement’ operationalised into 11 different engagement styles to select. These 105 questions are sorted into six categories (App Origin and Functionality, Inputs and Outputs, Privacy and Security, Clinical Foundation, Features and Engagement, Interoperability and Data Sharing) and are intended to be answerable for any trained rater—clinician, peer, end user—and inform the public-facing Mhealth Index and Navigation Database (MIND), where users can view app attributes and compare ratings (see figure 1 below). MIND, thus, constitutes a new framework based on the APA model, with an accompanying public-facing database.

A screenshot of MIND highlighting several of the app evaluation questions (green boxes) and ability to access more. MIND, Mhealth Index and Navigation Database.

Recent systematic reviews have illustrated the growing number of evaluation tools for digital health devices, including mobile health apps.9–11 Given the rapidly evolving health app space and the need to understand what aspects are considered in evaluation frameworks, we have sought to survey the landscape of existing frameworks. Our goal was to compare the categories and questions composing other frameworks to (1) identify common elements between them, (2) identify if gaps in evaluation frameworks have improved since 2018 and (3) assess how reflective our team’s MIND framework is in the current landscape. We, thus, aimed to map every question from the 2018 review, as well as questions from new app evaluation frameworks that have emerged since, using the questions of MIND as a reference. While informing our own efforts around MIND, the results of this review offer broad relevance across all of digital health, as understanding the current state of app evaluation helps inform how any new app may be assessed, categorised, judged and adopted.

Methods

Patient and public involvement

Like the APA model, MIND shifts the app evaluation process away from finding one ‘best’ app, and instead guiding users towards an informed decision based on selecting and placing value on the clinically relevant criteria that account for the needs and preferences of each patient and case. Questions were created with input of clinicians, patients, family members, researchers and policy-makers. The goal is not for a patient of clinician to consider all 105 questions but rather be able to access a subset of questions that appear most appropriate for the current use case at hand. Thus, thanks to its composition of discrete questions that aim to be objective and reproducible, MIND offers a useful tool to compare evaluation frameworks. It also offers an actionable resource for any user anywhere in the world to engage with app evaluation, providing tangible results in the often more theoretical world of app evaluation.

Design

We followed a three-step process in order to identify and compare frameworks to MIND. This process included (1) assembling all existing frameworks for mobile medical applications, (2) separating each framework into the discrete evaluation questions comprising it and (3) mapping all questions to the 105 MIND framework questions as a reference.

Search strategy and selection criteria

We started with the 45 frameworks identified in the 2018 review by Moshi et al9 and included 34 frameworks that have emerged since our initial analysis of the space that was conducted in 2018 and published in 2019.2 To accomplish this, we conducted an adapted scoping review based on the Moshi criteria to identify recent frameworks. Although MIND focuses on mental health apps, its considerations and categories are transferable to health apps more broadly, and, thus, there was no mental health specification in the search terms.

References were identified through searches of PubMed, EMBASE and PsychINFO with the search terms ((mobile application) OR (smartphone app)) AND ((framework) OR (criteria) OR (rating)) and publication date between January 2018 and October 2020. We also identified records beyond the database search by seeking frameworks mentioned in subsequent and recent reviews5 12 13 and surveying the grey literature and government websites. Papers were selected for inclusion if they meet the predetermined eligibility criteria—presenting an evaluation framework for mobile health apps with patient, clinician or end user-facing questions. Two reviewers (SL and JT) screened the literature separately and applied the inclusion criteria. The data extracted from the papers included: author and dates of publication, source affiliation, country of origin, name of framework, study design, description of framework, intended audience/user and framework scoring system. Articles were screened if they describe the evaluation of a single app, did not present a new framework (instead conducting a review of the space or relying on a previous framework), the framework was focused on developer instead of clinicians or end users, was the implementation and not evaluation focused, was not a framework for health apps and was a satisfaction survey instead of an evaluation framework. The data selection process is outlined in figure 2.

Framework identification through database searches (PubMed, EMBASE, PsychINFO) and other sources (reviews since 2018, grey literature, government websites).

The 34 frameworks identified in the search were combined with the 45 frameworks from the 2018 review for a total of 79 frameworks for consideration. To our knowledge, this list comprehensively reflects the state of the field at the time of assembly. However, we do not claim it to be exhaustive, as frameworks are constantly changing, emerging and sunsetting, with no central repository. The final list of frameworks assembled can be found in online supplemental appendix 1.

Supplemental material

Mapping

Each resulting framework was reviewed and compiled into a complete list of its unique questions. The 79 frameworks yielded 1701 questions in total. Several of the original 45 frameworks focused exclusively on in-depth privacy considerations (evaluating privacy and security practices rather than the app itself),14 and after eliminating these checklists that did not facilitate app evaluation by a clinician or end user, 70 total frameworks were mapped in entirety to the MIND framework.

In mapping questions, discussion was sometimes necessary as not every question was an exact, word-for-word match. The authors, thus, used discretion when it came to matching questions to MIND and discussed each decision to confirm mapping placement. Two raters (SL, LS) agreed on mapping placement, and disputes were brought to a third reviewer (JT) for final consideration. ‘Is data portable and interoperable?’,15 for example, would be mapped to the question ‘can you email or export your data?’ ‘Connectivity’16 was mapped to ‘Does the app work offline?’ and ‘Is the arrangement and size of buttons/content on the screen zoomable if needed’17 was mapped to ‘is there at least one accessibility feature?’ Questions about suitability for the ‘target audience’ were mapped to the ‘patient-facing’ question in MIND.

Results

Framework type

The aim of this review was to identify and compare mobile health app rating frameworks, assessing overlap and exploring changes and gaps relative to both previous reviews and to the MIND framework. Of the 70 frameworks ultimately assessed and mapped, the majority 39 (55.7%) offered models for evaluating mobile health apps broadly. Seven (10%) considered mental health apps, while six (8.5%) focused on apps for diabetes management. Other evaluation focuses included apps for asthma, autism, concussions, COVID-19, dermatology, eating disorders, heart failure, HIV, pain management, infertility and sickle cell disease (table 1).

Number of disease-specific and general app evaluation frameworks, with general mobile health frameworks constituting more than half of identified frameworks

Mapping

We mapped questions from 70 app evaluation frameworks against the six categories and 105 questions of MIND (see online supplemental appendix 2). We examined the number of frameworks that addressed each specific MIND category and identified areas of evaluation that are not addressed by MIND. Through the mapping process, we were able to gauge the most common questions and categories across different app evaluation frameworks.

Supplemental material

We sorted the questions into MIND’s six different categories—App Origin & Functionality, Inputs & Outputs, Privacy & Security, Evidence & Clinical Foundation, Features & Engagement Style and Interoperability & Data Sharing—in order to assess the most common broad areas of consideration. Across frameworks, the most common considerations were around privacy/security and clinical foundation, with 43 frameworks posing at least one question around the app’s privacy protections and 57 of the frameworks containing at least one question to evaluate evidence base or clinical foundation, as denoted in table 2. Fifty-nine frameworks covered at least two of the MIND categories, with the majority of frameworks overlapping with at least four of MIND categories.

The questions from all frameworks were mapped to the reference framework (MIND) sorted into its six categories, with this table denoting how many frameworks had questions that could be sorted into each of the categories

We then took a more granular look at the questions from each of the 70 frameworks, matching questions one-by-one to questions of the MIND framework when possible. On an individual question level, specific questions about the presence of a privacy policy, security measures in place, supporting studies and patient-facing (or target population) tools were the most prevalent, with representation from 20, 25, 27 and 28 frameworks, respectively, for each question. Each of the 70 frameworks had at least one question that mapped to MIND. The most common questions, sorted into their respective categories, are depicted in figure 3 and table 3, while the full list of mapped questions can be found in online supplemental appendix 2.

The most commonly addressed questions, grouped within the categories of MIND. The blue triangle constitutes MIND and its six main categories, while the green trapezoid represents questions pertaining to usability or ease of use, which are not covered by MIND. MIND, Mhealth Index and Navigation Database.

Commonly addressed questions among those that could be mapped to the MIND reference framework (blue), and those that could not (green)

Every question was examined but not every question in every framework could be matched to a corresponding question in MIND, and some questions fell outside one of the six categories. For example, 18 frameworks continue to present the subjective question of ‘is the app easy to use’ which will vary depending on the person and use case. MIND also does not offer questions related to other objective questions to which answers are not readily available such as ‘How were target users involved in the initial design and usability evaluations of the app?’18 While questions such as this are of high importance, lack of easily accessible answers creates a dilemma in their present utility for app evaluation. Furthermore, some questions such as economic analysis were not covered by MIND but by other frameworks and represent a similar dilemma in that actual data to base evaluation on are often lacking. Aside from subjective questions, other pronounced absences MIND were questions about customisability (addressed by seven other frameworks) and advertising (nine frameworks). Although MIND does ask about customisability in part by encouraging raters to consider accessibility features (and some frameworks ask about the ability to customise in conjunction with accessibility features,19 MIND neither pose a question around the user’s ability to tailor or customise app content nor does it ask questions about the presence of advertisements on an app. Other questions unaddressed by MIND were about the user’s ability to contact the producer or developer to seek guidance about app use. Variations of this question include ‘is there a way to feedback user comments to the app developer?’ MIND also does not pose any questions regarding instructions in the app or the existence of a user guide.20 Finally, it does ask about speed of app functionality. This variant of question asks, ‘is the app fast and easy to use in clinical settings?’15 figure 3 above, and table 3 below presents additional details on categories and questions both inside and outside the MIND reference framework.

Discussion

As mobile health apps have proliferated, choosing the right one has become increasingly challenging for patient and clinician alike. While app evaluation frameworks can help sort through the myriad of mobile health apps, the growing number of frameworks further complicates the process of evaluation. Our review examined the largest number of evaluation frameworks to date with the goal of assessing their unique characteristics, gaps as well as overlap with the 105 questions in MIND. We identified frameworks for evaluating a wide range of mobile health apps—some focused on general mobile health, some specific and addressing specific disease domains like asthma, heart failure, mental health or pain management.

Despite the different disease conditions they addressed, there was substantial overlap among the frameworks, especially around clinical foundation and privacy and security. The most common category addressed was clinical foundation, with 57 of the evaluation frameworks posing at least one question regarding evidence base. More than half of the frameworks also addressed privacy and/or security and app functionality or origin.

The widespread focus on clinical foundation and privacy represents a major change in the space since 2018, when our team analysed an initial review of 45 health app evaluation frameworks and found that the most common category of consideration among the different frameworks was usability, with short-term usability highly overrepresented compared with privacy and evidence with base. In this 2018 review, there were 93 unique questions corresponding to short-term usability but only 10 to the presence of a privacy policy. Although many frameworks continue to consider usability, our current review suggests the most common questions across frameworks now concern evidence, clinical foundation and privacy. This shift may reflect an increased recognition of the privacy dangers some apps may pose.

This review illustrates the challenges in conceiving a comprehensive evaluation model. A continued concern in mobile health apps is engagement,6 and it is unclear whether any framework adequately predicts engagement. Another persistent challenge is striking a balance between transparency/objectivity and subjectivity. Questions that prompt consideration of subjective user experiences may limit the generalisability and standardisation of a framework, as the questions inherently reflect the experience of the rater. An app’s ease of use, for example, will differ significantly depending on an individual’s level of comfort and experience with technology. However, subjective questions around user friendliness, visual appeal and interface design may be of greatest concern to an app user, and most predictive of engagement with an app.21 Finally, a thorough assessment of an app is only feasible if information about the app is available. For example, some questions with clinical significance, such as the consideration of how peers or target users may be involved in app development, are not easily answerable by a health app consumer. Overall, there is a need for more data and transparency when it comes to health apps. App evaluation frameworks, while thorough, rigorous and tailored to clinical app use, can only go so far without transparency on the part of app developers.22

The analysis additionally highlighted the flexibility and comprehensiveness of the MIND framework, which was used as a reference framework in this review, in diverse contexts. The MIND categories are inclusive of a wide range of frameworks and questions. Even without including any subjective questions in the mapping process, each of the 70 frameworks that were ultimately mapped had some overlap with MIND, and many of the 1700 questions ultimately included were mapped exactly with a MIND question. Although MIND was initially conceptualised as an evaluation tool specifically for mental health apps, the coherence between MIND and diverse types of app evaluation frameworks, such as those for concussion,23 heart disease24 and sickle cell anaemia,25 demonstrates how the MIND categories can encompass many health domains. Condition-specific questions, for example, are a good fit for the ‘Features & Engagement’ category of MIND.

The results of our analysis suggest while numerous new app evaluation frameworks continue to emerge, there is a naturally appearing standard of common questions asked across all. While different use cases and medical subspecialties will require unique questions to evaluate apps, there are a set of common questions around aspects like privacy and level of evidence that are more universal. MIND appears to cover a large subset of these questions and, thus, may offer a useful starting point for new efforts as well as means to consolidate exiting efforts. Advantage of the more objective approach offered by MIND is that it can be represented as a research database to facilitate discovery of apps while not conflicting with local needs, personal preferences or cultural priorities.26

Limitations

Our work is not the first to compare app evaluation frameworks. Recently, several reviews have compared how different mobile health app evaluation models address privacy,11 12 14 and another database (https://search.appcensus.io/) focuses exclusively on compiling privacy assessments of Android apps. We chose to exclude app evaluation frameworks that focused exclusively on in-depth privacy considerations and were unusable by a clinician or layperson as our goal was more comprehensive app evaluation. This decision is not to reject considerations of privacy and security that are of critical importance, but rather to narrow the focus to frameworks that are usable in the hands of the public today and can be used to inform clinical decisions. In addition, MIND was initially tailored to mental health, and thus does not encompass thorough disease-specific criteria for other conditions such as asthma, diabetes and sickle cell anaemia—though such questions may be easily integrated. Finally, subjective questions, especially those around ease of use and visual appeal, are difficult to standardise but may be among the most important features driving user engagement with mental health apps.21

Conclusion

Our work demonstrates the expansion of app evaluation frameworks. By illustrating how the MIND overlaps with many of these existing and emerging frameworks—we suggest the practical need for consolidation. Although specific disease tailored mobile health apps require specialised app evaluation questions, concerns around accessibility, privacy, clinical foundation and interoperability are nonspecific. If the full potential of digital health can be realised, there is a need for increased collaboration among industry, government and academia in order to ensure that the highest quality digital health tools reach the public. We emphasise that this effort is just a first step and highlight the need for interdisciplinary continued communication among diverse digital health stakeholders in order to best serve the public.