CTS Instruments Tools for Faculty and Student Assessment

CTS Instruments Tools for Faculty and Student Assessment

There are a number of critical thinking skills inventories and measures that have been developed including the Watson-Glaser Critical Thinking Appraisal (WGCTA), the Cornell Critical Thinking Test, the California Critical Thinking Disposition Inventory (CCTDI), California Critical Thinking Skills Test (CCTST), The Health Science Reasoning Test (HSRT), Professional Judgment Rating Form (PJRF), Teaching for Thinking Student Course Evaluation Form, Holistic Critical Thinking Scoring Rubric, and the Peer Evaluation of Group Presentation Form. Excluding the Watson-Glaser Critical Thinking Appraisal and the Cornell Critical Thinking Test, Facione and Facione have developed the rest of the critical thinking skills instruments. However, it is important to point out that all of the aforementioned measures are of questionable utility for dental educators because of their general content and lack of specific dental education content (See Table 7; See section on Critical Thinking and Assessment)

Table 7. Purposes of Critical Thinking Skills Instruments


Test Name                                                      Purpose

Watson-Glaser Critical Thinking Appraisal- FS (WGCTA-FS) Assesses participants’ skills in five subscales: inference, recognition of assumptions, deduction, interpretation, and evaluation of arguments.
Cornell Critical Thinking Test (CCTT) Measures test takers’ skills in induction, credibility, prediction and experimental planning, fallacies, and deduction.
California Critical Thinking Disposition Inventory (CCTDI) Assesses test takers’ consistent internal motivations to engage in critical thinking skills.


California Critical Thinking Skills Test

Provides objective measures of participants’ skills in six subscales: analysis, inference, explanation, interpretation, self-regulation, evaluation and an overall score for critical thinking.
The Health Science Reasoning Test (HSRT)

Assesses critical thinking skills of health science professionals and students.

Measures analysis, evaluation, inference, inductive and deductive reasoning.


Professional Judgment Rating Form (PJRF)

Measures extent to which novices approach problems with CTS. Can be used to assess effectiveness of training programs for individual or group evaluation.


Teaching for Thinking Student Course Evaluation Form

Used by students to rate the perceived critical thinking skills content in secondary and postsecondary classroom experiences.


Holistic Critical Thinking Scoring Rubric

Used by professors and students to rate learning outcomes or presentations on critical thinking skills and dispositions. The rubric can capture the type of target behaviors, qualities, or products that professors are interested in evaluating.


Peer Evaluation of Group Presentation Form

A common set of criteria used by peers and the instructor to evaluate student-led group presentations.


Reliability and Validity

Reliability, an estimate of internal consistency means that individual scores from an instrument should be the same or nearly the same from one administration of the instrument to another so that the instrument can be assumed to be free of bias and measurement error.52 Alpha coefficients are often used to report an estimate of internal consistency. Scores of .70 or higher indicate that the instrument has high reliability when the stakes are moderate. Scores of .80 and higher are appropriate when the stakes are high. Identification of major sources of errors, the size of those errors and the degree of generalizability of scores across alternate forms, administrations or relevant dimensions should be reported when discussing reliability. Variance or standard deviations of measurement errors, in terms of one or more coefficients, or in terms of item response theory based test information functions are important information to disclose. There are three types of reliability estimates: test-retest, parallel forms, and internal consistency. Test-rest assesses the consistency of a measure when it is administered at different times. Parallel forms, or alternative forms, assess the consistency of tests that are designed in the same way from the same content domain and are administered during independent testing sessions. Internal consistency is used to assess the relationships across items or subsets of items within a test during a single test administration.

Each type of reliability estimate has its own strengths and weaknesses. Those factors needed to be considered when designing a study because of their potential impact on the reliability estimates chosen. Test-retest reliability is often used in studies with a pre- and post-test design with no control group. However one disadvantage of this design is that reliability is not estimated until after the post-test has been conducted. If the reliability is too low, this result will impact the meaningfulness and usability of the scale. Parallel forms are used when a researcher is administering two similar instruments. However, the administration of two similar instruments for more complex and/or subjective constructs can complicate interpretations. Coefficients based on calculating the relationships between test items and subsets of items are not without limitations. Reliability coefficients are typically useful in comparison tests of measurement procedures. However, these comparisons are not usually straightforward. While a coefficient may show error due to scorer inconsistencies, it may not reflect variation that is indicative of as succession of examinee performance or products. A coefficient may demonstrate the internal consistency of the instrument but may not reflect measurement errors associated with the examinee’s motivation, efficiency, or health. Thus, when assessing constructs using multiple measures that result in reliability estimates, testing should be conducted in a short period of time in which individuals’ attributes are likely to remain stable.

Reliability estimates are often used in statistical analyses of quasi-experimental designs. A goal of statistical research is to have measures and/or observations that are reliable. Results from varied reliability estimates will affect the statistical analyses. In test development, investigating reliability as fully as is practical is recommended. When a measure is not repeatable and consistent it is not trustworthy or dependable. Reliability can be improved through a variety of methods. For example, internal consistency in a test can be improved by substituting more reliable for less reliable items. Also, increasing the number of reliable items on the test will increase the total reliability of the scale. Reliability can also be improved by standardizing the data collection process because this will reduce random error. However, while standardization refers to the data collection process, it also applies to the raters, forms, and occasions (times). Training the raters to use systematic procedures can help reduce the errors due to individual differences and how the raters make judgments while using identical test instructions. Using similar test environments also help standardize the process for the use of parallel forms when calculating test-retest reliability.

Establishing reliability does not pertain solely to quantitative research.  In qualitative research, it can be established through confirmability, triangulation, and extensive time in the field. Confirmability occurs when more than one person analyzes the data. Triangulation refers to gathering and corroborating evidence from different individuals, types of data, and methods of data collection in descriptions and themes, whereby the researcher looks at the data from multiple perspectives. Triangulation ensures the accuracy through multiple sources and increases reliability. For example, a researcher who is interested in studying self-regulated learning in middle school mathematics classrooms may interview teachers and students, and may also conduct observations in the classroom. In addition, the researcher may collect textbooks or lesson plans. Data from multiple perspectives can corroborate or refute findings. Using multiple data sources can establish reliability and enhance the accuracy of the study. Extensive time in the field is also used to establish reliability because ensures repeated opportunities to obtain data and can enhance the consistency among the data.

Validity means that individual scores from a particular instrument are meaningful, make sense, and allow researchers to draw conclusions from the sample to the population that is being studied. 52  Researchers often refer to content or face validity. Content validity or face validity means is the extent to which questions on an instrument are representative of the possible questions that a researcher could ask about that particular content or skills.

Watson-Glaser Critical Thinking Appraisal- FS (WGCTA-FS)

The WGCTA-FS is 40-item inventory was created to replace Forms A and B of the original test which participants reported was too long.53 This inventory assesses test takers skills in (a) inference: the extent to which the individual recognizes whether assumptions are clearly stated; (b) recognition of assumptions: measures whether an individual recognizes whether assumptions are clearly stated; (c) deduction: measures whether an individual decides if certain conclusions follow the information provided; (d) interpretation: measures whether an individual considers evidence provided and determines whether generalizations on data are warranted; and (e) evaluation of arguments: measures whether an individual  distinguishes between strong and relevant arguments from weak and irrelevant within particular issues. Internal consistency and test-retest reliability for the WGCTA-FS are .81.

Researchers investigated the reliability and validity of the WGCTA-FS for subjects in academic fields. Participants included 586 university students. The findings showed that internal consistencies for the total WGCTA-FS among students majoring in psychology, educational psychology, and special education including undergraduates and graduates ranges from .74 to .92. The correlations between course grades and total WGCTA-FS scores for all groups ranged from .24 to .62 and were significant at the p < .05 of p < .01.53 The WGCTA-FS was found to be a reliable and valid instrument for measuring critical thinking among this group of participants. 53

Cornell Critical Thinking Test (CCTT)

There are two forms of the CCTT, X and Z. Form X is for students in grades 4-14. Form Z is for advanced and gifted high school students, undergraduate and graduate students and adults.54 Test takers should finish Level Z, a 52-item test, in 50 minutes. Reliability estimates for Form Z range from .49 to .87 across the 42 groups who have tested for these purposes. Measures of validity were computed in standard conditions, roughly defined as conditions that do not adversely affect test performance.54 Correlations between Level Z and other measures of critical thinking range about .50.54 The CCTT is reportedly as predictive of graduate school grades as the Graduate Record Exam (GRE), a measure of aptitude and the Miller Analogies Test, and tend to correlate between .2 and .4.55

California Critical Thinking Disposition Inventory (CCTDI)

Facione and Facione have reported significant relationships between the CCTDI and the CCTST. When faculty focus on critical thinking in planning curriculum development, modest cross sectional and longitudinal gains have been demonstrated in students CT skills and habits of mind.56 The CCTDI consists of seven subscales and an overall score. The recommended cut-off score for each scale is 40, the suggested target score is 50, and the maximum score is 60. Scores below 40 on a specific scale are weak in that CT disposition and scores above 50 on a scale are strong in that dispositional aspect is strong. Overall scores of 280 show a serious overall deficiency in dispositions towards CT while an overall score of 350, while it is rare, shows a solid across the board strength. 57The 75-item CCTDI has seven subscales. The subscales and number of items follow: analyticity (11 items), self-confidence (9 items), inquisitiveness (10 items), maturity (10 items), open-mindedness (12 items), systematicity (11 items), and truth seeking (12 items). 57

In a study of instructional strategies and their influence on the development of critical thinking among undergraduate nursing students, Tiwari, Lai and Yuen found that when:

Compared with lecture students, PBL students showed significantly greater improvement in overall CCTDI (p = .0048), Truth seeking  (p= .0008), Analyticity (p=.0368) and Critical Thinking Self-confidence (p =.0342) subscales from the first to the second time points; in overall CCTDI (p = .0083), Truth seeking (p= .0090), and Analyticity (p=.0354) subscales from the second to the third time points; and in Truth seeking (p= .0173) and Systematicity (p= .0440) subscales scores from the first to the fourth time points. 58

California Critical Thinking Skills Test (CCTST)

Studies have shown the California Critical Thinking Skills Test captured gain scores in students’ critical thinking over one quarter or one semester.59 Multiple health science programs have demonstrated significant gains in students’ critical thinking using site-specific curriculum. 59 Studies conducted to control for re-test bias, showed no testing effect from pre- to post-test means using two independent groups of CT students. Since behavioral science measures can be impacted by social-desirability bias — the participant’s desire to answer in ways that would please the researcher — researchers are urged to have participants take the Marlowe Crowne Social Desirability Scale simultaneously when measuring pre- and post-test changes in critical thinking skills. The CCTST is a 34-item instrument. 59 This test has been correlated with the CCTDI with a sample of 1557 nursing education students. Results show that, r= .201, and the relationship between the CCTST and the CCTDI is significant at p< .001. Significant relationships between CCTST and other measures including the GRE total, GRE-analytic, GRE-Verbal, GRE-Quantitative, the WGCTA, and the SAT Math and Verbal have also been reported. The two forms of the CCTST, A and B, are considered statistically significant. 59 Depending of the testing context KR-20 alphas range from .70- .75. The newest version is CCTST Form 2000, and depending on the testing context, KR-20 alphas range from .78-.84.59

The Health Science Reasoning Test (HSRT)

Items within this inventory cover the domain of the CT cognitive skills identified by a Delphi group of experts whose work resulted in the development of the CCTDI and CCTST. This test measures health science undergraduate and graduate students’ CTS. Test takers are not required to have discipline-specific health sciences content knowledge. For this reason, it may have limited utility since it does not measure critical thinking skills in specific dental education domains.60

The preliminary estimates of internal consistency show that overall KR-20 coefficients range from .77 to .83. 60 The instrument has moderate reliability on analysis and inference subscales, although the factor loadings appear adequate. The low K-20 coefficients may be result of small sample size, variance in item response, or both (See Table 8). The HSRT is a 33-item instrument. The items are set in health sciences and clinical practice contexts. 60

Table 8. Estimates of Internal Consistency and Factor Loading by Subscale for HSRT

Subscale KR-20


Factor loading








Deductive .71 .366-.579
Analysis .54*










Evaluation .77



Professional Judgment Rating Form (PJRF)

The scale consists of 20 items, or two sets of 10 descriptors. The first set of descriptors relates primarily to attitudinal (habits of mind) dimension of CT. The second set of descriptors relates primarily to CT skills. The rater responds yes or no to each of the 20 descriptors. The rated individual is given one point for each desirable response provider by the rater. The desirable responses are in some cases, “yes” and in other cases, “no. Positive items are 1, 2, 8, 9, and 10. Negative items are 3, 4, 5, 6, and 7.61

A single rater should know the student well enough to respond to at least 17 or the 20 descriptors with confidence. If not, the validity of the ratings may be questionable. If a single rater is used, and ratings over time show some evidence of consistency, then comparisons between those ratings may be used to assess changes. If more than one rater is used, then inter-rater reliability must be established among the raters to yield meaningful results. While the PJRF can be used to assess the effectiveness of training programs for individuals or groups, the evaluation of participants’ actual skills are best measured by an objective tool such as the California Critical Thinking Skills Test. A brief interpretation of PJRF scores is shown below. 61

Scores of 17-20 are very strong and suggest that the individual shows consistent internal motivation and mental ability to make professional judgments in workplace.

Scores of 13-16 are indicative of positive show that the individual demonstrates the ability and habits of minds that reflect professional judgment in the workplace.

Scores of 8-12 are marginal/ambiguous show that the individual inconsistently shows the ability and habits of that reflect professional judgment in the workplace.

Scores of 4-7 are negative show that the individual shows a lack of mental ability or motivation for making professional judgments in the workplace.

Scores of 0-3 are very poor show that the individual shows a constant lack of thinking skills and motivation not to make professional judgments in the workplace. 61

Teaching for Thinking Student Course Evaluation Form

Course evaluation forms typically ask agree-disagree responses to items that focus on teacher behavior. However, typically these questions do not offer information about student learning. Because contemporary thinking about curriculum includes interest about student learning, this form was developed to address the differences in pedagogy and subject matter, learning outcomes, student demographics and course level that is characteristic of education today. This form also grew out of the “one size fits all” approach to teaching evaluations and a recognition of the limitations of this practice. This form offers information about how a particular course enhances student knowledge, sensitivities and dispositions (retrieved January 26, 2008 from URL: http://www.insightassessment.com/teaching/html). This form gives students an opportunity to provide feedback that can be used improve instruction.

Holistic Critical Thinking Scoring Rubric

This assessment tool uses a four-point classification schema that lists particular opposing reasoning skills for select criteria. One advantage of a rubric is that it offers clearly delineated components and scales for evaluating outcomes This rubric lays out for the student how their CT skills will be evaluated, it provides consistent framework for the professor as evaluator and the student.62 Users can add or delete any of the statements to reflect their institution’s effort to measure CT. Like most rubrics, this form is likely to have high face validity since the items tend to be relevant or descriptive of the target concept. This rubric can be used to rate student work or to assess learning outcomes. Experienced evaluators should engage in a process which leads to consensus regarding what kinds of things should be classified and in what ways. 62 If used improperly or by inexperienced evaluators seriously unreliable results may occur.

Peer Evaluation of Group Presentation Form

This form offers a common set of criteria to be used by peers and the instructor to evaluate student-led group presentations regarding: concepts, an analysis of arguments/positions, and conclusions.63 Users have an opportunity to rate the degree to which each component was demonstrated. Also open-ended questions give users an opportunity to cite examples of how concepts, the analysis of arguments/positions, and conclusions were demonstrated (See Table 8).

Table 8. Proposed Universal Criteria for Evaluating Students’ Critical Thinking Skills














Aside from the use of the above-mentioned assessment tools, Dexter et al. recommended that all schools develop a set of universal criteria for evaluating students’ development of critical thinking skills.64 Their rationale for suggesting the proposed criteria is that if faculty give feedback using this criteria, graduates will internalize these skills and use them to monitor their own thinking and practice (See Table 4).

52 Creswell, JW.  Educational research: planning, conducting and evaluating quantitative and qualitative research. Upper Saddle River, NJ: Pearson/Merrill Prentice Hall, 2008.

53 Gazella, BM, Hogan L, Masten W, Stacks J, Stephens R, Zascavage V.  Reliability and validity of the Watson-Glasere critical thinking appraisal forms for different academic groups. J of instructional psych 1999; 33(2): 141-143

54 Ennis RH, Millman J, Tomko, TN. Cornell Critical Thinking Tests Level X and level Z Manual. 5th ed.  Seaside, CA: The Critical Thinking Co. 2005.

55 Linn R. Ability testing: Individual differences, prediction and differential prediction. In KA Wigdor, WR Garner (eds.), Ability Testing. Uses, consequences, and controversies/ (Part II: Documentation section). Washington, DC: National Academy Press. 1982; 355-388.

56 Facione N, Facione P. Critical Thinking Assessment in Nursing Education Programs: An Aggregate Data Analysis. Millbrae, CA: The California Academic Press, 1997.

57 Facione PA, Facione NC, Giancarlo CA. The California Critical Thinking Disposition Inventory Test Manual. Millbrae, CA: The California Academic Press, 1996.

58 Tiwari A, Lai P, So M, Yuen K. A comparison of the effects of problem-based learning and lecturing on the development of students’ critical thinking. Med Educ 2006; 40(6): 547-54.

59 Facione PA, Facione NC, Giancarlo CA. The disposition toward critical thinking: Its character, measurement, and relationship to critical thinking skill. Informal Logic 2000; 20(1): 61-84

60 Facione NC, Facione PA. The Health Sciences Reasoning Test (HSRT) Test Manual. Millbrae, CA: Insight Assessment, 2006.

61 Facione PA, Blohm SW, Facione NC, Giancarlo CAF. PJRF Professional judgment rating form novice/internship level critical thinking abilities and habits of mind rater’s manual. Millbrae, CA: The California Academic Press, 2006.

62 Facione PA, Facione NC. Holistic Critical Thinking Score Rubric. Millbrae, CA: The California Academic Press. 1994.

63 Facione NC. Critical thinking as a reasoned judgment. The Album. Miilbrae, CA: Insight Assessment and The California Academic Press, 2002.

64 Dexter P, Applegate M, Backer J, Clayton K, Keffer J, Norton B, Ross B. A proposed framework for teaching and evaluating critical thinking in nursing. J of Prof Nursing 1997; 13(3): 160-167.