Assess Assessments

This section describes how you can assess your assessments. You will want to evaluate your assessments to ensure that they are accurately testing students’ knowledge or skills. The defining criteria of a well-designed assessment are that it is (1) linked to instructional goals and is (2) a valid and (3) reliable indicator of mastery of knowledge or skills. This section will describe each of these characteristics and how you can evaluate them, as well as how you can gather feedback from your students on the quality of your assessments.

Alignment with Instructional Goals

Alignment of assessments with instructional goals enables instructors to evaluate the extent to which students have learned the intended curriculum.1 A tool to help you align your instructional goals with your assessments as you develop them is a test blueprint. A test blueprint is a chart that identifies the learning objectives the instructor wants to assess, and the cognitive processes required for students to master each objective. For example, instructional objectives may require students to recall a concept, apply the concept, or create a product that demonstrates their understanding of the concept (see Bloom’s taxonomy for further description). The level of cognition at which you want your students to master the material should inform the types of questions you use in your assessments. Pascale (n.d.) provides 12 sample test blueprints as appendices, and also recommends you share the test blueprint with your students to assist them in studying.2


One definition for test validity is the “degree with which the inferences based on test scores are meaningful, useful, and appropriate."3 Huba and Freed, on the other hand, define test validity as the extent to which information gained through testing is useful in guiding learning.1 Gronlund identified some of the most common factors that lower test validity.4 These include:

  • Inadequate sampling of the achievement to be assessed.
  • Test items that do not function as intended due to lack of relevance, ambiguity, clues, bias (e.g., a test which shows systematic differences in the results of people based on group membership such as race or gender), or inappropriate difficulty.
  • Unclear directions or improper arrangement of test items.
  • Improper administration, such as inadequate time allowed or poorly controlled conditions.
  • Subjective scoring, or objective scoring that contains computational errors.
  • This section provides details about assessing test validity by examining test content, response processes, and consequences of testing.5

Test content validity refers to the extent to which assessments reflect student knowledge and adequately sample the content being assessed. Developing a test blueprint can help you more adequately sample the content to be assessed and increase the degree to which assessments reflect student mastery of the knowledge or skills being assessed. Without laying out a test blueprint in advance, you may inadvertently introduce error into your measurement. For example, if you are assessing students’ mathematical knowledge and you rely heavily on lengthy word problems, the test may capture reading ability rather than mathematical reasoning. For this reason, if your goal is to measure knowledge or skills other than reading comprehension, consider the readability level of your assessments. There are a variety of tools for checking readability level (read more about how you can check it in Microsoft Word), most of which report the results in terms of grade level.

You can also use student response processes to assess test validity. These data provide information about the reasoning that the test-takers use when they respond to a test question, essay prompt, or other type of assessment item. Although this reasoning occurs within their mind, a common method for gathering information about students’ thinking is to use think-aloud procedures. To gather think-aloud data on your assessments, ask a former student, friend, or graduate student to complete your assessment, and sit alongside them as they do. As they complete the assessment, ask them to share what they are thinking. If they are silent for more than 10 seconds, you will want to remind them to share their thoughts. Think-alouds are highly effective for identifying obstacles that may prevent students from demonstrating their knowledge.6 They can help you identify items or prompts that are not clearly written, information that cues the test-taker to the correct response, or alternative ways to interpret or solve problems which may lead to alternate correct answers.

Finally, examine both the positive and negative, intended and unintended the consequences of assessment. For example, a positive and intended outcome of assessment is that students learn more. However, unintended, negative consequences of assessment include students learning only the specific material covered by the tests, focusing on memorizing information, failing to understand it or how it can be applied, or cheating. To gather and examine consequential validity data, Gronlund suggests you ask if the assessment:4

  • improved motivation?
  • improved performance?
  • improved self-assessment skills?
  • contributed to transfer of learning to related areas?
  • encouraged independent learning?
  • encouraged good study habits?
  • contributed to a positive attitude toward schoolwork?
  • adversely affect students in any of the above areas?

To get answers these questions, informally interview your students or conduct follow-up surveys with them or former students.


Reliability refers to the consistency or stability of a measure. For example, if I weigh myself in the morning and then again at the end of the day, the two measurements should be very close if the scale is reliable. In assessment, reliability reflects whether taking the test again would result in a different score. One way that you can assess the temporal stability of your classroom assessments is to administer the same assessment twice. Computing a correlation between student scores will allow you to examine the extent to which scores were related on each administration with a higher correlation reflecting higher reliability.

In addition to temporal stability, reliability also addresses the extent to which test results are free from error.4 We expect a certain amount of error in test scores due to student motivation, carelessness, or fluctuations in memory. However, this kind of error is said to be random in that its impact on test scores is unpredictable. In contrast, you want to eliminate systematic error that results from the assessment design or scoring to increase reliability. Actions to increase the reliability of your classroom assessments include:

  • Using consistent assessment administration procedures. For example, ensure all students receive the same amount of time to complete the assessment in an environment that is free from distraction.
  • Removing bad or ambiguous test items. Inexact wording or poor construction of the question can contribute to this problem, and trick questions and items with trivial details fit into this area. One indicator of a poorly functioning test item is that less than 60% of students get it correct. You should also invite student input on your assessment design by asking students to approach you during tests or to provide written feedback on their test about items that are confusing.
  • Eliminating any gender or cultural bias in test items. For example, individuals from another culture may be unfamiliar with the jargon of American football (e.g. “touchdown”). Thus, you should refrain from using these terms in word problems. Always ensure that you have provided the test-taker with all the necessary information to solve the problem.
  • Using scoring rubrics to increase consistency in grading open-ended responses such as student essays. You can also assess the reliability of your essay scoring by asking another graduate student in your field to score a sample of your essays (ideally using a rubric) and comparing the scores each of you assigned.

The number of items you include should be determined by the amount of time students will have to complete the assessment. A good rule of thumb is to allow students two minutes to complete a multiple-choice item. Thus, in a 90-minute class, a multiple-choice test with 45 items would be appropriate.

Student Involvement in Assessment Improvement

In addition to gathering reliability and validity evidence to improve your assessments, you can also gather feedback from students. Invite them to help you identify alternative, effective assessment methods. Also encourage them to share their input on unclear instructions, multiple-choice items, essay prompts, rubric criteria, etc. By inviting students to contribute to assessment development, they become more involved in the process, which may help them become better performers.7


(1) Huba, M. E. & Freed, J. E. (2000). Learner-centered assessment on college campuses. Boston, MA: Allyn and Bacon. 

(2) Pascale, P. (n.d.). Developing a table of specifications. Youngstown State University. Retrieved from Eric Document Reproduction Service (ED 115 675).

(3) Brualdi, (1999). Traditional and Modern Concepts of Validity. ERIC/AE Digest. Retrieved from

(4) Gronlund, N. E. (2003). Assessment of student achievement. Boston, MA: Allyn and Bacon.

(5) American Educational Research Association, American Psychological Association, & National Council of Measurement in Education. (1999). Standards for Psychological Testing. Washington, DC: American Psychological Association.

(6) Johnstone, C.J., Bottsford-Miller, N.A., & Thompson, S.J. (2006). Using the think-aloud methods (cognitive labs) to evaluate test design for students with disabilities and English language learners. National Center on Educational Outcomes, University of Minnesota. 

(7) Stiggins, R.J. (2005). Student-involved assessment for learning. (4th ed.). New Jersey: Pearson, Merrill Prentice Hall.