Exploring research based approaches in rubric design

Disclosure: This write-up is not a formal research paper but rather a blog-style synthesis based on a review of empirical literature and expert opinions in the field of higher education assessment. It is intended to offer an evidence-based perspective on rubric design and is not an exhaustive or definitive research article.

‍

Two-Tiered Rubrics in Higher Education: An Evidence-Based Exploration

Introduction: Two-tiered or dual-layer rubrics refer to an assessment strategy in which students receive a general rubric outlining broad criteria and high-level performance descriptors, while instructors use a more detailed, internally-facing rubric with hyper-specific criteria for grading. The rationale behind this approach is to give students clear guidance without overly prescribing their work, and simultaneously provide graders (especially less experienced ones) with a rigorous framework to ensure consistency. This report examines empirical evidence on four key questions about two-tiered rubrics: (1) whether they improve inter-rater reliability (particularly for novice graders), (2) whether vague criteria like “college-level work,” “depth,” or “rigor” undermine consistency and fairness, (3) whether highly specific, concept-level performance expectations can serve as concrete anchors for assessing depth of understanding (without stifling creativity), and (4) whether using a detailed rubric internally (and a broader one for students) strikes a balance between standardization and avoiding cookie-cutter responses. Each claim is supported with findings from peer-reviewed studies and authoritative sources in higher education assessment.

Inter-Rater Reliability and Novice Graders

Research strongly indicates that well-designed rubrics can improve inter-rater reliability, ensuring different graders evaluate work more consistently. This effect is especially pronounced for less experienced or non-expert assessors. In a quantitative study by Kan and Bulut, teachers graded performance tasks with and without a rubric: “When teachers used a rubric, inter-rater reliability substantially increased”. Notably, novice instructors (with little teaching experience) and veteran instructors tended to grade differently when no rubric was provided; the novices often applied stricter or inconsistent criteria compared to seasoned educators. However, once all graders used a common rubric, “the differences in the teachers’ scoring due to their teaching experience became negligible”. In other words, a detailed scoring guide helped align novices’ judgments with those of experts, creating a shared standard and improving consistency.

Such findings suggest that an internal hyper-specific rubric could function as a training tool or checklist for TAs and new graders. By spelling out expectations in fine detail (e.g. what constitutes evidence of “analysis” or “logical reasoning”), the rubric reduces the ambiguity in judging complex student work. Empirical reviews note that one cause of low inter-rater reliability is rubric descriptors being too general or subjective, leading different graders to interpret criteria in their own way. A more explicit rubric minimizes the number of independent judgments a grader must make, thereby narrowing the room for disagreement. In practical terms, this means fewer instances of two graders assigning significantly different scores to the same work due to personal criteria. In sum, providing graders with a detailed, analytic rubric (even if students only see a higher-level version) can bolster reliability by ensuring everyone “speaks the same language” when evaluating student performance. Improved reliability not only makes grading fairer but also increases confidence in the assessment’s integrity.

Vagueness in Rubric Language and Fairness

Many traditional rubrics in higher education include vague terms such as “college-level work,” “depth,” or “rigor” to distinguish top-quality work. Empirical studies have identified vague criteria language as a threat to consistent and fair assessment. When descriptors are nebulous, graders inadvertently fall back on subjective impressions or biases. For example, a literature review by Nkhoma et al. notes that if rubric descriptions are “perceived as abstract and vague”, then “assessors will assign grades based on their overall impression rather than the criteria described”. In other words, terms like “rigorous analysis” or “in-depth discussion” without further clarification might mean one thing to one grader and something else to another. This undermines the rubric’s purpose of standardizing evaluation and can result in inconsistent grading.

Vague language also poses problems from the student’s perspective. Students – especially those new to the academic culture – may not intuitively grasp what phrases like “college-level” or “demonstrates depth” entail in practice. A program assessment handbook highlights that “many first-generation students and students from low-performing high schools do not have experience in decoding what ‘college-level’ work is”, so instructors should not rely on implicit understanding of such terms. If expectations are not explicitly spelled out, students unfamiliar with the unwritten norms of academia can be unfairly disadvantaged. For instance, telling students their essay needs “more depth” without concrete explanation leaves them guessing how to improve. Handley et al. (2013) found that terms like “critically analyse” or “synthesise” are often unclear to students and need further explanation for them to effectively use feedback.

To address these issues, researchers recommend making rubric criteria as clear and descriptive as possible. This includes using concrete qualifiers or examples rather than broad value-laden terms. Precise language and defined standards help ensure that all graders are applying the same criteria and that all students understand the targets. For example, instead of stating that a lab report must show “rigor,” a rubric could specify that it should include a thorough error analysis and reference at least two peer-reviewed sources to support methodology – tangible indicators that are less open to interpretation. As one comprehensive review puts it: “If the descriptions are perceived as abstract and vague... rubrics are not likely to be valuable in promoting learning,” and consistency suffers; therefore, “precise and descriptive language; tangible, qualitative terms; indicators and exemplars are highly recommended” when designing rubrics. In summary, empirical evidence confirms that vague criteria are problematic for both fairness and consistency, underscoring the importance of clearly defining terms like “depth” and “rigor” in assessment tools.

Specific Performance Expectations as Anchors of Depth and Understanding

One proposed benefit of a dual-layer rubric is that the internal rubric can contain highly specific, concept-level performance expectations that serve as concrete anchors for evaluating advanced understanding or “depth.” Rather than relying on abstract judgments (e.g. whether an essay feels sufficiently deep or rigorous), the grader’s rubric can list specific hallmarks of deep understanding. For example, in a senior AI assignment, a broad student-facing rubric might simply include “Demonstrates deep understanding of transformer models,” whereas the internal rubric would enumerate concrete evidence of that understanding – e.g. “correctly proves a non-trivial property of the transformer’s attention mechanism” or “extends the given model to a new use-case with justified modifications.” These detailed criteria act as quantitative or observable anchors: if a student accomplishes these specific tasks, the grader can objectively conclude that the work has the desired depth.

Research supports the idea that using specific, content-related criteria can improve the objectivity and transparency of assessing complex constructs like understanding or creativity. In their review of rubric design, Dawson et al. emphasize incorporating “indicators and exemplars” into performance level descriptors to clarify what meeting a standard looks like. An indicator might be a particular skill demonstration or content element that signals mastery. By including such indicators, the rubric translates fuzzy concepts (“excellent understanding”) into concrete evidence (“explains and applies X theory to solve Y problem”). This makes the grading more quantifiable without resorting to numeric tests, and it provides a more uniform basis for all graders to judge depth.

Importantly, using specific expectations does not have to come at the cost of student creativity or flexibility, if done thoughtfully. The literature suggests that criteria can be explicit and content-specific yet still open-ended in how students fulfill them. Panadero and Jonsson (2020) describe an example in a science context: a rubric for an astronomy assignment specified that top-tier arguments should be based on empirical data with multiple observations, include counterarguments that are addressed, and use appropriate qualifiers. These are high-level scientific reasoning requirements – essentially concept-level expectations for quality – that can be measured (either the student’s argument included those elements or not). Students in that study engaged more deeply with the content to meet these criteria, but the rubric “did not tell them exactly what to do, or how they should do it,” allowing room for different approaches to satisfying the criteria. In practice, one student might fulfill the “depth” criterion by proving a novel property or theorem, while another might design an insightful experiment – both demonstrating advanced understanding anchored to specific expectations.

Thus, empirical evidence and expert analysis support the use of concrete performance anchors to operationalize “depth” or “rigor” in grading. By defining depth in terms of particular achievements or behaviors (proving a property, integrating two theories, addressing counterarguments, etc.), instructors can more fairly and consistently assess higher-order understanding. These specific anchors make the grading criteria more standardized and measurable, yet if they are properly chosen as indicators of quality rather than as an exhaustive to-do list, they still permit creative and varied student solutions.

Balancing Standardization with Creativity: Broad vs. Detailed Rubrics

A common concern among educators is that overly detailed rubrics might produce “cookie-cutter” responses, as students may simply follow a formula to secure a high grade. When every requirement is spelled out step-by-step, students could be incentivized to tick all the boxes in the easiest way possible, potentially stifling originality or deeper exploration. Indeed, critics of rubrics have argued that “prespecified criteria and common standards” can encourage uniformity and work “against creative self-expression”, especially in open-ended tasks. Empirical observations bear this out to an extent: for instance, Wilson (2007) reported that some writers feel rubrics “prevent students from expressing their unique approach to concepts”, essentially treating rubrics as “obstacles to good writing” because students focus on satisfying the rubric at the expense of creative risks. Highly detailed rubrics, as Panadero and Jonsson note, “may not leave enough space for creative and divergent thinking” if they dictate too much of the process or content.

Using a two-tiered rubric approach is one way to mitigate this issue. The broad, student-facing rubric provides clarity on the criteria (the aspects of performance that matter) without giving away a “paint-by-numbers” template for the assignment. This aligns with best practices for rubric design, which suggest that rubrics should guide students on quality expectations rather than serve as a step-by-step recipe. For example, rather than listing every element that a perfect project must contain, the external rubric might say “Project demonstrates a thorough exploration of the problem and innovative thinking in solutions.” This tells students what to aim for (thoroughness, innovation) but not how to do it exactly – encouraging them to make creative decisions on how to meet those criteria. Meanwhile, the internal detailed rubric ensures that when graders sit down to evaluate the work, they have a common set of concrete benchmarks (the “shared grading blueprint”) against which to measure each project. Essentially, the internal rubric guards against subjectivity and drift in grading by less experienced TAs, while the external rubric guards against constraining student creativity.

Evidence from assessment literature supports the value of this balance. Measurement experts like Wiggins have long advocated that “criteria are indicators of quality, without dictating exactly what students should do, or how they should do it.”. In practice, this means rubrics (at least the versions given to students) should articulate what good work looks like in general terms – enough that students know the goals and don’t have to “read the instructor’s mind,” but not so much that every student output converges on an identical format. The scenario of providing only the general rubric to students and keeping the hyper-specific criteria internal fits this philosophy. It gives students a transparent outline of expectations (ensuring fairness and clarity), while avoiding an overly prescriptive checklist that could lead to formulaic work. In fact, an article in Educational Leadership notes that if a rubric defines quality too narrowly, it can yield “cookie-cutter products from students,” and warns that any rubric inducing that effect “is a bad one and should be shredded.”. The same article (citing Popham, 2006) contrasts a task-specific, rigid rubric descriptor with a more open-ended one: the narrow rubric demanded inclusion of very particular content points, whereas the improved rubric described structural and qualitative expectations (like having a logical order and clear sections) which set a standard without specifying the exact content to include. Students thus retain freedom in how to meet the standard, and higher-order thinking is encouraged rather than a scavenger hunt for pre-listed points.

On the grader’s side, using a detailed internal rubric helps maintain a shared benchmark so that all graders reward the same qualities even if students achieve them in diverse ways. The internal rubric can list, for graders’ eyes, examples or common indicators of excellence (for instance, “an innovative solution may involve applying theory X in a new context or integrating two disciplines”). This can be coupled with grader training or calibration sessions, but even on its own, an internal guide promotes consistency. Research on rubric use shows that clearly defined scoring criteria reduce bias and variance in grading. One study demonstrated that introducing rubrics not only raised student achievement on a second assignment but also “improved marker reliability,” meaning graders were more consistent with each other. Moreover, when all graders operate from the same detailed criteria, it diminishes the influence of individual grader’s tendencies (leniency, harshness, or pet preferences). In the earlier-mentioned Kan & Bulut study, the rubric essentially neutralized differences in severity between experienced and novice graders – a clear sign that a common rubric establishes a common standard. In a dual-rubric system, the internal rubric plays this role of the common standard-setter behind the scenes, so that even if students are not following a paint-by-numbers format, the graders are all using the same ruler to measure whatever the student produces.

It is worth noting that transparency and fairness require that students are not kept entirely in the dark about important expectations. Complete secrecy of grading criteria is generally discouraged as it can lead to perceptions of arbitrariness. Balloo et al. (2018) found that if assessment criteria are “concealed from the students, they are deprived of the possibility to take full responsibility for their work”. Thus, the broad rubric given to students must genuinely communicate the key goals and values of the assignment. The internal rubric should ideally not introduce wholly new criteria that students couldn’t have anticipated; rather, it should break down the public criteria into finer points or examples for consistency’s sake. When implemented in this aligned way, the two-tier approach can achieve the dual purpose of preserving student agency and creativity (through broad, non-prescriptive guidance) and ensuring grading standardization and fairness (through detailed internal benchmarks).

Conclusion

Empirical research and scholarship in higher education assessment provide a nuanced endorsement for the two-tiered rubric approach. Inter-rater reliability stands to benefit: detailed internal rubrics help novice graders apply criteria like experts, improving consistency. Studies also confirm that vague terms in rubrics are problematic – they introduce subjectivity and inequity, and should be replaced or supplemented with clearer definitions. By using highly specific performance expectations internally, instructors can anchor judgments of “depth” or “rigor” to tangible evidence. When crafted as quality indicators rather than rigid rules, such criteria can raise the standard of work without extinguishing creativity. Finally, adopting a dual-layer rubric (broad for students, detailed for graders) appears to be a promising strategy to avoid the “cookie-cutter” effect teachers fear while still maintaining a common grading benchmark. In sum, a growing body of evidence suggests that two-tiered rubrics can marry the best of both worlds – offering students clarity and room for innovation, and offering instructors reliability and fairness in evaluation – provided they are carefully designed and transparently aligned. All these findings underscore a central theme: the clarity and specificity of assessment criteria, and how they are communicated, have profound impacts on learning and fairness in higher education. Each institution or instructor considering this approach should align it with these research-based principles to ensure it truly benefits both student outcomes and grading integrity.

References: (All evidence is drawn from peer-reviewed studies, literature reviews, and authoritative educational resources)

Andrade, H. L. (2005). Teaching with rubrics: The good, the bad, and the ugly. Educational Leadership, 62(2), 100–103. Retrieved from https://www.ascd.org/el/articles/self-assessment-through-rubrics

Jonsson, A., & Panadero, E. (2020). A critical review of the arguments against the use of rubrics. Educational Research Review, 30, 100329. https://ernestopanadero.es/Publications/Articles/042_Panadero_&_Jonsson_2020_A_critical_review_of_the_arguments_against_the_use_of_rubrics.pdf

Nkhoma, M. Z., Sriratanaviriyakul, N., Cong, H. P., Lam, T. N., Richardson, J., & Kam, B. (2020). Unpacking the revised Bloom's taxonomy: Developing assessment activities for pre-service teachers. Informing Science: The International Journal of an Emerging Transdiscipline, 23, 237–276. https://proceedings.informingscience.org/InSITE2020/InSITE2020p237-276Nkhoma6195.pdf

Worcester State University. (2022). Assessment handbook. https://webcdn.worcester.edu/wp-content/uploads/2022/05/Worcester-State-University-Assessment-Handbook.pdf

Tierney, R., Simon, M., & Charland, J. (2011). Being fair: Teachers’ interpretations of principles for standards-based grading. Canadian Journal of Education, 34(3), 118–145. https://files.eric.ed.gov/fulltext/EJ1056261.pdf

‍