July 18, 2023
by Sridhar Rajagopalan, Co-founder & Chief Learning Officer, Ei
Blog, Category, Newest
0 Comments

Key Principles for Creating High-Quality Assessments

Key Principles for Creating High-Quality Assessments

Student assessments are a key part of the education system. The data and insights they generate provide valuable feedback on the performance of students, teachers, schools, and the education system in general. Additionally, because stakeholders in the system place great emphasis on good performance in examinations, improving assessments can have positive upstream effects across the system. ‘Teaching to the test’ can promote learning if the test is a good one.

Over the last two decades, Ei has created thousands for assessments at the school level. We have also studied the assessments used in our schools as well as the school-leaving exams. Based on our collective experience over this period, we share certain key principles that, if adhered to, can improve the quality of assessments (and consequently of student learning).

These principles are:

Testing key concepts and core knowledge, not peripheral facts
Using questions that are unfamiliar either in the way they are framed and/or their context
Having questions covering the entire range of difficulty in a paper
Ensuring that difficult questions are based on ‘good’ sources of difficulty
Using authentic data in questions
Avoiding narrowly defining a large number of competencies and then only having questions testing individual competencies
Using different questions types to test different aspects of student competencies.
Providing test creators access to past student performance data, to use while designing questions
Designing answer rubrics to capture errors and misconceptions

Publishing post-examination analysis booklets
In large-scale exams, reporting results using scaled scores and percentiles

Assessments are and shall continue to remain a ‘north star’ that guide the actors in the education system. They will therefore continue to have an outsized effect on the priorities of stakeholders. Adhering to these principles, we believe, will help create well-designed assessments that will not remain just a ‘necessary evil’, but positively influence the education system, and subsequently our workforce and society.

Principle 1. Testing key concepts and core knowledge, and not peripheral facts:

Examinations should primarily test a student’s understanding of key concepts. It is also okay if they test for certain facts, as long as those facts are core to the subject. For example, the concepts of compounds and mixtures and the difference between them represent a fundamental understanding about matter. The fact that the Earth’s axis is tilted from the perpendicular of the plane of revolution is a core fact which is okay to test. On the other hand, the amount by which the earth’s polar circumference is less than its equatorial circumference is an unimportant one and should not be tested. Yet many of our exams test peripheral or trivial facts like these.

Trivial facts should not be tested not just because they can easily be looked up on any mobile phone, but also because these trivial facts may displace core understanding. One way to test if a certain question is a valid one to ask is to check if a high percentage (say 70%) of practicing experts of the subject will answer it correctly. Every physicist will know the difference between compounds and mixtures and all about the tilt of the Earth’s axis, but most geographers would probably NOT know that the polar circumference is 72 km less than the equatorial circumference. Questions that test reasoning and higher-order thinking are also more appropriate to real-world tasks and challenges today and hence more important.

The figure below shows a question testing mechanical learning vs. real learning with understanding.

Figure 1: Asking the definition of a peninsula tests for mechanical learning while the alternative shown expects students to understand the characteristics of peninsulas even if they cannot give the textbook definition.

To summarise, not everything that can be asked should be asked. Rather assessments should focus on concepts and some facts that would serve as a foundation for real-life application or future learning.

Principle 2. Using questions that are unfamiliar either in the way they are framed and/or their context:

Most examinations in our country, both at the school and Board level, tend to have questions that are typical or fit a standard form. Questions rarely use unusual or unfamiliar contexts or forms. So students develop the techniques and confidence to answer those standard questions (often by reproducing the solutions in the textbooks). They learn that when they encounter a question in an exam, they should ‘pattern-match’ to check which question they have seen in the textbook or class matches it closest and apply the same procedure. Unfortunately, this actually works and yields the expected result with most questions, so the ‘learning’ is reinforced. Students gain no exposure to having to respond to problems that are presented differently or need to be tackled differently. The process of first trying to understand the problem, then thinking about it and then attempting to solve it step by step, is largely unknown to them.

Thus, this is nothing more than a form of rote learning, where procedures or patterns, if not facts, are memorised. When faced with unfamiliar problems whether in modern tests like PISA, competitive tests or even in unexpected real-life situations, the response is to feel flabbergasted or unprepared or that the question is ‘out of syllabus’. Most students lack the confidence to even attempt such questions.

Whether we want to check whether students have really learned concepts, or we want to prepare them for future exams, testw should contain questions that test a prescribed set of concepts in an unfamiliar way.

What do we mean by ‘unfamiliar’ questions? Questions can be unfamiliar in different ways:

they may be framed using real-life contexts (e.g. sports, technology, art, music, market transactions) which are not used for that concept in the textbook
they may be framed in the context of contemporary developments (e.g. COVID-19, cryptocurrency, an important current event) which too would be ‘new’ for them
they may integrate concepts taught in different subjects (e.g., show a graph recorded by a seismograph during an earthquake and ask a simple interpretation question in a mathematics test)
they may simply test for conceptual understanding, misconceptions or higher-order cognitive skills in any form that has not been discussed in the textbook

Figure 2: An example of an unfamiliar question that checks for conceptual understanding. The question is different from any typical question asked on a topic like evaporation; in fact, students have to figure out what concept is being tested.

Being able to apply conceptual understanding in unfamiliar contexts is a critical life-skill. Asking such questions in exams would automatically ensure their use in classroom teaching and help develop such skills.

It is important to note two important points about unfamiliar questions: Firstly, they are not necessarily difficult questions. Once students have understood the problem, it may actually be easy to solve. Secondly, all the questions in a test need not be unfamiliar. Up to 30% – 40% of questions can be familiar to students and thus answerable even by weaker students.

Finally, creating such questions may seem challenging, and it does require effort. Some tips are discussed in Box 1.

Box 1: How to create unfamiliar questions based on familiar concepts

1. Any question in an unfamiliar context checks for a concept in a specific context. So you can either start with the concept (e.g. heat, Pythagoras Theorem, food web or climate) or the context (e.g. moon landing, GPS, codes and cryptography or printing a book) But remember that real-life questions often involve multiple concepts and that is good!

2. A good way to develop the skill of creating unfamiliar questions is to study existing questions from tests like ASSET, PISA, TIMSS and others – all of which share examples of such questions. Here is a list of some examples of contexts and concepts that can be tested in those contexts:

Table 1: List of real-life contexts and the concepts that can be tested

Context	Concept
Sun Outages which block satellite TV signals twice a year	Orbits of satellites around the Earth and the Earth around the Sun
Speedometer, energy meters, weighing scales, etc.	Measurement; how to read scales correctly, least count
Satellite images released by NASA, ISRO etc., the instruments used by astronauts, etc.	Relevant science concepts like phases of the moon, gravity etc.
Decorative lights, mosquito swatter, etc.	Electric circuits
Arrangement of seats in auditoriums, theatres, online booking apps, etc.	Arithmetic progression, patterns and algebraic reasoning, estimation and number sense
Gears and speeds of cars	Ratio and proportional reasoning, the concept of power, speed
Google maps, photo editing applications	Proportional reasoning, algebraic reasoning, transformations, similar triangles
Non-routine 2D and 3D shapes around us	Mensuration
Information plaques in famous sites; authentic historical documents, labels of food items, bills, etc.	Language, reading comprehension as well as data analysis
Web pages, magazine article excerpts	Reading comprehension, applied grammar, appreciation
Historical documents, objects or photographs	Historical facts, or concepts like trade, war, etc.
Samples of art, architecture or culture	Various concepts in social studies, civics, lifestyles

3. Keeping the context and content in mind, we now create a question where a real-life example is used from the context while also covering the concept to be tested. It is okay if the example is simplified to be relevant to the appropriate age group, as long as it is based on the same principles actually used. See the example below.

Principle 3. Having questions covering the entire range of difficulty in a paper:

A key purpose of most examinations is to discriminate between students of different levels of ability. To do so, they must contain questions that taken together cover the entire range of student ability.

Since there will be test-takers with low, medium as well as high ability levels, the examination must be able to properly discriminate between them. To do this, there must be a good mix of easy, medium and difficult questions. Students with poorer knowledge of subject material will be able to solve only the easiest questions, whereas those with a stronger grasp of the subject matter will answer more difficult questions with only the highest ability students being able to answer the most difficult ones.

How does one know the difficulty level of questions while setting them? This is not easy and while question makers may estimate the difficulty, these estimates are often not correct. The only way to have this information is to pilot items and record the performance data. (If performance data of past items is available, sometimes the difficulty of similar items can be judged reasonably accurately. Also, if a certain groups of experts are proficient in setting questions and then analysing the actual performance, they develop a good sense of student performance on different types of items – though regular pilots are always necessary.)

Currently, it is found that many public examinations tend to have fewer questions at a higher level of difficulty (and in some cases none). For example, the data analysis of one of the past board papers shows that all the multiple-choice questions in the paper have the difficulty parameter in a very narrow range with most of them discriminating among only students with medium ability levels. The presence of too many easy and too few difficult questions skews the distribution of results and also leads to marks inflation (which pushes up college cut-offs and increases pressure on students as a single mark makes a huge difference).

Figure 4 shows the difficulty distribution of questions in a recent ASSET paper. The difficulty parameter represents the average performance of the item. Thus there are items answered correctly by over 90% of students while others were answered correctly by only 12% of students. Furthermore, there are items at almost every level of intermediate difficulty.

If questions at all difficulty levels are properly represented in a paper, the student results will also form a normal curve (which is correct as student abilities form a normal distribution). This is a necessary (though not sufficient) condition for a good assessment.

Principle 4. Ensuring that difficult questions are based on ‘good’ sources of difficulty:

Based on their content, examination questions can be difficult for ‘good’ or ‘bad’ reasons. For example, students may find certain questions difficult because they test multiple skills simultaneously. Such questions encourage students to engage in higher-order thinking and integrate aspects they have learnt. Such questions can be said to be built on good sources of difficulty.

On the other hand, questions that are based on ‘bad’ sources of difficulty may test irrelevant facts or may require students to engage in tedious calculations.

While ‘good’ sources of difficulty can encourage meaningful learning, ‘bad’ sources of difficulty may cause students to lose interest in the subject (Box 2 lists some common good and bad sources of difficulty.

Further, a good test would also have the different good sources of difficulty represented well and not one of them is overly represented so that at an overall level the test can discriminate well across students of all difficulty levels.

Principle 3 highlighted the importance of assessments containing questions across a range of difficulties. Even within this range, though, questions should be based only on good sources of difficulty.

Box 2: Sources of difficulty in questions

Good sources of difficulty

a. testing one or more misconceptions

b. requiring students to identify the relevant parts from the information provided before answering

c. testing concepts in an unfamiliar manner

d. using age-appropriate but unseen contexts

e. applying concepts in an unfamiliar context

f. basing questions on observations from real life

Bad sources of difficulty

a. erroneous questions

b. using contrived examples; unrealistic situations

c. using tricky or unnecessarily complicated question texts

d. confusing or ambiguous figures/tables / or other information

e. having options that are long, complicated, ambiguous or overlapping

f. involving unnecessarily long calculations

g. options are not homogeneous in content or grammatical structure

h. confusing language (like the use of double negatives)

i. requiring tedious working out or steps

Principle 5. Using authentic data in questions:

Questions in exams should, as far as possible, contain authentic data and examples from the real world, even in situations where the use of fictitious examples or data would otherwise suffice. The use of real-life contexts and data in examinations can make questions more engaging, and help students understand the practical importance of their education. Therefore, in addition to testing concepts, these questions become teaching tools in themselves. Their use in examinations will also encourage teachers to structure classroom instruction accordingly. A sample item using authentic data is shown in Figure 5.

Figure 5: A sample assessment item using authentic data

For example, if scores from sporting competitions are used in a question testing the concept of averages, they should be data from actual sporting events. Similarly, when students studying geography are questioned about plate tectonics, they should be given examples of real tectonic plates and their movements if possible. In language examinations, comprehension passages can be from real texts across domains such as history, science, or economics.

Of course, in some cases, the complexity of information may need to be moderated or simplified to be suitable for the targeted class level.

Principle 6. Avoiding narrowly defining a large number of competencies and then mapping questions to individual competencies:

There seems to be a widespread but wrong notion that good education and assessments require a large number of competencies to be listed for each subject, and individual questions in assessments mapped to individual competencies. Further, some seem to believe that merely doing the above will lead to good assessments and by extension, good education. ‘Competency Based Education’, a laudable goal, is sometimes understood in this narrow sense. In our experience, listing competencies and then mapping questions to competencies are both largely mechanical steps, and may actually increase and not reduce the rote component of an assessment.

The belief that students need to acquire key competencies is valid. However, the idea that this can be achieved in a mechanistic manner – first listing competencies and then creating questions or content that maps to those competencies – is flawed. Only the quality of content and assessments can lead to good teaching or learning, not merely a mapping.

Particularly in examinations, overly specific mapping leads to the use of narrowly structured examination questions that test only particular competencies, and that too in isolation. In fact, good questions that test multiple competencies are usually excluded in such a process because they breach artificially defined boundaries for competencies, making the paper more mechanical.

This problem is present even in ‘advanced’ education systems. In the USA, for example, the Common Core was introduced to establish set standards and competencies for student education, to improve learning outcomes. Though there was a lot more to the Common Core, in many cases, it was treated by teachers merely as a list of standards to be rigidly focussed on through lessons or questions.[1]

While examination boards must establish a set of necessary skills, concepts, and learning outcomes to guide the education system, they should not be overly prescriptive in how questions test them or aim to break them into very fine sub-categories.

Principle 7. Using different questions types to test different aspects of student competencies:

We often hear debates and arguments about how certain types of questions (say objective or subjective, or multiple-choice questions) are inferior or superior to other types of questions. However, the reality is that each question type has its strengths, weaknesses, and suitability based on the subject and the goal of the assessment. It may be said that a comprehensive assessment will have a mix of various types of questions, each used for its own strengths, as described in Box 3.

[1] Loveless, T. (2021, June 3). Why common core failed. Brookings. Retrieved January 24, 2022, from

https://www.brookings.edu/blog/brown-center-chalkboard/2021/03/18/why-common-core-failed/

Box 3: Different strengths and uses of different question formats

Different question formats have their own strengths and weaknesses and may be suitable in different situations:

Multiple-choice questions (MCQs) capture students’ misconceptions or common errors if framed well. They can be created very scientifically and scored quickly but good MCQs require effort and experience to create. They can test higher-order thinking skills too though badly designed MCQs often just test facts and may serve little purpose.

True/False questions are easy to frame and score but encourage guessing, given the 50% chance of getting them correct. Asking for an accompanying justification can help gauge student thought processes.

Blank or short answer questions are easy to frame and can be used to test key facts, terms or principles. They can be an effective method to test if students know and can express an answer concisely.

Long answer or essay-type questions can test a student’s ability to analyse information, synthesise different facts and ideas, etc. They are often the best question type for this objective. They also provide nuanced information on misconceptions and are easy to make. However, they take greater effort than other question types to evaluate and require a detailed rubric for proper marking. It is almost impossible to reduce an element of subjectivity in their correction, though.

Technology Enhanced Items (TEI) can improve test-takers engagement through the use of visual or auditory aids, for example, and provide a detailed diagnosis with standard correction. But they require time and skills to create and digital infrastructure to administer to students. Some examples of TEIs are shown in Figure 6.

Figure 6: Technology-enhanced items from mathematics and language. Students interact with such questions which not only record details of these interactions but may also adapt based on student responses.

Principle 8. Providing test creators access to past student performance data, to use while designing questions:

Especially for large-scale or summative examinations, test creators should be given access to data on past assessments. This will provide them insights of two types – one, about what items worked and issues, if any, with items and two, about the kinds of student responses and errors.

Knowing which items functioned well and which did not helps create better items for future assessments. Item data may indicate difficulty, discrimination, the extent of guessing, what ability students answered the item correctly and the wrong responses students gave. Though all of this data may not be available for every question, each piece of information provides valuable insights. As mentioned earlier, past item data also helps question makers estimate the difficulty of similar new items and thus create questions of varying difficulty in the paper.

(Knowing areas of student error is useful not just for assessment creators but teachers as well. Principle 10 below talks about the benefits to teachers and future students when this data is shared with them in the form of post-examination assessment booklets.)

Having misconception data in the format shown in Box 4 helps in developing good assessment items testing misconceptions and in creating plausible distractors.

Box 4: Misconception Data

The below table is a sample database of different misconceptions under the topic, ‘Respiration’.

S.No	Misconception statement	Extent of the misconception (Class)	Link to the assessment item
1	Plants carry out photosynthesis only during the day and respiration occurs only at night	62% (Class 9)	Resp_M_1
2	Plants use carbon dioxide for respiration	52% (Class 6)	Resp_M_2
3	Plants do not need oxygen	30% (Class 10)	Resp_M_3
4	Plants need oxygen only to convert to the carbon dioxide needed for photosynthesis	20% (Class 10)	Resp_M_4

Principle 9. Designing answer rubrics to capture errors and misconceptions:

For subjective questions in large-scale examinations to be corrected by multiple evaluators, rubrics with clear marking guidelines should be prepared. This helps bring uniformity in the assessment of students by different evaluators.

This should ideally be done in a two-step process. First, provisional rubrics are created along with the question paper based on discussions between question makers and select evaluators. These rubrics assign marks to different answer types. Next, once the test is completed and student answers sheets are available, a sample of them are selected and corrected by a team of experienced evaluators. Final rubrics are then made by accounting for answer types that were not covered in the provisional rubrics but found to occur in the actual answers. The correction by the senior evaluators would also help establish a ‘standard’ grade for each answer by consensus which would also be incorporated in the final rubric which will be shared with all evaluators.

Well-designed rubrics serve an additional purpose and the final rubric should be designed keeping this in mind – they can capture patterns in responses to subjective questions amongst students. For this, evaluators should assign codes to each answer based on its content and misconceptions contained. For example – A1, A2, and A3 can be codes used to classify different forms of completely correct answers, B1, B2, B3 can classify partially correct answers, and C1. C2, C3 can classify completely incorrect answers. This can facilitate an aggregated analysis of subjective questions and prepare a data pool of common misconceptions for exam-creators to incorporate.

These rubrics should be clear and objective with grading criteria to ensure standardisation. However, they should be used for grading only by subject matter experts who can discern minute subjectivity in student responses. A rubric of a PISA released item is shown in figure 7.

Figure 7: A good sample rubric. The stimulus is shown on the left side and the question on the top right cell. The rubric in the bottom right cell specifies codes for responses that get full, partial and no credit. Note that there could be multiple codes for different answers all representing full or partial credit if there is a plan to analyse the different answers by students.

Principle 10. Publishing post-examination analysis booklets:

After each round of examinations, an aggregated analysis of all questions should be prepared and distributed among teachers, students and parents, documenting trends in common misconceptions, sample answers, etc. The version shared with teachers would include additional detailed analyses which would contain a list of methods to address commonly found misconceptions and errors. Both quantitative and qualitative data should be synthesised for these purposes (Box 5 provides an example of how a question representing an important misconception may be presented, along with suggestions for teachers).

This will also create transparency in the process of examinations; all stakeholders will have a clear sense of expectations from exams and can prepare or support accordingly.

These analyses should be made public in a timely manner (3-4 months after an examination cycle) so that their findings can be addressed. This is crucial for any meaningful improvement of examinations, and by extension, the education system.

Principle 11. In large-scale exams, reporting results using scaled scores and percentiles:

Examinations should use scaled rather than raw scores to report results. In simple language, scaled scores represent a student’s performance on a consistent and standardised scale taking into account the differences in difficulty between different questions. Further, these difficulties are calculated based on actual student performance. Scaled results, therefore, reflect student performance much more correctly than raw scores. Internationally, it is common practice to use scaled scores for most large exams. Even in India most competitive exams for college admissions use scaled scores when reporting results.

Once scaled scores are tabulated, each student’s results should be declared as a percentile and not a percentage, meaning that their results will be expressed relative to other students’ performance. This helps distinguish between student performances at a very fine level, for example, students who have scored the same raw score.

Additionally, for public examinations like Board Exams, this makes establishing ‘cut-off’ scores easier. This is also why competitive assessments like the Joint Entrance Exam (JEE) for engineering college admissions in India use percentiles when declaring their results.

A big advantage of using scaled scores and percentiles is that it facilitates comparability between different examinations and different years. For example, students scoring, say, 93 percentile in 2019 and 2016 can be reliably considered to be of similar ability as also students scoring similar percentiles in admission tests conducted by different states. This can make processes like college admissions very fair without having to worry if a particular Board is ‘strict’ or ‘lenient’.

Conclusion:

Given that assessments are and shall continue to remain a ‘north star’ that guide the actors in the education systems, they will always have an outsized effect on the teaching-learning process. Because current assessments in India tend to prioritize rote-learning, they negatively impact the quality of education and therefore are perceived negatively by the public at large. They are considered a ‘necessary evil’ that serve certain functional purposes (sorting students based on ability), but little else.

However, if designed well, assessments possess significant potential to affect positive change throughout the education system. They can provide key feedback on student performance that can enable focussed learning remediation. They also help discriminate between students of different abilities and aptitudes and can help them make informed choices about careers. Well-designed assessments (particularly large-scale assessments) are also a key barometer of our education system that can shine the light on areas that require improvement. They will always have a multiplier effect on the education system and, by extension, our workforce and society. Using well-designed assessments can ensure that this effect is positive.

About the Author: Sridhar Rajagopalan is Co-founder & Chief Learning Officer of Ei

Key Principles for Creating High-Quality Assessments

Figure 1: Asking the definition of a peninsula tests for mechanical learning while the alternative shown expects students to understand the characteristics of peninsulas even if they cannot give the textbook definition.

Figure 2: An example of an unfamiliar question that checks for conceptual understanding. The question is different from any typical question asked on a topic like evaporation; in fact, students have to figure out what concept is being tested.

Figure 5: A sample assessment item using authentic data

Figure 6: Technology-enhanced items from mathematics and language. Students interact with such questions which not only record details of these interactions but may also adapt based on student responses.

080-471-87451

info@ei.study

India | UAE | South Africa | USA

Key Principles for Creating High-Quality Assessments

Figure 1: Asking the definition of a peninsula tests for mechanical learning while the alternative shown expects students to understand the characteristics of peninsulas even if they cannot give the textbook definition.

Figure 2: An example of an unfamiliar question that checks for conceptual understanding. The question is different from any typical question asked on a topic like evaporation; in fact, students have to figure out what concept is being tested.

Figure 5: A sample assessment item using authentic data

Figure 6: Technology-enhanced items from mathematics and language. Students interact with such questions which not only record details of these interactions but may also adapt based on student responses.

Exploring Education of Nagaland: Insights and Reflections on Teaching Landscape

Every Student Can Potentially Self-learn Using Technology

080-471-87451

info@ei.study

India | UAE | South Africa | USA