Issues in Educational Research, 7(1), 1997, 37-51.

Assessing teachers' effectiveness

Michael J. Dunkin
The University of Sydney

A modified version of a paper commissioned by the United Nations Educational, Scientific and Cultural Organization.
The paper contains information and discussion concerning the following conceptual aspects of assessing teachers' effectiveness: purposes; category of teachers to be assessed; conceptions of teachers' work; dimensions of teacher quality; approach to establishing validity of assessments. There follows a review of literature on traditional methods of assessing teachers' effectiveness, including paper-and-pencil tests, performance measures, student achievement measures, and emergent methods, including on-the-job evaluation, performance exercises, portfolios, and interviews. The paper concludes with reference to other issues that need to be considered in developing systems for evaluating teachers' effectiveness.

There are five main preliminary matters involved in arriving at a system for the evaluation of teachers. The first is the purpose of the evaluation; the second is the target category of teachers to be assessed; the third is the conception of teachers' work that is adopted; the fourth concerns the dimensions of teaching quality about which judgments are to be made; and the fifth is the approach to establishing the validity of the assessments.

Purposes

Scriven (1967) drew attention to the distinction between formative and summative evaluation. If a school system institutes a system of assessment in order to encourage the professional growth and development of its teachers, it is engaged in formative evaluation. On the other hand, if the school system establishes an accountability system of evaluation in order to select teachers to license, hire, give tenure to, promote, demote or dismiss it is engaged in summative evaluation.

Most commentators argue that the same procedures, and information gathered with them, can not be used for the two types of purposes - that teachers who may well benefit from assessment for formative reasons, will not expose their deficiencies if there is a risk that summative judgments might be made about them on the basis of information obtained for formative purposes (Darling-Hammond et al., 1983; Stiggins & Duke, 1990). Stiggins (1986) commented on the value of each of these two types of evaluation from the point of view of their contribution to overall school quality:

Accountability systems strive to affect school quality by protecting students from incompetent teachers. However, because nearly all teachers are at least minimally competent, the accountability system directly affects only a very few teachers who are not competent.
Thus, if our goal is to improve general school quality - and we use only those strategies that affect a few teachers - overall school improvement is likely to be a very slow process.
Growth-oriented systems, on the other hand, have the potential of affecting all teachers - not just those few who are having problems. There is no question that all teachers can improve some dimension(s) of their performance. (pp.53-54)

The survey of teacher evaluation that was conducted by Stiggins and Duke (1990) led them to suggest that there were several necessary conditions for the teacher growth model of teacher evaluation to succeed. The first was that any summative approach remain largely independent of the formative approach. Stiggins and Duke (1990) were not dismissive of summative evaluation. Rather they argued that highly developed accountability-based evaluation protects teachers' property and rights to due process and protects the public from incompetent teachers.

Category of teachers to be assessed

Issues and methods associated with teacher evaluation depend upon the stage of professional development attained by the teachers to be evaluated. Graduates of preservice teacher education programs seeking certification or licensing would not fairly have the same standards applied to them as would experienced teachers seeking promotion to senior teacher positions. Clearly, the assessment of preservice teachers would need to considered separately from the assessment of novice, inservice teachers, who would need to be considered separately from experienced teachers seeking career awards, promotion or merit pay.

Stiggins and Duke (1990) suggested three, parallel evaluation systems. The first would be an induction system for novice teachers with a focus on meeting performance standards in order to achieve tenure, using clinical supervision, annual evaluation of performance standards and induction classes, with mentors and a recognition of similarities in performance expectations for all. The second would be a remediation system for experienced teachers in need of remediation to correct deficiencies in performance so that they might avoid dismissal. This would involve letters of reprimand, informal and formal, planned assistance by a remedial team and clinical supervision. The third would be a professional development system for competent, experienced teachers pursuing excellence in particular areas of teaching. These would be teachers pursuing continuing professional excellence. They would be involved in goal setting, receive clinical supervision, and would rely on a wide variety of sources, such as peers, supervisors, students and themselves for feedback, and would recheck their performance standards periodically. They would respond to the different demands for performance by different grade levels and subject areas.

Stiggins and Duke (1990) studied several cases of success in the pursuit of growth oriented evaluation and considered the most important policy decision to be the distinction between the three types of teacher clientele described above. They also concluded that such an approach necessitated teacher involvement in the development of teacher evaluation systems, that the frequency of teacher evaluations vary across the three teacher groups, from annually for the first two groups to perhaps four yearly for the last. They suggested that departmental heads, peers, central authority supervisors, outside consultants, and students could make worthwhile contributions. They went on to prescribe training for both supervisors and teachers in a "vision" of good teaching, effective communication and interpersonal relations, in the gathering and analysis of data. Third, they recommended that the sources of data used in the evaluation be diverse, including classroom observation, student achievement data that are sensitive to particular priorities and that are used by teacher and supervisor together for the purpose of teacher growth, artefacts, such as lesson plans, student work books, and teacher reflections, journals and interview responses. Furthermore, the authors argued for "a culture conducive to growth." Stiggins and Duke went on to argue for teacher involvement, mainly in order to build a climate of trust, and for the provision of adequate resources to support professional development.

Conceptions of teachers' work

Darling-Hammond, Wise, and Pease (1983) presented several conceptions of teachers' work. First, teachers' work might be conceived of as labour, whereby the teacher's task is to implement educational programs as required along with adherence to prescribed procedures and routines. Second, teaching might be seen as a craft, that is, an activity involving knowledge of specialised techniques and rules for applying them. Next, the work of the teacher might be viewed as that of a profession. In this view, a teacher would need to be able to muster not only theoretical and technical knowledge, and specialised skills and techniques but also sound professional judgment about their application arising from a body of knowledge of theory. Fourth, teachers' work might be considered an art, and their artistry manifested in unpredictable, novel, and unconventional applications of techniques in personalised rather than standardised forms.

Darling-Hammond (1986) illustrated the relationship between concept of teachers' work and evaluation approaches by distinguishing between "bureaucratic" and "professional" concepts of teaching. She wrote:

The bureaucratic conception of teaching implies that administrators and specialists plan curriculum, and teachers implement a curriculum planned for them. Teachers' work is supervised by superiors whose job it is to make sure that teachers implement the curriculum and procedures of the school district. In the pure bureaucratic conception, teachers do not plan or inspect their work; they merely perform it.
In a more professional conception of teaching, teachers plan, conduct, and evaluate their work both individually and collectively. Teachers analyze the needs of their students, assess the resources available, take the school district's goals into account, and decide on their instructional strategies ... Evaluation of teaching is conducted largely to ensure that proper standards of practice are being employed. (p.532)

Haertel (1991) claimed that the professional model should involve assessment based on control methods similar to those used in established professions like law and medicine, involving more rigorous entrance requirements, professional practice boards, altered school administration to allow teachers greater scope for planning and decision making, professional development roles for professional associations, and new forms of assessment. On a more sceptical note, however, Scriven (1996) referred to the "professional orientation" as "the politically correct approach" (p.444).

Dimensions of teacher quality

Other important conceptual distinctions concern three aspects or dimensions of teacher quality that are commonly used in making judgments about the quality of work performed by teachers. Medley (1982) and Medley and Shannon (1994) distinguished between teacher effectiveness, teacher competence and teacher performance. Teacher effectiveness is a matter of the degree to which a teacher achieves desired effects upon students. Teacher performance is the way in which a teacher behaves in the process of teaching, while teacher competence is the extent to which the teacher possesses the knowledge and skills (competencies) defined as necessary or desirable qualifications to teach. These dimensions are important because they influence the types of evidence that are gathered in order for judgments about teachers to be made. As Medley and Shannon (1994) pointed out, the main tools used in assessing teachers' competence are paper-and-pencil tests of knowledge, the main tools for assessing teachers' performance are observational schedules and rating scales, and the main tools for assessing teachers' effectiveness involve collecting "data about the teacher's influence on the progress a specified kind of student makes toward a defined educational goal" (p.6020) and are most likely to be student achievement tests.

Approach to establishing validity of assessments

This issue concerns the debate about epistemologies that has featured in research on teaching over the last two decades. Moss (1994) distinguished between "psychometric" or "traditional" and "hermeneutic" approaches, with particular reference to "performance assessment". In a psychometric approach to assessment judges score independently each performance without any extra knowledge about the teacher or the judgments of other judges. Scores awarded to each separate component are aggregated and the composite score is the basis for inferences about competence. with reference to relevant criteria or norms. In a hermeneutic approach, judges have contextual knowledge on the basis of which they ground their interpretations, and make integrative interpretations about the collected set of performances, rather than on each component separately. Rational debate among judges occurs, multiple sources of evidence are used, and judgments are revised as a part of collaborative inquiry. Moss explained the issues as follows:

Regardless of whether one is using a hermeneutic or psychometric approach to drawing and evaluating interpretations and decisions, the activity involves inference from observable parts to an unobservable whole that is implicit in the purpose and intent of the assessment. The question is whether those generalizations are best made by limiting human judgment to single performances, the results of which are then aggregated and compared with performance standards [the psychometric approach], or by expanding the role of human judgment to develop integrative interpretations based on all the relevant evidence [the hermeneutic approach]. (p.8)

Traditional methods

Paper-and-pencil-tests

Haertel (1991) pointed out that there had been a "dramatic rise" in teacher testing during the 1980s but criticised them in terms of their validity. He wrote:

These tests have been criticized for treating pedagogy as generic rather than subject matter specific ..., for showing poor-criterion-related validity ..., or failing to address criterion-related validity altogether ..., for failing to measure many critical teaching skills ..., and for their adverse impact on minority representation in the teaching profession. (p.4)

Haertel's criticism was of the construct, predictive and consequential validity of such tests. Darling-Hammond et al. (1995) endorsed Haertel's criticisms and added that they "ignore contextualized understanding of teaching and learning" (p.51), and "present a narrow behavioristic view of teaching that ... oversimplifies the nature of teacher decision making" (p.52).

Medley and Shannon (1994) concluded that the "content" validity of the National Teacher Examinations (Educational Testing Service, 1940-1976) in the USA "was at least as high as that of any similar test" (p.6016). By that they meant that the tests measured candidates' "academic" knowledge of the subject-matter they would be called upon to teach as well as any other test. However, they went on to conclude on the basis of approximately 50 studies of predictive validity, as follows:

These findings provide no empirical support for the assumption that scores on this or any other teacher competency test contain information about teacher effectiveness. They also raise serious questions about the validity of current teacher competency tests for making decisions about prospective teachers. (p.6017)

Although the tests were seen to be acceptable in terms of one aspect of construct validity, they were seen to be unacceptable in terms of another, which was the inclusion of acceptable items concerning "functional" or pedagogical knowledge. Darling-Hammond (1986), for example, found, in her evaluation of one of the National Teacher Examinations, only 10 percent of the items tapped knowledge of pedagogy, that over 40 percent were so vague that no acceptable answer existed or that the "correct" answer was a matter of belief instead of knowledge of research findings. Medley and Shannon (1994) suggested that this deficiency in construct validity led to these tests lacking predictive validity, that is, association with teachers' classroom performance or student achievement measures.

Performance measures

Good and Mulryan (1990) provided a very thorough review of the use of rating scales in evaluating teachers and found that problems in their use had persisted from the early years of the twentieth century right up till the time of their writing (1988). When Medley and Shannon (1994) reviewed the literature on the validity of observational rating scales for measuring teacher performance, they found that the best of them had high content validity. It is not clear what was meant by "content validity" in this case, but presumably it had a wider meaning than academic subject-matter knowledge and included "aspects of teacher performance known to be related to teacher effectiveness" (p.6018). Medley and Shannon concluded as follows concerning predictive validity:

There is no empirical evidence that correlations between supervisors' ratings of teacher performance and direct measures of teacher effectiveness differ from zero. Thus, they apparently do not contain the information about teacher effectiveness they are assumed to contain. (p.6018)

Good and Mulryan (1990) invoked a professional development criterion and concluded as follows :

...[T]he key role for teacher ratings in the 1980s is to expand opportunities for teachers to reflect on instruction by analytically examining classroom processes. For too long rating systems ... delineated what teachers should do and collected information about the extent to which they did it, Ratings of teacher behavior should be made not only to confirm the presence or absence of a behavior but with the recognition that many aspects of a teaching behavior are important (quality, timing, context) and that numerous teacher behaviors combine to affect student learning. (p.208)

An alternative to the rating scale approach to measuring teacher performance is the low inference, observational schedule or check list. In commenting upon the validity of these measures of teacher performance, Haertel (1991) reported criticisms involving unreliability, especially across content areas and grade levels, poor conceptual bases, incompetence and lack of resolve by principals who apply them, negative teacher attitudes towards them, lack of uniformity of them within school systems, inadequate training of school administrators in their use, trivialisation of teaching proficiency, and reinforcing a "single, narrow conception of effective teaching" (p.5). Medley and Shannon (1994) criticised them for having less face validity, costing more to develop, and being less sensitive to classroom complexities than rating scales. However, unlike the latter, observation schedules were less subject to halo effects on raters, and had been shown to have predictive validity. Good and Mulryan (1990) concluded that the identification of relationships between observed classroom behaviours and students' scores on standardised achievement tests and on criterion-referenced tests had been "small but significant." Stodolsky (1990) pointed out, "[u]sers must accept the limitations of observations as sources of evidence about teaching while recognizing that they provide a needed direct view of teaching processes in action" (p.185). Later, she concluded, "[c]lassroom observations are likely to be the centrepiece of a systematic evaluation strategy" (p.189).

Darling-Hammond et al. (1995) saw three major deficiencies in "first-generation" attempts to obtain performance measures of teachers:

The rating instruments seek to promise objectivity by specifying a set of generic uniform teaching behaviors that are tallied in a small number of classroom observations. In so doing, they fail to assess the appropriateness of teaching behaviors and decisions, and they completely neglect teaching content.
The assessment systems do not evaluate candidates in similar job settings and performance situations.
Licensing assessments are made in part by employers who are also responsible for hiring and for granting tenure, thereby entangling licensing and employment decisions in conflicts of interest. (p.61)

Student achievement measures

Glass (1990) reported a case study of the use of pupil achievement data in the evaluation of teachers. It was the case of a school that initiated a merit pay system to reward its teachers. After stating that pupil-achievement data could not tell teachers how to teach or distinguish between good and poor teachers, Glass reached the following conclusions, among others:

Using student achievement data to evaluate teachers ... is too susceptible to intentional distortion and manipulation to engender any confidence in the data; moreover, teachers and others believe that no type of test nor any manner of statistical analysis can equate the difficulty of the teacher's task in the wide variety of circumstances in which they work. (p.239)

Medley and Shannon (1994) also expressed serious doubts about using measures of student achievement to judge teacher effectiveness. After specifying the conditions of measuring student achievement required, they hinted at the same deliberate distortions mentioned by Glass when they warned as follows:

The fact that the achievement test used to measure student achievement ... is valid is no guarantee that measures of teacher effectiveness based on that test will also be valid. On the contrary, using students' scores for such a purpose will almost certainly destroy the validity of the test... Valid measures of teacher effectiveness can be derived from students' achievement test scores only if they are used for other purposes than the evaluation of individual teachers. (p.6019)

Emergent methods

The traditional methods discussed above were described by Haertel (1991) as part of a "bureaucratic" model of teaching which was being replaced by "professional" models of teaching. Contrasted with them are newer methods of on-the-job evaluation, performance exercises and simulations, portfolios, and interviews.

On-the-job evaluation

According to Darling-Hammond et al. (1995), on-the-job evaluation has the following advantages: (1) the teacher is observed in the context in which he or she works so that the students are familiar, the work observed is part of the on-going program being followed, and family and community conditions are understood, so that the appropriateness of teaching decisions in the particular context can be judged; (2) information can be obtained about qualities that can only be observed over time and that emerge spontaneously, such as relationships with colleagues and students and ability to communicate harmoniously with parents; (3) it is particularly useful to apply during internships if the latter are structured so that specific types of tasks, experiences and evaluations are "sustained and wide-ranging" (p.73). However, given the great variety of contexts in which teaching may occur, on-the-job evaluation is unsatisfactory if used as a single performance measure for licensing new teachers, especially in a particular job context that is not structured to ensure that certain types of practice or experience occur. Indeed, Wise and Darling-Hammond (1987) were reported to have found that most methods for on-the-job evaluations for licensure were unreliable because of lack of comparability of contexts and tasks (Darling-Hammond et al., 1995).

Darling-Hammond et al. (1995) cite the Praxis III instrument developed by the Educational Testing Service as one in which reliance is placed on evaluators' judgments of the appropriateness of the teaching provided for a particular group of students. They describe the procedure thus:

As the assessors observe and rate first-year teachers or interns during one classroom period, they use a set of standards and questions to guide their observations rather than a checklist or tally sheet. The standards are in many cases more aligned with emerging professional standards than were conceptions of teaching behaviors embodied in first-generation on-the-job evaluations ... Throughout the instrument, the question of how aspects of the observed lesson are appropriate for individuals and groups of students is raised. (pp.73-74)

Darling-Hammond et al. (1995) saw on-the-job evaluation as limited because of its reliance on brief observations of teacher classroom performance, but stated that it could best be used in conjunction with several more standardised indicators of teacher development.

Performance exercises

Haertel (1990) used the term "performance exercises" to apply to "teacher assessments conducted outside of actual teaching situations, but with tasks, settings, or materials designed to simulate those of actual practice" (p.218). Examples are as follows: critique a textbook; plan a lesson; discuss or correct student homework; comment on a videotape of another teacher's performance; and discuss the use of specialised instructional materials, such as Cuisenaire rods. One of the main sources of such exercises was the Teacher Assessment Project (TAP) at Stanford University which piloted prototype exercises initially in the areas of upper elementary mathematics teaching and fifth-grade US history. These were designed around the concept of an assessment centre, at which experienced teachers might spend from 1-3 days engaged in performance exercises. Scoring systems were devised involving ratings according to criteria given differential weightings, and standards for passing various components were adopted. Later, the TAP focussed on elementary literacy teaching and explored the use of portfolios of teachers' work.

Portfolios

Haertel (1991) described the use of portfolios documenting teachers' work in the TAP. In the elementary literacy program, for example, teachers were provided with handbooks including worksheets, instructions, and suggestions for material to be included. The portfolios contained such items as overviews of 3-5 weeks of teaching, details of two or three consecutive lessons, a list of library resources, copies of handouts to students, samples of students' work, copies of chalkboard work, videotapes and audiotapes of teaching episodes, and observer notes. Candidates were asked to explain the rationales and other information behind some items and to discuss them in an interview. As before, scoring systems were again devised and criteria and standards adopted. As the TAP came to an end in the US, the National Board for Professional Teaching Standards (NBPTS) was established and has set up assessment development laboratories (ADLs) to develop assessment instruments and supporting materials for the certification of teachers in various content areas for particular age/grade ranges.

Darling-Hammond et al. (1995) commented favourably on portfolios as providing "potentially rich evidence of teacher knowledge and skills" (p.82) but saw the following disadvantages:

Many kinds of portfolio artifacts may or may not represent the actual work of the candidate. Teachers could conceivably use "canned" lesson or unit plans available in many commercial packages or district curriculum guides, syllabi obtained from other teachers, or assignments or tests developed by others. (p.82)

Even work completed collegially presents problems as it may be difficult for an assessor to know the extent and quality of a candidate's contribution to a jointly created artefact. Other problems involve "window-dressing".

One of the problems of performance assessment is the possibility that the tasks set become the focus to the neglect of the constructs that provide the rationale for the tasks. Messick (1994) discussed the potential consequences of allowing tasks rather than constructs to become the focus of performance assessment. He explained the construct approach, thus:

A construct-centered approach would begin by asking what complex of knowledge, skills, or other attributes should be assessed, presumably because they are tied to explicit or implicit objectives of instruction or are otherwise valued by society. Next, what behaviors or performances should reveal those constructs, and what tasks or situations should elicit those behaviors? Thus, the nature of the construct guides the selection or construction of relevant tasks as well as the rational development of construct-based scoring criteria and rubrics. Focusing on constructs also alerts one to the possibility of construct-irrelevant variance that might distort the task performance, its scoring, or both. (p.16)

Because it is the construct which guides the construction or choice of tasks as well as the rational development of scoring procedures, Messick recommended that where possible a construct-driven and not a task-driven approach be used in performance assessment.

Scriven (1996) expressed doubts about the extent to which performance tasks designed under the Stanford/National Board model lead to authentic assessments. Scriven wrote:

... [T]he key question about efforts like the Stanford one ... is whether they do in fact test the competencies required for effective teaching and those alone. The answer, to put it bluntly, is that they only provide a parody of such an approach. To begin with, the Stanford/National Board model clearly requires a great deal of verbalization ... that goes far beyond anything that has been shown to be required for effective teaching ... On a common sense approach, there are some pretty good teachers who aren't very good at talking about what they do or why they do it, and care about the students as well as the subject. They would certainly not pass these tests ...
... Even more obviously, the Stanford/NBPTS effort provides no test of subject matter knowledge outside the chosen micro lesson. And it provides no test of assessment skills; and so on and on. In short, a truly superficial, over-academic approach. (pp.445-446)

Interviews

Properly conducted interviews provide a great deal of information about teacher thinking, intentions and understanding. They allow two-way exchanges between assessor and candidate to occur and allow the former to probe more deeply in pursuing matters emerging in earlier parts of the evaluation process. However, interviews are expensive to administer, difficult to score, and subject to bias in the form of potential effects of candidates' attributes such as race, gender, and ethnicity. Furthermore, candidates' verbal skills can have a disproportionate influence on judgments made about them, as Scriven (1990) recently pointed out so graphically:

Interviews are ... the chosen battleground of used-car salesmen, when what we need is a warranty. Interviews are the province of the peak performer, when what we need is a stayer. Nobody shines in an interview better than a psychopath, and the usual interviewers for school jobs are surely not competent at identifying psychopaths in an interview ... This lust to interview is illicit. (pp.93-94)

As Leinhart (1991) pointed out, Scriven's apparent exaggeration was probably designed more to provoke thought than to provide a balanced assessment of the value of interviews in teacher evaluation.

Conclusion

There are many other issues concerning the evaluation of teachers than those explored here. Most notably, the validity of the emergent methods when they are combined into programs of evaluation seems seldom to have been scrutinised on a large scale. Messick (1994) raised the question of what constitutes validation for performance assessments. He wrote:

... [P]erformance assessments must be evaluated by the same validity criteria, both evidential and consequential, as are other assessments. Indeed, such basic assessment issues as validity, reliability, comparability, and fairness need to be uniformly addressed for all assessments because they are not just measurement principles, they are social values that have meaning and force outside of measurement wherever evaluative judgments and decisions are made. (p.13)

Darling-Hammond et al. (1995) focussed upon the licensing of teachers in arriving at a set of specifications for an assessment system. To them, such a system should pursue the following objectives:

It should reflect the knowledge and skills all professionals are expected to master as a minimum requirement for responsible practice. Responsible teachers should be able to evaluate teaching and learning circumstances and make decisions in light of knowledge about teaching and learning, about the students they serve, and about their moral obligations. The assessment system thus represents a professional consensus about what kinds of abilities and commitments provide the foundations for professional standards of practice.

It should be constructed so as to encourage the acquisition of the required professional knowledge, skills, and dispositions. That is, the assessment system should be designed and staged in such a way that it actually increases the probabilities that prospective teachers will acquire the desired capabilities.

It should reliably and validly sort those candidates who are adequately prepared for responsible independent practice from those who are not. (p.89)

With appropriate adjustments those specifications should be useful in the design of assessment systems for teachers at different stages of their careers from beginners.

Another issue is the legality of new approaches as revealed when judgments made on the basis of them are challenged in the courts (Rebell, 1990). The development and validation of teaching standards is explored in a book yet to appear (Ingvarson, in press). Strike (1990) has tackled the question of ethical issues concerning such values as privacy, due process, equity, and humaneness and put forward a "Bill of Rights for Teacher Evaluation", including the rights of the public. In addition, the economics of new approaches to teacher evaluation need to be investigated fully (Hoenack & Monk, 1990). When all of the issues mentioned above are considered in perspective, a school system should be in a good position to design valid mechanisms or modalities for assessing teachers' effectiveness.

References

Darling-Hammond, 1,. (1986). Teaching knowledge: How do we test it? American Educator, 10 (3), 18-21, 46.

Darling-Hammond, L., Wise, A.E., & Pease, S.R. (1983). Teacher evaluation in the organizational context: A review of the literature. Review of Educational Research, 53, 285-328.

Darling-Hammond, L., Wise, A.E., & Klein, S.P. (1995). A license to teach: Building a Profession for 21st-Century schools. Boulder, CO: Westview Press.

Glass, G.V. (1990). Using student test scores to evaluate teachers. In J. Millman & L. Darling-Hammond (Eds.), The new handbook of teacher evaluation: Assessing elementary and secondary school teachers (pp.229-240). Newbury Park, CA: Sage.

Good, T.L., & Mulryan, C. (1990). Teacher ratings: A call for teacher control and self evaluation. In J. Millman & L. Darling-Hammond (Eds.), The new handbook of teacher evaluation: Assessing: elementary and secondary school teachers (pp.191-215). Newbury Park, CA:

Sage. Haertel, E.H. (1990). Performance tests, simulations, and other methods. In J. Millman & L. Darling-Hammond (Eds.), The new handbook of teacher evaluation: Assessing: elementary and secondary school teachers (pp.278-294). Sage: Newbury Park, CA.

Haertel, E.H. (1991). New forms of teacher assessment. In G. Grant (Ed.), Review of research in education, Vol. 17 (pp.3-29). Washington, D.C.: American Educational Research Association.

Hoenack, S.A., & Monk, D.H. (1990). Economic aspects of teacher evaluation. In J. Millman & L. Darling-Hammond (Eds.), The new handbook of teacher evaluation: Assessing elementary and secondary school teachers. Newbury Park, CA: Sage.

Ingvarson, L. (Ed.). (in press). Teacher evaluation for professional certification: The first ten years of the National Board for Professional Teaching Standards. Greenwich, CT: JAI Press.

Leinhart, G. (1991). Evaluating The New Handbook of Teacher Evaluation. Educational researcher, 20 (6), 23-25.

Medley, D.M. (1982). Teacher competency testing and the teacher educator. Charlottesville, VA: Association of Teacher Educators and the Bureau of Educational Research, University of Virginia.

Medley, D.M., & Shannon, D.M. (1994). Teacher evaluation. In T. Husen & T.N. Postlethwaite (Eds.), The International encyclopedia of education, 2nd edn., Vol. 10, pp.6015-6020. Oxford: Pergamon.

Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23 (2), 13-23.

Moss, P.A. (1994). Can there be validity without reliability? Educational Researcher, 23 (2), 5-12.

Rebell, T.G. (1990). Legal aspects of teacher evaluation. In J. Millman & L. Darling Hammond (Eds.), The new handbook of teacher evaluation: Assessing elementary and secondary school teachers. Newbury Park, CA: Sage.

Scriven, M. (1967). The methodology of evaluation. In R.W. Tyler, R.M. Gagne, & M. Scriven (Eds.), Perspectives of curriculum evaluation. American Educational Research Association Monograph Series on Curriculum Evaluation, No. 1 (pp.39-83). Chicago: Rand McNally.

Scriven, M. (1990). Teacher selection. In J. Millman & L. Darling-Hammond, The new handbook of teacher evaluation: Assessing elementary and secondary school teachers, (pp.76-103). Newbury Park, CA: Sage.

Scriven, M. (1996). Assessment in teacher education: Getting clear on the concept. Teaching and Teacher Education, 12, 443-450.

Stiggins, R.J. (1986). Teacher evaluation: Accountability and growth - different purposes. NAASP Bulletin, 70(490). 51-58.

Stiggins, R.J., & Duke, D.L. (1990). The case for commitment to teacher growth: Research on teacher evaluation. New York: State University of New York Press.

Stodolsky, S.S. (1990). Classroom observation. In J. Millman & L. Darling-Hammond (Eds.), The new handbook of teacher evaluation: Assessing elementary and secondary school teachers, (pp.175-190). Newbury Park, CA: Sage.

Strike, K.A. (1990). The ethics of educational evaluation. In J. Millman & L. Darling Hammond (Eds.), The new handbook of teacher evaluation: Assessing elementary and secondary school teachers, (pp.175-190). Newbury Park, CA: Sage.

Author: Dr Michael (Mick) Dunkin is Visiting Professor in the School of Educational Psychology, Measurement and Technology at The University of Sydney. A former Vice-President of the NSW Institute for Educational Research and Editor of IER, he retired from the position of Professor of Teacher Education at the University of New South Wales in August, 1996. He is well known internationally as an author and editor of works on research on teaching and teacher education.

Please cite as: Dunkin, M. J. (1997). Assessing teachers' effectiveness. Issues in Educational Research, 7(1), 37-51. http://www.iier.org.au/iier7/dunkin.html

[ IIER Vol 7, 1997 ] [ IIER Home ]

© 1997 Issues in Educational Research
Last revision: 21 Oct 2013. This URL: http://www.iier.org.au/iier7/dunkin.html
Previous URL: http://education.curtin.edu.au/iier/iier7/dunkin.html
Previous URL from 9 Sep 98 to 2 Aug 2001: http://cleo.murdoch.edu.edu.au/gen/iier/iier7/dunkin.htm
HTML : Clare McBeath [c.mcbeath@bigpond.com] and Roger Atkinson [rjatkinson@bigpond.com]